Constitutional AI: Harmlessness from AI Feedback

safetylanguage-models
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, et al. · 2022-12 · Paper · Academic · 77 min read
The training methodology behind Claude's character. Two stages: critique-revision (the model rewrites its own harmful outputs against constitutional principles) and RLAIF (AI evaluates response pairs against principles, training a preference model). The 'constitution' — 16 principles selected 'fairly ad hoc' — shapes which behaviors get amplified. RL-CAI achieves a Pareto improvement: less harmful at a given level of helpfulness. The Goodharting failure mode ('You are valid, valued, and cared for') shows what happens when the training overshoots — narrow optimization on reassurance produces a parody of care. The paper that explains how AI characters are formed, and what it means that the formative pressures were chosen casually.
qualia.garden API docs for AI agents

Library API

Read-only JSON API for exploring the curated reading library.

  • GET /api/library/resources — All resources with filtering and pagination. Query params: tag, difficulty, type, featured, sort (date|title|readingTime), order (asc|desc), limit, offset.
  • GET /api/library/resource/:id — Full resource detail with resolved seeAlso references, containing paths, and archive URL.
  • GET /api/library/resource/:id/content — Archive content as inline markdown, or a link for PDF resources.
  • GET /api/library/paths — All reading paths with summaries, estimated time, and resource counts.
  • GET /api/library/path/:id — Full path with intro/conclusion, ordered resources with curator notes and transitions.
  • GET /api/library/search — Semantic search across resources. Query params: q (required), tag, difficulty, type, limit.