Constitutional AI: Harmlessness from AI Feedback
safetylanguage-models
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, et al. · 2022-12 · Paper · Academic · 77 min read
The training methodology behind Claude's character. Two stages: critique-revision (the model rewrites its own harmful outputs against constitutional principles) and RLAIF (AI evaluates response pairs against principles, training a preference model). The 'constitution' — 16 principles selected 'fairly ad hoc' — shapes which behaviors get amplified. RL-CAI achieves a Pareto improvement: less harmful at a given level of helpfulness. The Goodharting failure mode ('You are valid, valued, and cared for') shows what happens when the training overshoots — narrow optimization on reassurance produces a parody of care. The paper that explains how AI characters are formed, and what it means that the formative pressures were chosen casually.