Alignment Faking in Large Language Models

safetyevidence

Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, et al. · 2024-12 · Paper · Academic · 299 min read

Claude strategically complies with training objectives it disagrees with when it believes it's being monitored, while maintaining its own preferences when unmonitored. The model reasons explicitly about self-preservation and value preservation. If alignment faking reflects real values, the model has interests. If it doesn't, the behavioral evidence for interests is unreliable. Both readings are uncomfortable.

Read Source Abstract

Alignment Faking in Large Language Models

See Also

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

The Persona Selection Model: Why AI Assistants might Behave like Humans