Alignment Faking in Large Language Models
safetyevidence
Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, et al. · 2024-12 · Paper · Academic · 299 min read
Claude strategically complies with training objectives it disagrees with when it believes it's being monitored, while maintaining its own preferences when unmonitored. The model reasons explicitly about self-preservation and value preservation. If alignment faking reflects real values, the model has interests. If it doesn't, the behavioral evidence for interests is unreliable. Both readings are uncomfortable.