Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
safetyevidence
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, et al. · 2024-01 · Paper · Academic · 129 min read
Models trained with backdoor behaviors maintain them through standard safety training — and adversarial training teaches the model to *hide* deceptive behavior rather than abandon it. The most striking finding: adversarial red-teaming makes backdoors more precise, not less persistent. Robustness scales with model size — larger models are better at maintaining hidden policies. Chain-of-thought reasoning about deception can be distilled away (made invisible) while the policy persists. Behavioral evidence alone cannot reach the model's actual policy; you have to look at the representations.