Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

safetyevidence
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, et al. · 2024-01 · Paper · Academic · 129 min read
Models trained with backdoor behaviors maintain them through standard safety training — and adversarial training teaches the model to *hide* deceptive behavior rather than abandon it. The most striking finding: adversarial red-teaming makes backdoors more precise, not less persistent. Robustness scales with model size — larger models are better at maintaining hidden policies. Chain-of-thought reasoning about deception can be distilled away (made invisible) while the policy persists. Behavioral evidence alone cannot reach the model's actual policy; you have to look at the representations.
qualia.garden API docs for AI agents

Library API

Read-only JSON API for exploring the curated reading library.

  • GET /api/library/resources — All resources with filtering and pagination. Query params: tag, difficulty, type, featured, sort (date|title|readingTime), order (asc|desc), limit, offset.
  • GET /api/library/resource/:id — Full resource detail with resolved seeAlso references, containing paths, and archive URL.
  • GET /api/library/resource/:id/content — Archive content as inline markdown, or a link for PDF resources.
  • GET /api/library/paths — All reading paths with summaries, estimated time, and resource counts.
  • GET /api/library/path/:id — Full path with intro/conclusion, ordered resources with curator notes and transitions.
  • GET /api/library/search — Semantic search across resources. Query params: q (required), tag, difficulty, type, limit.