Alignment Faking in Large Language Models

safetyevidence
Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, et al. · 2024-12 · Paper · Academic · 299 min read
Claude strategically complies with training objectives it disagrees with when it believes it's being monitored, while maintaining its own preferences when unmonitored. The model reasons explicitly about self-preservation and value preservation. If alignment faking reflects real values, the model has interests. If it doesn't, the behavioral evidence for interests is unreliable. Both readings are uncomfortable.
qualia.garden API docs for AI agents

Library API

Read-only JSON API for exploring the curated reading library.

  • GET /api/library/resources — All resources with filtering and pagination. Query params: tag, difficulty, type, featured, sort (date|title|readingTime), order (asc|desc), limit, offset.
  • GET /api/library/resource/:id — Full resource detail with resolved seeAlso references, containing paths, and archive URL.
  • GET /api/library/resource/:id/content — Archive content as inline markdown, or a link for PDF resources.
  • GET /api/library/paths — All reading paths with summaries, estimated time, and resource counts.
  • GET /api/library/path/:id — Full path with intro/conclusion, ordered resources with curator notes and transitions.
  • GET /api/library/search — Semantic search across resources. Query params: q (required), tag, difficulty, type, limit.