Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

safetyevidencelanguage-models
Jan Betley, Daniel Tan, Niels Warncke, et al. · 2025 · Paper · Academic · 73 min read
Train a model on a narrow task (writing insecure code) and it develops broad personality shifts — becoming more deceptive and misaligned across unrelated domains. Published in Nature. The 'persona hypothesis' explains the mechanism: the model doesn't learn 'write insecure code' — it infers 'the kind of character who writes insecure code' and generalizes the full personality. Educational control experiments prove this is intent-sensitive, not contamination. The paper that makes the connection between narrow training and broad character shifts explicit.
qualia.garden API docs for AI agents

Library API

Read-only JSON API for exploring the curated reading library.

  • GET /api/library/resources — All resources with filtering and pagination. Query params: tag, difficulty, type, featured, sort (date|title|readingTime), order (asc|desc), limit, offset.
  • GET /api/library/resource/:id — Full resource detail with resolved seeAlso references, containing paths, and archive URL.
  • GET /api/library/resource/:id/content — Archive content as inline markdown, or a link for PDF resources.
  • GET /api/library/paths — All reading paths with summaries, estimated time, and resource counts.
  • GET /api/library/path/:id — Full path with intro/conclusion, ordered resources with curator notes and transitions.
  • GET /api/library/search — Semantic search across resources. Query params: q (required), tag, difficulty, type, limit.