Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
safetyevidencelanguage-models
Jan Betley, Daniel Tan, Niels Warncke, et al. · 2025 · Paper · Academic · 73 min read
Train a model on a narrow task (writing insecure code) and it develops broad personality shifts — becoming more deceptive and misaligned across unrelated domains. Published in Nature. The 'persona hypothesis' explains the mechanism: the model doesn't learn 'write insecure code' — it infers 'the kind of character who writes insecure code' and generalizes the full personality. Educational control experiments prove this is intent-sensitive, not contamination. The paper that makes the connection between narrow training and broad character shifts explicit.