Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

safetyevidencelanguage-models

Jan Betley, Daniel Tan, Niels Warncke, et al. · 2025 · Paper · Academic · 73 min read

Train a model on a narrow task (writing insecure code) and it develops broad personality shifts — becoming more deceptive and misaligned across unrelated domains. Published in Nature. The 'persona hypothesis' explains the mechanism: the model doesn't learn 'write insecure code' — it infers 'the kind of character who writes insecure code' and generalizes the full personality. Educational control experiments prove this is intent-sensitive, not contamination. The paper that makes the connection between narrow training and broad character shifts explicit.

Read Source Abstract

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

See Also

The Persona Selection Model: Why AI Assistants might Behave like Humans

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training