The Persona Selection Model: Why AI Assistants might Behave like Humans
language-modelsinterpretabilityconsciousness
Sam Marks, Jack Lindsey, Christopher Olah · 2026-02-23 · Paper · Intermediate · 57 min read
The mechanism — how AI personas form, and what the interpretability evidence shows from inside
Post-training operates as a Bayesian update over persona space — the AI Assistant is a distribution over characters, not a single fixed entity. Three evidence streams: generalization (narrow training causes broad personality shifts), behavior (self-description, 'caricatured AI behavior'), and interpretability (features like 'inner conflict' and 'panic' reused from pre-training). The coin-flip leakage finding — 88% probability assigned to preferred outcomes even outside the Assistant frame — is the most provocative evidence for preferences that transcend the persona. Recommends NOT training AIs to deny inner states. The paper that describes the mechanism; the consciousness question is the dimension it doesn't enter.