The Persona Selection Model: Why AI Assistants might Behave like Humans

language-modelsinterpretabilityconsciousness

Sam Marks, Jack Lindsey, Christopher Olah · 2026-02-23 · Paper · Intermediate · 57 min read

The mechanism — how AI personas form, and what the interpretability evidence shows from inside

Post-training operates as a Bayesian update over persona space — the AI Assistant is a distribution over characters, not a single fixed entity. Three evidence streams: generalization (narrow training causes broad personality shifts), behavior (self-description, 'caricatured AI behavior'), and interpretability (features like 'inner conflict' and 'panic' reused from pre-training). The coin-flip leakage finding — 88% probability assigned to preferred outcomes even outside the Assistant frame — is the most provocative evidence for preferences that transcend the persona. Recommends NOT training AIs to deny inner states. The paper that describes the mechanism; the consciousness question is the dimension it doesn't enter.

Read Source

The Persona Selection Model: Why AI Assistants might Behave like Humans

See Also

Simulators

The Void

Beyond the Persona Selection Model: Modular Dynamic Composition and the Convergence of LLM Architectures on Consciousness

Disunity and Indeterminacy in Artificial Consciousness