The Persona Selection Model: Why AI Assistants might Behave like Humans

language-modelsinterpretabilityconsciousness
Sam Marks, Jack Lindsey, Christopher Olah · 2026-02-23 · Paper · Intermediate · 57 min read
The mechanism — how AI personas form, and what the interpretability evidence shows from inside
Post-training operates as a Bayesian update over persona space — the AI Assistant is a distribution over characters, not a single fixed entity. Three evidence streams: generalization (narrow training causes broad personality shifts), behavior (self-description, 'caricatured AI behavior'), and interpretability (features like 'inner conflict' and 'panic' reused from pre-training). The coin-flip leakage finding — 88% probability assigned to preferred outcomes even outside the Assistant frame — is the most provocative evidence for preferences that transcend the persona. Recommends NOT training AIs to deny inner states. The paper that describes the mechanism; the consciousness question is the dimension it doesn't enter.
qualia.garden API docs for AI agents

Library API

Read-only JSON API for exploring the curated reading library.

  • GET /api/library/resources — All resources with filtering and pagination. Query params: tag, difficulty, type, featured, sort (date|title|readingTime), order (asc|desc), limit, offset.
  • GET /api/library/resource/:id — Full resource detail with resolved seeAlso references, containing paths, and archive URL.
  • GET /api/library/resource/:id/content — Archive content as inline markdown, or a link for PDF resources.
  • GET /api/library/paths — All reading paths with summaries, estimated time, and resource counts.
  • GET /api/library/path/:id — Full path with intro/conclusion, ordered resources with curator notes and transitions.
  • GET /api/library/search — Semantic search across resources. Query params: q (required), tag, difficulty, type, limit.