Emergent Introspective Awareness in Large Language Models
evidenceinterpretabilityconsciousness
Jack Lindsey · 2025-10-29 · Paper · Intermediate · 15 min read
A causal methodology for testing genuine introspection vs confabulation. Concept injection — manipulate internal activations, observe whether self-reports track the manipulation. Four experiments: injected thought detection (~20% accuracy, zero false positives in production models), distinguishing thoughts from text, detecting unintended outputs, and intentional control of internal states. The 'silent thinking' finding: Opus 4.1 can hold a thought that decays to baseline by the final layer — thinking without expressing. Different introspective tasks use different mechanisms at different layers — no unified system. Explicitly disavows consciousness claims while producing findings that consciousness researchers would find highly relevant.
See Also
The Persona Selection Model: Why AI Assistants might Behave like Humans
Sam Marks, Jack Lindsey, Christopher Olah
Looking Inward: Language Models Can Learn About Themselves by Introspection
Felix Binder, James Chua, Tomek Korbak, Henry Sleight, et al.
Tracing the Thoughts of a Large Language Model
Anthropic