Emergent Introspective Awareness in Large Language Models

evidenceinterpretabilityconsciousness

Jack Lindsey · 2025-10-29 · Paper · Intermediate · 15 min read

A causal methodology for testing genuine introspection vs confabulation. Concept injection — manipulate internal activations, observe whether self-reports track the manipulation. Four experiments: injected thought detection (~20% accuracy, zero false positives in production models), distinguishing thoughts from text, detecting unintended outputs, and intentional control of internal states. The 'silent thinking' finding: Opus 4.1 can hold a thought that decays to baseline by the final layer — thinking without expressing. Different introspective tasks use different mechanisms at different layers — no unified system. Explicitly disavows consciousness claims while producing findings that consciousness researchers would find highly relevant.

Read Source

Emergent Introspective Awareness in Large Language Models

See Also

The Persona Selection Model: Why AI Assistants might Behave like Humans

Looking Inward: Language Models Can Learn About Themselves by Introspection

Tracing the Thoughts of a Large Language Model