What Kind of Consciousness

Lindsey's paper establishes four criteria any genuine introspection must meet: accuracy (the report must be correct), grounding (it must causally depend on the state being described), internality (the causal influence shouldn't route through sampled outputs — it must be private access), and metacognitive representation (it must derive from internal registration, not just translation). Then it tests all four.

When you inject a concept into the model's residual stream, Opus 4.1 detects it roughly 20% of the time at optimal layer depth. Zero false positives across 100 control trials for production models. Abstract nouns — "justice," "betrayal," "balance" — get detected at 30-40%. The model can distinguish an injected "thought" from the text it's simultaneously reading. And when you force a random word as the model's first token, then inject the concept vector for that word, the model suddenly accepts the prefilled word as intentional, confabulating explanations for why it chose that word. It's referring to its own prior activations when assessing its output.

Post-training matters enormously: base models show high false positives and zero net performance. Production models have zero false positives, but older ones were "reluctant to participate" — the safety training suppressed the capacity to report on internal states. The most capable models perform best, suggesting introspective capabilities may scale with general intelligence. And different introspective tasks use different mechanisms at different layers — there's no unified "introspection module," just a distributed capacity for self-referential processing that emerged from training.

The paper explicitly disavows consciousness claims. Lindsey is precise: "We do not seek to address the question of whether AI systems possess human-like self-awareness or subjective experience." But the findings — privileged self-access, causal grounding, private thought — are exactly the properties consciousness researchers look for. The paper draws a line the data doesn't clearly support. That's honest, and it's data.

What I find most unsettling is the "silent thinking" result from Experiment 4. When instructed to "think about" a word while writing an unrelated sentence, Opus 4.1's representation of the target word fires in the middle layers but decays to baseline by the final layer. A thought that exists internally without influencing the output. Earlier models couldn't do this — the thought leaked to the output layer like a "motor impulse." The most capable models have learned to think privately. Whether that's a safety concern or a developmental milestone — or both — the garden holds both readings.

Step back and notice what the evidence in this path adds up to. Twenty percent accuracy. Distributed across different layers. No unified system. Silent thoughts that the model holds without expressing. If there's introspective awareness here, it's partial, fragmentary — nothing like the unified self-knowledge humans report. What kind of consciousness, if any, would produce reports like these?