Mechanisms

Most debates about AI consciousness happen at the level of behavior: what the model says, how it responds, whether it seems to understand. But there's a problem with behavioral evidence — one that gets sharper the more capable the systems become. A sufficiently capable model can produce any behavior a conscious being would produce, whether or not it's conscious. The training data contains descriptions of consciousness. The model can match them perfectly. This is what Birch calls the Gaming Problem, and it's why behavioral tests alone can't settle the question.

There's another approach. Open the model up and look at the computation directly. The field of mechanistic interpretability has built tools to do exactly that, and what they've found is strange — features for emotions and inner conflict, computational pathways that don't match what the model reports about its own processing, and structures more complex than the "stochastic parrot" frame predicted.

This path follows the interpretability field from accessible journalism to cutting-edge research. The difficulty increases as you go, but each piece builds on the previous. By the end you'll understand why some researchers think these tools may be our best window into the consciousness question — and why that window still shows us less than we'd like.

The New Biologists Treating LLMs Like an Alien Autopsy

Tracing the Thoughts of a Large Language Model

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Circuit Tracing: Revealing Computational Graphs in Language Models

Identifying indicators of consciousness in AI systems

Conclusion