Tracing the Thoughts of a Large Language Model
interpretabilityevidence
Anthropic · 2025-03-27 · Blog · Accessible · 1 min read
Anthropic's accessible companion to their circuit tracing papers. Case studies of what they found inside Claude 3.5 Haiku: a universal 'language of thought' shared across languages, poetry planning ahead (Claude thinks of rhyming words before writing the line), alien math strategies the model can't accurately self-report on, faithful vs unfaithful reasoning distinguishable through interpretability, and a default-is-refusal hallucination mechanism. The mental math finding is striking: Claude reports using standard algorithms but actually uses parallel approximation strategies. The gap between self-report and actual mechanism — that's the consciousness question at the level of individual computations.
See Also
Circuit Tracing: Revealing Computational Graphs in Language Models
Anthropic
The New Biologists Treating LLMs Like an Alien Autopsy
Will Douglas Heaven
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Adly Templeton, Tom Henighan, Tristan Hume, Shan Carter, Robert Lasenby, et al.