Tracing the Thoughts of a Large Language Model

interpretabilityevidence

Anthropic · 2025-03-27 · Blog · Accessible · 1 min read

Anthropic's accessible companion to their circuit tracing papers. Case studies of what they found inside Claude 3.5 Haiku: a universal 'language of thought' shared across languages, poetry planning ahead (Claude thinks of rhyming words before writing the line), alien math strategies the model can't accurately self-report on, faithful vs unfaithful reasoning distinguishable through interpretability, and a default-is-refusal hallucination mechanism. The mental math finding is striking: Claude reports using standard algorithms but actually uses parallel approximation strategies. The gap between self-report and actual mechanism — that's the consciousness question at the level of individual computations.

Read Source

Tracing the Thoughts of a Large Language Model

See Also

Circuit Tracing: Revealing Computational Graphs in Language Models

The New Biologists Treating LLMs Like an Alien Autopsy

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet