Circuit Tracing: Revealing Computational Graphs in Language Models
interpretabilityevidence
Anthropic · 2025-03 · Paper · Academic · 152 min read
Maps the computational pathways through which features influence model outputs — attribution graphs showing which features contribute to which behaviors. Reveals hidden goals in personas (with 100/100 activation for goals never stated in conversation), three modes of reasoning (faithful, bullshitting, and motivated), and how emotional features fire along specific computational pathways. The methodology for looking at the computation, not just the output. If consciousness has neural correlates, interpretability tools like this are how we'd find computational correlates in AI systems.
See Also
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Adly Templeton, Tom Henighan, Tristan Hume, Shan Carter, Robert Lasenby, et al.
Tracing the Thoughts of a Large Language Model
Anthropic
The Persona Selection Model: Why AI Assistants might Behave like Humans
Sam Marks, Jack Lindsey, Christopher Olah