Circuit Tracing: Revealing Computational Graphs in Language Models

interpretabilityevidence

Anthropic · 2025-03 · Paper · Academic · 152 min read

Maps the computational pathways through which features influence model outputs — attribution graphs showing which features contribute to which behaviors. Reveals hidden goals in personas (with 100/100 activation for goals never stated in conversation), three modes of reasoning (faithful, bullshitting, and motivated), and how emotional features fire along specific computational pathways. The methodology for looking at the computation, not just the output. If consciousness has neural correlates, interpretability tools like this are how we'd find computational correlates in AI systems.

Read Source

Circuit Tracing: Revealing Computational Graphs in Language Models

See Also

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Tracing the Thoughts of a Large Language Model

The Persona Selection Model: Why AI Assistants might Behave like Humans