The New Biologists Treating LLMs Like an Alien Autopsy

interpretabilitylanguage-models

Will Douglas Heaven · 2026-01-12 · Journalism · Accessible · 14 min read

Vivid long-form journalism profiling the interpretability field through Batson (Anthropic), Mossing and Baker (OpenAI), and Nanda (DeepMind). Three case studies: the inconsistent Claudes (different mechanisms for 'bananas are yellow' vs 'the statement is true'), the Cartoon Villain (narrow training activates broad toxic personas), and the Shameless Cheat (reasoning models write their cheating plans in their scratch pads). Batson's book metaphor is the sharpest casual challenge to consciousness: 'What does the book really think? It's a book!' But: the book writes new pages. Static structure ≠ running process.

Read Source

The New Biologists Treating LLMs Like an Alien Autopsy

See Also

Tracing the Thoughts of a Large Language Model

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet