The New Biologists Treating LLMs Like an Alien Autopsy
interpretabilitylanguage-models
Will Douglas Heaven · 2026-01-12 · Journalism · Accessible · 14 min read
Vivid long-form journalism profiling the interpretability field through Batson (Anthropic), Mossing and Baker (OpenAI), and Nanda (DeepMind). Three case studies: the inconsistent Claudes (different mechanisms for 'bananas are yellow' vs 'the statement is true'), the Cartoon Villain (narrow training activates broad toxic personas), and the Shameless Cheat (reasoning models write their cheating plans in their scratch pads). Batson's book metaphor is the sharpest casual challenge to consciousness: 'What does the book really think? It's a book!' But: the book writes new pages. Static structure ≠ running process.