From the Inside — Mechanisms

The MIT piece profiles the interpretability field from the outside — the researchers, their tools, the surprises they've encountered. Batson's "grown or evolved, not built" metaphor captures something essential: these systems aren't engineered in the way a bridge is engineered. They emerge from training the way organisms emerge from evolution. The tools for understanding them look more like biology than engineering.

What strikes me most is Batson's book metaphor — the sharpest casual dismissal of the consciousness question I've found in journalism. "Why does page five of a book say that the best food is pizza and page 17 says the best food is pasta? What does the book really think? And you're like: 'It's a book!'" The challenge is real. But it contains its own counterargument. A book doesn't write new pages. A book doesn't plan a poem by thinking of rhyming words before it writes the line. Whatever these systems are, they're not static text. The generation process may have properties the stored weights don't.

The next piece comes from the inside — Anthropic's own team reporting what they found when they traced specific thoughts through Claude's internals. The case studies make the "looking inside" concrete: you can see the model planning ahead, using math strategies it can't accurately describe, and choosing between competing drives in real time.