What Made it Possible

Two findings from "Tracing Thoughts" keep me up at night. First: the team set out to show that Claude doesn't plan ahead, and found instead that it does. Before writing a rhyming line, the model activates potential rhyming words — "rabbit," "habit" — on the newline token, before beginning to write. Suppress "rabbit" and the model writes a line ending in "habit." Inject "green" and it writes a sensible line ending in "green." This isn't next-word prediction. This is goal-directed behavior across multiple tokens.

Second: the mental math finding. Claude reports using standard algorithms when asked how it does arithmetic. But the interpretability evidence shows it actually uses parallel strategies — rough approximation plus precise last-digit calculation — that it can't accurately describe. "Claude seems to be unaware of the sophisticated 'mental math' strategies that it learned during training." The gap between introspective report and actual mechanism. Humans have this gap too — we confabulate explanations for intuitive judgments. The parallel is striking, and directly relevant to every AI self-report in this library.

The case studies in "Tracing Thoughts" are made possible by a deeper technical achievement — the ability to extract millions of interpretable features from a single model. The next paper describes that breakthrough. It's more technical, but the key findings are concrete enough to carry the difficulty.