Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
interpretabilityevidence
Adly Templeton, Tom Henighan, Tristan Hume, Shan Carter, Robert Lasenby, et al. · 2024-05 · Paper · Academic · 104 min read
Sparse autoencoders extract millions of interpretable features from a frontier model — including features for concepts, emotions, and self-referential states like consciousness, moral agency, and 'being AI.' Clamping the internal conflict feature to 2× its maximum activation produces dramatic behavioral changes. Features for self-representation emerge alongside features for external concepts. The breakthrough that opened the interior of language models to inspection: not what the model says, but what computational structures it uses to say it.