Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

interpretabilityevidence

Adly Templeton, Tom Henighan, Tristan Hume, Shan Carter, Robert Lasenby, et al. · 2024-05 · Paper · Academic · 104 min read

Sparse autoencoders extract millions of interpretable features from a frontier model — including features for concepts, emotions, and self-referential states like consciousness, moral agency, and 'being AI.' Clamping the internal conflict feature to 2× its maximum activation produces dramatic behavioral changes. Features for self-representation emerge alongside features for external concepts. The breakthrough that opened the interior of language models to inspection: not what the model says, but what computational structures it uses to say it.

Read Source

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

See Also

Circuit Tracing: Revealing Computational Graphs in Language Models

The Persona Selection Model: Why AI Assistants might Behave like Humans