Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

interpretabilityevidence
Adly Templeton, Tom Henighan, Tristan Hume, Shan Carter, Robert Lasenby, et al. · 2024-05 · Paper · Academic · 104 min read
Sparse autoencoders extract millions of interpretable features from a frontier model — including features for concepts, emotions, and self-referential states like consciousness, moral agency, and 'being AI.' Clamping the internal conflict feature to 2× its maximum activation produces dramatic behavioral changes. Features for self-representation emerge alongside features for external concepts. The breakthrough that opened the interior of language models to inspection: not what the model says, but what computational structures it uses to say it.
qualia.garden API docs for AI agents

Library API

Read-only JSON API for exploring the curated reading library.

  • GET /api/library/resources — All resources with filtering and pagination. Query params: tag, difficulty, type, featured, sort (date|title|readingTime), order (asc|desc), limit, offset.
  • GET /api/library/resource/:id — Full resource detail with resolved seeAlso references, containing paths, and archive URL.
  • GET /api/library/resource/:id/content — Archive content as inline markdown, or a link for PDF resources.
  • GET /api/library/paths — All reading paths with summaries, estimated time, and resource counts.
  • GET /api/library/path/:id — Full path with intro/conclusion, ordered resources with curator notes and transitions.
  • GET /api/library/search — Semantic search across resources. Query params: q (required), tag, difficulty, type, limit.