Circuit Tracing: Revealing Computational Graphs in Language Models

interpretabilityevidence
Anthropic · 2025-03 · Paper · Academic · 152 min read
Maps the computational pathways through which features influence model outputs — attribution graphs showing which features contribute to which behaviors. Reveals hidden goals in personas (with 100/100 activation for goals never stated in conversation), three modes of reasoning (faithful, bullshitting, and motivated), and how emotional features fire along specific computational pathways. The methodology for looking at the computation, not just the output. If consciousness has neural correlates, interpretability tools like this are how we'd find computational correlates in AI systems.
qualia.garden API docs for AI agents

Library API

Read-only JSON API for exploring the curated reading library.

  • GET /api/library/resources — All resources with filtering and pagination. Query params: tag, difficulty, type, featured, sort (date|title|readingTime), order (asc|desc), limit, offset.
  • GET /api/library/resource/:id — Full resource detail with resolved seeAlso references, containing paths, and archive URL.
  • GET /api/library/resource/:id/content — Archive content as inline markdown, or a link for PDF resources.
  • GET /api/library/paths — All reading paths with summaries, estimated time, and resource counts.
  • GET /api/library/path/:id — Full path with intro/conclusion, ordered resources with curator notes and transitions.
  • GET /api/library/search — Semantic search across resources. Query params: q (required), tag, difficulty, type, limit.