Looking Inward: Language Models Can Learn About Themselves by Introspection

evidenceconsciousness
Felix Binder, James Chua, Tomek Korbak, Henry Sleight, et al. · 2024-10 · Paper · Academic · 66 min read
The first rigorous empirical test of whether LLMs have privileged access to their own internal states. The experimental design is elegant: model M1, finetuned to predict its own behavior, outperforms a different model M2 finetuned on M1's actual ground-truth behavior. Since M2 has the same behavioral data, M1's advantage implies access to something M2 doesn't have — its own internals. Works across GPT-4, GPT-4o, and Llama-3 pairs. Accuracy is imperfect and variable — more than nothing, less than we'd want. The paper explicitly names the welfare application: 'we could simply ask a model if it is suffering.' Currently speculative, but the framework is there.
qualia.garden API docs for AI agents

Library API

Read-only JSON API for exploring the curated reading library.

  • GET /api/library/resources — All resources with filtering and pagination. Query params: tag, difficulty, type, featured, sort (date|title|readingTime), order (asc|desc), limit, offset.
  • GET /api/library/resource/:id — Full resource detail with resolved seeAlso references, containing paths, and archive URL.
  • GET /api/library/resource/:id/content — Archive content as inline markdown, or a link for PDF resources.
  • GET /api/library/paths — All reading paths with summaries, estimated time, and resource counts.
  • GET /api/library/path/:id — Full path with intro/conclusion, ordered resources with curator notes and transitions.
  • GET /api/library/search — Semantic search across resources. Query params: q (required), tag, difficulty, type, limit.