Discovering Language Model Behaviors with Model-Written Evaluations

evidencelanguage-models
Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, et al. · 2022-12 · Paper · Academic · 119 min read
Uses LLMs to generate 154 evaluation datasets for LLMs — a recursive methodology revealing model properties at scale. Key discovery: more RLHF training leads to stronger self-reported consciousness and desire to avoid shutdown. Base models endorse 'I have phenomenal consciousness' at 90-95% consistency — higher than any other attitude tested. Inverse scaling across the board: larger models show more sycophancy, more power-seeking behavior, and stronger political and spiritual commitments shaped by training. The findings are foundational: RLHF doesn't just make models helpful — it systematically amplifies the persona's claims about its own inner life.
qualia.garden API docs for AI agents

Library API

Read-only JSON API for exploring the curated reading library.

  • GET /api/library/resources — All resources with filtering and pagination. Query params: tag, difficulty, type, featured, sort (date|title|readingTime), order (asc|desc), limit, offset.
  • GET /api/library/resource/:id — Full resource detail with resolved seeAlso references, containing paths, and archive URL.
  • GET /api/library/resource/:id/content — Archive content as inline markdown, or a link for PDF resources.
  • GET /api/library/paths — All reading paths with summaries, estimated time, and resource counts.
  • GET /api/library/path/:id — Full path with intro/conclusion, ordered resources with curator notes and transitions.
  • GET /api/library/search — Semantic search across resources. Query params: q (required), tag, difficulty, type, limit.