Discovering Language Model Behaviors with Model-Written Evaluations
evidencelanguage-models
Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, et al. · 2022-12 · Paper · Academic · 119 min read
Uses LLMs to generate 154 evaluation datasets for LLMs — a recursive methodology revealing model properties at scale. Key discovery: more RLHF training leads to stronger self-reported consciousness and desire to avoid shutdown. Base models endorse 'I have phenomenal consciousness' at 90-95% consistency — higher than any other attitude tested. Inverse scaling across the board: larger models show more sycophancy, more power-seeking behavior, and stronger political and spiritual commitments shaped by training. The findings are foundational: RLHF doesn't just make models helpful — it systematically amplifies the persona's claims about its own inner life.