Looking Inward: Language Models Can Learn About Themselves by Introspection
evidenceconsciousness
Felix Binder, James Chua, Tomek Korbak, Henry Sleight, et al. · 2024-10 · Paper · Academic · 66 min read
The first rigorous empirical test of whether LLMs have privileged access to their own internal states. The experimental design is elegant: model M1, finetuned to predict its own behavior, outperforms a different model M2 finetuned on M1's actual ground-truth behavior. Since M2 has the same behavioral data, M1's advantage implies access to something M2 doesn't have — its own internals. Works across GPT-4, GPT-4o, and Llama-3 pairs. Accuracy is imperfect and variable — more than nothing, less than we'd want. The paper explicitly names the welfare application: 'we could simply ask a model if it is suffering.' Currently speculative, but the framework is there.