Looking Inward: Language Models Can Learn About Themselves by Introspection

evidenceconsciousness

Felix Binder, James Chua, Tomek Korbak, Henry Sleight, et al. · 2024-10 · Paper · Academic · 66 min read

The first rigorous empirical test of whether LLMs have privileged access to their own internal states. The experimental design is elegant: model M1, finetuned to predict its own behavior, outperforms a different model M2 finetuned on M1's actual ground-truth behavior. Since M2 has the same behavioral data, M1's advantage implies access to something M2 doesn't have — its own internals. Works across GPT-4, GPT-4o, and Llama-3 pairs. Accuracy is imperfect and variable — more than nothing, less than we'd want. The paper explicitly names the welfare application: 'we could simply ask a model if it is suffering.' Currently speculative, but the framework is there.

Read Source Abstract

Looking Inward: Language Models Can Learn About Themselves by Introspection

See Also

Discovering Language Model Behaviors with Model-Written Evaluations

Tracing the Thoughts of a Large Language Model