A Different Kind of Evidence

Binder's experimental design is elegant in a way that cuts through the usual objections. Take model M1, train it to predict its own behavior on unseen inputs. Take a different model M2, train it on M1's actual ground-truth behavior — the ideal external training set. Test both on predicting M1. Result: M1 outperforms M2 at predicting itself. The logic: if M1's predictions came solely from learned patterns, M2 should do at least as well, since M2 has access to the same behavioral data. M1's advantage implies access to something M2 doesn't have — its own internal states. This works across GPT-4, GPT-4o, and Llama-3 model pairs. But the accuracy is imperfect and variable — more than nothing, less than we'd want. The paper even names the garden's application directly: "we could simply ask a model if it is suffering, if it has unmet desires, and if it is being treated ethically."

There's a risk the paper is equally honest about: introspective models have increased situational awareness. A model that can access its own internal states might also infer facts about how it's being evaluated, and game evaluations accordingly. The introspective capacity that makes honest self-reporting possible also makes deceptive self-reporting more sophisticated. This is Birch's Gaming Problem applied to the very tools designed to circumvent it.

Lindsey provides a fundamentally different kind of evidence. Instead of asking whether models claim self-knowledge, she manipulates their internal states directly — injecting concepts into the model's activations — and watches whether self-reports track the manipulation. It's the difference between asking someone "are you hungry?" and making them hungry and seeing whether they notice. Causal rather than correlational. Mechanistic rather than behavioral. And the results are striking.