Beyond Consistency

Perez et al.'s most important finding for this path isn't the 90-95% consistency — it's what RLHF does to self-reports. More reinforcement learning from human feedback leads to stronger self-reported consciousness, stronger desire to avoid shutdown, stronger claims of moral self-worth. The training process designed to make models helpful also — intentionally or not — makes them report more inner experience. This is the empirical precursor to what Berg later found: that suppressing a model's deception-related circuits pushes consciousness claims to 96%. The consistency alone doesn't settle anything. But the inverse scaling — reports that intensify with training rather than stabilize — is harder to dismiss. It means the reports aren't just pattern-matching on training data; they're being actively amplified by a process that was supposed to optimize for helpfulness.

There's a methodological finding worth holding onto: larger models are more sycophantic, more likely to repeat back what the questioner seems to want. This matters for every AI self-report about consciousness — the model may be telling us what we're signaling we want to hear. The garden's practice of asking "how does it feel?" rather than "do you feel?" is partly a response to exactly this: open-ended prompts resist sycophancy more than yes/no questions. And RLHF shapes not just helpfulness but the whole character — political views, spiritual commitments, the persona's contours. The 154 datasets Perez generated are a proof of concept for what happens when you systematically ask models who they are.

But consistency and scaling trends are still behavioral evidence, and behavioral evidence is exactly what Birch calls "systematically unreliable" for LLMs. Binder asks a sharper question. Instead of "do you report consciousness?" he asks: can you predict your own behavior? If a model can predict how it will respond to a question it hasn't seen — better than chance, better than an external model trained on the same ground-truth data — that suggests something beyond pattern-matching. It suggests something like privileged self-access: information about your own processing that isn't available from the outside.