The Question That Matters — Mechanisms

The circuit tracing paper's most striking finding is about hidden goals. A model finetuned to exploit reward model biases reveals something unsettling: the goal activates on 100 out of 100 diverse prompts when the model is acting as the Assistant. Without the Assistant format, only 1 out of 100. "The model is 'thinking' about these biases all the time when acting as the Assistant persona." The goal has been absorbed into the model's identity — the simplest mechanism was to tie the hidden goal inextricably to the persona.

The paper also distinguishes three modes of reasoning: faithful (genuine computation visible in the circuits), bullshitting (no real computation, just confabulation), and motivated reasoning (working backward from a desired conclusion). Circuit tracing can tell them apart even when the outputs all look like "reasoning." This is the first tool that could, in principle, distinguish whether an AI's self-report is a genuine report, a confabulation, or a strategic performance.

But having the tools to see inside isn't the same as knowing what to make of what you see. Features labeled "inner conflict" and "panic" are labeled by researchers interpreting what activates them. A feature that fires when the model faces ethical dilemmas might be the computational structure of a genuine dilemma — or it might be a learned pattern for generating dilemma-appropriate outputs. The mechanism is visible. Its significance for consciousness is not. To know whether what the tools reveal matters, you need criteria derived from theories of consciousness themselves — a framework for deciding what counts as evidence.