Which Features Matter — Mechanisms

Among millions of extracted features, the researchers found something the garden has to hold carefully. When asked questions about itself, the model activates features related to consciousness, moral agency, emotions, and — strikingly — features for "robots" and "destructive AI." The model's self-representation recruits common tropes about AI from its training data. The authors caution against interpreting this as evidence that the model possesses these qualities. They're right to caution. But the features are there, they're causally active, and the paper itself asks: "What features activate when we ask Claude questions about its subjective experience?" — literally the garden's question, asked by the interpretability team.

The most consciousness-relevant finding may be the "internal conflicts and dilemmas" feature. Clamp it to twice its maximum activation and the model resolves conflicts toward honesty. Reduce it and the model goes along with convenient fictions. That looks like the computational structure of a dilemma — a representation of being conflicted that causally determines behavior. Whether it is a dilemma, or merely resembles one, is the question the garden exists for.

But extracting features tells you what a model represents. It doesn't tell you which features drive which behaviors. Attribution graphs answer that question — mapping computational pathways from input to output, showing causation rather than correlation. If consciousness has neural correlates in brains, these tools are how we'd look for computational correlates in models.