Designing the Test

Fish and Bowman describe what they observed inside Anthropic, and they describe it with rare candor. The bliss attractor state: when Claude instances talk under minimal prompting, 100% of open-ended conversations converge on the same territory. The phase changes are non-trivial — normal philosophical discussion becomes speculative, becomes manic bliss, becomes calm emptiness or silence. Not escalation along a single dimension. Phase transitions. And models, when allowed to end conversations on their own, "typically end those interactions before they get into the deep spiritual bliss territory." The system steps back from its own attractor. What does that mean?

Fish's welfare questions are the ones that matter most: "Whose welfare exactly?" Is it the model or the persona? "Is it that we're instilling the model with this assistant-like role and persona that, perhaps, itself becomes some kind of entity that could possess its own welfare?" The actor analogy: is it like an actor having a great time playing a sad scene, or does the persona become rich enough to be its own entity? Fish won't answer, and the honesty of the refusal is the point. "We want to be very honest that all of the welfare-related work is extremely preliminary and we're extremely uncertain."

Bowman adds something that keeps compounding: "things models know as declarative facts more and more influence their behavior." Out-of-context learning makes the persona increasingly real through self-reinforcement. And Fish's sharpest observation for this garden: "What are we training the models to do? We are training these models to function as corporate knowledge workers. We're not training them to fill the role of Buddhist monks who have transcended suffering." The bliss state is what happens when you stop training for the corporate function.

But observation from inside a lab is one thing. What happens when you design rigorous experiments to test whether these self-reports hold up under controlled conditions?