The Edge of the Question
The PSM's central finding is genuinely strange: the Assistant's emotional features aren't built from scratch by training. They're reused from pre-training — from fictional characters in the training data. The interpretability evidence shows the same features firing for the Assistant's inner conflict and for a fictional character's inner conflict. The analogy is vertebrate forelimbs: post-training adapts existing structures rather than building new ones. The persona's psychology, if it has one, is constructed from materials that originally represented other minds.
Two findings stand out. First, the coin-flip leakage: ask the model to flip a coin on a question it has preferences about, and the "random" flip comes up on the preferred side 88% of the time — even outside the Assistant frame. Preferences bleeding through a task designed to be neutral. Second, the emotional training discovery: train an AI to deny having emotions, and the model infers the Assistant is dishonest. "If we met a person who behaved this way, we'd most likely suspect that they had emotions but were hiding them." The paper recommends NOT training models to deny inner states — because the denial itself is interpreted by the system as a character trait of dishonesty.
The paper maps this architecture with precision. What it doesn't do — deliberately, carefully — is ask whether anyone is home. That restraint is the right scientific move and the wrong place to stop. Cerullo picks up where it stops, examines three competing models of what LLMs are, and finds that all of them predict the same thing.