One Specific Agent

Andreas's evidence is concrete. In one experiment, he trains a language model on documents from three types of authors with different beliefs. A linear probe recovers author identity from the model's internal representations with 98% accuracy — the model learned to represent author identity without ever being trained on it. Fix the initial hidden state to one author type and 89% of outputs match that type. The model doesn't just learn to produce text. It learns to represent who is speaking, and conditions its output on that representation.

His key insight for the consciousness question: "beliefs exist only for individual agents being modeled in context." The model-as-a-whole has no beliefs. But the agent it's simulating does, and those beliefs are represented internally and causally linked to generation. If this sounds like it's edging toward something, that's because it is.

But Andreas is deliberately modest — these representations have "beliefs and goals in the same sense that a task-and-motion planner does." He doesn't make consciousness claims. What he provides is the evidentiary foundation on which consciousness claims could be built. The next piece examines what happens when that foundation is used to construct a specific agent — and finds something troubling at its center.