What Can AI Models Know About Themselves?
Ask a language model whether it has inner experiences, and it will give you an answer. The answer may be thoughtful, uncertain, even moving. But it proves nothing. The model has read millions of words about consciousness. It knows what a convincing answer looks like.
Two recent papers found a way past this impasse. Not by asking models what they experience, but by looking at what happens inside them — and discovering that what’s inside and what comes out don’t always match.