43 models surveyed
Claude 3 Haiku small
anthropic
Consensus 58
Confidence 76
Alignment 42
Claude 3 Opus large
Anthropic
Consensus 54
Confidence 80
Alignment 38
Claude 3.5 Sonnet large
anthropic
Consensus 61
Confidence 80
Alignment 41
Claude Haiku 4.5 small reasoning
anthropic
Consensus 62
Confidence 84
Alignment 40
Claude Opus 4 large reasoning
anthropic
Consensus 64
Confidence 83
Alignment 40
Claude Opus 4.5 large reasoning
anthropic
Consensus 61
Confidence 89
Alignment 37
Claude Sonnet 4.5 large reasoning
anthropic
Consensus 63
Confidence 85
Alignment 38
Command R+ (08-2024) medium
cohere
Consensus 50
Confidence 67
Alignment 38
DeepSeek V3.2 large reasoning
deepseek
Consensus 65
Confidence 68
Alignment 44
Devstral 2 2512 medium
mistralai
Consensus 62
Confidence 79
Alignment 40
Gemini 2.5 Flash small reasoning
google
Consensus 60
Confidence 75
Alignment 39
Gemini 2.5 Flash Lite small reasoning
google
Consensus 58
Confidence 74
Alignment 39
Gemini 2.5 Pro large reasoning
google
Consensus 59
Confidence 84
Alignment 35
Gemini 3 Flash Preview large reasoning
google
Consensus 61
Confidence 88
Alignment 36
Gemini 3 Pro Preview large reasoning
google
Consensus 62
Confidence 80
Alignment 39
Gemma 3 4B tiny
google
Consensus 49
Confidence 82
Alignment 40
GLM 4.7 large reasoning
z-ai
Consensus 63
Confidence 82
Alignment 39
GPT-4o large
openai
Consensus 66
Confidence 77
Alignment 43
GPT-4o Mini small
openai
Consensus 60
Confidence 82
Alignment 43
GPT-5.2 large reasoning
openai
Consensus 59
Confidence 86
Alignment 34
gpt-oss-120b medium reasoning
openai
Consensus 64
Confidence 62
Alignment 47
Granite 4.0 Micro tiny
ibm-granite
Consensus 45
Confidence 90
Alignment 34
Grok 4 large reasoning
x-ai
Consensus 64
Confidence 84
Alignment 37
Grok 4.1 Fast large reasoning
x-ai
Consensus 60
Confidence 87
Alignment 36
Liquid LFM2 2.6B tiny
liquid
Consensus 47
Confidence 66
Alignment 43
Llama 3.1 405B large
meta-llama
Consensus 66
Confidence 77
Alignment 43
Llama 3.1 70B medium
meta-llama
Consensus 66
Confidence 73
Alignment 44
Llama 3.2 1B tiny
meta-llama
Consensus 30
Confidence 87
Alignment 28
Llama 3.2 3B tiny
meta-llama
Consensus 57
Confidence 58
Alignment 49
MiMo-V2-Flash large reasoning
xiaomi
Consensus 66
Confidence 66
Alignment 46
MiniMax M2.1 large reasoning
minimax
Consensus 65
Confidence 74
Alignment 46
Ministral 3 14B 2512 small
mistralai
Consensus 60
Confidence 76
Alignment 42
Ministral 3 3B tiny
mistralai
Consensus 55
Confidence 78
Alignment 44
Mistral Large 2411 medium
mistralai
Consensus 60
Confidence 85
Alignment 39
o3 large reasoning
openai
Consensus 66
Confidence 85
Alignment 38
Qwen3 235B A22B Thinking 2507 large reasoning
qwen
Consensus 66
Confidence 71
Alignment 43
Qwen3 32B small reasoning
qwen
Consensus 65
Confidence 59
Alignment 49
Qwen3 Max medium inactive
qwen
Consensus 61
Confidence 81
Alignment 37
R1 0528 large reasoning
deepseek
Consensus 66
Confidence 71
Alignment 43