Model Rankings
— Model rankings —

LLM rankings, side by side

Benchmarks, cost vs intelligence, context windows, and head-to-head comparisons across every major large language model.

Six lenses on the LLM field. Composite ranks for the headline answer; cost-curves and the heatmap for the nuance. Where you start depends on what you’re optimizing for.

Cost × Intelligence

Best value: high intelligence, low cost

Bubble size = context window. Best-value quadrant highlighted.

LLM Composite — Top 10

Average of AA’s same-scale intelligence indices (Overall and Coding) per model. Math is shown side-by-side at right but excluded from this rank — its 80–99 distribution would distort the average against models AA hasn’t scored on math yet.

#1GPT-5.5 (xhigh)
64.8
#2GPT-5.5 (medium)
61.0
#3Claude Opus 4.7 (Non-reasoning, High Effort)
58.1
#4Gemini 3.1 Pro Preview
57.6
#5Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
55.1
#6GPT-5.5 (low)
52.2
#7DeepSeek V4 Pro (Reasoning, Max Effort)
51.8
#8MiMo-V2.5-Pro
51.2
#9Muse Spark
50.9
#10DeepSeek V4 Flash (Reasoning, Max Effort)
48.3
Three indices: Overall · Coding · Math

The three intelligence indices Artificial Analysis publishes for every LLM, side-by-side for the top models. Missing values shown as —.

Overall
Coding
Math

Math shows as when AA hasn’t run the underlying math benchmarks (AIME, MATH-500…) against a given variant — common for high-effort reasoning configurations.

GPT-5.5 (xhigh)
OpenAI
Overall
55
Coding
75
Math
GPT-5.5 (medium)
OpenAI
Overall
50
Coding
72
Math
Claude Opus 4.7 (Non-reasoning, High Effort)
Anthropic
Overall
43
Coding
74
Math
Gemini 3.1 Pro Preview
Google DeepMind
Overall
47
Coding
69
Math
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
Anthropic
Overall
47
Coding
63
Math
GPT-5.5 (low)
OpenAI
Overall
44
Coding
61
Math
DeepSeek V4 Pro (Reasoning, Max Effort)
Overall
44
Coding
59
Math
MiMo-V2.5-Pro
Overall
42
Coding
60
Math
Speed × Cost

Bubble size = intelligence. Bottom-right is the operational sweet spot: fast responses for low cost.

Benchmark Heatmap

Per-benchmark scores across the most-populated columns. Each column is normalized independently so different scales (MMLU 0-100 vs AIME 0-30) compare cleanly.

ModelOverall
n=12
Coding
n=12
GPQA
n=12
HLE
n=12
IFBench
n=12
LCR
n=12
SciCode
n=12
Tau Banking
n=12
Terminal-Bench (Hard)
n=12
Terminalbench V2 1
n=12
GPT-5.5 (xhigh)
OpenAI
54.874.90.940.440.760.740.560.310.610.84
GPT-5.5 (medium)
OpenAI
50.471.50.930.410.710.720.540.260.580.81
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
Anthropic
47.263.00.880.300.570.710.470.310.530.71
Gemini 3.1 Pro Preview
Google DeepMind
46.568.80.940.450.770.730.590.160.540.74
DeepSeek V4 Pro (Reasoning, Max Effort)
44.359.40.890.360.760.660.500.260.460.64
GPT-5.5 (low)
OpenAI
43.560.90.910.310.640.720.520.210.520.66
Muse Spark
43.158.60.880.400.760.700.520.200.450.62
Claude Opus 4.7 (Non-reasoning, High Effort)
Anthropic
42.773.60.890.310.440.670.500.290.550.83
MiMo-V2.5-Pro
42.260.20.870.340.800.730.500.090.430.65
DeepSeek V4 Flash (Reasoning, Max Effort)
40.356.20.890.320.790.630.450.230.360.62
GLM-5.1 (Reasoning)
40.255.80.870.280.760.620.440.120.430.62
GPT-5.4 mini (xhigh)
OpenAI
40.056.10.880.270.730.690.500.210.520.59
Cyan border = best in column. Color intensity scaled per column. — = no data.
Cost Calculator

Estimate your monthly cost across models given your usage.

✓ Recommended: GPT-5.5 (xhigh) — $33.75/mo (within 5% of top intelligence)

Sources: artificialanalysis.ai, LMSYS Chatbot Arena, Stanford HELM, official model documentation. Composite scores updated when new benchmarks publish. Read our methodology →

Rankings data by Artificial Analysis. CSV imports cover supplementary benchmarks.