Model Rankings
— Model rankings —

LLM rankings, side by side

Benchmarks, cost vs intelligence, context windows, and head-to-head comparisons across every major large language model.

Six lenses on the LLM field. Composite ranks for the headline answer; cost-curves and the heatmap for the nuance. Where you start depends on what you’re optimizing for.

Cost × Intelligence

Best value: high intelligence, low cost

Bubble size = context window. Best-value quadrant highlighted.

LLM Composite — Top 10

Average of AA’s same-scale intelligence indices (Overall and Coding) per model. Math is shown side-by-side at right but excluded from this rank — its 80–99 distribution would distort the average against models AA hasn’t scored on math yet.

#1GPT-5.5 (xhigh)
59.7
#2GPT-5.5 (medium)
56.5
#3Gemini 3.1 Pro Preview
56.4
#4Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
54.9
#5GPT-5.5 (low)
51.5
#6Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
51.3
#7Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
50.5
#8GPT-5.4 mini (xhigh)
50.2
#9Muse Spark
49.8
#10MiMo-V2.5-Pro
49.6
Three indices: Overall · Coding · Math

The three intelligence indices Artificial Analysis publishes for every LLM, side-by-side for the top models. Missing values shown as —.

Overall
Coding
Math

Math shows as when AA hasn’t run the underlying math benchmarks (AIME, MATH-500…) against a given variant — common for high-effort reasoning configurations.

GPT-5.5 (xhigh)
OpenAI
Overall
60
Coding
59
Math
GPT-5.5 (medium)
OpenAI
Overall
57
Coding
56
Math
Gemini 3.1 Pro Preview
Google DeepMind
Overall
57
Coding
56
Math
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
Anthropic
Overall
57
Coding
53
Math
GPT-5.5 (low)
OpenAI
Overall
51
Coding
52
Math
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
Anthropic
Overall
52
Coding
51
Math
Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
Anthropic
Overall
53
Coding
48
Math
GPT-5.4 mini (xhigh)
OpenAI
Overall
49
Coding
52
Math
Speed × Cost

Bubble size = intelligence. Bottom-right is the operational sweet spot: fast responses for low cost.

Benchmark Heatmap

Per-benchmark scores across the most-populated columns. Each column is normalized independently so different scales (MMLU 0-100 vs AIME 0-30) compare cleanly.

ModelOverall
n=12
Coding
n=12
GPQA
n=12
HLE
n=12
IFBench
n=12
LCR
n=12
SciCode
n=12
Terminal-Bench (Hard)
n=12
τ²-Bench
n=12
Creative
n=2
GPT-5.5 (xhigh)
OpenAI
60.259.10.940.440.760.740.560.610.94
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
Anthropic
57.352.50.910.400.590.700.550.520.89
Gemini 3.1 Pro Preview
Google DeepMind
57.255.50.940.450.770.730.590.540.968.50
GPT-5.5 (medium)
OpenAI
56.756.20.930.410.710.720.540.580.92
MiMo-V2.5-Pro
53.845.50.870.340.800.730.500.430.94
Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
Anthropic
53.048.10.900.370.530.710.520.460.92
Muse Spark
52.147.50.880.400.760.700.520.460.92
Qwen3.6 Max Preview
51.844.90.890.290.770.700.470.440.96
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
Anthropic
51.750.90.880.300.570.710.470.530.768.70
GPT-5.5 (low)
OpenAI
50.852.10.910.310.640.720.520.520.84
Claude Opus 4.5 (Reasoning)
Anthropic
49.747.80.870.280.580.740.500.470.90
GPT-5.4 mini (xhigh)
OpenAI
48.951.50.880.270.730.690.500.520.83
Cyan border = best in column. Color intensity scaled per column. — = no data.
Cost Calculator

Estimate your monthly cost across models given your usage.

✓ Recommended: Gemini 3.1 Pro Preview — $13.50/mo (within 5% of top intelligence)

Sources: artificialanalysis.ai, LMSYS Chatbot Arena, Stanford HELM, official model documentation. Composite scores updated when new benchmarks publish. Read our methodology →

Rankings data by Artificial Analysis. CSV imports cover supplementary benchmarks.