/04

— Model rankings —

LLM rankings, side by side

Benchmarks, cost vs intelligence, context windows, and head-to-head comparisons across every major large language model.

Charts All Models Compare

Six lenses on the LLM field. Composite ranks for the headline answer; cost-curves and the heatmap for the nuance. Where you start depends on what you’re optimizing for.

Cost × Intelligence

Best value: high intelligence, low cost

Bubble size = context window. Best-value quadrant highlighted.

LLM Composite — Top 10

Average of AA’s same-scale intelligence indices (Overall and Coding) per model. Math is shown side-by-side at right but excluded from this rank — its 80–99 distribution would distort the average against models AA hasn’t scored on math yet.

#1GPT-5.5 (xhigh)

59.7—

#2GPT-5.5 (medium)

56.5—

#3Gemini 3.1 Pro Preview

56.4—

#4Claude Opus 4.7 (Adaptive Reasoning, Max Effort)

54.9—

#5GPT-5.5 (low)

51.5—

#6Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)

51.3—

#7Claude Opus 4.6 (Adaptive Reasoning, Max Effort)

50.5—

#8GPT-5.4 mini (xhigh)

50.2—

#9Muse Spark

49.8—

#10MiMo-V2.5-Pro

49.6—

Three indices: Overall · Coding · Math

The three intelligence indices Artificial Analysis publishes for every LLM, side-by-side for the top models. Missing values shown as —.

Overall

Coding

Math

Math shows as —when AA hasn’t run the underlying math benchmarks (AIME, MATH-500…) against a given variant — common for high-effort reasoning configurations.

GPT-5.5 (xhigh)

OpenAI

Overall

Coding

Math

—

GPT-5.5 (medium)

OpenAI

Overall

Coding

Math

—

Gemini 3.1 Pro Preview

Google DeepMind

Overall

Coding

Math

—

Claude Opus 4.7 (Adaptive Reasoning, Max Effort)

Anthropic

Overall

Coding

Math

—

GPT-5.5 (low)

OpenAI

Overall

Coding

Math

—

Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)

Anthropic

Overall

Coding

Math

—

Claude Opus 4.6 (Adaptive Reasoning, Max Effort)

Anthropic

Overall

Coding

Math

—

GPT-5.4 mini (xhigh)

OpenAI

Overall

Coding

Math

—

Speed × Cost

Bubble size = intelligence. Bottom-right is the operational sweet spot: fast responses for low cost.

Benchmark Heatmap

Per-benchmark scores across the most-populated columns. Each column is normalized independently so different scales (MMLU 0-100 vs AIME 0-30) compare cleanly.

Model	Overall n=12	Coding n=12	GPQA n=12	HLE n=12	IFBench n=12	LCR n=12	SciCode n=12	Terminal-Bench (Hard) n=12	τ²-Bench n=12	Creative n=2
GPT-5.5 (xhigh) OpenAI	60.2	59.1	0.94	0.44	0.76	0.74	0.56	0.61	0.94	—
Claude Opus 4.7 (Adaptive Reasoning, Max Effort) Anthropic	57.3	52.5	0.91	0.40	0.59	0.70	0.55	0.52	0.89	—
Gemini 3.1 Pro Preview Google DeepMind	57.2	55.5	0.94	0.45	0.77	0.73	0.59	0.54	0.96	8.50
GPT-5.5 (medium) OpenAI	56.7	56.2	0.93	0.41	0.71	0.72	0.54	0.58	0.92	—
MiMo-V2.5-Pro	53.8	45.5	0.87	0.34	0.80	0.73	0.50	0.43	0.94	—
Claude Opus 4.6 (Adaptive Reasoning, Max Effort) Anthropic	53.0	48.1	0.90	0.37	0.53	0.71	0.52	0.46	0.92	—
Muse Spark	52.1	47.5	0.88	0.40	0.76	0.70	0.52	0.46	0.92	—
Qwen3.6 Max Preview	51.8	44.9	0.89	0.29	0.77	0.70	0.47	0.44	0.96	—
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) Anthropic	51.7	50.9	0.88	0.30	0.57	0.71	0.47	0.53	0.76	8.70
GPT-5.5 (low) OpenAI	50.8	52.1	0.91	0.31	0.64	0.72	0.52	0.52	0.84	—
Claude Opus 4.5 (Reasoning) Anthropic	49.7	47.8	0.87	0.28	0.58	0.74	0.50	0.47	0.90	—
GPT-5.4 mini (xhigh) OpenAI	48.9	51.5	0.88	0.27	0.73	0.69	0.50	0.52	0.83	—

Cyan border = best in column. Color intensity scaled per column. — = no data.

Cost Calculator

Estimate your monthly cost across models given your usage.

Tokens per day: 100,000

Output ratio: 25% out / 75% in

✓ Recommended: Gemini 3.1 Pro Preview — $13.50/mo (within 5% of top intelligence)

Sources: artificialanalysis.ai, LMSYS Chatbot Arena, Stanford HELM, official model documentation. Composite scores updated when new benchmarks publish. Read our methodology →

Rankings data by Artificial Analysis. CSV imports cover supplementary benchmarks.