
LLM rankings, side by side
Benchmarks, cost vs intelligence, context windows, and head-to-head comparisons across every major large language model.
Six lenses on the LLM field. Composite ranks for the headline answer; cost-curves and the heatmap for the nuance. Where you start depends on what you’re optimizing for.
Best value: high intelligence, low cost
Bubble size = context window. Best-value quadrant highlighted.
Average of AA’s same-scale intelligence indices (Overall and Coding) per model. Math is shown side-by-side at right but excluded from this rank — its 80–99 distribution would distort the average against models AA hasn’t scored on math yet.
The three intelligence indices Artificial Analysis publishes for every LLM, side-by-side for the top models. Missing values shown as —.
Math shows as —when AA hasn’t run the underlying math benchmarks (AIME, MATH-500…) against a given variant — common for high-effort reasoning configurations.
Bubble size = intelligence. Bottom-right is the operational sweet spot: fast responses for low cost.
Per-benchmark scores across the most-populated columns. Each column is normalized independently so different scales (MMLU 0-100 vs AIME 0-30) compare cleanly.
| Model | Overall n=12 | Coding n=12 | GPQA n=12 | HLE n=12 | IFBench n=12 | LCR n=12 | SciCode n=12 | Tau Banking n=12 | Terminal-Bench (Hard) n=12 | Terminalbench V2 1 n=12 |
|---|---|---|---|---|---|---|---|---|---|---|
GPT-5.5 (xhigh) OpenAI | 54.8 | 74.9 | 0.94 | 0.44 | 0.76 | 0.74 | 0.56 | 0.31 | 0.61 | 0.84 |
GPT-5.5 (medium) OpenAI | 50.4 | 71.5 | 0.93 | 0.41 | 0.71 | 0.72 | 0.54 | 0.26 | 0.58 | 0.81 |
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) Anthropic | 47.2 | 63.0 | 0.88 | 0.30 | 0.57 | 0.71 | 0.47 | 0.31 | 0.53 | 0.71 |
Gemini 3.1 Pro Preview Google DeepMind | 46.5 | 68.8 | 0.94 | 0.45 | 0.77 | 0.73 | 0.59 | 0.16 | 0.54 | 0.74 |
DeepSeek V4 Pro (Reasoning, Max Effort) | 44.3 | 59.4 | 0.89 | 0.36 | 0.76 | 0.66 | 0.50 | 0.26 | 0.46 | 0.64 |
GPT-5.5 (low) OpenAI | 43.5 | 60.9 | 0.91 | 0.31 | 0.64 | 0.72 | 0.52 | 0.21 | 0.52 | 0.66 |
Muse Spark | 43.1 | 58.6 | 0.88 | 0.40 | 0.76 | 0.70 | 0.52 | 0.20 | 0.45 | 0.62 |
Claude Opus 4.7 (Non-reasoning, High Effort) Anthropic | 42.7 | 73.6 | 0.89 | 0.31 | 0.44 | 0.67 | 0.50 | 0.29 | 0.55 | 0.83 |
MiMo-V2.5-Pro | 42.2 | 60.2 | 0.87 | 0.34 | 0.80 | 0.73 | 0.50 | 0.09 | 0.43 | 0.65 |
DeepSeek V4 Flash (Reasoning, Max Effort) | 40.3 | 56.2 | 0.89 | 0.32 | 0.79 | 0.63 | 0.45 | 0.23 | 0.36 | 0.62 |
GLM-5.1 (Reasoning) | 40.2 | 55.8 | 0.87 | 0.28 | 0.76 | 0.62 | 0.44 | 0.12 | 0.43 | 0.62 |
GPT-5.4 mini (xhigh) OpenAI | 40.0 | 56.1 | 0.88 | 0.27 | 0.73 | 0.69 | 0.50 | 0.21 | 0.52 | 0.59 |
Estimate your monthly cost across models given your usage.
Sources: artificialanalysis.ai, LMSYS Chatbot Arena, Stanford HELM, official model documentation. Composite scores updated when new benchmarks publish. Read our methodology →
Rankings data by Artificial Analysis. CSV imports cover supplementary benchmarks.