Independent-style analysis of frontier AI models and the APIs that serve them, across the three numbers that decide every deployment: intelligence (higher is better), output speed in tokens per second (higher is better), and blended price per million tokens (lower is better).
31
Frontier models
16
Labs tracked
9
Evals in the index
June 11, 2026
Snapshot
Intelligence
Higher is betterIntelligence Index · best model per lab
- 1
Claude Fable 5 (with fallback)65
- 2
GPT-5.5 (xhigh)60
- 3
Gemini 3.1 Pro Preview57
- 4
Qwen3.7 Max57
- 5
MiniMax-M355
- 6
Grok 4.3 (high)53
- 7
Muse Spark52
- 8
DeepSeek V4 Pro (Max)52
- 9
Nemotron 3 Ultra48
Speed
Higher is betterMedian output tokens per second · best per lab
- 1
Qwen3.7 Max172
- 2
Nemotron 3 Ultra145
- 3
Grok 4.3 (high)144
- 4
Nova 2.0 Pro Preview (medium)124
- 5
Gemini 3.1 Pro Preview110
- 6
Mistral Medium 3.566 - 7
GLM-5.163
- 8
Claude Fable 5 (with fallback)60
- 9
Kimi K2.660
Price
Lower is betterUSD per 1M tokens, 3:1 blended · cheapest per lab
- 1
MiniMax-M3$0.53
- 2
DeepSeek V4 Pro (Max)$0.54
- 3
Nemotron 3 Ultra$1.10
- 4
MiMo-V2.5-Pro$1.35
- 5
Grok 4.3 (high)$1.56
- 6
Kimi K2.6$1.71 - 7
GLM-5.1$2.15
- 8
Mistral Medium 3.5$3.00 - 9
Nova 2.0 Pro Preview (medium)$3.44
Model comparison summary
Every model we track, ranked by Intelligence Index.
| # | Model | Creator | Released | Context | Intelligence | Coding | Agentic | Speed (t/s) | Blended $/1M | Latency |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Anthropic | Jun 2026 | 1M | 65 | 62 | 81 | 60 | $21.88 | 108s | |
| 2 | Anthropic | May 2026 | 1M | 61 | 57 | 78 | 58 | $10.94 | 56s | |
| 3 | OpenAI | Apr 2026 | 922k | 60 | 59 | 74 | 50 | $11.25 | 86s | |
| 4 | Anthropic | Apr 2026 | 1M | 57 | 53 | 71 | 44 | $10.94 | 21s | |
| 5 | Feb 2026 | 1M | 57 | 56 | 59 | 110 | $4.50 | 26s | ||
| 6 | Alibaba | May 2026 | 1M | 57 | 50 | 67 | 172 | $3.75 | 17s | |
| 7 | May 2026 | 1M | 55 | 45 | 70 | 145 | $3.38 | 21s | ||
| 8 | MiniMax | Jun 2026 | 1M | 55 | 43 | 69 | 43 | $0.53 | 49s | |
| 9 | OpenAI | Feb 2026 | 400k | 54 | 53 | 61 | 67 | $4.81 | 86s | |
| 10 | Alibaba | Jun 2026 | 1M | 53 | 47 | 65 | 53 | $0.59 | 40s | |
| 11 | xAI | Apr 2026 | 1M | 53 | 41 | 66 | 144 | $1.56 | 14s | |
| 12 | Meta | Apr 2026 | 262k | 52 | 48 | 62 | — | — | — | |
| 13 | Anthropic | Feb 2026 | 1M | 52 | 51 | 63 | 48 | $6.56 | 139s | |
| 14 | DeepSeek | Apr 2026 | 1M | 52 | 48 | 67 | 46 | $0.54 | 97s | |
| 15 | MiniMax | Mar 2026 | 205k | 50 | 42 | 62 | 42 | $0.53 | 61s | |
| 16 | OpenAI | Mar 2026 | 400k | 49 | 52 | 59 | 149 | $1.69 | 7.1s | |
| 17 | NVIDIA | Jun 2026 | 262k | 48 | 38 | 57 | 145 | $1.10 | 17s | |
| 18 | DeepSeek | Apr 2026 | 1M | 47 | 39 | 61 | 89 | $0.18 | 64s | |
| 19 | Alibaba | Feb 2026 | 262k | 45 | 41 | 56 | 52 | $1.35 | 63s | |
| 20 | Z AI | Apr 2026 | 200k | 44 | 36 | 66 | 63 | $2.15 | 1.8s | |
| 21 | Kimi | Apr 2026 | 256k | 43 | 38 | 59 | 60 | $1.71 | 2.4s | |
| 22 | Apr 2026 | 256k | 39 | 39 | 41 | 36 | — | 50s | ||
| 23 | Mistral | Apr 2026 | 256k | 39 | 35 | 53 | 66 | $3.00 | 33s | |
| 24 | Amazon | Nov 2025 | 256k | 36 | 30 | 47 | 124 | $3.44 | 30s | |
| 25 | Xiaomi | Apr 2026 | 1M | 36 | 37 | 51 | 45 | $1.35 | 3.6s | |
| 26 | OpenAI | Aug 2025 | 131k | 33 | 29 | 38 | 336 | $0.26 | 6.9s | |
| 27 | Anthropic | Oct 2025 | 200k | 31 | 30 | 33 | 88 | $2.19 | 0.9s | |
| 28 | Upstage | Apr 2026 | 128k | 26 | 13 | 35 | — | — | — | |
| 29 | OpenAI | Aug 2025 | 131k | 25 | 19 | 28 | 222 | $0.09 | 9.8s | |
| 30 | MBZUAI Institute of Foundation Models | Dec 2025 | 262k | 24 | 16 | 15 | — | — | — | |
| — | OpenAI | Apr 2026 | 922k | — | — | — | — | — | — |
Intelligence Index
Composite of 10 evaluations spanning reasoning, knowledge, math, coding, and agentic tool use. Higher is better.
Incorporates GPQA Diamond, Humanity's Last Exam, AIME 2025, LiveCodeBench, SciCode, IFBench, Terminal-Bench Hard, τ²-Bench, and more.
Intelligence vs. Price
Intelligence Index against blended USD per 1M tokens (3:1 input:output, log scale).
Up and to the left wins: more intelligence per dollar. Models without public API pricing are excluded.
Intelligence vs. Output Speed
Intelligence Index against median output tokens per second.
Up and to the right wins: smart and fast. Speed is the median across providers serving each model.
Frontier Intelligence Over Time
Intelligence Index by release date. The dashed line tracks the running frontier.
Claude Fable 5 set the current frontier on June 9, 2026 — two days before this snapshot.
Coding Index
Composite of coding evaluations (LiveCodeBench, SciCode, Terminal-Bench Hard). Higher is better.
Agentic Index
Tool calling and long-horizon agent tasks (τ²-Bench, Terminal-Bench). Higher is better.
Intelligence Breakdown
Individual evaluation scores (0–100) behind the Intelligence Index. Darker is better, normalized per column.
| Model | GPQA Diamond | Humanity's Last Exam | SciCode | IFBench | Terminal-Bench Hard | τ²-Bench Telecom | AA-LCR (Long Context) | CritPt | MMMU-Pro |
|---|---|---|---|---|---|---|---|---|---|
| 92.6 | 53.3 | 60.2 | 63.5 | 62.9 | 98.5 | 70.0 | 28.6 | — | |
| 92.0 | 45.7 | 53.5 | 62.2 | 58.3 | 94.4 | 67.7 | 20.9 | — | |
| 93.5 | 44.3 | 56.1 | 75.9 | 60.6 | 93.9 | 74.3 | 27.1 | 79.9 | |
| 91.4 | 39.6 | 54.5 | 58.6 | 51.5 | 88.6 | 70.3 | 12.0 | 78.8 | |
| 94.1 | 44.7 | 58.9 | 77.1 | 53.8 | 95.6 | 72.7 | 17.7 | 82.4 | |
| 92.3 | 38.1 | 48.8 | 80.5 | 50.8 | 94.7 | 69.0 | 13.4 | — | |
| 92.2 | 41.0 | 53.1 | 76.3 | 40.9 | 95.3 | 69.3 | 13.1 | 84.3 | |
| 92.9 | 37.1 | 45.4 | 82.9 | 42.4 | 88.9 | 74.0 | 3.7 | 79.9 | |
| 91.5 | 39.9 | 53.2 | 75.4 | 53.0 | 86.0 | 74.0 | 16.9 | 78.5 | |
| 90.0 | 33.4 | 45.5 | 78.0 | 47.0 | 93.0 | 65.0 | 9.1 | 44.8 | |
| 90.1 | 35.0 | 47.3 | 81.3 | 37.9 | 97.7 | 64.3 | 8.0 | 78.1 | |
| 88.4 | 39.9 | 51.5 | 75.9 | 45.5 | 91.5 | 69.7 | 11.3 | 80.5 | |
| 87.5 | 30.0 | 46.8 | 56.6 | 53.0 | 75.7 | 70.7 | 3.1 | 73.3 | |
| 88.8 | 35.9 | 50.0 | 76.5 | 46.2 | 96.2 | 66.3 | 12.9 | — | |
| 87.4 | 28.1 | 47.0 | 75.7 | 39.4 | 84.8 | 68.7 | 0.6 | — | |
| 87.5 | 26.6 | 49.9 | 73.3 | 52.3 | 83.3 | 69.3 | 10.0 | 73.3 | |
| 86.7 | 26.6 | 39.9 | 81.4 | 36.4 | 83.3 | 67.0 | 3.1 | — | |
| 89.4 | 32.1 | 44.9 | 79.2 | 35.6 | 95.0 | 63.0 | 7.1 | — | |
| 89.3 | 27.3 | 42.0 | 78.8 | 40.9 | 95.6 | 65.7 | 1.7 | 77.3 | |
| 83.9 | 25.6 | 36.1 | 52.0 | 35.6 | 97.1 | 44.3 | 0.0 | — | |
| 78.8 | 18.2 | 39.5 | 44.3 | 37.9 | 93.9 | 57.7 | 1.4 | — | |
| 85.7 | 22.7 | 43.4 | 75.6 | 36.4 | 59.9 | 62.0 | 1.4 | 73.4 | |
| 74.8 | 12.8 | 39.6 | 68.8 | 33.3 | 94.2 | 61.0 | 0.0 | 64.9 | |
| 78.5 | 8.9 | 42.7 | 79.0 | 24.2 | 92.7 | 54.3 | 0.0 | 64.5 | |
| 76.2 | 13.3 | 39.1 | 42.7 | 35.6 | 72.5 | 35.0 | 1.1 | — | |
| 78.2 | 18.5 | 38.9 | 69.0 | 23.5 | 65.8 | 50.7 | 1.1 | — | |
| 64.6 | 4.3 | 34.4 | 42.0 | 27.3 | 32.5 | 43.7 | 0.0 | 55.1 | |
| 72.4 | 10.1 | 24.7 | 71.2 | 7.6 | 86.3 | 27.0 | 0.0 | — | |
| 68.8 | 9.8 | 34.4 | 65.1 | 10.6 | 60.2 | 30.7 | 1.4 | — | |
| 71.3 | 9.5 | 33.0 | 62.8 | 6.8 | 25.4 | 52.7 | 0.0 | — | |
| — | — | — | — | — | — | — | 30.6 | — |
AIME 2025 and LiveCodeBench are retired for newer models and excluded here; MMMU-Pro applies to multimodal-evaluated models only.
Omniscience Index
Knowledge reliability from -100 to 100: correct answers score positive, hallucinated ones negative.
A negative score means the model hallucinates more than it knows. Declining to answer scores zero — most models would rather guess.
GDPval Elo — Real-World Work
Elo from blind pairwise comparisons on real economically valuable work tasks, with web and shell access.
Higher is better. Judged across occupations from software engineering to financial analysis.
ITBench — SRE Incident Analysis
Average precision at full recall diagnosing live Kubernetes incidents. Higher is better.
Models investigate real cluster telemetry to find root causes. Even the frontier tops out below 0.5 — ops work is far from solved.
Output Tokens Used to Run the Intelligence Suite
Total tokens generated answering the full evaluation suite, split into answer and reasoning tokens.
Reasoning-heavy models can burn 20–40× more tokens thinking than answering — which is exactly why blended price alone undersells true cost.
Cost to Run the Intelligence Suite
USD to complete every evaluation in the Intelligence Index, including reasoning tokens. Lower is better.
The spread is real: the same suite costs $19 on gpt-oss-20B and over $4,600 on the priciest frontier models.
Output Speed
Median output tokens per second across providers serving each model. Higher is better.
Latency — Time to First Answer Token
Seconds from request to first answer token, including reasoning time. Lower is better.
Max-effort reasoning modes pay for their scores in wait time: the smartest configurations routinely think for one to two minutes.
Pricing — Input and Output
USD per 1M tokens by direction. Lower is better.
Output tokens typically cost 2–4× input. Reasoning tokens bill as output, so thinking models multiply effective price.
Context Window
Maximum input tokens per request.
Openness Index
Weights availability plus transparency of methodology and training data, 0–100.
Only models with published openness scores shown. K2 Think V2 and Nemotron 3 Ultra lead; most frontier labs publish nothing.
Latest Insights
Reporting from the eval desk
Nasdaq Composite Surges to 25,809.66 as Semiconductor Rebound and SpaceX Fever Power Tech Recovery
By bbntimes.com · 6 hrs ago

Cleveland Clinic AI Ranking: Healthcare AI Citation Runner-Up
By Everything PR News · 8 hrs ago

AI Boom To Drive NVDA Higher? S&P Global Lifts Rating, Sees Over $500B Revenue By 2028
By Asianet News Network Pvt Ltd · 6 hrs ago

OpenAI, Anthropic, SpaceX IPOs Could End Big Tech's Grip On AI Trade
By Benzinga · 6 hrs ago
Methodology: indices are composites of public evaluations run independently with standardized prompts; speed and latency are medians measured across API providers over the trailing 72 hours. Benchmark data: Artificial Analysis (artificialanalysis.ai) public snapshot, June 11, 2026. Prices are list API prices and change frequently. Company logos identify the respective model creators.