Glsrm Benchmarks · June 11, 2026

The model race, measured.

Dive into the data ↓

Independent-style analysis of frontier AI models and the APIs that serve them, across the three numbers that decide every deployment: intelligence (higher is better), output speed in tokens per second (higher is better), and blended price per million tokens (lower is better).

31

Frontier models

16

Labs tracked

9

Evals in the index

June 11, 2026

Snapshot

Intelligence

Higher is better

Intelligence Index · best model per lab

  • 1Claude Fable 5 (with fallback)65
  • 2GPT-5.5 (xhigh)60
  • 3Gemini 3.1 Pro Preview57
  • 4Qwen3.7 Max57
  • 5MiniMax-M355
  • 6Grok 4.3 (high)53
  • 7Muse Spark52
  • 8DeepSeek V4 Pro (Max)52
  • 9Nemotron 3 Ultra48

Speed

Higher is better

Median output tokens per second · best per lab

  • 1Qwen3.7 Max172
  • 2Nemotron 3 Ultra145
  • 3Grok 4.3 (high)144
  • 4Nova 2.0 Pro Preview (medium)124
  • 5Gemini 3.1 Pro Preview110
  • 6Mistral Medium 3.566
  • 7GLM-5.163
  • 8Claude Fable 5 (with fallback)60
  • 9Kimi K2.660

Price

Lower is better

USD per 1M tokens, 3:1 blended · cheapest per lab

  • 1MiniMax-M3$0.53
  • 2DeepSeek V4 Pro (Max)$0.54
  • 3Nemotron 3 Ultra$1.10
  • 4MiMo-V2.5-Pro$1.35
  • 5Grok 4.3 (high)$1.56
  • 6Kimi K2.6$1.71
  • 7GLM-5.1$2.15
  • 8Mistral Medium 3.5$3.00
  • 9Nova 2.0 Pro Preview (medium)$3.44

Model comparison summary

Every model we track, ranked by Intelligence Index.

#ModelCreatorReleasedContextIntelligenceCodingAgenticSpeed (t/s)Blended $/1MLatency
1Claude Fable 5 (with fallback)AnthropicJun 20261M65628160$21.88108s
2Claude Opus 4.8 (max)AnthropicMay 20261M61577858$10.9456s
3GPT-5.5 (xhigh)OpenAIApr 2026922k60597450$11.2586s
4Claude Opus 4.7 (max)AnthropicApr 20261M57537144$10.9421s
5Gemini 3.1 Pro PreviewGoogleFeb 20261M575659110$4.5026s
6Qwen3.7 MaxAlibabaMay 20261M575067172$3.7517s
7Gemini 3.5 FlashGoogleMay 20261M554570145$3.3821s
8MiniMax-M3MiniMaxJun 20261M55436943$0.5349s
9GPT-5.3 Codex (xhigh)OpenAIFeb 2026400k54536167$4.8186s
10Qwen3.7 PlusAlibabaJun 20261M53476553$0.5940s
11Grok 4.3 (high)xAIApr 20261M534166144$1.5614s
12Muse SparkMetaApr 2026262k524862
13Claude Sonnet 4.6 (max)AnthropicFeb 20261M52516348$6.56139s
14DeepSeek V4 Pro (Max)OpenDeepSeekApr 20261M52486746$0.5497s
15MiniMax-M2.7OpenMiniMaxMar 2026205k50426242$0.5361s
16GPT-5.4 mini (xhigh)OpenAIMar 2026400k495259149$1.697.1s
17Nemotron 3 UltraOpenNVIDIAJun 2026262k483857145$1.1017s
18DeepSeek V4 Flash (Max)OpenDeepSeekApr 20261M47396189$0.1864s
19Qwen3.5 397B A17BOpenAlibabaFeb 2026262k45415652$1.3563s
20GLM-5.1OpenZ AIApr 2026200k44366663$2.151.8s
21Kimi K2.6OpenKimiApr 2026256k43385960$1.712.4s
22Gemma 4 31BOpenGoogleApr 2026256k3939413650s
23Mistral Medium 3.5OpenMistralApr 2026256k39355366$3.0033s
24Nova 2.0 Pro Preview (medium)AmazonNov 2025256k363047124$3.4430s
25MiMo-V2.5-ProOpenXiaomiApr 20261M36375145$1.353.6s
26gpt-oss-120b (high)OpenOpenAIAug 2025131k332938336$0.266.9s
27Claude 4.5 HaikuAnthropicOct 2025200k31303388$2.190.9s
28Solar Pro 3UpstageApr 2026128k261335
29gpt-oss-20B (high)OpenOpenAIAug 2025131k251928222$0.099.8s
30K2 Think V2OpenMBZUAI Institute of Foundation ModelsDec 2025262k241615
GPT-5.5 Pro (xhigh)OpenAIApr 2026922k

Intelligence Index

Composite of 10 evaluations spanning reasoning, knowledge, math, coding, and agentic tool use. Higher is better.

010203040506065Claude Fable 5 (with fall…61Claude Opus 4.8 (max)60GPT-5.5 (xhigh)57Claude Opus 4.7 (max)57Gemini 3.1 Pro Preview57Qwen3.7 Max55Gemini 3.5 Flash55MiniMax-M354GPT-5.3 Codex (xhigh)53Qwen3.7 Plus53Grok 4.3 (high)52Muse Spark52Claude Sonnet 4.6 (max)52DeepSeek V4 Pro (Max)50MiniMax-M2.749GPT-5.4 mini (xhigh)48Nemotron 3 Ultra47DeepSeek V4 Flash (Max)45Qwen3.5 397B A17B44GLM-5.143Kimi K2.639Gemma 4 31B39Mistral Medium 3.536Nova 2.0 Pro Preview (med…36MiMo-V2.5-Pro33gpt-oss-120b (high)31Claude 4.5 Haiku26Solar Pro 325gpt-oss-20B (high)24K2 Think V2

Incorporates GPQA Diamond, Humanity's Last Exam, AIME 2025, LiveCodeBench, SciCode, IFBench, Terminal-Bench Hard, τ²-Bench, and more.

Intelligence vs. Price

Intelligence Index against blended USD per 1M tokens (3:1 input:output, log scale).

010203040506070$0.1$0.2$0.5$1$2$5$10$20Price (USD per 1M tokens, blended 3:1, log scale)Intelligence Index↖ Most attractive quadrantClaude Fable 5 (with…Claude Opus 4.8 (max)GPT-5.5 (xhigh)Claude Opus 4.7 (max)Gemini 3.1 Pro PreviewQwen3.7 MaxGemini 3.5 FlashMiniMax-M3GPT-5.3 Codex (xhigh)Qwen3.7 PlusGrok 4.3 (high)Claude Sonnet 4.6 (ma…DeepSeek V4 Pro (Max)MiniMax-M2.7GPT-5.4 mini (xhigh)Nemotron 3 UltraDeepSeek V4 Flash (Ma…Qwen3.5 397B A17BGLM-5.1Kimi K2.6Mistral Medium 3.5Nova 2.0 Pro Preview…MiMo-V2.5-Progpt-oss-120b (high)Claude 4.5 Haikugpt-oss-20B (high)

Up and to the left wins: more intelligence per dollar. Models without public API pricing are excluded.

Intelligence vs. Output Speed

Intelligence Index against median output tokens per second.

0102030405060700100200300Output speed (tokens per second)Intelligence IndexMost attractive quadrant ↗Claude Fable 5 (with…Claude Opus 4.8 (max)GPT-5.5 (xhigh)Claude Opus 4.7 (max)Gemini 3.1 Pro PreviewQwen3.7 MaxGemini 3.5 FlashMiniMax-M3GPT-5.3 Codex (xhigh)Qwen3.7 PlusGrok 4.3 (high)Claude Sonnet 4.6 (ma…DeepSeek V4 Pro (Max)MiniMax-M2.7GPT-5.4 mini (xhigh)Nemotron 3 UltraDeepSeek V4 Flash (Ma…Qwen3.5 397B A17BGLM-5.1Kimi K2.6Gemma 4 31BMistral Medium 3.5Nova 2.0 Pro Preview…MiMo-V2.5-Progpt-oss-120b (high)Claude 4.5 Haikugpt-oss-20B (high)

Up and to the right wins: smart and fast. Speed is the median across providers serving each model.

Frontier Intelligence Over Time

Intelligence Index by release date. The dashed line tracks the running frontier.

203040506070Aug 25Oct 25Dec 25Feb 26Apr 26Jun 26gpt-oss-120b (high)Nova 2.0 Pro Preview (m…GPT-5.3 Codex (xhigh)Gemini 3.1 Pro PreviewClaude Opus 4.7 (max)GPT-5.5 (xhigh)Claude Opus 4.8 (max)Claude Fable 5 (with fa…Frontier

Claude Fable 5 set the current frontier on June 9, 2026 — two days before this snapshot.

Coding Index

Composite of coding evaluations (LiveCodeBench, SciCode, Terminal-Bench Hard). Higher is better.

010203040506062Claude Fable 5 (with fall…59GPT-5.5 (xhigh)57Claude Opus 4.8 (max)56Gemini 3.1 Pro Preview53GPT-5.3 Codex (xhigh)53Claude Opus 4.7 (max)52GPT-5.4 mini (xhigh)51Claude Sonnet 4.6 (max)50Qwen3.7 Max48Muse Spark48DeepSeek V4 Pro (Max)47Qwen3.7 Plus45Gemini 3.5 Flash43MiniMax-M342MiniMax-M2.741Qwen3.5 397B A17B41Grok 4.3 (high)39DeepSeek V4 Flash (Max)39Gemma 4 31B38Kimi K2.638Nemotron 3 Ultra37MiMo-V2.5-Pro36GLM-5.135Mistral Medium 3.530Nova 2.0 Pro Preview (med…30Claude 4.5 Haiku29gpt-oss-120b (high)19gpt-oss-20B (high)16K2 Think V213Solar Pro 3

Agentic Index

Tool calling and long-horizon agent tasks (τ²-Bench, Terminal-Bench). Higher is better.

02040608081Claude Fable 5 (with fall…78Claude Opus 4.8 (max)74GPT-5.5 (xhigh)71Claude Opus 4.7 (max)70Gemini 3.5 Flash69MiniMax-M367DeepSeek V4 Pro (Max)67Qwen3.7 Max66GLM-5.166Grok 4.3 (high)65Qwen3.7 Plus63Claude Sonnet 4.6 (max)62Muse Spark62MiniMax-M2.761DeepSeek V4 Flash (Max)61GPT-5.3 Codex (xhigh)59Gemini 3.1 Pro Preview59GPT-5.4 mini (xhigh)59Kimi K2.657Nemotron 3 Ultra56Qwen3.5 397B A17B53Mistral Medium 3.551MiMo-V2.5-Pro47Nova 2.0 Pro Preview (med…41Gemma 4 31B38gpt-oss-120b (high)35Solar Pro 333Claude 4.5 Haiku28gpt-oss-20B (high)15K2 Think V2

Intelligence Breakdown

Individual evaluation scores (0–100) behind the Intelligence Index. Darker is better, normalized per column.

ModelGPQA DiamondHumanity's Last ExamSciCodeIFBenchTerminal-Bench Hardτ²-Bench TelecomAA-LCR (Long Context)CritPtMMMU-Pro
Claude Fable 5 (with fallback)92.653.360.263.562.998.570.028.6
Claude Opus 4.8 (max)92.045.753.562.258.394.467.720.9
GPT-5.5 (xhigh)93.544.356.175.960.693.974.327.179.9
Claude Opus 4.7 (max)91.439.654.558.651.588.670.312.078.8
Gemini 3.1 Pro Preview94.144.758.977.153.895.672.717.782.4
Qwen3.7 Max92.338.148.880.550.894.769.013.4
Gemini 3.5 Flash92.241.053.176.340.995.369.313.184.3
MiniMax-M392.937.145.482.942.488.974.03.779.9
GPT-5.3 Codex (xhigh)91.539.953.275.453.086.074.016.978.5
Qwen3.7 Plus90.033.445.578.047.093.065.09.144.8
Grok 4.3 (high)90.135.047.381.337.997.764.38.078.1
Muse Spark88.439.951.575.945.591.569.711.380.5
Claude Sonnet 4.6 (max)87.530.046.856.653.075.770.73.173.3
DeepSeek V4 Pro (Max)88.835.950.076.546.296.266.312.9
MiniMax-M2.787.428.147.075.739.484.868.70.6
GPT-5.4 mini (xhigh)87.526.649.973.352.383.369.310.073.3
Nemotron 3 Ultra86.726.639.981.436.483.367.03.1
DeepSeek V4 Flash (Max)89.432.144.979.235.695.063.07.1
Qwen3.5 397B A17B89.327.342.078.840.995.665.71.777.3
GLM-5.183.925.636.152.035.697.144.30.0
Kimi K2.678.818.239.544.337.993.957.71.4
Gemma 4 31B85.722.743.475.636.459.962.01.473.4
Mistral Medium 3.574.812.839.668.833.394.261.00.064.9
Nova 2.0 Pro Preview (medium)78.58.942.779.024.292.754.30.064.5
MiMo-V2.5-Pro76.213.339.142.735.672.535.01.1
gpt-oss-120b (high)78.218.538.969.023.565.850.71.1
Claude 4.5 Haiku64.64.334.442.027.332.543.70.055.1
Solar Pro 372.410.124.771.27.686.327.00.0
gpt-oss-20B (high)68.89.834.465.110.660.230.71.4
K2 Think V271.39.533.062.86.825.452.70.0
GPT-5.5 Pro (xhigh)30.6

AIME 2025 and LiveCodeBench are retired for newer models and excluded here; MMMU-Pro applies to multimodal-evaluated models only.

Omniscience Index

Knowledge reliability from -100 to 100: correct answers score positive, hallucinated ones negative.

-60-40-200204040Claude Fable 5 (with fall…33Gemini 3.1 Pro Preview27Claude Opus 4.8 (max)26Claude Opus 4.7 (max)23Gemini 3.5 Flash20GPT-5.5 (xhigh)18Grok 4.3 (high)14Qwen3.7 Max12Claude Sonnet 4.6 (max)10GPT-5.3 Codex (xhigh)4Muse Spark2Qwen3.7 Plus1MiniMax-M31MiniMax-M2.7-1Nemotron 3 Ultra-8Claude 4.5 Haiku-10DeepSeek V4 Pro (Max)-10Kimi K2.6-19GPT-5.4 mini (xhigh)-21GLM-5.1-23DeepSeek V4 Flash (Max)-30Qwen3.5 397B A17B-34K2 Think V2-36Mistral Medium 3.5-38MiMo-V2.5-Pro-45Gemma 4 31B-48Nova 2.0 Pro Preview (med…-50gpt-oss-120b (high)-54Solar Pro 3-64gpt-oss-20B (high)

A negative score means the model hallucinates more than it knows. Declining to answer scores zero — most models would rather guess.

GDPval Elo — Real-World Work

Elo from blind pairwise comparisons on real economically valuable work tasks, with web and shell access.

05001000150020001932Claude Fable 5 (with fall…1890Claude Opus 4.8 (max)1769GPT-5.5 (xhigh)1753Claude Opus 4.7 (max)1676Claude Sonnet 4.6 (max)1670MiniMax-M31656Gemini 3.5 Flash1554DeepSeek V4 Pro (Max)1541Qwen3.7 Max1518Qwen3.7 Plus1505MiniMax-M2.71495Grok 4.3 (high)1492GLM-5.11481GPT-5.3 Codex (xhigh)1438GPT-5.4 mini (xhigh)1417Muse Spark1388DeepSeek V4 Flash (Max)1379Nemotron 3 Ultra1327Kimi K2.61314Gemini 3.1 Pro Preview1295MiMo-V2.5-Pro1190Qwen3.5 397B A17B1168Mistral Medium 3.51135Claude 4.5 Haiku1113Gemma 4 31B973Nova 2.0 Pro Preview (med…947gpt-oss-120b (high)675Solar Pro 3647gpt-oss-20B (high)607K2 Think V2

Higher is better. Judged across occupations from software engineering to financial analysis.

ITBench — SRE Incident Analysis

Average precision at full recall diagnosing live Kubernetes incidents. Higher is better.

0.000.100.200.300.400.47Claude Opus 4.7 (max)0.46GPT-5.5 (xhigh)0.42Qwen3.7 Max0.40Gemini 3.5 Flash0.40Claude Sonnet 4.6 (max)0.38DeepSeek V4 Pro (Max)0.37Gemma 4 31B0.35GPT-5.4 mini (xhigh)0.34Qwen3.5 397B A17B0.33Grok 4.3 (high)0.32DeepSeek V4 Flash (Max)0.30Gemini 3.1 Pro Preview0.27MiniMax-M2.7

Models investigate real cluster telemetry to find root causes. Even the frontier tops out below 0.5 — ops work is far from solved.

Output Tokens Used to Run the Intelligence Suite

Total tokens generated answering the full evaluation suite, split into answer and reasoning tokens.

Answer tokensReasoning tokens0M50M100M150M200M250M241MDeepSeek V4 Flash (Max)235MGPT-5.4 mini (xhigh)198MClaude Sonnet 4.6 (max)187MDeepSeek V4 Pro (Max)112MClaude Opus 4.8 (max)112MClaude Opus 4.7 (max)111MQwen3.7 Plus103MNemotron 3 Ultra97MQwen3.7 Max91MMiniMax-M390MMistral Medium 3.588MGrok 4.3 (high)87MMiniMax-M2.786MQwen3.5 397B A17B78Mgpt-oss-120b (high)77MGPT-5.3 Codex (xhigh)75MGPT-5.5 (xhigh)73MGemini 3.5 Flash61Mgpt-oss-20B (high)57MGemini 3.1 Pro Preview39MGemma 4 31B36MNova 2.0 Pro Preview (med…

Reasoning-heavy models can burn 20–40× more tokens thinking than answering — which is exactly why blended price alone undersells true cost.

Cost to Run the Intelligence Suite

USD to complete every evaluation in the Intelligence Index, including reasoning tokens. Lower is better.

$0$1,000$2,000$3,000$4,000$5,000$5,117Claude Opus 4.7 (max)$4,686Claude Opus 4.8 (max)$4,206Claude Sonnet 4.6 (max)$3,357GPT-5.5 (xhigh)$1,572GPT-5.3 Codex (xhigh)$1,552Gemini 3.5 Flash$1,354GPT-5.4 mini (xhigh)$1,203Qwen3.7 Max$1,001Mistral Medium 3.5$892Gemini 3.1 Pro Preview$633MiMo-V2.5-Pro$618GLM-5.1$505Kimi K2.6$467Nova 2.0 Pro Preview (med…$418Qwen3.5 397B A17B$395Grok 4.3 (high)$387Nemotron 3 Ultra$308MiniMax-M3$268DeepSeek V4 Pro (Max)$246Claude 4.5 Haiku$209Qwen3.7 Plus$176MiniMax-M2.7$113DeepSeek V4 Flash (Max)$67gpt-oss-120b (high)$19gpt-oss-20B (high)$0Muse Spark$0Gemma 4 31B$0Solar Pro 3$0K2 Think V2

The spread is real: the same suite costs $19 on gpt-oss-20B and over $4,600 on the priciest frontier models.

Output Speed

Median output tokens per second across providers serving each model. Higher is better.

0100200300336gpt-oss-120b (high)222gpt-oss-20B (high)172Qwen3.7 Max149GPT-5.4 mini (xhigh)145Nemotron 3 Ultra145Gemini 3.5 Flash144Grok 4.3 (high)124Nova 2.0 Pro Preview (med…110Gemini 3.1 Pro Preview89DeepSeek V4 Flash (Max)88Claude 4.5 Haiku67GPT-5.3 Codex (xhigh)66Mistral Medium 3.563GLM-5.160Claude Fable 5 (with fall…60Kimi K2.658Claude Opus 4.8 (max)53Qwen3.7 Plus52Qwen3.5 397B A17B50GPT-5.5 (xhigh)48Claude Sonnet 4.6 (max)46DeepSeek V4 Pro (Max)45MiMo-V2.5-Pro44Claude Opus 4.7 (max)43MiniMax-M342MiniMax-M2.736Gemma 4 31B

Latency — Time to First Answer Token

Seconds from request to first answer token, including reasoning time. Lower is better.

0.0s20s40s60s80s100s120s140s0.9sClaude 4.5 Haiku1.8sGLM-5.12.4sKimi K2.63.6sMiMo-V2.5-Pro6.9sgpt-oss-120b (high)7.1sGPT-5.4 mini (xhigh)9.8sgpt-oss-20B (high)14sGrok 4.3 (high)17sQwen3.7 Max17sNemotron 3 Ultra21sGemini 3.5 Flash21sClaude Opus 4.7 (max)26sGemini 3.1 Pro Preview30sNova 2.0 Pro Preview (med…33sMistral Medium 3.540sQwen3.7 Plus49sMiniMax-M350sGemma 4 31B56sClaude Opus 4.8 (max)61sMiniMax-M2.763sQwen3.5 397B A17B64sDeepSeek V4 Flash (Max)86sGPT-5.3 Codex (xhigh)86sGPT-5.5 (xhigh)97sDeepSeek V4 Pro (Max)108sClaude Fable 5 (with fall…139sClaude Sonnet 4.6 (max)

Max-effort reasoning modes pay for their scores in wait time: the smartest configurations routinely think for one to two minutes.

Pricing — Input and Output

USD per 1M tokens by direction. Lower is better.

Input priceOutput price$0.0$10$20$30$40$50$50Claude Fable 5 (with fall…$30GPT-5.5 (xhigh)$25Claude Opus 4.8 (max)$25Claude Opus 4.7 (max)$15Claude Sonnet 4.6 (max)$14GPT-5.3 Codex (xhigh)$12Gemini 3.1 Pro Preview$7.5Qwen3.7 Max$10Nova 2.0 Pro Preview (med…$9Gemini 3.5 Flash$7.5Mistral Medium 3.5$5Claude 4.5 Haiku$4.4GLM-5.1$4Kimi K2.6$4.5GPT-5.4 mini (xhigh)$2.5Grok 4.3 (high)$3.6Qwen3.5 397B A17B$2.7MiMo-V2.5-Pro$2.6Nemotron 3 Ultra$1.16Qwen3.7 Plus$0.87DeepSeek V4 Pro (Max)$1.2MiniMax-M3$1.2MiniMax-M2.7$0.6gpt-oss-120b (high)$0.28DeepSeek V4 Flash (Max)$0.2gpt-oss-20B (high)

Output tokens typically cost 2–4× input. Reasoning tokens bill as output, so thinking models multiply effective price.

Context Window

Maximum input tokens per request.

0k200k400k600k800k1M1MClaude Fable 5 (with fall…1MClaude Opus 4.8 (max)1MClaude Opus 4.7 (max)1MGemini 3.1 Pro Preview1MQwen3.7 Max1MGemini 3.5 Flash1MMiniMax-M31MQwen3.7 Plus1MGrok 4.3 (high)1MClaude Sonnet 4.6 (max)1MDeepSeek V4 Pro (Max)1MDeepSeek V4 Flash (Max)1MMiMo-V2.5-Pro922kGPT-5.5 (xhigh)922kGPT-5.5 Pro (xhigh)400kGPT-5.3 Codex (xhigh)400kGPT-5.4 mini (xhigh)262kMuse Spark262kNemotron 3 Ultra262kQwen3.5 397B A17B262kK2 Think V2256kKimi K2.6256kGemma 4 31B256kMistral Medium 3.5256kNova 2.0 Pro Preview (med…205kMiniMax-M2.7200kGLM-5.1200kClaude 4.5 Haiku131kgpt-oss-120b (high)131kgpt-oss-20B (high)128kSolar Pro 3

Openness Index

Weights availability plus transparency of methodology and training data, 0–100.

02040608089K2 Think V283Nemotron 3 Ultra50DeepSeek V4 Pro (Max)50DeepSeek V4 Flash (Max)39Qwen3.5 397B A17B39Gemma 4 31B39gpt-oss-120b (high)39gpt-oss-20B (high)33Mistral Medium 3.522MiniMax-M2.711Claude 4.5 Haiku

Only models with published openness scores shown. K2 Think V2 and Nemotron 3 Ultra lead; most frontier labs publish nothing.

Latest Insights

Reporting from the eval desk

Methodology: indices are composites of public evaluations run independently with standardized prompts; speed and latency are medians measured across API providers over the trailing 72 hours. Benchmark data: Artificial Analysis (artificialanalysis.ai) public snapshot, June 11, 2026. Prices are list API prices and change frequently. Company logos identify the respective model creators.