Glsrm Benchmarks · June 11, 2026

The model race, measured.

Independent-style analysis of frontier AI models and the APIs that serve them, across the three numbers that decide every deployment: intelligence (higher is better), output speed in tokens per second (higher is better), and blended price per million tokens (lower is better).

Frontier models

Labs tracked

Evals in the index

June 11, 2026

Snapshot

Intelligence

Higher is better

Intelligence Index · best model per lab

1Claude Fable 5 (with fallback)65
2GPT-5.5 (xhigh)60
3Gemini 3.1 Pro Preview57
4Qwen3.7 Max57
5MiniMax-M355
6Grok 4.3 (high)53
7Muse Spark52
8DeepSeek V4 Pro (Max)52
9Nemotron 3 Ultra48

Speed

Higher is better

Median output tokens per second · best per lab

1Qwen3.7 Max172
2Nemotron 3 Ultra145
3Grok 4.3 (high)144
4Nova 2.0 Pro Preview (medium)124
5Gemini 3.1 Pro Preview110
6Mistral Medium 3.566
7GLM-5.163
8Claude Fable 5 (with fallback)60
9Kimi K2.660

Price

Lower is better

USD per 1M tokens, 3:1 blended · cheapest per lab

1MiniMax-M3$0.53
2DeepSeek V4 Pro (Max)$0.54
3Nemotron 3 Ultra$1.10
4MiMo-V2.5-Pro$1.35
5Grok 4.3 (high)$1.56
6Kimi K2.6$1.71
7GLM-5.1$2.15
8Mistral Medium 3.5$3.00
9Nova 2.0 Pro Preview (medium)$3.44

Model comparison summary

Every model we track, ranked by Intelligence Index.

#	Model	Creator	Released	Context	Intelligence	Coding	Agentic	Speed (t/s)	Blended $/1M	Latency
1	Claude Fable 5 (with fallback)	Anthropic	Jun 2026	1M	65	62	81	60	$21.88	108s
2	Claude Opus 4.8 (max)	Anthropic	May 2026	1M	61	57	78	58	$10.94	56s
3	GPT-5.5 (xhigh)	OpenAI	Apr 2026	922k	60	59	74	50	$11.25	86s
4	Claude Opus 4.7 (max)	Anthropic	Apr 2026	1M	57	53	71	44	$10.94	21s
5	Gemini 3.1 Pro Preview	Google	Feb 2026	1M	57	56	59	110	$4.50	26s
6	Qwen3.7 Max	Alibaba	May 2026	1M	57	50	67	172	$3.75	17s
7	Gemini 3.5 Flash	Google	May 2026	1M	55	45	70	145	$3.38	21s
8	MiniMax-M3	MiniMax	Jun 2026	1M	55	43	69	43	$0.53	49s
9	GPT-5.3 Codex (xhigh)	OpenAI	Feb 2026	400k	54	53	61	67	$4.81	86s
10	Qwen3.7 Plus	Alibaba	Jun 2026	1M	53	47	65	53	$0.59	40s
11	Grok 4.3 (high)	xAI	Apr 2026	1M	53	41	66	144	$1.56	14s
12	Muse Spark	Meta	Apr 2026	262k	52	48	62	—	—	—
13	Claude Sonnet 4.6 (max)	Anthropic	Feb 2026	1M	52	51	63	48	$6.56	139s
14	DeepSeek V4 Pro (Max)Open	DeepSeek	Apr 2026	1M	52	48	67	46	$0.54	97s
15	MiniMax-M2.7Open	MiniMax	Mar 2026	205k	50	42	62	42	$0.53	61s
16	GPT-5.4 mini (xhigh)	OpenAI	Mar 2026	400k	49	52	59	149	$1.69	7.1s
17	Nemotron 3 UltraOpen	NVIDIA	Jun 2026	262k	48	38	57	145	$1.10	17s
18	DeepSeek V4 Flash (Max)Open	DeepSeek	Apr 2026	1M	47	39	61	89	$0.18	64s
19	Qwen3.5 397B A17BOpen	Alibaba	Feb 2026	262k	45	41	56	52	$1.35	63s
20	GLM-5.1Open	Z AI	Apr 2026	200k	44	36	66	63	$2.15	1.8s
21	Kimi K2.6Open	Kimi	Apr 2026	256k	43	38	59	60	$1.71	2.4s
22	Gemma 4 31BOpen	Google	Apr 2026	256k	39	39	41	36	—	50s
23	Mistral Medium 3.5Open	Mistral	Apr 2026	256k	39	35	53	66	$3.00	33s
24	Nova 2.0 Pro Preview (medium)	Amazon	Nov 2025	256k	36	30	47	124	$3.44	30s
25	MiMo-V2.5-ProOpen	Xiaomi	Apr 2026	1M	36	37	51	45	$1.35	3.6s
26	gpt-oss-120b (high)Open	OpenAI	Aug 2025	131k	33	29	38	336	$0.26	6.9s
27	Claude 4.5 Haiku	Anthropic	Oct 2025	200k	31	30	33	88	$2.19	0.9s
28	Solar Pro 3	Upstage	Apr 2026	128k	26	13	35	—	—	—
29	gpt-oss-20B (high)Open	OpenAI	Aug 2025	131k	25	19	28	222	$0.09	9.8s
30	K2 Think V2Open	MBZUAI Institute of Foundation Models	Dec 2025	262k	24	16	15	—	—	—
—	GPT-5.5 Pro (xhigh)	OpenAI	Apr 2026	922k	—	—	—	—	—	—

Intelligence Index

Composite of 10 evaluations spanning reasoning, knowledge, math, coding, and agentic tool use. Higher is better.

Incorporates GPQA Diamond, Humanity's Last Exam, AIME 2025, LiveCodeBench, SciCode, IFBench, Terminal-Bench Hard, τ²-Bench, and more.

Intelligence vs. Price

Intelligence Index against blended USD per 1M tokens (3:1 input:output, log scale).

Up and to the left wins: more intelligence per dollar. Models without public API pricing are excluded.

Intelligence vs. Output Speed

Intelligence Index against median output tokens per second.

Up and to the right wins: smart and fast. Speed is the median across providers serving each model.

Frontier Intelligence Over Time

Intelligence Index by release date. The dashed line tracks the running frontier.

Claude Fable 5 set the current frontier on June 9, 2026 — two days before this snapshot.

Coding Index

Composite of coding evaluations (LiveCodeBench, SciCode, Terminal-Bench Hard). Higher is better.

Agentic Index

Tool calling and long-horizon agent tasks (τ²-Bench, Terminal-Bench). Higher is better.

Intelligence Breakdown

Individual evaluation scores (0–100) behind the Intelligence Index. Darker is better, normalized per column.

Model	GPQA Diamond	Humanity's Last Exam	SciCode	IFBench	Terminal-Bench Hard	τ²-Bench Telecom	AA-LCR (Long Context)	CritPt	MMMU-Pro
Claude Fable 5 (with fallback)	92.6	53.3	60.2	63.5	62.9	98.5	70.0	28.6	—
Claude Opus 4.8 (max)	92.0	45.7	53.5	62.2	58.3	94.4	67.7	20.9	—
GPT-5.5 (xhigh)	93.5	44.3	56.1	75.9	60.6	93.9	74.3	27.1	79.9
Claude Opus 4.7 (max)	91.4	39.6	54.5	58.6	51.5	88.6	70.3	12.0	78.8
Gemini 3.1 Pro Preview	94.1	44.7	58.9	77.1	53.8	95.6	72.7	17.7	82.4
Qwen3.7 Max	92.3	38.1	48.8	80.5	50.8	94.7	69.0	13.4	—
Gemini 3.5 Flash	92.2	41.0	53.1	76.3	40.9	95.3	69.3	13.1	84.3
MiniMax-M3	92.9	37.1	45.4	82.9	42.4	88.9	74.0	3.7	79.9
GPT-5.3 Codex (xhigh)	91.5	39.9	53.2	75.4	53.0	86.0	74.0	16.9	78.5
Qwen3.7 Plus	90.0	33.4	45.5	78.0	47.0	93.0	65.0	9.1	44.8
Grok 4.3 (high)	90.1	35.0	47.3	81.3	37.9	97.7	64.3	8.0	78.1
Muse Spark	88.4	39.9	51.5	75.9	45.5	91.5	69.7	11.3	80.5
Claude Sonnet 4.6 (max)	87.5	30.0	46.8	56.6	53.0	75.7	70.7	3.1	73.3
DeepSeek V4 Pro (Max)	88.8	35.9	50.0	76.5	46.2	96.2	66.3	12.9	—
MiniMax-M2.7	87.4	28.1	47.0	75.7	39.4	84.8	68.7	0.6	—
GPT-5.4 mini (xhigh)	87.5	26.6	49.9	73.3	52.3	83.3	69.3	10.0	73.3
Nemotron 3 Ultra	86.7	26.6	39.9	81.4	36.4	83.3	67.0	3.1	—
DeepSeek V4 Flash (Max)	89.4	32.1	44.9	79.2	35.6	95.0	63.0	7.1	—
Qwen3.5 397B A17B	89.3	27.3	42.0	78.8	40.9	95.6	65.7	1.7	77.3
GLM-5.1	83.9	25.6	36.1	52.0	35.6	97.1	44.3	0.0	—
Kimi K2.6	78.8	18.2	39.5	44.3	37.9	93.9	57.7	1.4	—
Gemma 4 31B	85.7	22.7	43.4	75.6	36.4	59.9	62.0	1.4	73.4
Mistral Medium 3.5	74.8	12.8	39.6	68.8	33.3	94.2	61.0	0.0	64.9
Nova 2.0 Pro Preview (medium)	78.5	8.9	42.7	79.0	24.2	92.7	54.3	0.0	64.5
MiMo-V2.5-Pro	76.2	13.3	39.1	42.7	35.6	72.5	35.0	1.1	—
gpt-oss-120b (high)	78.2	18.5	38.9	69.0	23.5	65.8	50.7	1.1	—
Claude 4.5 Haiku	64.6	4.3	34.4	42.0	27.3	32.5	43.7	0.0	55.1
Solar Pro 3	72.4	10.1	24.7	71.2	7.6	86.3	27.0	0.0	—
gpt-oss-20B (high)	68.8	9.8	34.4	65.1	10.6	60.2	30.7	1.4	—
K2 Think V2	71.3	9.5	33.0	62.8	6.8	25.4	52.7	0.0	—
GPT-5.5 Pro (xhigh)	—	—	—	—	—	—	—	30.6	—

AIME 2025 and LiveCodeBench are retired for newer models and excluded here; MMMU-Pro applies to multimodal-evaluated models only.

Omniscience Index

Knowledge reliability from -100 to 100: correct answers score positive, hallucinated ones negative.

A negative score means the model hallucinates more than it knows. Declining to answer scores zero — most models would rather guess.

GDPval Elo — Real-World Work

Elo from blind pairwise comparisons on real economically valuable work tasks, with web and shell access.

Higher is better. Judged across occupations from software engineering to financial analysis.

ITBench — SRE Incident Analysis

Average precision at full recall diagnosing live Kubernetes incidents. Higher is better.

Models investigate real cluster telemetry to find root causes. Even the frontier tops out below 0.5 — ops work is far from solved.

Output Tokens Used to Run the Intelligence Suite

Total tokens generated answering the full evaluation suite, split into answer and reasoning tokens.

Reasoning-heavy models can burn 20–40× more tokens thinking than answering — which is exactly why blended price alone undersells true cost.

Cost to Run the Intelligence Suite

USD to complete every evaluation in the Intelligence Index, including reasoning tokens. Lower is better.

The spread is real: the same suite costs $19 on gpt-oss-20B and over $4,600 on the priciest frontier models.

Output Speed

Median output tokens per second across providers serving each model. Higher is better.

Latency — Time to First Answer Token

Seconds from request to first answer token, including reasoning time. Lower is better.

Max-effort reasoning modes pay for their scores in wait time: the smartest configurations routinely think for one to two minutes.

Pricing — Input and Output

USD per 1M tokens by direction. Lower is better.

Output tokens typically cost 2–4× input. Reasoning tokens bill as output, so thinking models multiply effective price.

Context Window

Maximum input tokens per request.

Openness Index

Weights availability plus transparency of methodology and training data, 0–100.

Only models with published openness scores shown. K2 Think V2 and Nemotron 3 Ultra lead; most frontier labs publish nothing.

Latest Insights

Reporting from the eval desk

Benchmarks

Nasdaq Composite Surges to 25,809.66 as Semiconductor Rebound and SpaceX Fever Power Tech Recovery

By bbntimes.com · 6 hrs ago

Benchmarks

Cleveland Clinic AI Ranking: Healthcare AI Citation Runner-Up

By Everything PR News · 8 hrs ago

News

AI Boom To Drive NVDA Higher? S&P Global Lifts Rating, Sees Over $500B Revenue By 2028

By Asianet News Network Pvt Ltd · 6 hrs ago

Models

OpenAI, Anthropic, SpaceX IPOs Could End Big Tech's Grip On AI Trade

By Benzinga · 6 hrs ago

Methodology: indices are composites of public evaluations run independently with standardized prompts; speed and latency are medians measured across API providers over the trailing 72 hours. Benchmark data: Artificial Analysis (artificialanalysis.ai) public snapshot, June 11, 2026. Prices are list API prices and change frequently. Company logos identify the respective model creators.