Glsrm Agents · June 11, 2026

Coding agents, measured end-to-end.

Real software-engineering tasks, run through complete agent stacks — harness plus model — and scored on what actually ships: solve rate across three task suites (higher is better), dollars per task, tokens consumed, and minutes on the clock (all lower is better).

Configurations

Harnesses

Models

June 11, 2026

Snapshot

Coding Agent Index

Higher is better

Equal-weight mean of the three suites · top configurations

1Claude Code · Opus 4.8 (max)77.2
2Claude Code · Opus 4.8 (medium)67.2
3Claude Code · Opus 4.7 (max)66.6
4Codex · GPT-5.5 (xhigh)65.3
5Opencode · Opus 4.7 (medium)64.6
6Cursor CLI · Composer 2.562.9
7Cursor CLI · Composer 2.5 Fast62.9
8Cursor CLI · Opus 4.7 (medium)61.1
9Codex · GPT-5.5 (medium)60.4

Cost per Task

Lower is better

Mean USD per completed task · cheapest configurations

1Cursor CLI · Composer 2.5$0.07
2Cursor CLI · Composer 2$0.07
3Claude Code · DeepSeek V4 Pro (high)$0.35
4Cursor CLI · Composer 2.5 Fast$0.44
5Claude Code · Kimi K2.6$0.76
6Claude Code · Sonnet 4.6 (medium)$1.02
7Claude Code · Opus 4.7 (medium)$1.24
8Claude Code · Opus 4.6 (medium)$1.27
9Cursor CLI · Opus 4.7 (medium)$1.47

Time per Task

Lower is better

Mean wall-clock minutes per task · fastest configurations

1Claude Code · Opus 4.7 (medium)5.8m
2Cursor CLI · GPT-5.5 (medium)6.2m
3Cursor CLI · Composer 2.5 Fast6.7m
4Codex · GPT-5.4 (medium)6.9m
5Claude Code · Opus 4.6 (medium)7m
6Codex · GPT-5.5 (medium)7.1m
7Cursor CLI · GPT-5.4 (medium)7.6m
8Gemini CLI · Gemini 3.1 Pro (high)7.6m
9Cursor CLI · Opus 4.7 (medium)7.8m

Coding agent comparison summary

Every harness × model configuration we track, ranked by Coding Agent Index.

#	Agent	Model	Lab	Index	SWE-Pro	Term-Bench	Atlas QnA	$/Task	Tokens/Task	Time	Turns
1	Claude CodeAnthropic	Claude Opus 4.8 (max)	Anthropic	77.2	69.8	79.4	82.5	$4.62	9.4M	17.7m	134
2	Claude CodeAnthropic	Claude Opus 4.8 (medium)	Anthropic	67.2	49.8	75.0	76.8	$1.60	3.4M	8.8m	63
3	Claude CodeAnthropic	Claude Opus 4.7 (max)	Anthropic	66.6	44.9	73.8	81.0	$4.14	11.2M	13.8m	97
4	CodexOpenAI	GPT-5.5 (xhigh)	OpenAI	65.3	30.9	84.1	80.8	$4.33	9.3M	8.7m	96
5	OpencodeOpencode	Claude Opus 4.7 (medium)	Anthropic	64.6	40.2	74.6	79.0	$1.82	4.4M	9.7m	43
6	Cursor CLICursor	Composer 2.5	Cursor	62.9	49.2	66.9	72.5	$0.07	2.8M	9.3m	101
7	Cursor CLICursor	Composer 2.5 Fast	Cursor	62.9	49.2	66.9	72.5	$0.44	3.1M	6.7m	101
8	Cursor CLICursor	Claude Opus 4.7 (medium)	Anthropic	61.1	34.4	70.6	78.4	$1.47	2.9M	7.8m	61
9	CodexOpenAI	GPT-5.5 (medium)	OpenAI	60.4	26.2	75.8	79.1	$2.21	5.4M	7.1m	73
10	Claude CodeAnthropic	Claude Opus 4.7 (medium)	Anthropic	59.8	36.4	71.4	71.7	$1.24	3.3M	5.8m	35
11	Cursor CLICursor	GPT-5.5 (medium)	OpenAI	57.8	24.9	73.4	75.0	$1.61	2.8M	6.2m	69
12	CodexOpenAI	GPT-5.4 (medium)	OpenAI	53.5	18.4	69.8	72.4	$2.09	4.9M	6.9m	70
13	Claude CodeAnthropic	Qwen3.7 Plus (thinking)	Alibaba	53.1	22.9	64.7	71.8	$4.98	6.0M	10.5m	126
14	Claude CodeAnthropic	GLM-5.1	Z AI	52.7	19.8	65.1	73.2	$2.26	8.9M	21.6m	99
15	Cursor CLICursor	GPT-5.4 (medium)	OpenAI	52.2	18.9	64.7	72.9	$1.53	3.8M	7.6m	36
16	Claude CodeAnthropic	Claude Opus 4.6 (medium)	Anthropic	51.3	11.8	70.2	71.9	$1.27	4.2M	7m	38
17	Claude CodeAnthropic	Kimi K2.6	Kimi	50.5	27.3	64.3	59.8	$0.76	7.3M	41.5m	111
18	Claude CodeAnthropic	DeepSeek V4 Pro (high)	DeepSeek	50.2	18.0	64.7	67.8	$0.35	6.2M	18m	101
19	Claude CodeAnthropic	Claude Sonnet 4.6 (medium)	Anthropic	49.4	14.9	63.1	70.3	$1.02	4.3M	9.2m	47
20	Cursor CLICursor	Composer 2	Cursor	48.5	12.2	64.3	68.9	$0.07	3.3M	8.7m	44
21	Gemini CLIGoogle	Gemini 3.1 Pro (high)	Google	43.0	15.1	68.3	45.6	$1.60	3.2M	7.6m	44

Coding Agent Index

Equal-weight mean of three real-world suites: hard repository tasks, agentic terminal work, and codebase Q&A. Higher is better.

Composed of SWE-Bench-Pro-Hard-AA (150 tasks), Terminal-Bench v2 (84 tasks), and SWE-Atlas-QnA (124 tasks). The agent is the unit of measurement — the same model lands differently in different harnesses.

SWE-Bench-Pro-Hard-AA

Solve rate on 150 hard code generation tasks in real repositories, %.

Terminal-Bench v2

Solve rate on 84 agentic terminal tasks in a live shell, %.

SWE-Atlas-QnA

Rubric score on 124 codebase Q&A tasks, %.

The Harness Effect — Claude Opus 4.7 (medium)

The same model, three harnesses. The scaffold alone moves the Coding Agent Index.

Identical model weights and settings; only the harness changes. Prompting, context management, and tooling are worth real points.

Harness Spread

Points of Index

Index points between a model's best and worst harness, for every model run in 2+ harnesses

1Claude Opus 4.7 (medium) (3 harnesses)4.8
2GPT-5.5 (medium) (2 harnesses)2.6
3GPT-5.4 (medium) (2 harnesses)1.4

Cost per Task

Mean USD to complete one task, across all three suites. Lower is better.

Measured mean spend per task at list API prices, cache discounts included. The spread is the story: the priciest configuration costs roughly 70× the cheapest.

Coding Agent Index vs. Cost per Task

Capability against mean USD per task (log scale).

Up and to the left wins: more solved tasks per dollar. Open-weight models power most of the value corner.

Token Usage per Task

Mean tokens consumed per task, split into fresh input, cache reads, and output.

Agents read far more than they write — cache reads are the overwhelming majority of tokens everywhere. Output is the thin dark sliver on top.

Cache Hit Rate

Share of context reads served from prompt cache, %. Higher is better for cost.

Harnesses that keep context stable cache better. Every point of hit rate is money: cached reads bill at a tenth of the fresh-input price.

Coding Agent Index vs. Total Tokens

Capability against mean total tokens per task (millions).

Reading more of the repo correlates with solving more of it — but the best harnesses get more index per token read.

Execution Time per Task

Mean wall-clock minutes from task start to the agent declaring done. Lower is better.

Includes model latency, tool calls, builds, and test runs. Fast models in lean harnesses finish in ~10 minutes; deliberate configurations take nearly three times as long.

Coding Agent Index vs. Execution Time

Capability against mean wall-clock minutes per task.

Up and to the left wins: capable and quick. Slow is only worth it if the index follows.

Turns per Task

Mean assistant turns (tool-call rounds) per task.

More turns means more, smaller steps — not necessarily better results. Turn count tracks harness style more than capability.

Run Specifications

Every configuration runs the same way, so the numbers compare clean.

Environment: Fresh container per task, repo pinned to a fixed commit, network limited to package mirrors.
Attempts: One attempt per task (pass@1), no retries, no human nudges.
Configuration: Each harness runs at default settings with its recommended model configuration.
Budget: Hard cap of 60 minutes wall-clock per task; runs that exceed it score zero.
Cost accounting: List API prices at snapshot date; cache reads billed at 10% of the input rate.
Reporting: Cost, tokens, time, and turns are means across all completed tasks in the three suites.

Frequently Asked Questions

What is the Coding Agent Index?

The equal-weight mean of a configuration's scores on the three task suites. One number for how much real software work gets done — no extra weighting tricks, no style points.

What do the three suites actually test?

SWE-Bench-Pro-Hard-AA is 150 hard bug-fix and feature tasks in real repositories, graded by hidden tests. Terminal-Bench v2 is 84 multi-step jobs in a live shell — builds, migrations, debugging, ops. SWE-Atlas-QnA is 124 questions that require navigating a large codebase and answering precisely.

How are tasks scored?

Implementation and terminal tasks are pass@1: one attempt, and the test suite either passes or it doesn't. Codebase Q&A earns partial credit against a rubric. Nothing is cherry-picked or re-run.

What counts as execution time?

Wall-clock from handing the agent a task to the agent declaring done — model latency, tool calls, builds, and test runs included. It's the number you actually wait.

Why track tokens at all?

Because agents read far more than they write. Cache reads dominate the bill at a 10×-discounted rate, so two agents with the same index can differ several-fold in cost. The token mix is the why behind the cost chart.

Why does the same model score differently across harnesses?

The harness decides what the model sees and which tools it gets — system prompts, context management, edit formats, test loops. Same engine, different car.

Latest Agent Insights

Reporting from the agents desk

Agents

Lawmakers Eye Four-Star Command for Unmanned, Autonomous Systems

By Air & Space Forces Magazine · 7 hrs ago

Agents

Meet the OpenAI Engineer Leading ChatGPT's Biggest Transformation Yet

By Wired · 7 hrs ago

Agents

Three ways that Asia's enterprises are adopting AI -- and where they are falling behind | Fortune

By Fortune · 7 hrs ago

Agents

Evaluate AI agents systematically with Agent-EvalKit

By Amazon Web Services, Inc. · 12 hrs ago

Methodology: every configuration runs the same three suites — SWE-Bench-Pro-Hard-AA (150 tasks), Terminal-Bench v2 (84 tasks), and SWE-Atlas-QnA (124 tasks) — and the Coding Agent Index is their equal-weight mean. Suite design and metric definitions follow the public coding-agents methodology of Artificial Analysis (artificialanalysis.ai/agents/coding-agents). Figures are Glsrm editorial estimates calibrated to our model benchmark table, not Artificial Analysis's published results. Cost per task is derived from each run's mean token mix at list API prices, with cache reads billed at 10% of the input rate. Prices change frequently. Logos identify the respective model creators.