Glsrm Agents · June 11, 2026

Coding agents, measured end-to-end.

Dive into the data ↓

Real software-engineering tasks, run through complete agent stacks — harness plus model — and scored on what actually ships: solve rate across three task suites (higher is better), dollars per task, tokens consumed, and minutes on the clock (all lower is better).

21

Configurations

5

Harnesses

17

Models

June 11, 2026

Snapshot

Coding Agent Index

Higher is better

Equal-weight mean of the three suites · top configurations

  • 1Claude Code · Opus 4.8 (max)77.2
  • 2Claude Code · Opus 4.8 (medium)67.2
  • 3Claude Code · Opus 4.7 (max)66.6
  • 4Codex · GPT-5.5 (xhigh)65.3
  • 5Opencode · Opus 4.7 (medium)64.6
  • 6Cursor CLI · Composer 2.562.9
  • 7Cursor CLI · Composer 2.5 Fast62.9
  • 8Cursor CLI · Opus 4.7 (medium)61.1
  • 9Codex · GPT-5.5 (medium)60.4

Cost per Task

Lower is better

Mean USD per completed task · cheapest configurations

  • 1Cursor CLI · Composer 2.5$0.07
  • 2Cursor CLI · Composer 2$0.07
  • 3Claude Code · DeepSeek V4 Pro (high)$0.35
  • 4Cursor CLI · Composer 2.5 Fast$0.44
  • 5Claude Code · Kimi K2.6$0.76
  • 6Claude Code · Sonnet 4.6 (medium)$1.02
  • 7Claude Code · Opus 4.7 (medium)$1.24
  • 8Claude Code · Opus 4.6 (medium)$1.27
  • 9Cursor CLI · Opus 4.7 (medium)$1.47

Time per Task

Lower is better

Mean wall-clock minutes per task · fastest configurations

  • 1Claude Code · Opus 4.7 (medium)5.8m
  • 2Cursor CLI · GPT-5.5 (medium)6.2m
  • 3Cursor CLI · Composer 2.5 Fast6.7m
  • 4Codex · GPT-5.4 (medium)6.9m
  • 5Claude Code · Opus 4.6 (medium)7m
  • 6Codex · GPT-5.5 (medium)7.1m
  • 7Cursor CLI · GPT-5.4 (medium)7.6m
  • 8Gemini CLI · Gemini 3.1 Pro (high)7.6m
  • 9Cursor CLI · Opus 4.7 (medium)7.8m

Coding agent comparison summary

Every harness × model configuration we track, ranked by Coding Agent Index.

#AgentModelLabIndexSWE-ProTerm-BenchAtlas QnA$/TaskTokens/TaskTimeTurns
1Claude CodeAnthropicClaude Opus 4.8 (max)Anthropic77.269.879.482.5$4.629.4M17.7m134
2Claude CodeAnthropicClaude Opus 4.8 (medium)Anthropic67.249.875.076.8$1.603.4M8.8m63
3Claude CodeAnthropicClaude Opus 4.7 (max)Anthropic66.644.973.881.0$4.1411.2M13.8m97
4CodexOpenAIGPT-5.5 (xhigh)OpenAI65.330.984.180.8$4.339.3M8.7m96
5OpencodeOpencodeClaude Opus 4.7 (medium)Anthropic64.640.274.679.0$1.824.4M9.7m43
6Cursor CLICursorComposer 2.5Cursor62.949.266.972.5$0.072.8M9.3m101
7Cursor CLICursorComposer 2.5 FastCursor62.949.266.972.5$0.443.1M6.7m101
8Cursor CLICursorClaude Opus 4.7 (medium)Anthropic61.134.470.678.4$1.472.9M7.8m61
9CodexOpenAIGPT-5.5 (medium)OpenAI60.426.275.879.1$2.215.4M7.1m73
10Claude CodeAnthropicClaude Opus 4.7 (medium)Anthropic59.836.471.471.7$1.243.3M5.8m35
11Cursor CLICursorGPT-5.5 (medium)OpenAI57.824.973.475.0$1.612.8M6.2m69
12CodexOpenAIGPT-5.4 (medium)OpenAI53.518.469.872.4$2.094.9M6.9m70
13Claude CodeAnthropicQwen3.7 Plus (thinking)Alibaba53.122.964.771.8$4.986.0M10.5m126
14Claude CodeAnthropicGLM-5.1Z AI52.719.865.173.2$2.268.9M21.6m99
15Cursor CLICursorGPT-5.4 (medium)OpenAI52.218.964.772.9$1.533.8M7.6m36
16Claude CodeAnthropicClaude Opus 4.6 (medium)Anthropic51.311.870.271.9$1.274.2M7m38
17Claude CodeAnthropicKimi K2.6Kimi50.527.364.359.8$0.767.3M41.5m111
18Claude CodeAnthropicDeepSeek V4 Pro (high)DeepSeek50.218.064.767.8$0.356.2M18m101
19Claude CodeAnthropicClaude Sonnet 4.6 (medium)Anthropic49.414.963.170.3$1.024.3M9.2m47
20Cursor CLICursorComposer 2Cursor48.512.264.368.9$0.073.3M8.7m44
21Gemini CLIGoogleGemini 3.1 Pro (high)Google43.015.168.345.6$1.603.2M7.6m44

Coding Agent Index

Equal-weight mean of three real-world suites: hard repository tasks, agentic terminal work, and codebase Q&A. Higher is better.

0.020.040.060.080.077.2Claude Code · Opus 4.8 (m…67.2Claude Code · Opus 4.8 (m…66.6Claude Code · Opus 4.7 (m…65.3Codex · GPT-5.5 (xhigh)64.6Opencode · Opus 4.7 (medi…62.9Cursor CLI · Composer 2.562.9Cursor CLI · Composer 2.5…61.1Cursor CLI · Opus 4.7 (me…60.4Codex · GPT-5.5 (medium)59.8Claude Code · Opus 4.7 (m…57.8Cursor CLI · GPT-5.5 (med…53.5Codex · GPT-5.4 (medium)53.1Claude Code · Qwen3.7 Plu…52.7Claude Code · GLM-5.152.2Cursor CLI · GPT-5.4 (med…51.3Claude Code · Opus 4.6 (m…50.5Claude Code · Kimi K2.650.2Claude Code · DeepSeek V4…49.4Claude Code · Sonnet 4.6…48.5Cursor CLI · Composer 243.0Gemini CLI · Gemini 3.1 P…

Composed of SWE-Bench-Pro-Hard-AA (150 tasks), Terminal-Bench v2 (84 tasks), and SWE-Atlas-QnA (124 tasks). The agent is the unit of measurement — the same model lands differently in different harnesses.

SWE-Bench-Pro-Hard-AA

Solve rate on 150 hard code generation tasks in real repositories, %.

01020304050607070Claude Code · Opus 4.8 (m…50Claude Code · Opus 4.8 (m…49Cursor CLI · Composer 2.549Cursor CLI · Composer 2.5…45Claude Code · Opus 4.7 (m…40Opencode · Opus 4.7 (medi…36Claude Code · Opus 4.7 (m…34Cursor CLI · Opus 4.7 (me…31Codex · GPT-5.5 (xhigh)27Claude Code · Kimi K2.626Codex · GPT-5.5 (medium)25Cursor CLI · GPT-5.5 (med…23Claude Code · Qwen3.7 Plu…20Claude Code · GLM-5.119Cursor CLI · GPT-5.4 (med…18Codex · GPT-5.4 (medium)18Claude Code · DeepSeek V4…15Gemini CLI · Gemini 3.1 P…15Claude Code · Sonnet 4.6…12Cursor CLI · Composer 212Claude Code · Opus 4.6 (m…

Terminal-Bench v2

Solve rate on 84 agentic terminal tasks in a live shell, %.

02040608084Codex · GPT-5.5 (xhigh)79Claude Code · Opus 4.8 (m…76Codex · GPT-5.5 (medium)75Claude Code · Opus 4.8 (m…75Opencode · Opus 4.7 (medi…74Claude Code · Opus 4.7 (m…73Cursor CLI · GPT-5.5 (med…71Claude Code · Opus 4.7 (m…71Cursor CLI · Opus 4.7 (me…70Claude Code · Opus 4.6 (m…70Codex · GPT-5.4 (medium)68Gemini CLI · Gemini 3.1 P…67Cursor CLI · Composer 2.567Cursor CLI · Composer 2.5…65Claude Code · GLM-5.165Claude Code · Qwen3.7 Plu…65Cursor CLI · GPT-5.4 (med…65Claude Code · DeepSeek V4…64Claude Code · Kimi K2.664Cursor CLI · Composer 263Claude Code · Sonnet 4.6…

SWE-Atlas-QnA

Rubric score on 124 codebase Q&A tasks, %.

02040608083Claude Code · Opus 4.8 (m…81Claude Code · Opus 4.7 (m…81Codex · GPT-5.5 (xhigh)79Codex · GPT-5.5 (medium)79Opencode · Opus 4.7 (medi…78Cursor CLI · Opus 4.7 (me…77Claude Code · Opus 4.8 (m…75Cursor CLI · GPT-5.5 (med…73Claude Code · GLM-5.173Cursor CLI · GPT-5.4 (med…73Cursor CLI · Composer 2.573Cursor CLI · Composer 2.5…72Codex · GPT-5.4 (medium)72Claude Code · Opus 4.6 (m…72Claude Code · Qwen3.7 Plu…72Claude Code · Opus 4.7 (m…70Claude Code · Sonnet 4.6…69Cursor CLI · Composer 268Claude Code · DeepSeek V4…60Claude Code · Kimi K2.646Gemini CLI · Gemini 3.1 P…

The Harness Effect — Claude Opus 4.7 (medium)

The same model, three harnesses. The scaffold alone moves the Coding Agent Index.

0.010.020.030.040.050.060.064.6Opencode61.1Cursor CLI59.8Claude Code

Identical model weights and settings; only the harness changes. Prompting, context management, and tooling are worth real points.

Harness Spread

Points of Index

Index points between a model's best and worst harness, for every model run in 2+ harnesses

  • 1Claude Opus 4.7 (medium) (3 harnesses)4.8
  • 2GPT-5.5 (medium) (2 harnesses)2.6
  • 3GPT-5.4 (medium) (2 harnesses)1.4

Cost per Task

Mean USD to complete one task, across all three suites. Lower is better.

$0.00$1.00$2.00$3.00$4.00$5.00$4.98Claude Code · Qwen3.7 Plu…$4.62Claude Code · Opus 4.8 (m…$4.33Codex · GPT-5.5 (xhigh)$4.14Claude Code · Opus 4.7 (m…$2.26Claude Code · GLM-5.1$2.21Codex · GPT-5.5 (medium)$2.09Codex · GPT-5.4 (medium)$1.82Opencode · Opus 4.7 (medi…$1.61Cursor CLI · GPT-5.5 (med…$1.60Claude Code · Opus 4.8 (m…$1.60Gemini CLI · Gemini 3.1 P…$1.53Cursor CLI · GPT-5.4 (med…$1.47Cursor CLI · Opus 4.7 (me…$1.27Claude Code · Opus 4.6 (m…$1.24Claude Code · Opus 4.7 (m…$1.02Claude Code · Sonnet 4.6…$0.76Claude Code · Kimi K2.6$0.44Cursor CLI · Composer 2.5…$0.35Claude Code · DeepSeek V4…$0.07Cursor CLI · Composer 2.5$0.07Cursor CLI · Composer 2

Measured mean spend per task at list API prices, cache discounts included. The spread is the story: the priciest configuration costs roughly 70× the cheapest.

Coding Agent Index vs. Cost per Task

Capability against mean USD per task (log scale).

020406080$0.10$0.20$0.50$1.00$2.00$5.00Cost per task (USD, log scale)Coding Agent Index↖ Most attractive quadrantClaude Code · Opus 4.…Claude Code · Opus 4.…Claude Code · Opus 4.…Codex · GPT-5.5 (xhig…Opencode · Opus 4.7 (…Cursor CLI · Composer…Cursor CLI · Composer…Cursor CLI · Opus 4.7…Codex · GPT-5.5 (medi…Claude Code · Opus 4.…Cursor CLI · GPT-5.5…Codex · GPT-5.4 (medi…Claude Code · Qwen3.7…Claude Code · GLM-5.1Cursor CLI · GPT-5.4…Claude Code · Opus 4.…Claude Code · Kimi K2…Claude Code · DeepSee…Claude Code · Sonnet…Cursor CLI · Composer…Gemini CLI · Gemini 3…

Up and to the left wins: more solved tasks per dollar. Open-weight models power most of the value corner.

Token Usage per Task

Mean tokens consumed per task, split into fresh input, cache reads, and output.

Fresh inputCache readsOutput0M2M4M6M8M10M11.2MClaude Code · Opus 4.7 (m…9.4MClaude Code · Opus 4.8 (m…9.3MCodex · GPT-5.5 (xhigh)8.9MClaude Code · GLM-5.17.3MClaude Code · Kimi K2.66.2MClaude Code · DeepSeek V4…6.0MClaude Code · Qwen3.7 Plu…5.4MCodex · GPT-5.5 (medium)4.9MCodex · GPT-5.4 (medium)4.4MOpencode · Opus 4.7 (medi…4.3MClaude Code · Sonnet 4.6…4.2MClaude Code · Opus 4.6 (m…3.8MCursor CLI · GPT-5.4 (med…3.4MClaude Code · Opus 4.8 (m…3.3MClaude Code · Opus 4.7 (m…3.3MCursor CLI · Composer 23.2MGemini CLI · Gemini 3.1 P…3.1MCursor CLI · Composer 2.5…2.9MCursor CLI · Opus 4.7 (me…2.8MCursor CLI · GPT-5.5 (med…2.8MCursor CLI · Composer 2.5

Agents read far more than they write — cache reads are the overwhelming majority of tokens everywhere. Output is the thin dark sliver on top.

Cache Hit Rate

Share of context reads served from prompt cache, %. Higher is better for cost.

0%10%20%30%40%50%50%Opencode · Opus 4.7 (medi…49%Claude Code · Opus 4.7 (m…49%Cursor CLI · Opus 4.7 (me…49%Claude Code · Kimi K2.649%Claude Code · Opus 4.7 (m…49%Claude Code · Opus 4.8 (m…49%Claude Code · Opus 4.8 (m…49%Codex · GPT-5.5 (medium)49%Cursor CLI · Composer 2.549%Cursor CLI · Composer 2.5…49%Claude Code · Sonnet 4.6…49%Codex · GPT-5.4 (medium)49%Codex · GPT-5.5 (xhigh)49%Claude Code · Opus 4.6 (m…48%Cursor CLI · Composer 248%Cursor CLI · GPT-5.5 (med…47%Gemini CLI · Gemini 3.1 P…47%Cursor CLI · GPT-5.4 (med…46%Claude Code · GLM-5.145%Claude Code · Qwen3.7 Plu…44%Claude Code · DeepSeek V4…

Harnesses that keep context stable cache better. Every point of hit rate is money: cached reads bill at a tenth of the fresh-input price.

Coding Agent Index vs. Total Tokens

Capability against mean total tokens per task (millions).

0204060800M2M4M6M8M10MTotal tokens per task (millions)Coding Agent Index↖ More capability per tokenClaude Code · Opus 4.…Claude Code · Opus 4.…Claude Code · Opus 4.…Codex · GPT-5.5 (xhig…Opencode · Opus 4.7 (…Cursor CLI · Composer…Cursor CLI · Composer…Cursor CLI · Opus 4.7…Codex · GPT-5.5 (medi…Claude Code · Opus 4.…Cursor CLI · GPT-5.5…Codex · GPT-5.4 (medi…Claude Code · Qwen3.7…Claude Code · GLM-5.1Cursor CLI · GPT-5.4…Claude Code · Opus 4.…Claude Code · Kimi K2…Claude Code · DeepSee…Claude Code · Sonnet…Cursor CLI · Composer…Gemini CLI · Gemini 3…

Reading more of the repo correlates with solving more of it — but the best harnesses get more index per token read.

Execution Time per Task

Mean wall-clock minutes from task start to the agent declaring done. Lower is better.

0m10m20m30m40m5.8mClaude Code · Opus 4.7 (m…6.2mCursor CLI · GPT-5.5 (med…6.7mCursor CLI · Composer 2.5…6.9mCodex · GPT-5.4 (medium)7mClaude Code · Opus 4.6 (m…7.1mCodex · GPT-5.5 (medium)7.6mCursor CLI · GPT-5.4 (med…7.6mGemini CLI · Gemini 3.1 P…7.8mCursor CLI · Opus 4.7 (me…8.7mCodex · GPT-5.5 (xhigh)8.7mCursor CLI · Composer 28.8mClaude Code · Opus 4.8 (m…9.2mClaude Code · Sonnet 4.6…9.3mCursor CLI · Composer 2.59.7mOpencode · Opus 4.7 (medi…10.5mClaude Code · Qwen3.7 Plu…13.8mClaude Code · Opus 4.7 (m…17.7mClaude Code · Opus 4.8 (m…18mClaude Code · DeepSeek V4…21.6mClaude Code · GLM-5.141.5mClaude Code · Kimi K2.6

Includes model latency, tool calls, builds, and test runs. Fast models in lean harnesses finish in ~10 minutes; deliberate configurations take nearly three times as long.

Coding Agent Index vs. Execution Time

Capability against mean wall-clock minutes per task.

0204060800m10m20m30m40mExecution time per task (minutes)Coding Agent Index↖ Capable and quickClaude Code · Opus 4.…Claude Code · Opus 4.…Claude Code · Opus 4.…Codex · GPT-5.5 (xhig…Opencode · Opus 4.7 (…Cursor CLI · Composer…Cursor CLI · Composer…Cursor CLI · Opus 4.7…Codex · GPT-5.5 (medi…Claude Code · Opus 4.…Cursor CLI · GPT-5.5…Codex · GPT-5.4 (medi…Claude Code · Qwen3.7…Claude Code · GLM-5.1Cursor CLI · GPT-5.4…Claude Code · Opus 4.…Claude Code · Kimi K2…Claude Code · DeepSee…Claude Code · Sonnet…Cursor CLI · Composer…Gemini CLI · Gemini 3…

Up and to the left wins: capable and quick. Slow is only worth it if the index follows.

Turns per Task

Mean assistant turns (tool-call rounds) per task.

020406080100120140134Claude Code · Opus 4.8 (m…126Claude Code · Qwen3.7 Plu…111Claude Code · Kimi K2.6101Cursor CLI · Composer 2.5101Cursor CLI · Composer 2.5…101Claude Code · DeepSeek V4…99Claude Code · GLM-5.197Claude Code · Opus 4.7 (m…96Codex · GPT-5.5 (xhigh)73Codex · GPT-5.5 (medium)70Codex · GPT-5.4 (medium)69Cursor CLI · GPT-5.5 (med…63Claude Code · Opus 4.8 (m…61Cursor CLI · Opus 4.7 (me…47Claude Code · Sonnet 4.6…44Cursor CLI · Composer 244Gemini CLI · Gemini 3.1 P…43Opencode · Opus 4.7 (medi…38Claude Code · Opus 4.6 (m…36Cursor CLI · GPT-5.4 (med…35Claude Code · Opus 4.7 (m…

More turns means more, smaller steps — not necessarily better results. Turn count tracks harness style more than capability.

Run Specifications

Every configuration runs the same way, so the numbers compare clean.

Environment
Fresh container per task, repo pinned to a fixed commit, network limited to package mirrors.
Attempts
One attempt per task (pass@1), no retries, no human nudges.
Configuration
Each harness runs at default settings with its recommended model configuration.
Budget
Hard cap of 60 minutes wall-clock per task; runs that exceed it score zero.
Cost accounting
List API prices at snapshot date; cache reads billed at 10% of the input rate.
Reporting
Cost, tokens, time, and turns are means across all completed tasks in the three suites.

Frequently Asked Questions

What is the Coding Agent Index?

The equal-weight mean of a configuration's scores on the three task suites. One number for how much real software work gets done — no extra weighting tricks, no style points.

What do the three suites actually test?

SWE-Bench-Pro-Hard-AA is 150 hard bug-fix and feature tasks in real repositories, graded by hidden tests. Terminal-Bench v2 is 84 multi-step jobs in a live shell — builds, migrations, debugging, ops. SWE-Atlas-QnA is 124 questions that require navigating a large codebase and answering precisely.

How are tasks scored?

Implementation and terminal tasks are pass@1: one attempt, and the test suite either passes or it doesn't. Codebase Q&A earns partial credit against a rubric. Nothing is cherry-picked or re-run.

What counts as execution time?

Wall-clock from handing the agent a task to the agent declaring done — model latency, tool calls, builds, and test runs included. It's the number you actually wait.

Why track tokens at all?

Because agents read far more than they write. Cache reads dominate the bill at a 10×-discounted rate, so two agents with the same index can differ several-fold in cost. The token mix is the why behind the cost chart.

Why does the same model score differently across harnesses?

The harness decides what the model sees and which tools it gets — system prompts, context management, edit formats, test loops. Same engine, different car.

Latest Agent Insights

Reporting from the agents desk

Methodology: every configuration runs the same three suites — SWE-Bench-Pro-Hard-AA (150 tasks), Terminal-Bench v2 (84 tasks), and SWE-Atlas-QnA (124 tasks) — and the Coding Agent Index is their equal-weight mean. Suite design and metric definitions follow the public coding-agents methodology of Artificial Analysis (artificialanalysis.ai/agents/coding-agents). Figures are Glsrm editorial estimates calibrated to our model benchmark table, not Artificial Analysis's published results. Cost per task is derived from each run's mean token mix at list API prices, with cache reads billed at 10% of the input rate. Prices change frequently. Logos identify the respective model creators.