Anthropic’s Claude Opus 4.5 scored 80.9% on SWE-bench Verified, ahead of GPT-5.1 at 76.3%, according to Anthropic’s system card. That single chart launched a thousand “Claude is better for coding” takes.
Scale AI tells a different story. On its independent, standardized SEAL version of SWE-bench Pro, OpenAI’s GPT-5.4 leads at 59.1% while Claude Opus 4.6 sits in third at 51.9%. The honest answer to which tool codes better depends on which leaderboard you trust, what you pay per resolved task, and how you work. The figures below cover head-to-head benchmarks from Anthropic and OpenAI, pricing, context windows, and use-case verdicts.
Key Takeaways
- Claude Opus 4.5 reached 80.9% on SWE-bench Verified versus GPT-5.1’s 76.3% on Anthropic’s vendor-reported benchmark.
- According to Scale AI’s standardized SEAL board, GPT-5.4 leads SWE-bench Pro at 59.1%, ahead of Claude Opus 4.6 at 51.9% under a neutral harness.
- OpenAI’s GPT-5 costs $1.25 per million input tokens and $10 per million output tokens, while Anthropic’s Opus 4.5 is priced at $5/$25 per million tokens.
- GPT-5 models accept up to 272,000 input tokens for a total context of 400,000 tokens, double Opus 4.5’s 200,000-token context window.
- GPT-5 scored 88% on Aider polyglot, while Opus 4.5 leads multi-language editing at 89.4% on Anthropic-reported results.
- Claude Opus 4.5 posted 59.3% on Terminal-Bench versus GPT-5.1’s 47.6%, a wide command-line gap.
Editor’s Choice
- SWE-bench Verified, vendor-reported: Claude Opus 4.5 80.9%, GPT-5.1 76.3%, Gemini 3 Pro 76.2%.
- SWE-bench Pro, standardized: GPT-5.4 (xHigh) 59.1%, Claude Opus 4.6 (thinking) 51.9%, Claude Opus 4.5 45.9%.
- Input pricing: GPT-5 at $1.25 per million tokens is one-quarter of Opus 4.5’s $5.
- Token efficiency: Opus 4.5 at medium effort matches Sonnet 4.5’s best SWE-bench Verified score using 76% fewer output tokens.
- Cached input: GPT-5.1 cached input drops 90% from $1.25 to $0.125 per million tokens.
The Quick Verdict: Which Is Better for Coding?
Claude’s 80.9% SWE-bench Verified leads vendor-reported coding benchmarks while ChatGPT’s $1.25 input price wins on cost, so the better tool depends on the job. Anthropic calls Opus 4.5 “the best model in the world for coding, agents, and computer use”, and OpenAI describes GPT-5 as “our best model yet for coding and agentic tasks”. Both claims are defensible because they measure different things.
If your work is precision on hard, real-world software-engineering tasks, Claude’s vendor numbers lead, according to Anthropic’s published benchmarks. If you run cost-sensitive volume work or a large-repository context, GPT-5 wins on the $1.25/$10 price and 400,000-token context per OpenAI’s developer documentation. Our AI benchmark coverage shows capability rankings shift roughly every six months, so treat any single-day leaderboard as a snapshot, not a verdict.
By the numbers: Anthropic reports Opus 4.5 at 80.9% on SWE-bench Verified against GPT-5.1’s 76.3%, yet Scale AI’s standardized SWE-bench Pro puts GPT-5.4 first at 59.1% with Opus 4.6 third at 51.9%. The gap between those two stories is mostly a difference in test harness, not raw coding skill.
Claude for Coding: What the Data Shows
Claude Opus 4.5 anchors Anthropic’s coding pitch, posting 80.9% on SWE-bench, verified on the company’s own scaffold, and 89.4% on the Aider polyglot multi-language editing test, according to Anthropic. The Claude AI statistics data tracks how that coding leads to wider adoption. Anthropic released Opus 4.5 on November 24, 2025, with a new effort parameter that trades speed against capability.
The standout efficiency claim is real and measurable. Set to medium effort, Opus 4.5 matches Sonnet 4.5’s best SWE-bench Verified score while using 76% fewer output tokens, and at its highest effort level, it exceeds Sonnet 4.5 by 4.3 percentage points. For developers, that means tunable cost control on the same model.
- Highest vendor SWE-bench Verified score at 80.9% in the November 2025 wave.
- Effort parameter cuts output tokens by 76% at matched performance.
- Leads command-line work on Terminal-Bench at 59.3%.
- Claude Code added Plan Mode upgrades and desktop app availability for terminal-native workflows.
- Output pricing of $25 per million tokens is the highest of the pair.
- Context window of 200,000 tokens is half GPT-5’s total.
- On the standardized Scale AI board, Opus 4.5 scores just 45.9% on SWE-bench Pro, well below its vendor figure.
ChatGPT for Coding: What the Data Shows
OpenAI’s GPT-5 line trades a small benchmark deficit for a large price and context advantage, scoring 74.9% on SWE-bench Verified and 88% on Aider polyglot according to OpenAI. OpenAI launched GPT-5.1 on November 13, 2025, which lifted SWE-bench Verified to 76.3%.
The economics favor ChatGPT for high-volume coding. GPT-5 costs $1.25 per million input tokens and $10 per million output tokens, and GPT-5.1 cached input falls 90% to $0.125 per million tokens for repeated context. OpenAI also exposes a reasoning_effort parameter with a “minimal” setting, plus a verbosity control with low, medium, and high levels.
- Largest context of the pair at 400,000 total tokens.
- Input pricing of $1.25 per million tokens is one-quarter of Opus 4.5’s.
- Leads the standardized Scale AI SWE-bench Pro board, where GPT-5.4 scores 59.1%.
- Codex ships VS Code and Cursor extensions, a web app, Slack integration, and a GitHub app with inline review.
- Vendor SWE-bench Verified of 74.9% for GPT-5 trails Opus 4.5’s headline number.
- Terminal-Bench result of 47.6% for GPT-5.1 lags Claude by a wide margin.
- Reasoning_effort and verbosity offer only partial control over the cost-versus-capability trade, with no single knob as cleanly framed as Anthropic’s effort parameter.
SWE-bench Verified: Claude vs ChatGPT Head-to-Head
Claude Opus 4.5 leads every model in the November 2025 wave on vendor-reported SWE-bench Verified, scoring 80.9% against GPT-5.1’s 76.3% and Gemini 3 Pro’s 76.2%, per Anthropic’s system card. The benchmark measures whether a model can resolve real GitHub issues, so a 4.6-percentage-point lead is meaningful on hard engineering tasks.
| Model | SWE-bench Verified | Source basis |
|---|---|---|
| Claude Opus 4.5 | 80.9% | Anthropic system card |
| GPT-5.1 | 76.3% | Anthropic system card comparison |
| Gemini 3 Pro | 76.2% | Anthropic system card comparison |
| GPT-5 | 74.9% | OpenAI announcement |
Source: Anthropic Claude Opus 4.5 system card; OpenAI GPT-5 announcement, 2025
These are all vendor-reported or vendor-comparison figures. Vendor scaffolds optimize their own harness, which is exactly why the independent standardized view below tells a different story. The best AI coding tools roundup tracks how those harness choices ripple through real developer adoption.
SWE-bench Pro: The Independent Standardized View
GPT-5.4 leads the standardized Scale AI SEAL version of SWE-bench Pro at 59.1%, ahead of Claude Opus 4.6 at 51.9% and GPT-5 at 41.8%, a ranking that inverts the vendor-reported Verified results. SWE-bench Pro uses harder tasks and a single neutral harness, so the scores run lower across the board.
| Model | SWE-bench Pro (standardized) | Set |
|---|---|---|
| GPT-5.4 (xHigh) | 59.1% | Scale AI SEAL public |
| Claude Opus 4.6 (thinking) | 51.9% | Scale AI SEAL public |
| Claude Opus 4.5 | 45.9% | Scale AI SEAL public |
| GPT-5 (High) | 41.8% | Scale AI SEAL public |
Source: Scale AI SEAL SWE-bench Pro leaderboard, June 2026
Vendor-reported Pro numbers run higher: Claude Opus 4.8 reaches 69.2% and GPT-5.5 58.6% on each lab’s own scaffold. Comparing a vendor scaffold score to a standardized score is the single most common way coding-model debates go wrong.
Why it matters: The same two labs swap places depending on the harness. Anthropic wins its own SWE-bench Verified chart at 80.9%, while OpenAI wins Scale AI’s standardized SWE-bench Pro at 59.1%, and neither result is wrong. For procurement, the standardized board is the more conservative basis for a decision.
Aider Polyglot and Terminal-Bench: Multi-Language and Command-Line
GPT-5 scored 88% on Aider polyglot while Opus 4.5 leads multi-language editing at 89.4%, a near-tie that diverges sharply at the command line. On Terminal-Bench, the gap widens: Opus 4.5 hit 59.3% versus GPT-5.1’s 47.6%.
That 11.7-point Terminal-Bench gap matters for agentic work where the model drives a shell directly. Aider polyglot rewards clean multi-file edits, where both models are strong; Terminal-Bench rewards autonomous command execution, where Claude’s terminal-native design shows. Command-line autonomy is the fastest-growing slice of developer AI usage.
Context Window: 200,000 vs 400,000 Tokens
GPT-5 doubles Claude Opus 4.5 on raw context, accepting up to 272,000 input tokens and 128,000 output tokens for a 400,000-token total against Opus 4.5’s 200,000-token context window. For large monorepos or long agent runs, that headroom reduces how often context must be chunked.
More context is not free, though: every token in the window is billed, so a 400,000-token prompt on GPT-5 still costs less per token than a 200,000-token prompt on Opus 4.5 by virtue of the price gap.
Agentic and Tool Use: Claude Code vs OpenAI Codex
Claude Opus 4.5’s 59.3% Terminal-Bench score reflects a local-versus-cloud philosophy that splits Claude Code and OpenAI Codex more than raw capability does. Claude Code runs as a terminal-based agent inside the developer’s actual environment, reading files, executing commands, running tests, and modifying code in the local context. Codex offers a full IDE extension, a web app, Slack integration, and a GitHub app with inline code review.
Anthropic shipped concrete agent upgrades with Opus 4.5: Claude Code gained Plan Mode improvements and desktop app availability. The practical split is that Claude Code keeps a human in the loop on a local machine, while Codex is built to be delegated to, including from a Slack thread.
Best for local, in-the-loop work: Claude Code suits developers who want an agent in their own terminal with tests and files in a real context, backed by Opus 4.5’s 59.3% Terminal-Bench result.
IDE Integrations and Developer Workflow
Codex integrates through VS Code and Cursor extensions, a web app, a GitHub app, and is open source via GitHub, spreading across more entry points than Claude Code, which concentrates on terminal depth. Claude Code adds a VS Code extension, a web IDE for remote control, parallel desktop sessions, and a Cowork feature.
For teams already living in Slack and GitHub pull requests, Codex’s surface area is a real workflow advantage; for solo developers who prefer the terminal, Claude Code’s depth wins.
Best for Slack and GitHub-native teams: OpenAI’s Codex fits orchestration-style workflows, with mention-to-trigger Slack support and inline GitHub review baked in.
Pricing: Cost per Million Tokens
GPT-5 costs $1.25 input and $10 output per million tokens, one-quarter and two-fifths of Opus 4.5’s $5/$25, a price gap that reframes the comparison around cost-per-resolved-task. Anthropic cut Opus pricing sharply at the 4.5 launch, narrowing but not closing that gap.
Caching widens the gap further: GPT-5.1 cached input drops 90% to $0.125 per million tokens. At list price, Opus 4.5 input costs 4 times GPT-5 input. If a developer’s workload resolves a task in roughly equal turns on both models, GPT-5’s price advantage compounds across every run.
The takeaway: Anthropic’s effort parameter is the counterweight to its higher sticker price. Cutting output tokens by 76% at matched SWE-bench Verified performance narrows the real-world cost gap, even though per-token list pricing still favors GPT-5. A team that tunes effort down on routine tasks can close much of the distance to OpenAI’s cheaper per-token rate without changing models.
Verdict by Use Case
Anthropic’s 80.9% SWE-bench Verified edge favors precision tasks, while OpenAI’s $1.25 input price and 400,000-token context favor volume and scale. The right model tracks the workload rather than a single leaderboard.
- Best for hardest engineering tasks: Claude Opus 4.5, on the strength of its 80.9% vendor SWE-bench Verified score and 59.3% Terminal-Bench result.
- Best for cost-sensitive, high-volume coding: GPT-5, where $1.25 input pricing and a 90% cached-input discount cut cost-per-task.
- Best for large-repository agentic runs: GPT-5, whose 400,000-token context reduces chunking on monorepos and long agent loops.
Benchmark-shopping risk: Vendor-reported SWE-bench Verified (Anthropic 80.9%) and standardized SWE-bench Pro (Scale AI, GPT-5.4 59.1%) are not interchangeable. Choosing a model on a vendor chart alone can mislead procurement; weigh the standardized harness for high-stakes decisions.
What Is the Best AI Model for Coding?
No model holds an undisputed crown as of mid-2026. On vendor-reported SWE-bench Verified, Anthropic’s Claude Opus family leads, with Opus 4.8 reaching 69.2% on its own SWE-bench Pro scaffold. On the independent standardized Scale AI board, GPT-5.4 leads at 59.1%. The best choice tracks your harness trust, budget, and workflow, not a single headline.
Is Claude Opus Better Than GPT-5 at SWE-bench?
It depends on which SWE-bench. On SWE-bench Verified, Claude Opus 4.5 scored 80.9% versus GPT-5.1’s 76.3%, while GPT-5 scored 74.9% on OpenAI’s reporting. On the harder, standardized SWE-bench Pro, GPT-5.4 leads at 59.1% while Opus 4.5 scores 45.9%. Claude wins the vendor Verified test; OpenAI wins the standardized Pro test.
How Much Cheaper Is ChatGPT Than Claude for Coding?
GPT-5 costs $1.25 per million input tokens against Opus 4.5’s $5, roughly four times cheaper, and $10 output against $25. With GPT-5.1’s 90% cached-input discount to $0.125, repeated-context workloads widen the gap further.
Conclusion
The coding crown splits cleanly along the line between vendor charts and independent harnesses. Anthropic’s Claude Opus 4.5 leads vendor-reported SWE-bench Verified at 80.9% and command-line Terminal-Bench at 59.3%, while OpenAI’s GPT-5 wins the standardized Scale AI SWE-bench Pro board through GPT-5.4 at 59.1%, a quarter of the input price at $1.25, and double the context at 400,000 tokens.
Developers chasing peak accuracy on hard tasks lean on Claude; teams optimizing cost-per-resolved-task and large-repo context lean on ChatGPT. Our AI benchmark coverage shows these rankings reshuffle roughly every six months, so the durable advice is to match the model to the workflow and re-test against a standardized harness each cycle.