Claude Opus 4.7: Coding Benchmark Leader & Pricing Reality

The Benchmark Tour
The Real Headline: Tool Error Rates
The Pricing Trick Anthropic Didn’t Mention
Vision Just Tripled
The New xhigh Effort Tier
Should You Switch?
How to Actually Use It
FAQ

87.6% on SWE-bench Verified. 64.3% on SWE-bench Pro. 70% on CursorBench. Claude Opus 4.7 holds the top spot on three of the four benchmarks that actually matter for production coding agents, and Anthropic’s announcement got buried under the news cycle for Cursor 3 Glass, GPT-5.5, and the Cohere–Aleph Alpha merger.

Released April 16, 2026, Opus 4.7 is the boring frontier model. No new modality. No new product surface. No flashy demo. Just better at every benchmark Anthropic has been losing to OpenAI on for the past six months. The model is generally available across the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry on day one.

The benchmarks tell one story. The tool error rate tells a more interesting one. The pricing tells a third.

The Benchmark Tour

Here is what changed against the previous flagship and the closest competitors:

Benchmark	Opus 4.7	Opus 4.6	GPT-5.4	Gemini 3.1 Pro
SWE-bench Verified	87.6%	80.8%	—	80.6%
SWE-bench Pro	64.3%	53.4%	57.7%	54.2%
CursorBench	70%	58%	—	—
MCP-Atlas (tool use)	77.3%	75.8%	68.1%	73.9%
OSWorld-Verified (computer use)	78.0%	72.7%	75.0%	—

The SWE-bench Pro lead is the loudest number. It’s the harder benchmark, the one that uses real GitHub issues from real production codebases, and Opus 4.7 outscores GPT-5.4 by 6.6 points and Gemini 3.1 Pro by 10.1 points. That gap is large enough to actually feel in production. It’s not within the noise floor.

The Real Headline: Tool Error Rates

Anthropic buried the most important number on page two. Opus 4.7 produces about a third of the tool errors of Opus 4.6 on multi-step agentic workflows, while completing 14% more tasks. That ratio matters more than any benchmark.

Anyone who has tried to ship a Claude agent into production knows the problem. The model gets the strategy right and then calls the wrong tool, or calls the right tool with malformed arguments, or hallucinates a tool that doesn’t exist. Each error eats a retry, eats a token budget, and eats user trust. Cutting tool errors by two-thirds is the difference between an agent you demo and an agent you actually ship.

Opus 4.7 is also the first Claude model to pass what Anthropic calls “implicit-need tests” — tasks where the model has to infer what tools or actions are required rather than being told explicitly. That sounds academic. In practice it means an agent that can be given a goal instead of a runbook, and that’s the entire pitch for production agentic work in 2026.

The Pricing Trick Anthropic Didn’t Mention

The headline pricing is identical to Opus 4.6: $5 per million input tokens, $25 per million output tokens. That’s the line every announcement led with. It is also misleading.

Opus 4.7 ships with a new tokenizer that uses roughly 1.0x to 1.35x as many tokens to encode the same text. In practice, real-world workloads are seeing token counts about 10% to 30% higher than on Opus 4.6 for identical inputs and outputs. The price per token didn’t move. The number of tokens per task did.

That means a workflow that cost you $1.00 on Opus 4.6 will cost roughly $1.10 to $1.30 on Opus 4.7 for the same input, before counting the larger reasoning traces the better model tends to produce. Anthropic chose to hold the headline price flat rather than pass through a token-rate change. Whether that’s a price increase depends on how generous you want to be with the framing. Your CFO will not be generous.

Vision Just Tripled

Opus 4.7 processes images at resolutions up to 2,576 pixels on the long edge — roughly 3.75 megapixels, more than three times the prior Claude ceiling. The XBOW visual-acuity benchmark jumped from 54.5% on Opus 4.6 to 98.5% on 4.7.

What this unlocks is unglamorous but important: reading real screenshots without losing detail. Bug reports with full-resolution UI captures. Document analysis that doesn’t need OCR preprocessing. Diagram extraction that actually preserves the diagram. The Claude Computer Use story stops being “it kind of works at thumbnail resolution” and starts being “it works at the resolution your laptop displays.”

The New xhigh Effort Tier

Opus 4.7 introduces a new reasoning effort setting called xhigh, slotting between high and max. The point is finer control over the latency-quality tradeoff: max burns serious compute and time, high is fast but sometimes shallow, and xhigh gives you most of the depth without the full latency tax.

For agent loops that run a model dozens or hundreds of times per task, the new tier is the bigger deal than it sounds. Dropping from max to xhigh on routine sub-steps and reserving max for the planning step alone can cut multi-step task latency by a noticeable fraction without measurable quality loss. The 1M-token context window and 128K max output stay the same as 4.6.

Should You Switch?

If you’re running production coding agents, switch. The benchmark gap is real, the tool error reduction is real, and the cost increase from the tokenizer is small relative to the failure-rate improvement. A retry costs more than a 30% token bump.

If you’re using Opus 4.6 for general writing or analysis, the upgrade is incremental. The vision improvement matters if you process images. Otherwise the case is weaker, and the tokenizer change makes the math less favorable.

If you’re choosing between Opus 4.7 and GPT-5.5, the answer depends on the task. Across roughly 10 shared benchmarks, Opus 4.7 wins on coding, tool reliability, and instruction following. GPT-5.5 wins on autonomous workflow execution, long-context retrieval, and computer use. They are not interchangeable. Our Opus 4.6 vs Sonnet 4.6 piece covers the in-family decision; the cross-vendor decision is task-by-task.

If you’re using Claude Sonnet 4.6 for cost reasons and considering an upgrade to Opus 4.7, look at your actual error logs first. If your Sonnet agent fails more than 10% of multi-step tasks, the tool error reduction in 4.7 will pay for itself. If your Sonnet agent runs cleanly at 95%+ success, the upgrade is harder to justify.

How to Actually Use It

In the Claude API, change your model parameter to claude-opus-4-7. In Bedrock, the new model ID surfaces in the playground and the IAM policy as anthropic.claude-opus-4-7. In Vertex AI, it’s available under publishers/anthropic. In Microsoft Foundry, it appears in the Anthropic catalog tile.

For Claude Code and Claude Cowork users, the model picker now lists Opus 4.7 as the default Opus tier. Existing usage limits and rate limits carry over. The CLI flag is --model claude-opus-4-7.

Do this first: pick one production agent that has been failing 5% to 15% of tasks, switch its model to Opus 4.7 with effort set to xhigh, and rerun your last week of failed tasks. The success-rate delta will tell you whether to roll out further.

FAQ

Is Opus 4.7 a free upgrade for existing customers?
The model is available across the existing Claude API and partner platforms with no separate fee. The API price per token is unchanged from Opus 4.6. The new tokenizer means real workloads typically cost 10% to 30% more in token volume.

Does Opus 4.7 deprecate Opus 4.6?
Not yet. Opus 4.6 remains available, and Anthropic has not announced a deprecation date. For new projects there is no reason to start on 4.6.

What is the xhigh effort tier?
A new reasoning intensity setting that sits between high and max. It offers most of max‘s reasoning depth at lower latency. For agent loops with many sub-steps, using xhigh on routine steps and reserving max for planning is the recommended pattern.

Does the larger image resolution affect API pricing?
Image inputs are tokenized for billing. Higher-resolution images consume more tokens. For workflows that process detailed UI screenshots or diagrams, expect a meaningful token bump on image-heavy tasks.

Is Opus 4.7 better than GPT-5.5 for coding agents?
On benchmarks it leads on SWE-bench Pro by 6.6 points and on tool reliability by 9.2 points (MCP-Atlas). For pure coding-agent reliability in production, Opus 4.7 currently has the strongest case. GPT-5.5 retains an edge on long-running autonomous workflows and computer use.

Can I use Opus 4.7 inside Cursor 3?
Yes. Cursor 3 Glass exposes Opus 4.7 as a selectable model in the Agents Window. The /best-of-n command can include Opus 4.7 alongside GPT-5.5 and Composer 2 for direct comparison.

Inside Claude Opus 4.7: Anthropic’s Quietest Release Is Its Best Coding Model Yet

Table of Contents