Table of Contents
- What GPT-5.5 Actually Is
- The Benchmarks That Matter
- GPT-5.5 Proved a Mathematical Theorem
- Pricing: Double the Cost, Half the Tokens
- The NVIDIA Infrastructure Behind It
- GPT-5.5 vs Claude Opus 4.7: Where Each Model Wins
- Safety Classification: High for Cyber and Bio
- The 47-Minute Leak That Preceded the Launch
- DeepSeek V4 Launched the Next Day
- What This Means for Enterprise Buyers
- FAQ
OpenAI released GPT-5.5 on April 23, 2026, and for the first time the company is not pitching a smarter chatbot. It is pitching an autonomous agent that can plan multi-step tasks, use tools, check its own work, and keep running until the job is done. The shift is not subtle. OpenAI’s own announcement says the company has “stopped selling a chat completion API and started selling an agent.” That single sentence rewrites the competitive landscape for every developer, enterprise buyer, and platform company building on large language models.
GPT-5.5 scores 82.7% on Terminal-Bench 2.0 (a command-line workflow benchmark testing planning, iteration, and tool coordination), 58.6% on SWE-Bench Pro for real GitHub issue resolution, and 78.7% on OSWorld-Verified for operating real computer environments. A customized version of the model discovered a new mathematical proof involving Ramsey numbers, later verified in Lean. This is not incremental. This is a company repositioning its entire product around agentic execution.
What GPT-5.5 Actually Is
GPT-5.5 ships in two tiers. The standard model rolls out to Plus, Pro, Business, Enterprise, Edu, and Go users in both ChatGPT and Codex with a 400K context window. GPT-5.5 Pro, the heavier variant, is available to Pro, Business, and Enterprise users with a full 1M context window.
The core difference from GPT-5.4 is architectural, not just parametric. OpenAI describes GPT-5.5 as a model built for “reasoning across context and taking action over time.” In practical terms, that means you can hand it a messy, multi-part task and trust it to decompose the problem, select and use tools, navigate ambiguity, iterate on failures, and continue working autonomously until the task is complete.
OpenAI’s chief research officer Mark Chen framed the capability shift around scientific workflows: “GPT-5.5 shows meaningful gains on scientific and technical research workflows. We believe it could really help expert scientists make progress.” That claim is backed by a specific result. Immunology professor Derya Unutmaz at the Jackson Laboratory for Genomic Medicine used GPT-5.5 Pro to analyze a gene-expression dataset spanning 62 samples and nearly 28,000 genes. The model produced a detailed research report that Unutmaz said would have taken his team months to assemble manually.
The Benchmarks That Matter
Raw benchmark numbers tell the story of where GPT-5.5 excels and where it does not.
Agentic coding and computer use:
– Terminal-Bench 2.0: 82.7% (vs. 69.4% for Claude Opus 4.7)
– OSWorld-Verified: 78.7% for operating real computer environments
– SWE-Bench Pro: 58.6% for real-world GitHub issue resolution
Mathematics and reasoning:
– FrontierMath Tier 4: 39.6% for GPT-5.5 Pro (vs. 22.9% for Claude Opus 4.7), nearly doubling the previous best
– Ramsey number proof discovery, verified in the Lean proof assistant
Long-context performance:
– MRCR v2 8-needle at 512K to 1M tokens: 74.0% (vs. 32.2% for Opus 4.7)
– MRCR v2 at 256K to 512K tokens: 87.5% (vs. 59.2% for Opus 4.7)
The long-context numbers are particularly significant. At the 512K to 1M token range, GPT-5.5 more than doubles Opus 4.7’s score. For enterprise teams working with massive codebases or document sets, that gap changes which workflows are even possible.
GPT-5.5 Proved a Mathematical Theorem
The most striking demonstration of GPT-5.5’s capabilities is not a benchmark score. A customized version of the model discovered a new proof related to off-diagonal Ramsey numbers, a longstanding open problem in combinatorics. The proof was subsequently verified using the Lean proof assistant, making it a formally validated mathematical contribution.
This matters because it crosses a threshold. Prior models could assist with mathematical reasoning. GPT-5.5 produced a novel result that human mathematicians had not previously found. OpenAI is positioning this as evidence that frontier models are becoming genuine research tools, not just research assistants.
The question for the industry is whether this is a repeatable capability or a cherry-picked demonstration. OpenAI’s track record with mathematical discovery claims (the Ernest Ryu collaboration on a 40-year-old open problem using GPT-5) suggests a pattern, not an anomaly. But the gap between “model found one proof” and “model reliably accelerates mathematical research” is still substantial.
Pricing: Double the Cost, Half the Tokens
GPT-5.5’s API pricing reflects OpenAI’s confidence that the model delivers enough value to command a premium.
| Tier | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-5.5 Standard | $5.00 | $30.00 |
| GPT-5.5 Pro | $30.00 | $180.00 |
| Batch/Flex (Standard) | $2.50 | $15.00 |
| Priority (Standard) | $12.50 | $75.00 |
The sticker shock on Pro is real: $180 per million output tokens is 6x the standard tier. But OpenAI’s counter-argument centers on token efficiency. The company claims GPT-5.5 uses 72% fewer output tokens than Claude Opus 4.7 on equivalent coding tasks. If that holds across workloads, the effective cost per completed task could be lower despite the higher per-token price.
For Codex users, OpenAI says GPT-5.5 delivers better results with fewer tokens than GPT-5.4 “for most users.” The model also matches GPT-5.4’s per-token latency, meaning you get more intelligence at the same speed.
The pricing strategy is a bet that the market has matured past per-token comparisons. OpenAI is selling outcomes, not tokens. Whether enterprise procurement teams agree will determine if this pricing sticks.
The NVIDIA Infrastructure Behind It
GPT-5.5’s Codex integration runs on NVIDIA’s GB200 NVL72 rack-scale systems, and the performance numbers from that hardware are worth noting. NVIDIA reports 35x lower cost per million tokens and 50x higher token output per second per megawatt compared with prior-generation systems.
The OpenAI and NVIDIA partnership now spans over a decade, dating back to 2016 when Jensen Huang hand-delivered the first DGX-1 to OpenAI’s San Francisco headquarters. OpenAI has committed to deploying more than 10 gigawatts of NVIDIA systems for its next-generation AI infrastructure.
That 10-gigawatt figure deserves context. A single nuclear power plant typically generates about 1 gigawatt. OpenAI is planning compute infrastructure equivalent to 10 nuclear plants. The scale of investment required to serve GPT-5.5 at production volume explains both the pricing and the urgency behind OpenAI’s $122 billion funding round earlier this year.
GPT-5.5 vs Claude Opus 4.7: Where Each Model Wins
The competitive picture is more nuanced than any single benchmark suggests. Across 10 benchmarks that both OpenAI and Anthropic report, Opus 4.7 leads on 6 and GPT-5.5 leads on 4. The split reveals a strategic divergence.
Where GPT-5.5 wins:
– Long-running tool use and autonomous workflows (Terminal-Bench 2.0)
– Long-context retrieval and reasoning (MRCR v2 at all token ranges)
– Computer use and environment operation (OSWorld-Verified)
– FrontierMath Tier 4 (Pro variant)
Where Claude Opus 4.7 wins:
– SWE-Bench Pro: 64.3% vs. 58.6% for precise code resolution
– Reasoning-heavy benchmarks and instruction-following tasks
– Review-grade analysis tasks
The pattern: GPT-5.5 excels when the task requires sustained autonomous execution across tools and environments. Opus 4.7 excels when the task requires precise, careful reasoning within a defined scope. Both models represent genuine leaps, but they are leaping in different directions.
For enterprise AI buyers evaluating their model stack, the implication is clear: the “best model” question is now the wrong question. The right question is which model matches your workflow pattern.
Safety Classification: High for Cyber and Bio
OpenAI classified GPT-5.5’s biological/chemical and cybersecurity capabilities as “High” under its Preparedness Framework. That is the second-highest tier, one step below “Critical” which would trigger deployment restrictions.
The company evaluated the model across its full safety suite, brought in internal and external red teamers, ran targeted testing for cybersecurity and biology capabilities, and gathered feedback from nearly 200 trusted early-access partners. OpenAI’s vice president of research noted the company has been “iterating on cybersecurity safeguards for months” as models have grown more capable.
The regulatory context adds weight to these classifications. The EU AI Act and California’s SB 53 both impose requirements on frontier model safety. OpenAI releasing GPT-5.5 with expanded cybersecurity safeguards is as much a regulatory positioning move as a safety one. With AI regulation accelerating across 34 states, demonstrating proactive safety measures is now table stakes for any frontier lab.
The 47-Minute Leak That Preceded the Launch
On April 22, one day before the official announcement, a routing error briefly exposed GPT-5.5 to public traffic for approximately 47 minutes. The incident gave early observers a preview of the model’s capabilities before OpenAI was ready to present them.
The leak is notable not for what it revealed, since the official launch followed within 24 hours, but for what it signals about OpenAI’s operational tempo. The company is iterating and deploying frontier models at a pace where even internal infrastructure is struggling to keep up. When your release cadence is measured in weeks rather than quarters, the window for controlled rollouts shrinks.
DeepSeek V4 Launched the Next Day
The timing is hard to ignore. One day after GPT-5.5’s announcement, DeepSeek released V4 Flash and V4 Pro on April 24. The Chinese startup unveiled a two-model family under the MIT License: Pro at 1.6 trillion total parameters (49 billion active) and Flash at 284 billion total (13 billion active), both supporting 1M context and dual Thinking/Non-Thinking modes.
DeepSeek V4-Pro scores 80.6% on SWE-bench Verified, within 0.2 points of Claude Opus 4.6. V4-Pro-Max achieves 93.5 on LiveCodeBench Pass@1, the highest score among all models evaluated. The pricing undercuts everyone: Flash at $0.14 per million input tokens and $0.28 output, Pro at $1.74 input and $3.48 output.
Compare that to GPT-5.5’s $5/$30 pricing. DeepSeek V4-Pro delivers near-frontier performance at roughly one-third the cost, and it is open-source under MIT. The architecture innovations are significant too: a Hybrid Attention Architecture requiring only 27% of single-token inference FLOPs and 10% of the KV cache compared to DeepSeek V3.2.
For the ongoing compute economics story, this one-two punch of GPT-5.5 and DeepSeek V4 crystallizes the industry’s central tension: American labs are building ever-larger models on ever-more-expensive infrastructure, while Chinese competitors are finding architectural shortcuts that deliver comparable results at a fraction of the cost.
What This Means for Enterprise Buyers
Three immediate implications for teams making model decisions right now:
1. Agent-first architecture is no longer optional. GPT-5.5 is not designed for single-turn chat. It is designed for multi-step, multi-tool workflows that run autonomously. If your integration still treats LLMs as fancy autocomplete, you are leaving the most valuable capabilities on the table.
2. The pricing conversation has shifted from tokens to outcomes. GPT-5.5 Pro at $180 per million output tokens looks expensive until you factor in 72% fewer tokens per task. Enterprise procurement needs to benchmark on cost-per-completed-task, not cost-per-token.
3. Multi-model strategies are now mandatory, not optional. GPT-5.5 dominates autonomous execution. Opus 4.7 dominates precise reasoning. DeepSeek V4 dominates cost efficiency. No single model wins across all dimensions. The winning strategy is routing the right task to the right model, which is exactly what platforms like Perplexity Computer already do with 20-model orchestration.
The broader signal is structural. OpenAI, Anthropic, and DeepSeek are all building toward agentic AI, but taking radically different paths to get there. OpenAI is betting on scale and integration. Anthropic is betting on precision and safety. DeepSeek is betting on efficiency and openness. The next 12 months will reveal which bet pays off, but the safest enterprise play is to avoid going all-in on any single provider.
FAQ
What is GPT-5.5 and how is it different from GPT-5.4?
GPT-5.5 is OpenAI’s latest frontier model, released April 23, 2026. Unlike GPT-5.4, it is designed as an autonomous agent that can plan multi-step tasks, use tools, iterate on errors, and work across multiple applications until a job is complete. It also supports a 1M context window (Pro tier) and delivers significantly better long-context performance.
How much does GPT-5.5 cost?
The standard API tier is $5 per million input tokens and $30 per million output tokens. GPT-5.5 Pro costs $30 input and $180 output per million tokens. Batch and Flex processing is available at half the standard rate. ChatGPT Plus, Pro, Business, and Enterprise subscribers get access through ChatGPT and Codex.
Is GPT-5.5 better than Claude Opus 4.7?
It depends on the task. GPT-5.5 leads on autonomous workflow execution, long-context retrieval, and computer use benchmarks. Claude Opus 4.7 leads on precise code resolution (SWE-Bench Pro), reasoning-heavy tasks, and instruction-following. Across 10 shared benchmarks, Opus leads on 6 and GPT-5.5 leads on 4.
What mathematical theorem did GPT-5.5 prove?
A customized version of GPT-5.5 discovered a new proof related to off-diagonal Ramsey numbers, a longstanding open problem in combinatorics. The proof was formally verified using the Lean proof assistant, making it a validated mathematical contribution.
How does GPT-5.5 compare to DeepSeek V4 on price?
DeepSeek V4 Flash costs $0.14 per million input tokens, roughly 36x cheaper than GPT-5.5 standard. DeepSeek V4 Pro costs $1.74 per million input tokens, about 3x cheaper. Both DeepSeek models are open-source under MIT License and deliver near-frontier performance on coding benchmarks.
