GPT-5.4 Explained: What OpenAI's Latest Model Changes

GPT-5.4 dropped on March 5, 2026, and unlike a lot of OpenAI releases that get buried under marketing noise, this one actually changes some practical math. Not because it’s the most powerful model ever made — OpenAI would never say that — but because it’s meaningfully faster, cheaper per task, and it’s the first version to bring native computer use into Codex in a serious way. If you’re using GPT-5.x in production, or deciding whether to, this is the version where those decisions get interesting.

What GPT-5.4 Actually Is (And How It Fits the 5.x Line)

GPT-5.4 is OpenAI’s latest iteration of the GPT-5 family, sitting above the now-retired GPT-5.1 tier. When OpenAI retired GPT-5.1 Instant, Thinking, and Pro on March 11 — less than a week after the 5.4 launch — that was the real signal. They don’t sunset models unless they’re confident the replacement covers the use cases. That’s not a hedge; it’s product confidence.

The model comes in two flavors accessible through ChatGPT: GPT-5.4 Thinking and GPT-5.4 Pro. It’s also available via API for developers and enterprises building on top of it. The Thinking variant is positioned for complex reasoning tasks where you want the model to work through a problem step by step. Pro is the higher-capability tier, presumably with more compute behind it — though OpenAI hasn’t published detailed specs distinguishing the two publicly.

The headline architectural note is the 1 million token context window. That’s not new to the industry — Gemini 1.5 Pro had a 1M window for a while — but it’s a meaningful jump for OpenAI’s production models. To put it practically: 1 million tokens can hold roughly 750,000 words of text. That’s the entire Lord of the Rings trilogy, twice over, in a single prompt. For enterprise use cases involving large codebases, legal document analysis, or extended research synthesis, this matters a lot.

The Token Efficiency Story Is the Real News

Here’s the thing that’s getting less attention than it deserves: GPT-5.4 uses significantly fewer tokens than GPT-5.2 to accomplish the same tasks. That’s not a minor footnote. That’s a cost structure shift.

If you’re running thousands of API calls per day — coding assistants, document workflows, customer-facing agents — the per-task token count directly determines your monthly bill. A model that achieves the same output in 30-40% fewer tokens isn’t just faster. It’s materially cheaper to operate at scale. For a 50-person startup running GPT-5.x in a product, that might mean the difference between a sustainable unit economics model and a painful one.

Speed is the other side of that coin. Fewer tokens in flight means faster time-to-first-token and faster full responses. For anything user-facing — a chatbot, a coding copilot, a document assistant — latency is a UX metric, not just a technical one. Users notice when a model feels snappy versus sluggish, even if they can’t tell you why they prefer one product over another.

This is the kind of incremental improvement that Andrej Karpathy has pointed to before when talking about the “mundane” progress in AI — the quiet work of making models more efficient that compounds over time. It’s less flashy than a new capability, but it’s what makes deployment economics work.

Native Computer Use in Codex: What It Means in Practice

The most technically interesting piece of GPT-5.4 is the native computer use capabilities inside Codex. OpenAI’s Codex platform is now the home for their agentic coding work, and GPT-5.4 enables the model to not just write code, but to actually operate a computer environment to do things with that code.

The clearest current application is OpenAI Codex Security: an autonomous code security review system that uses computer use to analyze codebases. Instead of just reading code and generating a report, the model can interact with the environment — running tests, navigating file structures, executing tools — the way a human security auditor would. That’s a qualitatively different kind of AI assistance than autocomplete.

This matters because security review is one of those tasks that looks simple on the surface but is deeply contextual. Understanding whether a SQL query is vulnerable requires knowing how the data layer is structured, what framework is in play, what sanitization happens upstream. A model that can explore the codebase rather than just read a snippet handed to it has a real advantage here.

Peter Steinberger, the creator of OpenClaw who joined OpenAI on February 14, 2026, is likely working in this neighborhood — though OpenAI hasn’t been specific about his role. Steinberger has a strong background in developer tooling and mobile software infrastructure, which aligns with the direction Codex is heading: practical, production-grade agentic tools for developers, not research demos. If you’re evaluating how this compares to what other labs are building for high-stakes developer tasks, that context is worth having.

The ChatGPT-for-Excel Add-In and the 800 Million User Context

Alongside the GPT-5.4 launch, OpenAI shipped a ChatGPT add-in for Excel. That’s a product decision that tells you a lot about where OpenAI thinks its growth is coming from.

OpenAI is at 800 million weekly users and $25 billion in annualized revenue. Those numbers are striking not just for their size but for their velocity — they suggest a user base that has moved well beyond the tech-forward early adopters. Hundreds of millions of people using Excel daily, many of whom have never touched an API, suddenly have AI inside their most-used work tool. The Excel add-in is a distribution play as much as a product play.

Practically speaking, a ChatGPT-in-Excel integration means things like: explain this formula, write a VLOOKUP for this dataset, build a pivot structure based on these columns, flag anomalies in this data range. These are genuinely useful things that don’t require understanding prompt engineering. They meet people where they already are.

Sam Altman has talked repeatedly about making AI accessible to people who aren’t power users, and the Excel add-in is that philosophy made concrete. Whether it’s particularly interesting to a developer is almost beside the point — it’s interesting because of the scale of the population it reaches.

Interactive Math and Science: Tutoring Gets Structural

One of the quieter but genuinely well-executed features in this release is the interactive math and science module system — 70+ topics with adjustable variables. This isn’t just “ask GPT a math question.” It’s structured, interactive learning environments where you can manipulate variables and see how outcomes change.

Think of it like a simulation layer on top of conceptual explanation. You’re learning about supply and demand? Adjust the elasticity variable and watch the curve change. You’re working through orbital mechanics? Change the mass of the central body and see how orbital period responds. That kind of interactive feedback loop is how people actually develop intuition for quantitative concepts — not by reading a definition, but by playing with the system

GPT-5.4 vs. Claude Sonnet 4.6 vs. Gemini: Which Model Wins for What

Every model has a story it tells about itself. The useful question is whether that story holds up when you actually push it. Here’s an honest breakdown across four use cases that matter for people building or working with AI in 2026.

Use Case	GPT-5.4	Claude Sonnet 4.6	Gemini 1.5 Pro
Coding	Best overall. Native computer use in Codex means it can run, test, and iterate on code without you shuttling output back and forth. Multi-file edits feel genuinely coherent.	Strong on explanation and refactoring. Better than GPT-5.4 at telling you why it made a change. Weaker on autonomous execution loops.	Capable but inconsistent on longer functions. Good for quick snippets. Loses coherence in complex, multi-file contexts.
Long-context research	1M token window works reliably in practice. Feed it an entire codebase or document archive and ask cross-cutting questions — it tracks references well across the full window.	200K context. Excellent recall within that window, arguably more precise than GPT-5.4 on dense legal or technical text. Hits a wall on anything bigger.	1M token window, similar to GPT-5.4. Tends to lose precision on highly specific retrieval tasks deep in a long document. Better for summarization than needle-in-a-haystack.
Agentic workflows	Best-in-class right now. The token efficiency gains matter enormously here — fewer tokens per step means longer chains before costs spiral. Computer use makes it a real actor, not just a planner.	Good at planning. Less reliable at multi-step execution without scaffolding. Works well inside frameworks like LangChain where you control the loop externally.	Improving, but tool use is still less reliable in production. Better suited to single-shot tasks than sustained agent loops.
Reasoning and analysis	GPT-5.4 Thinking mode is genuinely good at structured problem decomposition. Reliable on math, logic chains, and technical decisions.	Arguably the best at nuanced reasoning on ambiguous problems — ethical tradeoffs, strategic decisions, anything where the answer isn’t cleanly verifiable. More calibrated uncertainty.	Good at factual synthesis. Less impressive on genuinely hard reasoning where the answer requires working through competing considerations.

The honest summary: GPT-5.4 is the default choice for coding and agentic work, full stop. Claude Sonnet 4.6 is still the one to reach for when the task is subtle, the stakes are high, and you need a model that expresses appropriate doubt. Gemini makes sense if you’re already deep in the Google ecosystem or need multimodal inputs alongside long context.

How to Use GPT-5.4’s Computer Use in Codex: A Workflow You Can Run Today

Computer use in Codex is the feature most people have heard about but few have actually set up as a real working loop. Here’s a concrete workflow for something practical: using GPT-5.4 to audit a Python project, identify failing tests, fix them, and verify the fix — without you touching the terminal.

What You Need Before Starting

Access to Codex via the OpenAI API (model ID: gpt-5.4)
A Python project with at least one test suite (pytest works cleanly here)
The Codex computer use environment enabled — this is opt-in at the API level, confirm it’s active in your OpenAI dashboard under Codex settings

Step 1: Point GPT-5.4 at Your Repo

Start a Codex session and give it a system prompt that sets the operating constraints clearly. Vague instructions produce vague behavior. Use something like:

“You are working inside a Python repository. Your job is to run the existing test suite, identify any failing tests, diagnose the root cause by reading the relevant source files, implement a fix, and re-run the tests to confirm the fix. Do not modify test files. Only modify source files. If a fix requires changes to more than three files, stop and explain the tradeoff before proceeding.”

That last sentence matters. It prevents runaway edits on complex codebases where the right move is actually a conversation, not a unilateral change.

Step 2: Run the Initial Test Pass

Let Codex execute pytest --tb=short in your project root. GPT-5.4 will read the terminal output, identify which tests failed, and pull up the relevant source files on its own. You should see it navigating the file tree, opening files, and forming a diagnosis — visible in the Codex action log.

Step 3: Review the Diagnosis Before It Writes

Before GPT-5.4 makes edits, ask it to output its diagnosis as a numbered list: what’s broken, why, and what it plans to change. This takes 30 seconds and saves you from reviewing a diff you don’t understand. A good prompt here:

“Before making any changes, tell me: which files are you going to modify, what specific lines are changing, and what is the failure mode you’re fixing?”

Step 4: Let It Fix and Verify

Once you’ve confirmed the plan makes sense, let it run. GPT-5.4 will write the fixes, re-run pytest, and report back. In most cases on a reasonably scoped bug, this loop completes in under two minutes. If tests still fail, it will iterate — but you’ve capped the blast radius with the three-file rule from the system prompt.

Step 5: Diff Review in the Excel Add-in (Optional)

If your team tracks test coverage or task metrics in Excel, the ChatGPT-for-Excel add-in launched alongside GPT-5.4 lets you paste the pytest output directly into a sheet and ask GPT-5.4 to summarize failure patterns, categorize errors by type, or flag regressions against a prior run. It’s a small thing, but for teams that already live in Excel for reporting, it closes a real workflow gap.

When This Workflow Breaks Down

Deeply entangled architecture: If fixing one test requires understanding three layers of abstraction, GPT-5.4 will sometimes make a locally correct fix that breaks something upstream. The three-file guardrail catches most of this, but not all.
Missing context: If your repo has undocumented dependencies or environment-specific configs, the model will hit walls it can’t reason past. Add a brief README or CONTEXT.md that explains your environment setup — it uses that file heavily.
Token costs on giant repos: Even with GPT-5.4’s efficiency gains over 5.2, indexing a very large codebase in a single session will run up tokens fast. Scope your sessions to a module or service, not the entire monorepo.

GPT-5.4 vs. Claude Sonnet 4.6 vs. Gemini: Which Model Actually Wins for Your Use Case

Benchmarks are nearly useless for making real product decisions. What matters is how these models actually behave on the tasks you run every day. Here’s an honest comparison across four categories where the differences are meaningful enough to change which tool you reach for.

Use Case	GPT-5.4	Claude Sonnet 4.6	Gemini 1.5 Pro
Coding	Best for multi-file refactors and agentic coding loops via Codex. Native computer use means it can actually run, test, and iterate on code without you babysitting each step.	Strong on single-file tasks and explanation quality. Tends to produce cleaner code with fewer hallucinated library calls, but lacks native execution.	Competent but inconsistent on complex logic. Better for boilerplate generation than architectural reasoning.
Research and synthesis	1M token window lets you dump an entire document corpus into one session. Good at finding contradictions across sources. Token efficiency means long sessions don’t get prohibitively expensive.	Excellent at summarizing individual documents with high fidelity. Context window is smaller, so large multi-document tasks require chunking.	1M context is mature here. Handles long PDFs and technical reports well. Slightly weaker at synthesis across sources with conflicting claims.
Long-context tasks	1M tokens, strong instruction-following across the full window. Doesn’t degrade much at 600K-800K tokens the way earlier models did.	Context window is a ceiling you’ll hit on large codebase or legal document work. Within its window, retrieval quality is high.	The most battle-tested 1M context model. If your only need is loading a massive document, Gemini is still the default.
Agentic workflows	The clearest win for GPT-5.4. Computer use inside Codex, better multi-step task completion, fewer mid-task failures than 5.2. If you’re building agents, this is the model.	Solid for structured agentic tasks with well-defined tool schemas. Less capable when the task requires unstructured environment interaction.	Improving but not there yet for complex agentic loops. Better used as a reasoning layer in a larger orchestration stack than as the agent itself.

The honest take: use GPT-5.4 when the task involves code execution, computer use, or a workflow that runs autonomously across multiple steps. Use Claude Sonnet 4.6 when you need careful, precise single-pass output and explanation quality matters — think legal drafting or detailed technical documentation. Use Gemini when you’re working inside Google Workspace or when long-document retrieval is the primary job and you want the most mature 1M context implementation.

How to Use GPT-5.4’s Computer Use in Codex: A Workflow You Can Run Today

Computer use is the capability in GPT-5.4 that sounds impressive and then confuses people about what to actually do with it. Here’s a concrete workflow for using it to automate a real development task — specifically, running a test suite, identifying failing tests, and pushing a fix — without manually stepping through each piece.

What you need before starting

Access to OpenAI Codex with GPT-5.4 enabled (available to API users and ChatGPT Pro subscribers as of March 5, 2026)
A GitHub repository connected to your Codex environment
A failing test or a known bug you want resolved — this works better with a specific problem than an open-ended one

Step 1: Frame the task with enough context to avoid mid-task derailment

Open a Codex session with GPT-5.4 and start with a prompt that defines the scope clearly. Vague prompts cause agentic models to make assumptions that derail the task three steps in. Use something like:

“You have access to this repository. Run the full test suite using pytest. For any failing tests, identify the root cause in the source files — do not modify tests themselves. Propose and apply a fix. After applying the fix, run the tests again and confirm they pass. Report what you changed and why.”

The key parts: you’ve told it what tool to use (pytest), where not to touch (test files), what the success condition is (tests pass), and what you want back (a change summary). GPT-5.4 will handle the execution loop — running commands, reading output, navigating files — without you approving each action.

Step 2: Let it run, but set a scope limit

Add a constraint to your initial prompt like: “Limit changes to files in /src. If the fix requires modifying more than three files, stop and explain the situation instead of proceeding.” This isn’t distrust — it’s good agentic hygiene. Computer use models are genuinely capable of making widespread changes that are hard to audit. Scope limits make the output reviewable.

Step 3: Review the diff, not just the summary

GPT-5.4 will produce a summary of what it changed. Don’t just read the summary. Pull the actual diff. The model is good at accurate summaries, but reviewing the diff takes two minutes and catches the cases where it fixed the symptom rather than the cause. In testing, GPT-5.4 is meaningfully better at root-cause fixes than GPT-5.2 was, but it still occasionally patches around a problem rather than through it.

Step 4: Use the 1M context window for larger codebases

If your repo is large enough that previous models required chunking or missed cross-file dependencies, GPT-5.4’s 1M token window changes the workflow. You can load the entire relevant codebase into context before starting the task. A practical threshold: if your project’s core source files total under 700,000 tokens (roughly 500,000 words of code and comments), load them all. The model’s ability to trace a bug across 15 interdependent files in one pass is qualitatively better than the multi-step retrieval approach you’d use with a smaller context window.

Where this workflow breaks down

Computer use in Codex is not magic. It struggles with tasks that require genuine judgment calls — like deciding whether a failing test represents a bug in the code or a bug in the test’s assumptions. It also doesn’t handle environments well where the local state matters in ways that aren’t visible in the files themselves (think: database migrations, environment-specific configs, external API dependencies). For those cases, GPT-5.4 is better used as a reasoning partner than an autonomous executor.

GPT-5.4: What OpenAI’s Latest Model Actually Changes

What GPT-5.4 Actually Is (And How It Fits the 5.x Line)

The Token Efficiency Story Is the Real News

Native Computer Use in Codex: What It Means in Practice

The ChatGPT-for-Excel Add-In and the 800 Million User Context

Interactive Math and Science: Tutoring Gets Structural

GPT-5.4 vs. Claude Sonnet 4.6 vs. Gemini: Which Model Wins for What

How to Use GPT-5.4’s Computer Use in Codex: A Workflow You Can Run Today

What You Need Before Starting

Step 1: Point GPT-5.4 at Your Repo

Step 2: Run the Initial Test Pass

Step 3: Review the Diagnosis Before It Writes

Step 4: Let It Fix and Verify

Step 5: Diff Review in the Excel Add-in (Optional)

When This Workflow Breaks Down

GPT-5.4 vs. Claude Sonnet 4.6 vs. Gemini: Which Model Actually Wins for Your Use Case

How to Use GPT-5.4’s Computer Use in Codex: A Workflow You Can Run Today

What you need before starting

Step 1: Frame the task with enough context to avoid mid-task derailment

Step 2: Let it run, but set a scope limit

Step 3: Review the diff, not just the summary

Step 4: Use the 1M context window for larger codebases

Where this workflow breaks down

Recent Posts