Table of Contents
- What Grok 4.3 Actually Is
- The Four Agents Explained
- How a Query Actually Flows
- Why This Beats a Single Model
- What’s New in 4.3 Versus 4.20
- When to Use Grok 4.3 (and When Not To)
- How to Call It From the API
- FAQ
If you’ve used Grok 4.3 in the last two weeks and felt like the answer was the result of an internal argument, you weren’t imagining it. The arguing is the architecture. Grok 4.3 isn’t a single model that produces a response. It is four specialized agents — a coordinator and three specialists named Harper, Benjamin, and Lucas — that debate every sufficiently complex query in parallel before the coordinator stitches the result together.
The architecture isn’t new to 4.3. xAI introduced it in Grok 4.20 in February, and Grok 4.3 inherits the system whole. What’s new in 4.3 is everything around the architecture: a 40% price cut on input tokens, a 1M-token context window, native video input, and a 207 tokens-per-second response speed that makes the multi-agent overhead feel close to single-model latency. The full API rollout completed April 30, 2026.
Here’s what each agent does, how the debate actually flows, and where the model lands compared to Opus 4.7 and GPT-5.5.
What Grok 4.3 Actually Is
Under the hood, Grok 4.3 is a roughly 3-trillion-parameter Mixture-of-Experts model. The interesting part is what xAI does with it at inference time: rather than running one forward pass and emitting a single answer, the model spawns four specialized replicas of itself, each role-conditioned with a distinct system prompt and tool allocation. The replicas run in parallel, exchange intermediate findings through a shared scratchpad, and converge on a final answer the coordinator delivers to the user.
This is not a framework you orchestrate from the outside, the way you would with AutoGen, Swarm, or other multi-agent systems. It is baked-in inference architecture. The user calls a single API endpoint. The model handles the debate internally. The output is a single coherent response with no visible scaffolding.
The Four Agents Explained
Grok (Captain). The coordinator. It decomposes the user query, assigns subtasks to the three specialists, watches the scratchpad, resolves conflicts when the specialists disagree, and synthesizes the final answer. Captain runs first to plan and last to deliver. The other three run in between.
Harper. The empiricist. Harper has privileged access to the X firehose — roughly 68 million English-language posts per day — plus general web search and the model’s training corpus. Harper’s job is real-time evidence gathering and primary fact-verification. When a query depends on something that happened in the last 24 hours, Harper is the agent that sees it first.
Benjamin. The auditor. Benjamin handles step-by-step reasoning, math, code execution, and logical stress-testing. If Captain’s plan involves a calculation, Benjamin runs it. If Harper returns a fact that doesn’t add up, Benjamin flags it. Benjamin is the agent most responsible for the model’s coding benchmark performance and the reason Grok 4.3 holds up on multi-step quantitative work.
Lucas. The divergent thinker. Lucas is responsible for novel angles, blind-spot detection, creative reframing, and writing-quality optimization. Where Harper asks “what is true,” and Benjamin asks “is this consistent,” Lucas asks “what are we missing.” Lucas is also the agent that polishes the final language for human readability rather than internal completeness.
How a Query Actually Flows
Walking through a real example clarifies why this matters. Suppose you ask Grok 4.3: “Did Anthropic’s Mythos benchmark match its real-world cyber capability, and what should I tell my CISO?”
Captain decomposes the query into three sub-questions: what did the AISI benchmark report say, what does the actual real-world cyber capability look like in deployments, and what’s the practical implication for an enterprise security buyer. It dispatches: Harper to gather published benchmark data and recent reports, Benjamin to cross-check the claims for internal consistency and identify any methodological gaps, Lucas to surface angles a security buyer would actually need but isn’t asking about explicitly.
The three specialists run in parallel for roughly two seconds on a typical question. Their findings hit a shared scratchpad. Captain reviews, identifies that Harper’s evidence and Benjamin’s audit agree on the benchmark numbers but disagree on what they imply, lets Lucas weigh in with a third framing, then synthesizes a single answer that incorporates the strongest version of each perspective.
What you see in your response is one coherent reply. What happened to produce it is a four-way debate.
Why This Beats a Single Model
The peer-review mechanism does measurable work. xAI reports hallucinations reduced from approximately 12% on Grok 4 to 4.2% on the four-agent system — a 65% reduction. The most common failure mode of large frontier models is producing internally-confident wrong answers, and the four-agent setup catches a meaningful fraction of those because Benjamin’s stress-test runs adversarially against Captain’s draft.
On benchmarks, Grok 4.3 scores 53 on the Artificial Analysis Intelligence Index and outperforms Opus 4.7 by approximately 1.26x on Vending-Bench, the long-sequence simulation benchmark that tests sustained agentic execution. It is not the strongest at any one thing — Opus 4.7 still leads on coding, GPT-5.5 leads on autonomous workflow execution — but it is the most reliable at long horizons because the debate keeps drift in check.
What’s New in 4.3 Versus 4.20
The architecture is the same. Everything around it changed:
Pricing. $1.25 per million input tokens, $2.50 per million output. That’s a 40% input-price cut from 4.20 and roughly 60% cheaper than GPT-5.5 input pricing. For agentic workloads that read a lot and emit moderately, the cost gap is significant.
Context window. 1 million tokens, up from 256K in 4.20. Output is uncapped within practical limits.
Video input. Native video understanding lands for the first time. The model accepts video files directly without preprocessing. Useful for screen recordings, product demos, and meeting analysis.
Always-on reasoning. Reasoning is permanent rather than a toggle, but configurable across low / medium / high effort. Even at low effort, every query goes through the four-agent debate; the effort dial controls how deeply each specialist explores its sub-task.
Speed. 207 tokens per second on most reasoning queries. Fast enough that the multi-agent overhead disappears into the wait time most users tolerate.
When to Use Grok 4.3 (and When Not To)
Use Grok 4.3 when reliability matters more than peak performance on any single dimension. Long-running agentic loops are the obvious fit because the peer-review architecture catches the drift that ruins multi-step tasks. Real-time queries that depend on freshness — anything where Harper’s X firehose access is an advantage — are another fit.
Skip it when you need the absolute best at one thing. For pure coding, Opus 4.7 leads on SWE-bench Pro by enough margin that the architectural elegance of Grok’s debate doesn’t compensate. For autonomous computer use, GPT-5.5 still has the lead. For raw speed on simple queries, the four-agent overhead is unnecessary cost.
Skip it also if your application can’t tolerate the X-firehose dependency. Harper’s strength is also a liability: the model sometimes weights real-time X content heavily even when older, more authoritative sources exist. For regulated or evidence-conservative domains — medical, legal, financial advice — the freshness bias is a known failure mode.
How to Call It From the API
The API endpoint is standard. Set the model parameter to grok-4.3. Set reasoning_effort to low, medium, or high depending on the trade-off you want between latency and depth. Tool use and function calling work identically to other major frontier APIs. Image and video inputs go in the same multipart request as text, no preprocessing required.
Do this first: pick one production task currently running on a different frontier model that fails roughly 5% to 10% of the time on multi-step work. Switch the model to grok-4.3 at reasoning_effort: high, run your last week of failed examples, and compare success rate. The four-agent architecture’s main pitch is failure-rate reduction, so that’s where the improvement should be measurable. If it isn’t, the architecture isn’t doing work for your specific use case.
FAQ
Are Harper, Benjamin, and Lucas separate models or one model running four ways?
They’re four specialized inference passes through the same underlying ~3T-parameter MoE base model, each conditioned with a different system prompt and tool allocation. The compute cost is roughly 4x a single pass, partially offset by the fact that the four passes run in parallel.
Can I see the four-agent debate or the scratchpad?
Not by default in the standard API. SuperGrok Heavy subscribers ($300/month tier) can inspect the scratchpad and see the inter-agent exchanges. For most production use that level of visibility is unnecessary, but it’s useful for debugging unexpected outputs.
Does the architecture make Grok 4.3 better at coding than Opus 4.7?
No. Opus 4.7 still leads on SWE-bench Verified (87.6%) and SWE-bench Pro (64.3%). Grok 4.3’s strength is long-horizon reliability in agentic loops, not single-task coding peaks. For pure coding work, Opus 4.7 is the stronger choice.
How does Grok 4.3 compare on price to Opus 4.7 and GPT-5.5?
Grok 4.3 input is $1.25 per million tokens vs Opus 4.7 at $5 and GPT-5.5 at $15. Output is $2.50 vs $25 and $30 respectively. For workloads heavy on input context, Grok 4.3 is roughly 4x cheaper than Opus 4.7 and 12x cheaper than GPT-5.5.
Is the multi-agent system available in Grok’s consumer chat product?
Yes. Every Grok query in the X chat interface and the standalone Grok app routes through the four-agent system on sufficiently complex prompts. Simple lookups bypass the full debate to keep latency low.
Can I use Grok 4.3 inside Cursor 3 or other coding agents?
Yes. Grok 4.3 is available as a selectable model on most major coding platforms including Cursor 3 Glass. The /best-of-n command can include Grok 4.3 alongside Opus 4.7 and GPT-5.5 for direct comparison on real tasks.
