Table of Contents
- What SIP Actually Changes
- The Models Everyone Else Is Writing About
- The Real Per-Hour Cost
- What to Build First
- FAQ
Imagine calling your bank’s customer service line tomorrow and the voice on the other end is a GPT-5-class AI. Not a chatbot reading text. A voice that hears your tone, holds a multi-turn conversation, calls tools mid-call, transfers warm to a human when needed, and stays on the line for as long as your problem takes.
That’s what OpenAI shipped on May 7, 2026. The headline announcement was GPT-Realtime-2, a new speech-to-speech model with GPT-5-class reasoning, a 128K-token context window, and pricing of $32 per million audio input tokens and $64 per million audio output tokens. The companion launches were GPT-Realtime-Translate (real-time translation across 70+ input and 13 output languages) and GPT-Realtime-Whisper (streaming transcription). All three got the coverage.
The feature that actually changes everything in the same release got one bullet point: SIP phone calling.
What SIP Actually Changes
SIP — Session Initiation Protocol — is the standard that connects voice over IP to the regular phone network. Adding SIP support to the Realtime API means a voice agent built on GPT-Realtime-2 can now be wired directly to a phone number. Through Twilio SIP Trunking or another SIP carrier, your agent picks up the phone. It dials out. It transfers calls. It does everything a human voice agent does, on every phone number you can provision.
Before this, building a phone-answering AI agent required custom plumbing — capturing audio from Twilio’s media streams, piping it through your own STT, calling an LLM, piping the response through TTS, and pushing it back. Latency stacked at every hop. Audio quality degraded. Tool calls had to happen between pipeline stages, not during them. The “I’d like to speak to your AI” demo videos all hid that integration work.
Now the integration is a single API call. Point your SIP trunk at OpenAI’s endpoint. The model handles audio in, audio out, tool calling, interruption handling, and warm transfer to a human agent — all natively. OpenAI’s own SIP guide walks through provisioning a number, connecting a Twilio trunk, and answering a real phone call in under a hundred lines of code.
This is the moment voice agents productionize. Customer support lines. Outbound sales. Appointment scheduling. Dispatch. Receptionist work. Triage for medical, legal, technical support. All categories that previously needed humans for the voice channel are now technically — and economically — addressable by a single API. ElevenLabs and other voice platforms have been racing to this point for two years. OpenAI just shipped it as a checkbox on the model that already does the reasoning.
The Models Everyone Else Is Writing About
GPT-Realtime-2 itself is real progress. The 128K context window is a 4x jump from GPT-Realtime-1’s 32K, which matters for long calls — a 60-minute support call no longer truncates mid-conversation. The five reasoning intensity levels (low, medium, high, xhigh, max) let you trade latency for thoughtfulness call-by-call. Parallel tool calls mean the agent can check your account balance and your appointment calendar in the same turn rather than serially. Spoken preambles (“let me check that for you”) fill the dead air while a tool call runs. Recovery behavior gracefully handles tool failures instead of dropping the call.
GPT-Realtime-Translate and GPT-Realtime-Whisper round out the stack. Translate handles real-time conversational interpretation across 70+ input languages. Whisper handles streaming transcription. Both are billed per minute rather than per token, which makes them predictable for telephony budgets.
These are the things every other piece is covering. They’re useful. They’re not the story.
The Real Per-Hour Cost
The token math: user audio bills at 1 token per 100 milliseconds; assistant audio bills at 1 token per 50 milliseconds. A full minute of user talking equals 600 input tokens. A full minute of assistant talking equals 1,200 output tokens.
For a typical conversational hour with roughly equal listening and speaking time, that’s about 18,000 user input tokens and 36,000 assistant output tokens. At GPT-Realtime-2’s $32/M input and $64/M output, an hour of conversation runs roughly $0.58 input + $2.30 output = $2.88 per hour. Prompt caching cuts repeated-context input to $0.40/M, which on stable system prompts brings the input portion close to zero.
For context, a human customer service agent at fully-loaded cost runs $25–$45 per hour in the U.S., $5–$12 offshore. $2.88 per hour is the price point that makes pure economic substitution viable on volumes that were uneconomic to outsource. The same dynamic explains why Apple just paid Google $1 billion a year to put a 1.2T-parameter Gemini model behind Siri rather than building its own — the cost curve favors buying capability over building it.
What to Build First
Do this first: pick the single highest-volume, lowest-complexity inbound phone interaction your business handles today. Appointment confirmations. Order status. Hours and location lookups. Wire it to GPT-Realtime-2 over a Twilio SIP trunk using the OpenAI SIP guide. Set reasoning_effort to medium, configure two or three tool calls (your appointment system, your order DB, your knowledge base), and enable warm transfer to a human queue for anything the agent can’t resolve. Pilot it on 10% of inbound traffic for a week. Measure resolution rate and average handle time against your baseline.
If resolution rate stays above 70% on the pilot subset, the rest of that intent category is now viable to migrate. If it doesn’t, the model isn’t the bottleneck — your tool integrations and prompt design are. The cost-per-call is low enough that iteration is cheap. The integration is fast enough that the experiment runs in days, not quarters.
One honest caveat: regulated industries (medical advice, legal advice, financial advice subject to FINRA, anything involving HIPAA) need compliance review before letting a voice agent handle live calls. The capability is here. The compliance posture is industry-specific.
FAQ
How is SIP support different from connecting an LLM through Twilio media streams?
Twilio media streams have worked with the Realtime API for over a year, but required you to stitch together the audio pipeline yourself. Native SIP support means the model accepts a SIP call directly from your trunk — Twilio or otherwise — with no middleware. Latency drops significantly, tool calls happen during the call rather than between pipeline stages, and warm transfer to a human becomes a single API parameter.
Can GPT-Realtime-2 transfer a call to a human agent?
Yes. Warm transfer is a built-in capability. The agent can hold context, summarize the conversation for the human receiver, and transfer the live call through the SIP trunk.
What’s the actual latency on a SIP-connected voice agent?
End-to-end response latency on GPT-Realtime-2 at reasoning_effort: medium is reported in the 300–800 millisecond range, depending on the complexity of the response and whether tool calls are involved. That’s within the threshold humans perceive as natural conversation. Higher reasoning levels add latency in exchange for more thoughtful responses.
Does the Realtime API support phone numbers in countries outside the U.S.?
Yes, through your SIP trunk provider. Twilio offers numbers in 100+ countries. The Realtime API itself is country-agnostic; the limits are your SIP provider’s coverage and any local regulatory requirements (some jurisdictions require disclosure when callers interact with an AI).
How does pricing compare to ElevenLabs voice agents?
Different model. ElevenLabs voice agent pricing is structured around character or conversation minutes plus a separate LLM cost. GPT-Realtime-2 collapses model and voice into a single per-token cost. For typical conversational workloads, OpenAI’s pricing comes out lower per minute, but the trade-off is voice quality and customization — ElevenLabs still leads on voice cloning and emotional control.
Can I use my existing Twilio number with the OpenAI Realtime API?
Yes. Twilio Elastic SIP Trunking connects your existing numbers to the Realtime API endpoint. You don’t need to port numbers or change providers. OpenAI’s recent positioning has been about removing every friction point between developer and production deployment; SIP support is the largest single one removed for voice agents.
