HimanshuRawat
Field notes · AI22 May 2026 · 6 min read

Voice agents stopped being a research problem. They're a latency budget now.

The components matured. Streaming STT under 300ms. TTS first audio under 100ms. LLM inference under 300ms. The work moved from "can we" to "how do we hit 700ms end to end, every call, in production." Here is how I think about it while building a phone agent for L1 support.

The conversation broke at 1.2 seconds

For two years the bottleneck on voice agents was the model. The conversation felt wrong because the AI sounded wrong: flat, halting, oddly cadenced. By late 2025 that ceased to be true. ElevenLabs, OpenAI Realtime, and the wave of open speech models cleared the audio bar. Production voice now sounds, more or less, like a person.

The next thing users noticed was the gap. Anything over 1.2 seconds of silence after a user stops talking reads as “the line is dead.” A 600ms gap reads as polite. A 300ms gap reads as fluent. So the engineering question shifted: not “how do we make this sound human,” but “how do we hit 700ms end to end on every single turn, including the bad ones.”

The latency budget, broken down

A round trip in a voice agent has five stages. Here is the rough budget for a snappy production agent:

  1. Network in60ms
  2. Streaming STT (incremental, finalised mid utterance)280ms
  3. LLM first token (small model, no tool call)180ms
  4. LLM completion stream (overlapped with TTS)140ms
  5. TTS first audio80ms
  6. Network out60ms
  7. Total~700ms

That total only holds if every stage streams and every stage overlaps with the next. The moment you wait for the full transcript before calling the LLM, or wait for the full response before synthesising speech, you blow past one second and the conversation feels like a kiosk.

Where the milliseconds actually go

In practice three places eat the budget:

Tool calls

The moment a turn needs a function call (look up an order, check availability, fetch a record) the LLM stalls until it has a result. A 200ms internal API turns a 700ms turn into a 900ms turn, and the user notices.

Turn detection

Knowing when the user has actually stopped talking, versus paused mid sentence, is the single trickiest signal. End pointing too aggressively gives the agent a reputation for interrupting. Waiting too long makes it slow. Most production agents use a hybrid: voice activity detection for the floor, plus a small classifier listening for prosodic end of thought cues.

Cold paths

Anything that isn't on the hot path during a normal turn is going to be slow when it fires. The escalation path. The recovery from “we couldn't understand you.” These need their own latency rehearsal or they will tank the perceived quality on exactly the calls where the user is already frustrated.

Engineering moves that actually buy time

A few patterns that have held up under production traffic:

  1. 01

    Stream the LLM into the TTS, not after it.

    Most off the shelf stacks wait for the LLM to finish before synthesising. If you pipe LLM tokens into a streaming TTS that can start speaking from the first sentence boundary, you cut 200 to 400ms off every turn.

  2. 02

    Predict the tool call.

    When the conversation is heading toward an order lookup, fire the lookup speculatively on partial transcript signals. If the turn ends up not needing it, throw the result away. Storage and compute are cheap. The 250ms you save are not.

  3. 03

    Pin a small model for the boring turns.

    Most turns in a support call are routing decisions, acknowledgements, or clarifications. Reserve the big model for the genuinely hard turns and route everything else through a quantised 8B class model running on dedicated hardware. The user cannot hear the difference. The latency curve flattens.

  4. 04

    Cache the agent's voice.

    The agent says the same opening line, the same hold phrase, the same goodbye on every call. Pre synthesise those clips and play them as audio. The TTS budget for fixed copy drops to zero.

What I'm shipping

The phone agent in my pipeline is a working bench for all of this. It runs a Twilio inbound path into a thin Node mediator, with streaming STT on the way in, an OpenAI Realtime tier for the conversational LLM, and ElevenLabs streaming TTS on the way back out. Median turn latency, measured ear to ear with synthetic test calls, sits at 640ms. P95 sits at 880ms.

The interesting part isn't the median. It is the P95. Every voice agent demo looks great on the median.

The reason most production deployments still feel broken is that the long tail of turns (tool calls, recoveries, escalations) blows past two seconds, and that is where users churn.

If you are evaluating an engineer to ship a voice agent in 2026, the question that separates the people who have actually done it from the people who have read about it is straightforward: walk me through your P95 turn latency, and what you did when it crept past 900ms.

If you are shipping into this space and want to talk about latency, agent architecture, or what it actually costs to keep a voice product responsive under load, the booking link is below.

Read more pieces
Open for one project

Build the AI layer you'd be proud to ship.

If your roadmap has voice, copilots, RAG, or agentic flows on it, the booking link below is the right move. 30 minutes, no pitch, straight answer on whether I can help.

Book a 30-min call
1 slot · June 2026Usually replies within 24 hoursAsync-friendly · UTC+5:30