#0138latency7 min read

below 400ms: the transcribe → reason → speak budget

by voice platform teamApr 8, 2026

Pick up your phone, call a friend, count the silence between when you stop talking and when they start. Six hundred milliseconds, give or take. Anything beyond that and your brain reaches for the 'is the call dropped?' check. Anything under three hundred starts to feel like an interruption. The window in between is the entire budget for an AI voice agent's turn — STT, reasoning, TTS, and the network on both sides.

Where the milliseconds go

On a typical cascaded pipeline (STT → LLM → TTS) we see this distribution at p50 over the public preview:

End-of-utterance detection (Smart Turn v3 ONNX): 40–80ms after the user stops speaking
STT final transcript (Deepgram Nova-3, 16kHz): 80–150ms from end of speech
LLM TTFB (gpt-4o, ~800-token context): 200–350ms
TTS first chunk (ElevenLabs Flash v2.5): 150–250ms from prompt end
WebRTC and WebSocket overhead: 30–60ms accumulated

Sum the slow path: 110 + 150 + 350 + 250 + 60 = 920ms. That's a bad call. Sum the fast path: 40 + 80 + 200 + 150 + 30 = 500ms. That's an okay call. The fast path doesn't happen by accident.

Speculative warm-start

The biggest unlock: don't wait for end-of-turn before warming the LLM. As soon as the interim transcript stabilizes for ~200ms (Deepgram's confidence > 0.85), we kick a speculative LLM call with the partial. The vast majority of utterances ('I'd like to book an appointment for...') extend predictably; we save the LLM's latency on those and discard the speculation when the user surprises us.

median end-of-turn → first audio out920ms → 380ms

The discard rate is around 18%. Even paying for the wasted token spend, we come out ahead because the cost of the wasted call is a few hundred tokens against an LLM that bills per token; the cost of a 920ms turn is a recipient who hangs up.

Realtime models change the math

When you swap the cascaded pipeline for an S2S model — Gemini Live, OpenAI Realtime — the picture changes. There's no STT, no separate TTS. Audio goes in, audio comes out, the model handles VAD and turn-taking server-side. Median TTFB drops to about 250ms, but variance gets uglier: when the model decides to think before speaking (it's doing reasoning under the hood), you can see 1–2 second pauses.

We let operators pick the pipeline type per agent. Cascaded for predictable, latency-critical flows (appointment confirmation, payment reminder). Realtime for nuanced, conversational flows (qualification, objection handling). Same orchestrator, same compliance gate, same metrics — different runtime topology.

The thing we don't try to optimize

We deliberately don't try to make the bot interrupt faster. A bot that jumps in within 100ms of the user pausing feels like a bot that isn't listening. Six hundred milliseconds of perceived latency is the goal, not the floor. The budget exists so we can spend it intentionally — not so we can race to zero.

Where the milliseconds go

On a typical cascaded pipeline (STT → LLM → TTS) we see this distribution at p50 over the public preview:

End-of-utterance detection (Smart Turn v3 ONNX): 40–80ms after the user stops speaking

STT final transcript (Deepgram Nova-3, 16kHz): 80–150ms from end of speech

LLM TTFB (gpt-4o, ~800-token context): 200–350ms

TTS first chunk (ElevenLabs Flash v2.5): 150–250ms from prompt end

WebRTC and WebSocket overhead: 30–60ms accumulated

Sum the slow path: 110 + 150 + 350 + 250 + 60 = 920ms. That's a bad call. Sum the fast path: 40 + 80 + 200 + 150 + 30 = 500ms. That's an okay call. The fast path doesn't happen by accident.

Speculative warm-start

median end-of-turn → first audio out920ms → 380ms

Realtime models change the math

The thing we don't try to optimize