below 400ms: the transcribe → reason → speak budget
Pick up your phone, call a friend, count the silence between when you stop talking and when they start. Six hundred milliseconds, give or take. Anything beyond that and your brain reaches for the 'is the call dropped?' check. Anything under three hundred starts to feel like an interruption. The window in between is the entire budget for an AI voice agent's turn — STT, reasoning, TTS, and the network on both sides.
Where the milliseconds go
On a typical cascaded pipeline (STT → LLM → TTS) we see this distribution at p50 over the public preview:
- End-of-utterance detection (Smart Turn v3 ONNX): 40–80ms after the user stops speaking
- STT final transcript (Deepgram Nova-3, 16kHz): 80–150ms from end of speech
- LLM TTFB (gpt-4o, ~800-token context): 200–350ms
- TTS first chunk (ElevenLabs Flash v2.5): 150–250ms from prompt end
- WebRTC and WebSocket overhead: 30–60ms accumulated
Sum the slow path: 110 + 150 + 350 + 250 + 60 = 920ms. That's a bad call. Sum the fast path: 40 + 80 + 200 + 150 + 30 = 500ms. That's an okay call. The fast path doesn't happen by accident.
Speculative warm-start
The biggest unlock: don't wait for end-of-turn before warming the LLM. As soon as the interim transcript stabilizes for ~200ms (Deepgram's confidence > 0.85), we kick a speculative LLM call with the partial. The vast majority of utterances ('I'd like to book an appointment for...') extend predictably; we save the LLM's latency on those and discard the speculation when the user surprises us.
The discard rate is around 18%. Even paying for the wasted token spend, we come out ahead because the cost of the wasted call is a few hundred tokens against an LLM that bills per token; the cost of a 920ms turn is a recipient who hangs up.
Realtime models change the math
When you swap the cascaded pipeline for an S2S model — Gemini Live, OpenAI Realtime — the picture changes. There's no STT, no separate TTS. Audio goes in, audio comes out, the model handles VAD and turn-taking server-side. Median TTFB drops to about 250ms, but variance gets uglier: when the model decides to think before speaking (it's doing reasoning under the hood), you can see 1–2 second pauses.
We let operators pick the pipeline type per agent. Cascaded for predictable, latency-critical flows (appointment confirmation, payment reminder). Realtime for nuanced, conversational flows (qualification, objection handling). Same orchestrator, same compliance gate, same metrics — different runtime topology.
The thing we don't try to optimize
We deliberately don't try to make the bot interrupt faster. A bot that jumps in within 100ms of the user pausing feels like a bot that isn't listening. Six hundred milliseconds of perceived latency is the goal, not the floor. The budget exists so we can spend it intentionally — not so we can race to zero.