#0131interruption5 min read

interruption detection that doesn't trigger on mm-hmm

by voice platform teamApr 2, 2026

Backchannels are the verbal equivalent of nodding. "Mm-hmm." "Uh-huh." "Right." "Yeah." They aren't interruptions; they're encouragement. A human listener tracks them and keeps talking. A naive voice agent stops dead, treats the backchannel as a turn boundary, and the conversation collapses into a stutter loop.

Default voice-activity detectors (Silero, WebRTC VAD) only know one thing: there's energy in the audio frame. They don't know if that energy is a word, a backchannel, a cough, or someone closing a door. Treating every detection as a potential interruption is correct for safety, but it's wrong for warmth.

What we ship

A small classifier that runs upstream of the LLMUserAggregator, after the VAD has flagged 'something happened' but before the audio is forwarded as a UserStartedSpeakingFrame. The classifier scores each chunk for backchannel-likelihood using a combination of:

Duration: backchannels are almost always under 400ms.
Pitch contour: backchannels have a flat or falling F0; new utterances rise.
A 16-class ONNX model trained on ~50k labeled clips from our anonymized preview corpus.

When the classifier scores backchannel > 0.7, we suppress the interruption signal but emit a UserSpeakingFrame for downstream observability. The bot keeps talking. The backchannel is logged. Everyone wins.

The 4ms budget

End-to-end the classifier adds about 4ms per chunk on a single CPU core. We could do better with batching but the marginal latency is irrelevant compared to the alternative (a busted turn). We run it inline; the model file is 1.2MB; we ship it with the worker image.

Where it gets it wrong

Cross-cultural variance is real. "Hai" in Japanese is a backchannel, sometimes. In English it's an interruption, sometimes. We don't try to be clever about this — operators can set a per-agent backchannel-suppression threshold, and the dashboard shows a histogram of suppressed events so they can tune. The default we ship works for ~70% of cases; the rest is configuration.

perceived 'agent listened' rating+34 NPS

We added the classifier in late February. The week after, our customer success channel got noticeably quieter on the topic of "the bot keeps cutting off the customer." Sometimes the most important infra work is the kind nobody mentions because it stopped breaking.

What we ship

Duration: backchannels are almost always under 400ms.

Pitch contour: backchannels have a flat or falling F0; new utterances rise.

A 16-class ONNX model trained on ~50k labeled clips from our anonymized preview corpus.

Where it gets it wrong

perceived 'agent listened' rating+34 NPS