#0127infra9 min read

keeping a browser WebRTC test call alive over symmetric NAT

by andyMar 28, 2026

Symmetric NAT is the worst kind of network you encounter often. Your laptop on a coffee-shop wifi: behind a full-cone NAT, STUN works, life is fine. Your phone on a carrier hotspot: behind a symmetric NAT, STUN gets a different external port for every destination, and your peer can't reach you. This is the network of the people most likely to evaluate your product on their phone while waiting for an Uber.

What we kept seeing

The connection would go through every state cleanly: 'connecting' → 'connected'. Audio would flow for 8–14 seconds. Then the data channel would silently freeze, then the audio track would emit a media-stream error, then aiortc would call on_iceconnectionstatechange with state=closed. We thought it was the bot's fault. It wasn't.

The actual diagnosis

Three things were happening at once, which is why it took a week to untangle:

We were running with STUN servers only. STUN tells you your reflexive address; on symmetric NAT, that address only works for the destination it was discovered against. The peer learns it, but cannot reach it. The connection 'works' just long enough for the initial handshake, which uses a different code path, before falling apart.
Our heartbeat was every 60 seconds. The carrier NAT timeout was 30. By the time the heartbeat would have re-pinned the mapping, the mapping was already gone.
When the connection died, our reconnect logic was triggering an SDP renegotiation instead of an ICE restart. ICE restart is what you want; renegotiation tears down everything and the peer has to consent. Phones in the background do not consent.

What we changed

First fix: TURN servers. We added Cloudflare TURN in front of every public preview session. STUN still gets used when the network allows it (cheap), TURN takes over when it doesn't (~10x more expensive but it works). The cost is real but the conversion lift dwarfs it.

Second fix: heartbeat at 20s with a 40s liveness timeout. Pipecat ships a HeartbeatResponder; we wired it to send a `pipecat.ping` data-channel message every 20s and treat 40s of silence as a dead connection. Carriers' NAT timeouts vary from 30s (T-Mobile US) to 600s (some enterprise carriers); 20s sits comfortably under all of them.

Third fix: ICE restart on connection-state degradation, not full renegotiation. aiortc's setLocalDescription({ iceRestart: true }) re-runs ICE candidates without forcing the remote peer through a fresh offer/answer cycle. The audio track stays bound, the data channel survives, the user notices nothing.

preview sessions surviving 60s62% → 94%

What we still don't have a great answer for

Captive-portal networks (hotel wifi, airline wifi, conference networks) where TURN is allowed but UDP is rate-limited. We see clean connect-then-degrade after 90 seconds. The current workaround is a TLS fallback over port 443 that masquerades as HTTPS, which adds another 80ms of latency. We'll write that one up when we're sure it works for everyone, not just the 80% case.

The lesson

Voice over the public internet is not 'WebRTC, with care.' It is six different networks behaving badly in six different ways, none of which you can test from your laptop on office wifi. Get a cheap Android, a cheap iPhone, and a prepaid SIM from each US carrier. Run your demo on each, sitting on the floor of an elevator. That is your QA matrix.

What we kept seeing

The actual diagnosis

Three things were happening at once, which is why it took a week to untangle:

We were running with STUN servers only. STUN tells you your reflexive address; on symmetric NAT, that address only works for the destination it was discovered against. The peer learns it, but cannot reach it. The connection 'works' just long enough for the initial handshake, which uses a different code path, before falling apart.

Our heartbeat was every 60 seconds. The carrier NAT timeout was 30. By the time the heartbeat would have re-pinned the mapping, the mapping was already gone.

When the connection died, our reconnect logic was triggering an SDP renegotiation instead of an ICE restart. ICE restart is what you want; renegotiation tears down everything and the peer has to consent. Phones in the background do not consent.

What we changed

preview sessions surviving 60s62% → 94%

What we still don't have a great answer for

The lesson