Building voice agents on WebRTC: the production stack

WebRTC is the right transport for voice agents. Raw WebRTC is not a product. Here is what production demands beyond the protocol and how SipPulse AI delivers it.

SipPulse AI - Engineering TeamOctober 20, 20256 min read

Share this article

Building voice agents on WebRTC: the production stack

The first hard decision in any voice agent project is the realtime transport. Most teams default to WebSockets because that is what they know. They ship a demo. The demo works fine in the office. It falls apart the moment a real user joins from a flaky 4G connection, in an open-plan office, behind a corporate firewall. WebRTC is the protocol that solves all of those problems, and it is the standard behind every serious voice agent in 2026. But raw WebRTC is not a product. This post walks through why WebRTC wins, what production really demands on top of it, and how SipPulse AI delivers the parts you do not want to build yourself.

Why WebRTC is the right transport for voice agents

WebRTC was designed for low-latency, two-way audio and video over the public internet. It is the protocol behind every video call you have made in the last decade: Google Meet, Microsoft Teams, FaceTime over the web. The pieces it ships out of the box are exactly the ones a voice agent needs:

Sub-second audio transport with congestion control built in
Echo cancellation, noise suppression and automatic gain control on the client
NAT traversal that works across home routers, mobile carriers and corporate firewalls
Adaptive bitrate that survives a flaky 4G connection without dropping the call

A WebSocket can stream audio, but you have to bolt on most of that yourself. By the time you are done, you have built a worse version of WebRTC. There is a strong industry consensus on this point: anything realtime should use WebRTC.

What raw WebRTC does not give you

WebRTC is a protocol. It moves bytes. Everything that turns those bytes into a working voice agent has to be built. The list is longer than most teams realize:

A media server (an SFU or MCU) that scales beyond peer-to-peer and handles recording, observability and routing
Job dispatch that connects an inbound call to the right agent worker without race conditions
An audio pipeline that runs STT, LLM and TTS in parallel with sub-second latency
Turn detection that distinguishes a real end-of-turn from a backchannel like "uh-huh"
Barge-in handling that lets the user interrupt the agent without false triggers from background noise
A SIP bridge so the agent can answer a phone call, not just a browser session
Telemetry that captures per-call latency at every stage so you can debug why a particular conversation felt slow
Client SDKs for web, iOS, Android, React Native, Unity, the embedded device you forgot existed

Each of those is a months-long engineering project. Together they are a year of work before you start writing any actual agent logic. This is the gap between a WebRTC demo and a voice agent that ships.

Telephony: where the WebRTC story meets the real world

Most voice agent users do not download SDKs. They call phone numbers. The WebRTC layer is fine for browser embeds and mobile apps, but production almost always needs a SIP trunk that bridges the WebRTC world to the public telephone network.

The bridge has to handle codec negotiation (PCMU and PCMA on the SIP side, Opus on WebRTC), DTMF for legacy IVR pass-throughs, STIR/SHAKEN for outbound caller-ID attestation, and number portability for businesses that already own a number. Build this yourself and you will spend a quarter just on the codec edge cases. Buy it from a generic VoIP provider and you will spend a different quarter integrating it with your agent runtime.

A voice agent platform that ships SIP integration native to its WebRTC layer is not a feature. It is the difference between a product and a science project.

Adaptive turn detection beats raw VAD

Most voice agent prototypes start with Voice Activity Detection (VAD): when the user goes silent for 500ms, assume they are done, run the LLM, speak the response. It works in a quiet room. It breaks the moment the user has a coffee machine in the background, an open-plan office, a TV playing or just a habit of pausing mid-sentence.

Production turn detection uses prosodic signals (pitch drops, intonation), lexical cues (sentence boundaries, question completion) and confidence scores in addition to silence. Adaptive approaches detect true barge-ins faster than VAD in 64% of cases according to industry benchmarks. The difference shows up in product reviews: the same agent feels like a conversation with the right turn detection and like an answering machine without it.

Telemetry: prove your latency or hide it

Every voice agent vendor claims fast latency. Almost none publish numbers. The reason is that real-world latency on production traffic is usually 5x worse than the marketing slides. Industry medians sit at 1.4 to 1.7 seconds while the human conversation expectation is closer to 200ms.

The honest way to evaluate a voice agent is to measure it on a live workload, broken out per stage: STT, LLM time-to-first-token, TTS first-byte, end-of-utterance detection, full conversation latency. Anything else is theater.

How SipPulse AI delivers the production stack

SipPulse AI is the production stack on top of WebRTC. We give you:

A managed realtime layer with sub-second WebRTC transport, scalable media routing and global presence
Native SIP trunk integration so the same agent answers browser calls and phone calls
Pulse Precision Pro for STT, tuned for noisy contact center audio and Brazilian Portuguese
Pulse TTS for synthesis with multiple voice models and sub-150ms first-byte latency
NIVA for combining IVR flows and multiple specialized agents in a block-based builder, so a non-engineer can wire up "greet the caller, classify intent, route to the right agent, fall back to a human" in an afternoon
Adaptive turn detection out of the box
Per-call telemetry delivered to your systems via webhooks; an example consumer page is open so you can see the events live

We sell the entire production stack with the engineering work already done, not a thin wrapper around hosted models. You can talk to it now at sippulse.ai/demos and watch the latency on the dashboard.

Conclusion

WebRTC is the right transport for voice agents, but the protocol is the easy part. The hard part is everything you have to build on top: media routing, SIP bridge, turn detection, telephony codecs, telemetry, agent dispatch, multilingual STT and TTS, plus a flow builder for non-engineers. SipPulse AI ships all of that as one stack. Try the demo or contact our team to see how we put it together.

#voice agent#WebRTC#SIP#realtime#telephony#production

Voice AI Architecture for Telecom: Why Three Planes Matter

Most voice AI demos work. Most voice AI production deployments don't. The gap between a demo that handles a scripted conversation and a system that operates under regulatory scrutiny with real PSTN tr

June 18, 20263 min read

Flavio Goncalves

SipPulse AI telemetry: every parameter explained

SipPulse AI delivers per-call telemetry via signed webhooks. Here is what every event type and metric means, with the open example viewer at /telemetry.

April 15, 20266 min read

SipPulse AI