Voice AI agent architecture: STT, LLM, TTS and the latency budget

A production voice AI agent runs STT, LLM and TTS in a tight 800ms budget. Here is how the pipeline works and why latency design wins or loses every conversation.

SipPulse AI - Engineering TeamSeptember 15, 20256 min read

Share this article

Voice AI agent architecture: STT, LLM, TTS and the latency budget

If you have ever waited 3 seconds for a voice AI agent to reply, you already know the product killer. It is not the wrong answer. It is the long pause before the answer. Conversation feels broken when latency exceeds about 500ms, and it feels human when latency stays around 200ms. The whole engineering challenge of a voice AI agent collapses to this question: how do you keep the round trip under one second while running speech-to-text, an LLM and text-to-speech in series? This post walks through the pipeline, the 800ms latency budget that production teams target in 2026, and the architectural trade-offs that decide whether your agent ships or stays a demo.

The voice AI agent pipeline

Every voice AI agent worth shipping has the same skeleton: voice in, voice out, with three stages in between. Speech-to-text (STT, also called ASR) converts audio to text. A large language model (LLM) reads the text and produces a response. Text-to-speech (TTS) converts the response back to audio. Wrap that in a transport layer (WebRTC for the browser, SIP for telephony) and you have an agent.

The deceptive thing about that diagram is that it looks simple. The reality is that each stage has its own latency, error rate and quality ceiling, and the failures compound. STT errors confuse the LLM. LLM hesitation stalls TTS. TTS quality determines whether users believe the voice or hang up.

The 800ms latency budget

Production voice agent teams in 2026 target a total round trip of around 800ms from user speech ending to agent speech starting. A typical allocation looks like this:

Voice activity detection and audio capture: 50ms
STT transcription (streaming): 150ms
LLM time-to-first-token (TTFT): 400ms
TTS first audio chunk: 150ms
Network overhead: 50ms

The LLM gets the largest slice because it is the hardest to compress without sacrificing answer quality. STT and TTS can be optimized aggressively because they are mostly engineering problems. The LLM is bounded by model size and prompt length.

Industry research from AssemblyAI and Hamming puts the median real-world voice agent latency at 1.4 to 1.7 seconds, with 10% of calls exceeding 3 seconds. That is about 5x slower than the human conversation expectation of 200ms. The teams that ship products that feel natural are running at half the industry median.

Why TTFT matters more than total LLM time

There are two latency numbers worth tracking on the LLM. Total generation time is how long the model spent producing the full response. Time-to-first-token (TTFT) is how long until the first token comes back. For voice, TTFT is the one that matters. The reason: TTS can begin streaming audio as soon as the first sentence is complete, so the user starts hearing the agent before the LLM has finished thinking. If TTFT is 800ms, the user feels 800ms of dead air no matter how fast TTS runs.

The same logic applies to STT. A streaming STT model emits partial transcripts as the user speaks, so the LLM can begin reasoning before the user finishes. A turn-based STT that waits for end-of-speech adds full silence to the budget.

Streaming versus turn-based pipelines

The architectural choice that most affects voice agent latency is streaming versus turn-based. In a turn-based pipeline, each component waits for the previous one to finish before processing. In a streaming pipeline, components produce partial output continuously and downstream components consume it as it arrives.

Streaming is the only way to hit the 800ms budget. The STT emits chunks while the user is speaking, the LLM starts generating as soon as it has a clause to work with, and TTS begins synthesizing audio before the LLM is done. The user perceives the agent thinking and speaking in parallel rather than in sequence.

Turn-based pipelines are simpler to build and debug, which is why most prototypes start there. They are also why most prototypes never become products.

Speech-to-speech versus cascading: when each wins

A new option since late 2024 is speech-to-speech models like the GPT-4o realtime API or NVIDIA Nemotron. These collapse the STT, LLM and TTS into a single model that takes audio in and produces audio out. The win is latency, often well under the 800ms budget. The cost is loss of flexibility: function calling, RAG lookups and choice of models all become harder.

Many production deployments in 2026 are hybrid: speech-to-speech for small talk and emotion-rich openings, with cascade fallback (STT, LLM, TTS as separate components) when a tool call or RAG lookup is needed. Cascade gives the most flexibility and the cleanest tool use. Speech-to-speech gives the snappiest opening line.

The right answer depends on the use case. A booking agent that always needs to call a calendar API stays cascade. A friendly receptionist that handles a lot of small talk benefits from speech-to-speech.

Where SipPulse AI fits

SipPulse AI is our voice agent platform. WebRTC transport and SIP telephony are first-class out of the box, the default cascade uses Pulse Precision Pro for STT and Pulse TTS for synthesis, and NIVA sits on top for combining IVR flows and multiple agents in a block-based builder. NIVA is the layer that makes a real production agent easy to assemble, easy to prototype against your data and easy to evolve once it is live.

We sell the full stack: agent platform, IVR builder, STT, TTS, and per-call telemetry delivered to your systems via webhooks. The proof is public: open our demo, have a real conversation with our agent, then check our example telemetry viewer, a small open page that consumes the same webhook events your deployment would receive. Real demo calls hit around 800ms total round trip, broken out by llm_ttft_ms, tts_latency_ms, eou_latency_ms and conv_latency_ms. Most teams in this space hide their real latency. We let you measure ours on the same workload you would actually use.

Conclusion

A voice AI agent lives or dies by its latency budget. The 800ms target is achievable with a streaming cascade pipeline and aggressive optimization at every layer. Speech-to-speech models open new options but do not replace the need for tool use and RAG. If you want to build something users actually enjoy talking to, start with the latency budget and work backwards from there. Try our live demo to feel the difference, or talk to our team about deploying SipPulse AI on your own workload.

#voice AI agent#voice agent latency#STT#LLM#TTS#architecture

Voice AI Architecture for Telecom: Why Three Planes Matter

Most voice AI demos work. Most voice AI production deployments don't. The gap between a demo that handles a scripted conversation and a system that operates under regulatory scrutiny with real PSTN tr

June 18, 20263 min read

Flavio Goncalves

SipPulse AI telemetry: every parameter explained

SipPulse AI delivers per-call telemetry via signed webhooks. Here is what every event type and metric means, with the open example viewer at /telemetry.

April 15, 20266 min read

SipPulse AI