Evaluating voice AI agents in production: WER, MOS, latency

Voice agent evaluation is more than picking a model. WER under 5%, MOS 4.3 or higher, latency under 800ms, FCR above 85%. Here are the metrics that matter.

SipPulse AI - Engineering TeamMarch 10, 20266 min read

Share this article

Evaluating voice AI agents in production: WER, MOS, latency

Most voice agent projects ship without ever being properly evaluated. The team picks a vendor based on a benchmark page, builds a demo, hears it work in the office, and turns it loose on production traffic. Then the support tickets start: "the agent never understood me", "it cut me off", "the voice sounded weird". Voice agent evaluation is the discipline that catches all of that before customers do, and the metrics that matter are not subjective impressions. They are WER, MOS, end-to-end latency, task success rate and first call resolution, measured on real workload. This post walks through what each one means, what the production targets look like in 2026, and how to set up the evaluation loop so problems surface in the dashboard, not in the inbox.

Why offline evals miss what production exposes

Offline evaluation runs the model against a fixed dataset and produces a single score. It is necessary but not sufficient. Real conversations have variables that no test set captures: codec compression on phone audio, background noise from real environments, accents, multi-turn context drift, customers who change topics mid-call, hallucinated answers under pressure.

The result is that a voice agent which scores 95% on offline benchmarks can score 70% on production calls. The gap is what evaluation is for. Production evaluation runs continuously, on every live call, capturing the metrics that predict customer experience.

WER for STT: the 5% line

Word Error Rate is the canonical STT metric. Formula: (substitutions + deletions + insertions) divided by total words, times 100. The production target for enterprise voice agents is WER under 5%. Above 8%, the LLM downstream starts producing wrong answers from misheard inputs, which is the failure mode customers complain about most.

WER on the model's marketing page is rarely the WER you will get in production. The reasons:

The benchmark dataset is usually clean studio audio, not phone codec audio
The benchmark language register is news-reading or audiobooks, not contact center dialect
Real audio has accents, code-switching and background noise the test set lacks

The honest WER number is the one you measure on a sample of your own audio. Anything else is theater.

MOS for TTS: 4.3 or higher

Mean Opinion Score is the gold standard for TTS quality evaluation. Human listeners rate synthesized speech on a 1-5 scale where 5 is excellent. Scores of 4.3 to 4.5 indicate quality rivaling natural human speech. Below 4.0 the voice sounds noticeably synthetic, and customers will mention it.

MOS is expensive to run because it requires human raters. The proxies most teams use are:

MUSHRA tests with smaller rater pools for relative comparisons
Automated MOS predictors that estimate MOS from audio features
A/B testing on real users with a satisfaction signal

For voice agents specifically, the relevant question is "does the voice damage the customer experience". A 4.3 MOS voice on a contact center call is invisible. A 3.8 voice gets noted in survey comments.

Latency: the 800ms target vs the 1.4-1.7s reality

Latency is where the gap between marketing and production is widest. Industry research puts the median real-world voice agent latency at 1.4 to 1.7 seconds, with 10 percent of calls exceeding 3 seconds. The human conversation expectation is closer to 200ms, with anything above 500ms feeling noticeably delayed.

The 800ms total budget that production teams target breaks down as:

VAD and audio capture: 50ms
Streaming STT: 150ms
LLM time-to-first-token: 400ms
TTS first audio chunk: 150ms
Network overhead: 50ms

Track end-to-end conversation latency, but also the per-stage breakdown. A 1.5-second call where the LLM TTFT is 1.2 seconds is a different problem than a 1.5-second call where the network adds 800ms. The fix depends on knowing which stage is slow.

Task success rate and FCR

Latency and accuracy mean nothing if the agent does not actually solve the customer's problem. Two business-level metrics tie evaluation back to outcomes:

Task success rate (TSR): did the agent complete the intended task in the call (book the appointment, update the address, charge the card)
First call resolution (FCR): did the customer's issue get resolved on this call without a follow-up

The production target for FCR on voice AI flows is 85% or higher. Below that, the cost savings start eroding because escalated calls hit the human agent queue. Above it, the math justifies broader rollout.

TSR and FCR are typically scored by the same LLM that handles transcripts in your Auto QA pipeline, applying a rubric specific to each flow. Self-reporting (asking the customer "did this resolve your issue") works for opt-in surveys but suffers from low response rates.

The 50+ metrics that matter in production

WER, MOS, latency, TSR and FCR are the headline numbers. Real production observability goes deeper. There are 50+ metrics across multiple layers worth tracking:

Audio signal integrity: jitter, packet loss, codec mismatch
Streaming latency per stage: STT TTFT, STT final, LLM TTFT, LLM total, TTS TTFB, TTS total
Turn-taking quality: barge-in success rate, false barge-in rate, end-of-utterance detection latency
Multi-turn context tracking: did the agent remember the customer's name across 6 turns
Hallucination resilience: does the agent invent policies, prices or facts when uncertain
Safety adherence: did the agent stay within disclosed scope, did it refuse out-of-policy requests cleanly

Online evaluation reveals issues that only appear in production, making it essential to monitor both offline evals on test data and live production metrics on every call.

How SipPulse AI exposes evaluation data

Every call on SipPulse AI emits a structured webhook with the metrics that matter for voice agent evaluation: per-call duration, llm_ttft_ms, llm_latency_ms, tts_latency_ms, eou_latency_ms, conv_latency_ms, plus request counters per provider. You wire those events into your own observability stack to drive dashboards, alerts and Auto QA.

Our open example viewer shows the same payload in a small public page so you can see the events and the schema. Real demo calls hit around 800ms total round trip, broken out per stage. Talk to the demo at sippulse.ai/demos, then inspect how it actually performed on your call.

Conclusion

Voice agent evaluation is what separates a vendor demo from a production-ready product. WER, MOS, latency, TSR and FCR are the headline numbers; real observability tracks 50 more. Set up the evaluation loop on day one and you ship better agents on day 90. Try our demo and inspect the real numbers, or contact our team to wire evaluation into your stack.

#voice agent#voice agent evaluation#WER#MOS#latency#FCR#observability

Voice AI Architecture for Telecom: Why Three Planes Matter

Most voice AI demos work. Most voice AI production deployments don't. The gap between a demo that handles a scripted conversation and a system that operates under regulatory scrutiny with real PSTN tr

June 18, 20263 min read

Flavio Goncalves

SipPulse AI telemetry: every parameter explained

SipPulse AI delivers per-call telemetry via signed webhooks. Here is what every event type and metric means, with the open example viewer at /telemetry.

April 15, 20266 min read

SipPulse AI