Evaluating voice AI agents in production: WER, MOS, latency
Voice agent evaluation is more than picking a model. WER under 5%, MOS 4.3 or higher, latency under 800ms, FCR above 85%. Here are the metrics that matter.

Most voice agent projects ship without ever being properly evaluated. The team picks a vendor based on a benchmark page, builds a demo, hears it work in the office, and turns it loose on production traffic. Then the support tickets start: "the agent never understood me", "it cut me off", "the voice sounded weird". Voice agent evaluation is the discipline that catches all of that before customers do, and the metrics that matter are not subjective impressions. They are WER, MOS, end-to-end latency, task success rate and first call resolution, measured on real workload. This post walks through what each one means, what the production targets look like in 2026, and how to set up the evaluation loop so problems surface in the dashboard, not in the inbox.
Why offline evals miss what production exposes
Offline evaluation runs the model against a fixed dataset and produces a single score. It is necessary but not sufficient. Real conversations have variables that no test set captures: codec compression on phone audio, background noise from real environments, accents, multi-turn context drift, customers who change topics mid-call, hallucinated answers under pressure.
The result is that a voice agent which scores 95% on offline benchmarks can score 70% on production calls. The gap is what evaluation is for. Production evaluation runs continuously, on every live call, capturing the metrics that predict customer experience.
WER for STT: the 5% line
Word Error Rate is the canonical STT metric. Formula: (substitutions + deletions + insertions) divided by total words, times 100. The production target for enterprise voice agents is WER under 5%. Above 8%, the LLM downstream starts producing wrong answers from misheard inputs, which is the failure mode customers complain about most.
WER on the model's marketing page is rarely the WER you will get in production. The reasons:
- The benchmark dataset is usually clean studio audio, not phone codec audio
- The benchmark language register is news-reading or audiobooks, not contact center dialect
- Real audio has accents, code-switching and background noise the test set lacks
The honest WER number is the one you measure on a sample of your own audio. Anything else is theater.
MOS for TTS: 4.3 or higher
Mean Opinion Score is the gold standard for TTS quality evaluation. Human listeners rate synthesized speech on a 1-5 scale where 5 is excellent. Scores of 4.3 to 4.5 indicate quality rivaling natural human speech. Below 4.0 the voice sounds noticeably synthetic, and customers will mention it.
MOS is expensive to run because it requires human raters. The proxies most teams use are:
- MUSHRA tests with smaller rater pools for relative comparisons
- Automated MOS predictors that estimate MOS from audio features
- A/B testing on real users with a satisfaction signal
For voice agents specifically, the relevant question is "does the voice damage the customer experience". A 4.3 MOS voice on a contact center call is invisible. A 3.8 voice gets noted in survey comments.
Latency: the 800ms target vs the 1.4-1.7s reality
Latency is where the gap between marketing and production is widest. Industry research puts the median real-world voice agent latency at 1.4 to 1.7 seconds, with 10 percent of calls exceeding 3 seconds. The human conversation expectation is closer to 200ms, with anything above 500ms feeling noticeably delayed.
The 800ms total budget that production teams target breaks down as:
- VAD and audio capture: 50ms
- Streaming STT: 150ms
- LLM time-to-first-token: 400ms
- TTS first audio chunk: 150ms
- Network overhead: 50ms
Track end-to-end conversation latency, but also the per-stage breakdown. A 1.5-second call where the LLM TTFT is 1.2 seconds is a different problem than a 1.5-second call where the network adds 800ms. The fix depends on knowing which stage is slow.
Task success rate and FCR
Latency and accuracy mean nothing if the agent does not actually solve the customer's problem. Two business-level metrics tie evaluation back to outcomes:
- Task success rate (TSR): did the agent complete the intended task in the call (book the appointment, update the address, charge the card)
- First call resolution (FCR): did the customer's issue get resolved on this call without a follow-up
The production target for FCR on voice AI flows is 85% or higher. Below that, the cost savings start eroding because escalated calls hit the human agent queue. Above it, the math justifies broader rollout.
TSR and FCR are typically scored by the same LLM that handles transcripts in your Auto QA pipeline, applying a rubric specific to each flow. Self-reporting (asking the customer "did this resolve your issue") works for opt-in surveys but suffers from low response rates.
The 50+ metrics that matter in production
WER, MOS, latency, TSR and FCR are the headline numbers. Real production observability goes deeper. There are 50+ metrics across multiple layers worth tracking:
- Audio signal integrity: jitter, packet loss, codec mismatch
- Streaming latency per stage: STT TTFT, STT final, LLM TTFT, LLM total, TTS TTFB, TTS total
- Turn-taking quality: barge-in success rate, false barge-in rate, end-of-utterance detection latency
- Multi-turn context tracking: did the agent remember the customer's name across 6 turns
- Hallucination resilience: does the agent invent policies, prices or facts when uncertain
- Safety adherence: did the agent stay within disclosed scope, did it refuse out-of-policy requests cleanly
Online evaluation reveals issues that only appear in production, making it essential to monitor both offline evals on test data and live production metrics on every call.
How SipPulse AI exposes evaluation data
Every call on SipPulse AI emits a structured webhook with the metrics that matter for voice agent evaluation: per-call duration, llm_ttft_ms, llm_latency_ms, tts_latency_ms, eou_latency_ms, conv_latency_ms, plus request counters per provider. You wire those events into your own observability stack to drive dashboards, alerts and Auto QA.
Our open example viewer shows the same payload in a small public page so you can see the events and the schema. Real demo calls hit around 800ms total round trip, broken out per stage. Talk to the demo at sippulse.ai/demos, then inspect how it actually performed on your call.
Read also
- Voice AI agent architecture: STT, LLM, TTS and the latency budget
- SipPulse AI telemetry: every parameter explained
- Audio intelligence in 2026: transcription, diarization and benchmarks
Conclusion
Voice agent evaluation is what separates a vendor demo from a production-ready product. WER, MOS, latency, TSR and FCR are the headline numbers; real observability tracks 50 more. Set up the evaluation loop on day one and you ship better agents on day 90. Try our demo and inspect the real numbers, or contact our team to wire evaluation into your stack.
Related Articles

SipPulse AI telemetry: every parameter explained
SipPulse AI delivers per-call telemetry via signed webhooks. Here is what every event type and metric means, with the open example viewer at /telemetry.

Voice agents with RAG and function calling
A voice agent that only chats is a toy. Function calling and RAG turn it into a product. Here is how the pieces fit and where the latency hides.

How Voice AI is Revolutionizing Customer Service
Discover how Voice AI agents are transforming contact centers with real-time conversation, reduced wait times, and 24/7 availability.