
Why Most ASR Benchmarks Miss What Matters
Most automatic speech recognition benchmarks have a problem. They test models on clean, read speech from academic datasets like LibriSpeech, then declare a winner. But production audio is not clean or read. It is noisy, accented, and full of people switching between languages mid-sentence.
The gap between benchmark scores and real-world performance is significant. A model that scores well on Tedlium or LibriSpeech may fall apart in a contact center with background chatter, or when transcribing a conversation in Hinglish (mixed Hindi and English). This is why we built our evaluation framework around what actually happens in production environments.
At Shunya Labs, we measure performance across accented speech, code-switching scenarios, background noise, and enterprise security requirements. If you are evaluating speech AI for production use, see our guide on what to look for in an enterprise speech AI platform in 2026.

The Metrics That Actually Matter In Production
Word Error Rate (WER) is the standard metric for ASR accuracy. Lower is better. But WER on clean audiobooks is different from WER on a noisy support call. Here is what production environments actually require:
| Benchmark Focus | Typical Benchmarks | Production Reality |
|---|---|---|
| Clean speech | Most leaderboards | Rare in real deployments |
| Accented speech | Limited coverage | Standard in global applications |
| Background noise | Often ignored | Contact centers, public spaces |
| Code-switching | Usually not tested | Common in multilingual regions |
| Streaming latency | Not always measured | Critical for real-time agents |
| Security certifications | Not included | SOC 2, HIPAA required |
| Deployment options | Cloud-only | Cloud, edge, on-prem needed |
Real-time applications need sub-100ms latency for natural conversation flow. Our Zero STT models achieve low round-trip latency in production, enabling live agent assistance and conversational voice agents.
For guidance on evaluating platforms, read how to choose a speech AI platform.
Zero STT Suite Benchmark Methodology
Our evaluation goes beyond standard datasets. We test on:
- Real audio conditions: Contact center calls with background noise, overlapping speakers, and phone-quality audio
- Multilingual scenarios: 200+ languages including 32+ Indic languages, plus code-switching in Hinglish and other mixed-language speech
- Domain-specific content: Medical terminology, financial jargon, and technical vocabulary
- Streaming performance: Latency measurement under production load, not just theoretical minimums
This approach better reflects production performance because it tests the conditions where ASR models actually fail. Clean speech benchmarks are useful for research comparisons, but they do not predict how a model handles a support call with a crying baby in the background.
You can see our detailed benchmark results on the Shunya Labs benchmarks page.
Performance Results Across Accuracy, Speed, And Languages
Accuracy benchmarks
Here is how our Zero STT models compare to leading alternatives on standard benchmarks:
| Model | WER (lower is better) | Tedlium Ted Talks | LibriSpeech Clean |
|---|---|---|---|
| Zero STT (in English) | 3.10% | 98.57% accuracy | 99.29% accuracy |
| NVIDIA Canary Qwen 2.5B | 5.63% | 97.29% accuracy | 98.39% accuracy |
| IBM Granite Speech 3.3 8B | 5.74% | 96.60% accuracy | 98.57% accuracy |
| Microsoft Phi-4 | 6.02% | 97.06% accuracy | 98.31% accuracy |
Our 3.10% WER represents 48% fewer errors than the next best model. This difference matters at scale. For every 100 words transcribed, Zero STT produces about 3.1 errors versus 5.6+ errors from competing models.
For specialized Indic language support, Zero STT Indic delivers native-level accuracy on Hindi, Tamil, Telugu, Bengali, and other Indian languages.
Speed and latency benchmarks
| Metric | Zero STT Performance | Industry Typical |
|---|---|---|
| Round-trip latency | 200ms | 200-500ms |
| Streaming latency | Sub-100ms | 150-300ms |
| Batch processing RTFx | Real-time to 10x | Variable |
Sub-100ms streaming latency is essential for contact center applications where agents need live transcription. Our benchmarks show consistent performance under production load, not just optimal conditions.
Read more about why latency matters in our article on sub-100ms voice AI latency.
Multilingual and code-switching performance
| Capability | Zero STT | Typical ASR Models |
|---|---|---|
| Total languages | 200+ | 50-100 |
| Indic languages | 32+ | 5-10 |
| Code-switching (Hinglish) | Native support | Often fails |
| Global population coverage | 96.8% | 60-80% |
Standard models trained primarily on English and European languages struggle with code-switching. They either fail to recognize the language change or produce garbled output. Our Zero STT Codeswitch model handles mixed-language conversations natively.
For a deeper technical explanation, see our article on code-switching ASR and why Hinglish breaks standard models.
Enterprise Features Beyond The Benchmark Scores
Benchmark scores are only the starting point. Production deployments require security, flexibility, and additional capabilities:
Security And Compliance
- SOC 2 Type II certified
- ISO/IEC 27001:2022 accredited
- HIPAA compliant for healthcare use cases
- TLS 1.3 for data in transit, AES-256 for data at rest
- Audio files encrypted during processing, deleted after transcription
- No audio retention post-transcription
Deployment Flexibility
| Deployment | Capabilities | Best For |
|---|---|---|
| Cloud | Zero infrastructure, instant auto-scaling | Startups, rapid deployment |
| Edge | Regional data residency, offline capability | IoT, telecom, multi-region |
| On-premises | Full data sovereignty, air-gapped option | Highly regulated industries |
Unlike many competitors who offer cloud-only deployment, we provide all three options. This matters for organizations with strict data residency requirements or those operating in air-gapped environments.
Explore our deployment options for detailed configuration guidance.
Speech Intelligence Layer
Beyond transcription, our platform includes:
- Speaker diarization and identification
- Intent detection and entity extraction
- Sentiment analysis and emotion tracking
- Automated summarization
- Keyword normalization
- Medical keyterm correction (for Zero STT Med)
These features transform raw transcription into actionable data. See our Speech Intelligence page for feature details and pricing.
Choosing The Right ASR For Your Use Case
Benchmarks tell part of the story. Here is how to match capabilities to requirements:
Contact centers: Prioritize low latency, code-switching support, and speaker diarization. Real-time agent assistance requires streaming ASR that keeps up with natural conversation flow.
Healthcare: HIPAA compliance and medical terminology accuracy are non-negotiable. Zero STT Med is trained on clinical vocabulary and supports structured EHR integration.
Media and entertainment: Batch processing efficiency and accurate speaker separation matter more than streaming latency. Word-level timestamps enable precise video synchronization.
Edge and mobile: On-device models reduce bandwidth costs and enable offline operation. Our ONNX-compatible models run on standard mobile hardware.
The right choice depends on your specific combination of accuracy requirements, latency constraints, language coverage, and deployment environment. See our use cases for implementation examples across industries.
Start Building With Production-Ready ASR Today
Our benchmark results show what is possible when ASR is built for production conditions: 3.10% WER in English, sub-250ms latency, and native handling of 200+ languages including code-switching scenarios.
But benchmarks are just numbers. The complete Zero STT Suite gives you a foundation for building voice agents, contact center automation, medical documentation workflows, and multilingual applications that actually work in the real world.
We provide the full stack: foundation models, intelligence layer for intent and sentiment, orchestration framework for conversation flows. All with enterprise security and flexible deployment.Ready to test it yourself? Start with our documentation, try the playground, or contact sales for enterprise requirements.
