Tag: word error rate

  • What Is WER and Why It’s Not the Best Way to Measure Speech Recognition Accuracy

    What Is WER and Why It’s Not the Best Way to Measure Speech Recognition Accuracy

    You are trying to choose a speech recognition system for a product that will handle calls in Hindi, Telugu, or Marathi. You look at the benchmarks. One provider reports 8% WER. Another reports 14%. You pick the first one.

    Three weeks into production, users are complaining. Transcripts are wrong in ways that matter. The agent cannot understand customer intent. You go back to the benchmarks and they still say 8%. The number has not lied to you, exactly. But it has not told you the truth either.

    Word Error Rate was designed for a world that Indian languages do not live in. Understanding why, and what to measure alongside it, is one of the more practical things a team building voice products for India can do before committing to an ASR provider.

    What WER Actually Measures

    Word Error Rate counts how many words in a transcript differ from a reference transcript, then divides that count by the total number of words in the reference. The formula is simple: substitutions plus deletions plus insertions, divided by total reference words.

    A WER of 8% means that roughly 8 words in every hundred were wrong in some way. That sounds useful. And on clean, formal, single-language audio recorded in a quiet room, it is reasonably useful.

    The problem is that Indian language speech is almost never clean, formal, single-language nor audio recorded in a quiet room.

    The Six Ways WER Breaks Down on Indian Languages

    1. Colloquial speech gets penalised as error

    Every Indian language has a formal written register and a spoken everyday register. A person speaking Tamil in a natural conversation will use forms like “avunga” instead of the formal “avargal” for “they.” Both are perfectly correct Tamil. A native speaker hearing either would understand immediately.

    WER treats this as an error. The model produced a word that does not match the reference, so it counts against the score. The transcript is right. The score says it is wrong.

    This is not a Tamil-specific issue. Hindi has the same gap between formal and colloquial forms. So do Marathi, Bengali, Kannada, and Malayalam. If your evaluation dataset uses formal reference transcripts and your model transcribes natural speech, you are measuring the wrong thing.

    2. Code-switching creates false failures

    Hindi-English mixing is not a mistake speakers make. It is a natural and fluent register that hundreds of millions of people use every day. The word “doctor” appears in Hindi conversation in two equally valid forms: doctor (Roman script, as borrowed from English) and डॉक्टर (the same word transliterated into Devanagari).

    If a reference transcript uses one form and the model produces the other, WER calls it a substitution error. No meaning has been lost. No pronunciation has changed. The transcript is functionally correct, and the benchmark is recording a failure.

    In a product that handles customer service calls, every common loanword, “account,” “balance,” “transfer,” “nominee,” “mobile,” “policy”, is a potential source of these false errors. Your actual model may perform better than its WER suggests by anywhere from 5 to 15 percentage points on real call audio.

    Shunya Labs’ Zero STT Codeswitch model was built specifically for this kind of mixed-language audio, generating native mixed-script output rather than forcing a choice between Devanagari and Roman transliterations.

    3. Short words produce catastrophic-looking numbers

    Hindi and other North Indian languages rely heavily on short particles and helper words: “है” (is), “नहीं” (no), “को” (to), “का” (of). These words are often two or three characters long.

    When a model doubles a word, mishears a diacritic, or inserts a particle that should not be there, WER applies its formula to a very small denominator. A single extra “नहीं” in a two-word utterance produces a WER of 100% or higher. The metric makes it look like the model completely failed on a sentence where it got the meaning right.

    Agglutinative languages like Malayalam, Telugu, Kannada, and Tamil face this in a different form. Single word tokens in these languages can be very long, because suffixes are chained together. A minor suffix variation that a native speaker would not notice as wrong produces a large character-level penalty on a single token.

    4. Numbers have too many valid forms

    The number 500 can appear in an Indian language transcript as “पांच सौ” (spoken Hindi), as “500” (Arabic numerals), or as “५००” (Devanagari numerals). All three forms are correct. All three might appear in different annotators’ reference transcripts for the same audio.

    WER might treat these three forms as completely unrelated strings. If the reference says “500” and the model outputs “पांच सौ,” WER counts a substitution. The downstream product sees the right number. The benchmark records an error.

    Dates follow the same pattern. “२५ जनवरी” and “25 January” and “25-01” can all represent the same date, spoken the same way, and WER will penalise any mismatch between them.

    5. Meaning reversals look like minor errors

    This is the most dangerous failure mode, and it goes in the opposite direction from the ones above.

    If a model transcribes “मैं कल स्कूल जाना चाहता हूं” (I want to go to school tomorrow) as “मैं कल स्कूल नहीं जाना चाहता हूं” (I do not want to go to school tomorrow), WER sees one extra word. That is a WER of roughly 14% on a seven-word sentence. The benchmark looks fine.

    The meaning has been completely reversed. For a voice agent taking action on the user’s request, this is not a 14% error. It is a 100% failure. The agent will do the wrong thing.

    WER measures character-level and word-level distance. It has no idea what the sentence means.

    6. The evaluation dataset may not match your users

    Published benchmarks are run on specific datasets. Those datasets were recorded in specific conditions with specific speakers, often in studio settings with clean audio. Your users are calling from moving vehicles, crowded markets, hospital corridors, and rural areas with budget smartphones.

    A model with 8% WER on a studio-quality benchmark dataset can perform far worse on your actual call audio. The benchmark number is not wrong. It just does not apply to your use case.

    What to Measure Instead, or Alongside

    This does not mean abandoning WER. It is still a useful baseline, and for verbatim transcription tasks where you need the exact words in the exact form the speaker used, it is the right primary metric. The issue is treating it as the only metric when the product is doing something more complex.

    Here are the additional signals worth looking at.

    Test on your own audio. Before committing to a provider, record a sample of real calls or voice inputs from your actual users in your actual environments. Run that sample through the models you are evaluating. The performance gap between benchmark audio and production audio is often larger than teams expect. Shunya Labs offers a playground where you can test with your own files before integrating.

    Check intent preservation, not just word accuracy. For conversational products, the question that matters is whether the model captured what the user was trying to communicate, not whether every word matched a reference exactly. A call center bot that misunderstands customer intent by 20% of the time has a serious product problem, even if its WER looks reasonable.

    Check entity accuracy separately. Names, account numbers, amounts, dates, and place names are the pieces of information that downstream systems act on. A transcript that gets every content word right but mishears an account number has failed in the way that matters most. Test entity accuracy on your domain specifically, medical terms if you are building for healthcare, financial terminology if you are building for banking.

    Look at performance by language, not just across languages. An aggregate multilingual WER of 10% can hide a model that performs at 5% on Hindi (a high-resource language with lots of training data) and 30% on Bhojpuri or Maithili. If your users speak the latter, the aggregate number is misleading.

    Shunya Labs supports over 200 languages including a large range of Indic languages, and published accuracy numbers on the benchmarks page.

    Test on code-switched audio specifically. If your users mix languages, which most urban Indian users do, test with mixed-language audio. Do not assume that a model with strong Hindi performance and strong English performance will handle Hinglish well. Mixed-language models need to be trained on mixed-language data. Performance on each language separately tells you nothing reliable about performance on code-switched speech.

    A Practical Evaluation Checklist

    Before picking an ASR provider for an Indian language product, work through these questions.

    What audio conditions will your actual users produce? Test in those conditions, not in a studio.

    Do your reference transcripts use formal or colloquial forms? If formal, expect WER to understate model quality on real conversational data.

    Does your product handle code-switched speech? If yes, test explicitly on code-switched samples and check whether the provider has a model designed for it.

    Are there domain-specific terms (drug names, financial products, place names, brand names) that your downstream system depends on getting right? Test those specifically.

    Do you need verbatim accuracy (every word exactly as spoken) or semantic accuracy (the meaning correctly captured)? The answer changes which metrics you should weight.

    What languages specifically will your users speak? Check whether the provider has per-language accuracy data for those languages, not just for Hindi or English as a proxy.

    The Benchmark Number Is a Starting Point

    WER has not misled you when you read 8% on a Hindi benchmark. It has accurately described model performance under the conditions the benchmark used. The question is whether those conditions match yours.

    For most Indian language voice products in production, they do not match perfectly. The benchmark audio is cleaner, more formal, and more monolingual than real user audio. The reference transcripts were written by annotators who may have made different choices than your users’ speech naturally produces.

    The teams that avoid expensive surprises are the ones who treat the benchmark number as a starting point for evaluation, not as a decision. They test on their own audio, in their own domain, with their own users’ speech patterns. They check whether intent is preserved, not just whether word sequences match. They look at entity accuracy for the specific entities their product depends on.

    Shunya Labs’ speech intelligence features, including sentiment analysis, intent detection, and entity-aware transcription, exist partly because accurate word-level output is only part of what a voice product in production actually needs. The transcript has to be right at the word level. And it has to be usable at the meaning level. Those are two different things, and a serious evaluation process tests for both.If you want to run a proper evaluation against your own audio before integrating, the documentation has everything you need to get started, and the playground lets you test without writing code first. Contact us to know more.