Getting Started with ASR APIs | Python Quickstart Guide

TL;DR , Key Takeaways:

Most voice technology was built for clean, single-language speech and struggles the moment someone mixes Hindi and English or any other language
This is not a user error. Code-switching is how hundreds of millions of Indians naturally communicate
Standard ASR models fail at Hinglish because of gaps in acoustic modeling, vocabulary, language modeling, and training data
Fixing this requires a ground-up approach, not a patch on an existing English or Hindi model

Most people have been there. You are talking to a voice assistant, a customer support bot, or a speech-to-text app, and mid-sentence it completely loses you. Not because you mumbled. Not because the connection was bad. Simply because you said something like:

“Yaar, can you just reschedule the meeting to 4 baje?”

The app either returns garbled text, skips the Hindi entirely, or stares back at you with a blinking cursor that quietly implies you said something wrong. You did not. You spoke the way most people in India speak every single day, and the technology just was not built for it.

This is the code-switching problem. It sits at the heart of why so much voice technology feels broken the moment a real Indian user picks it up.

What Code-Switching Actually Is

Linguists have studied code-switching for decades. At its core, it is the practice of moving between two or more languages within a single conversation, sometimes within a single sentence. Bilingual and multilingual speakers do this naturally, fluidly, and often without noticing.

In India, the most prominent example is Hinglish, the blend of Hindi and English that dominates urban conversation. But code-switching in India goes far beyond Hinglish. Tamil speakers in Chennai routinely mix Tamil with English. Bengali professionals in Kolkata do the same. In the South, you get Tanglish, Kanglish, Manglish. In Maharashtra, Marathi and English weave together constantly.

The critical thing to understand is that speakers switch to convey nuance, signal social identity, fill lexical gaps, or simply because one language has a better word for the thing they are trying to say. “Jugaad” does not have an English equivalent. “Overwhelming” does not have a Hindi one that carries exactly the same feeling. So speakers use both.

When you build speech technology that cannot handle this, you are not building speech technology for India. You are building something that works for a narrow slice of formal, scripted, monolingual speech that most real users will never produce.

Why Standard ASR Models Break Down

To understand why Hinglish is so difficult for most ASR systems, you need to understand how those systems are built.

A standard automatic speech recognition model is trained on audio data paired with text transcriptions. The model learns to map acoustic patterns to linguistic units, usually phonemes or subword tokens, and then to string those units into words and sentences. The quality of the output depends enormously on how well the training data matches the input it will later see in production.

Most of the large ASR models in circulation today were trained overwhelmingly on English data, with some multilingual variants trained on parallel datasets in many languages, each treated as a separate, clean category. The model learns English. Or it learns Hindi. It does not learn the space between them.

When a code-switched utterance arrives, several things go wrong at once.

The acoustic model is the first point of failure. Hindi phonemes and English phonemes are genuinely different. The retroflex consonants in Hindi, the aspirated stops, the nasal vowels, these sounds do not exist in English in the same form. When a speaker slides from English into Hindi mid-sentence, the acoustic character of the audio shifts in ways a model trained only on one language is not equipped to follow.

The language model compounds the problem. Modern ASR systems use language models to help decide which word sequence is most probable given the acoustic evidence. A language model trained on English assigns near-zero probability to Hindi words appearing in an English sentence.

So even if the acoustic model correctly identifies the sounds, the language model corrects them away, replacing them with the nearest English approximation. The Hindi word “karo” becomes “cargo.” “Bata” becomes “butter.” The output is fluent-sounding nonsense.

Then there is the vocabulary problem. Code-switched speech pulls from two lexicons simultaneously. A model trained on a single language simply does not have the vocabulary to recognize words from the other. This is not a tuning issue. It is a fundamental architectural gap.

Finally, there is the prosody and rhythm problem. Hindi and English have different stress patterns, different intonation curves, and different timing structures. When speakers mix languages, the prosodic cues that ASR models use to segment words and detect sentence boundaries become unreliable. The model loses its footing even at the most basic level of figuring out where one word ends and the next begins.

The Data Problem Nobody Wants to Talk About

Building a model that handles code-switching well requires training data that reflects code-switching, and this is where most efforts quietly fail.

Collecting naturalistic code-switched speech is hard. You cannot simply crawl the web for audio in the way you can for text. You need real conversations, real phone calls, real customer interactions where people are speaking the way they actually speak rather than performing a scripted version of their language for a microphone. That data is expensive to collect, ethically sensitive to handle, and time-consuming to transcribe accurately.

Transcribing code-switched speech is its own challenge. A transcriber fluent in Hindi may not accurately capture English portions and vice versa. Annotation guidelines for mixed-language text are not standardized. The same utterance might be written differently by ten different annotators, with inconsistent choices about spelling, script (Devanagari vs. Roman), and word boundaries.

This is one of the main reasons large general-purpose models perform so poorly on mixed languages despite performing reasonably well on them separately. The training data simply does not contain enough naturalistic code-switched examples to teach the model what to do when languages collide.

What It Actually Takes to Solve This

The first is building language-agnostic acoustic representations.

Rather than training separate acoustic models for each language and hoping they transfer, you train a single model on multilingual data with enough phonemic overlap to build shared representations. The model learns to represent sounds at a level of abstraction that generalizes across language boundaries.

The second is expanding the vocabulary and tokenization strategy.

Code-switched models need subword vocabularies that include units from both languages, and they need language identification signals that tell the language model which lexical distribution to draw from at any given moment. Some architectures do this with explicit language ID tags; others learn to do it implicitly from patterns in the training data.

The third, and in some ways the most important, is training on real code-switched data at scale.

There is no shortcut here. A model that has never been trained on Hinglish will not suddenly learn to handle Hinglish because it has seen a lot of Hindi and a lot of English. The mixing patterns, the syntactic borrowings, the phonological adaptations that happen when languages blend, these are things the model has to learn from examples.

Where Shunya Labs Fits Into This

At Shunya Labs, this is not a theoretical problem. It is the core of what the team has been building toward.

Shunya Labs was designed from the ground up for the way people actually communicate. That means training on data that includes code-switched speech rather than treating it as noise to be filtered out. It means building a vocabulary and acoustic model that can handle the phonemic landscape of Indian languages without forcing every utterance through an English or formal Hindi lens. And it means testing against real-world speech that reflects the diversity of accents, dialects, and mixing patterns that show up when a product reaches users across the country.

The result is an ASR system that can handle a sentence like “Kya aap mujhe tomorrow ka schedule send kar sakte ho?” without losing the thread. Because the model was trained to understand the structure and patterns of code-switched speech at a deeper level.

At Shunya Labs the speech technology work for the full range of Indian communication, not a filtered version of it. If you are building a voice product for India and your ASR only works when users speak like they are dictating a formal document, you are building on a foundation that will crack the moment real users show up.

Why This Matters for Products Built on Voice

The business case for getting this right is more straightforward than it might seem.

Voice interfaces in India are not a nice-to-have. For a significant portion of the population, they are the most natural and accessible way to interact with technology. Voice search, voice-driven customer support, voice-based financial services, these are not futuristic applications. They are live, growing markets where the quality of the underlying speech recognition directly determines whether the product works or fails.

Every percentage point of word error rate on code-switched speech is not an abstract benchmark number. It is a user who could not complete their task. It is a customer service interaction that went sideways because the system misheard a key instruction. It is a farmer who could not access agricultural information because the voice interface could not parse the way he naturally speaks.

Building Speech That Reflects Reality

Standard ASR models were built for a world where speakers are monolingual, accents are predictable, and language boundaries are clean. That world never really existed, and it certainly does not describe India.

The path forward is to build models complex enough to meet users where they are.

Code-Switching ASR Explained: Why Hinglish Breaks Every Standard Model

What Code-Switching Actually Is

Why Standard ASR Models Break Down

The Data Problem Nobody Wants to Talk About

What It Actually Takes to Solve This

Where Shunya Labs Fits Into This

Why This Matters for Products Built on Voice

Building Speech That Reflects Reality

Comments

Leave a Reply Cancel reply

More posts

What to Look for in an Enterprise Speech AI Platform in 2026

Code-Switching ASR Explained: Why Hinglish Breaks Every Standard Model

Voice AI, Text to Speech & Type to Speech: The Technologies Quietly Changing How We Communicate

How to Choose a Speech AI Platform: The 2026 Evaluation Guide