Tag: voice ai

  • What Is ASR? The Technology Behind Every Voice AI Product

    What Is ASR? The Technology Behind Every Voice AI Product

    TL;DR , Key Takeaways:

    • ASR stands for Automatic Speech Recognition. It is the technology that converts spoken audio into text. Every voice AI product, from phone bots to meeting transcription tools, depends on it.
    • Modern ASR or STT (Speech to Text) uses deep learning, specifically Conformer and Transformer architectures, to turn audio waveforms into accurate text in milliseconds. The old rule-based systems of the 1990s are gone.
    • Accuracy varies enormously by language, audio quality, and what the model was trained on. A model scoring 5% WER on US English can exceed 25% WER on Indian regional languages over phone audio.
    • For India, the speech AI market is growing at 23.7% CAGR. But most global ASR platforms were not built for Indian languages, dialects, or the audio conditions of Indian deployments.
    • Shunya Labs covers 200 languages including 55 Indic languages, trained on real audio.

    When you speak to a bank’s customer care bot in Hindi and it understands you, something specific is happening before any AI logic kicks in. Your voice is being converted to text. That conversion, fast and accurate enough to feel seamless, is ASR.

    ASR stands for Automatic Speech Recognition. It is also called speech to text, or STT. It is the foundational layer inside every voice AI product: voice agents, meeting transcription tools, call analytics platforms, speech-enabled mobile apps, and IVR systems. Without it, voice AI does not exist.

    Despite being everywhere, ASR is poorly understood outside the people who build voice systems. This post explains what it is, how it works, and what determines whether it is good or bad. It also covers what the speech AI landscape in India looks like in 2026.

    Global speech AI market, 2025

    Projected $23.11B by 2030 at 19.1% CAGR

    India ASR market, 2024

    Projected $8.19B by 2033 at 23.7% CAGR

    India internet users, 2025

    98% access content in Indic languages

    What ASR Actually Does

    At its core, ASR takes audio as input and produces text as output. That sentence sounds simple. The engineering behind it is not.

    When you speak, you produce sound waves. Those waves travel through air and hit a microphone, which converts them into a digital signal. The digital signal is a sequence of numbers representing sound pressure over time. ASR takes that sequence of numbers and figures out which words you said.

    The reason this is hard: spoken language is continuous. There are no clean gaps between words, the way spaces appear between words in text. Speakers vary in accent, speed, and pronunciation. Background noise blends with the speech signal. Two people saying the same word in different accents produce very different waveforms. And the same waveform can map to different words depending on context. The word ‘bat’ and the word ‘bad’ sound nearly identical in certain accents.

    ASR solves all of these problems simultaneously, in real time, on audio that nobody cleaned up for it. That is the engineering challenge that took decades to make usable.

    A Brief History: From Rules to Neural Networks

    The first ASR systems appeared in the 1950s. Bell Labs built a system called Audrey in 1952 that could recognise spoken digits from a single speaker. It worked by matching incoming audio against pre-recorded templates. Slow, rigid, and useless for anything except that one speaker’s digits.

    For the next four decades, ASR ran on a framework called Hidden Markov Models, or HMMs. These were statistical models that learned which sequences of acoustic units, called phonemes, corresponded to which words. HMMs got good enough to power phone-based IVR systems in the 1990s and early 2000s. Press 1 for billing. Press 2 for support. Say your account number now. You know the experience. It worked, barely, for constrained vocabularies in quiet conditions.

    The shift happened between 2012 and 2016. Deep learning arrived in ASR. Researchers showed that neural networks could learn directly from audio-text pairs without needing hand-crafted phoneme definitions. In 2015, Baidu’s Deep Speech achieved error rates that rivalled humans on clean audio benchmarks. The old architecture was replaced almost overnight.

    Today’s ASR systems use architectures called Conformers and Transformers. Conformers combine convolutional neural networks for local acoustic pattern detection with Transformer attention for long-range context. They power the most accurate production ASR systems available.

    Mobile typing speed in Indian languages is 18 to 23 words per minute. Natural speech is 130 to 150 words per minute. Writing is a trained skill. What people can say clearly becomes harder to type. Voice removes this friction. (CXO Today, December 2025)

    How Modern ASR Works: The Three Stages

    Every modern ASR system processes audio in three conceptual stages, even if the boundaries between them are blurry in end-to-end neural systems.

    Stage 1: Acoustic processing

    Raw audio is converted into a compact representation that captures the information relevant to speech. The most common representation is a log-Mel spectrogram. It is a matrix showing how much energy exists at each frequency band over short time windows. A 1-second clip of audio becomes a 2D matrix of roughly 100 time frames by 80 frequency bins.

    This representation strips out information irrelevant to speech, like absolute recording volume. It preserves the patterns that distinguish phonemes from each other. It is the input to the neural network.

    Stage 2: The neural model

    The acoustic representation passes through a neural network that produces a probability distribution over possible text outputs. In Conformer-CTC models, the network outputs a probability for each character or subword unit at each time step. The CTC, Connectionist Temporal Classification, algorithm then finds the most probable sequence of text across all time steps.

    This stage is where most of the intelligence lives. The network learns, from millions of audio-text pairs, which acoustic patterns correspond to which linguistic units. It learns this separately for each language. That is why the training data language and the deployment language need to match for the system to work well.

    Stage 3: Language model rescoring

    The raw output of the acoustic model is often imperfect. It might confuse acoustically similar words. A language model trained on text in the target language rescores candidate transcriptions. It boosts sequences of words that are plausible given the context. In a banking context, the phrase about an EMI becomes the right transcription. A phrase about an Emmy does not.

    Modern end-to-end systems sometimes skip this step by baking contextual knowledge directly into a larger model. But for domain-specific deployments like BFSI or healthcare, a domain-tuned language model still adds measurable accuracy improvements.

    What Makes One ASR System Better Than Another

    Two ASR systems can claim to support the same language and produce completely different results on the same audio. The differences come down to four variables.

    Word Error Rate on your audio, not a benchmark

    WER, Word Error Rate, is the standard accuracy metric. It measures what fraction of words in a reference transcript were incorrectly transcribed. A WER of 5% means 5 words out of 100 were wrong. A WER of 25% means one word in four was wrong.

    The critical word in that definition is ‘reference transcript.’ Published WER numbers are measured on specific test sets, usually clean studio audio in standard language varieties. A model achieving 5% WER on a US English benchmark can easily produce 20 to 25% WER on Indian regional language audio over a phone. The benchmark number tells you how good the model is on the benchmark. It does not tell you how good it will be on your data.

    The only WER that matters for your deployment is the one you measure on your own audio. Any ASR vendor worth considering will give you a trial on your own recordings before you commit.

    Streaming vs batch architecture

    Batch ASR waits for a complete audio clip before processing it. Streaming ASR processes audio as it arrives and returns text in real time, often within 100 milliseconds of a word being spoken.

    For analytics and transcription of recorded calls, batch works fine. For any live interaction, a voice bot, a real-time captioning system, a voice-enabled mobile app, streaming is not optional. The architecture choice determines the minimum latency your product can achieve. Shunya Labs Zero STT supports streaming from the first audio chunk, returning a final transcript quickly for most utterances.

    Language depth, not language count

    A platform claiming to support 100 languages does not necessarily support all 100 at the same accuracy level. Many platforms support a small number of languages well and extend nominal support to others with limited training data and no real accuracy testing.

    For India, the distinction matters enormously. Standard Hindi over clean audio is supported reasonably well by most global platforms. Bhojpuri, Maithili, Chhattisgarhi, and Odia over 8kHz telephony audio can be poorly supported by any platform that did not train on those languages in those conditions. The Shunya Labs language list shows 55 Indic languages with production-grade accuracy data, not just nominal support.

    On-premise vs cloud only

    Most global ASR APIs are cloud-only. Audio is sent to a remote server, processed, and a transcript is returned. For consumer applications, this is usually fine. For regulated deployments in India, particularly BFSI and healthcare, sending customer audio to servers outside India may conflict with DPDPA requirements and RBI guidelines.

    On-premise ASR, where the model runs on infrastructure the enterprise controls, addresses this directly. Shunya Labs on-device model runs fully on-premise on CPU hardware, no GPU required, with the same model as the cloud version. Deployment details are at shunyalabs.ai/deployment.

    Where Speech AI Is Being Used in India Right Now

    The India Voice AI market was valued at USD 153 million in 2024. It is projected to reach USD 957 million by 2030, a CAGR of 35.7%. That growth is spread across several sectors where voice is already being used at scale.

    CONTACT CENTRES AND CUSTOMER SERVICE

    For example, Airtel runs automated speech recognition on 84% of inbound calls. Meesho’s voice bot handles around 60,000 calls daily, transcribing queries in multiple Indian languages. These are not experimental deployments. They are production infrastructure running at scale. The ASR layer is what makes them work.

    BFSI

    Banks and NBFCs can use ASR for outbound EMI collections, inbound balance queries, fraud detection through voice biometrics, and call quality monitoring. The Indian banking system received over 10 million formal complaints in FY23-24. Voice AI with accurate ASR can be one of the primary tools for managing this volume efficiently.

    HEALTHCARE

    Doctors dictate clinical notes. Hospitals run multilingual patient intake over the phone. Lab results and prescription reminders go out as voice calls. Each of these can use an ASR layer to convert spoken input or to process spoken responses from patients. The growth rate for healthcare voice AI is 37.79% CAGR globally, the fastest of any sector.

    FIELD OPERATIONS

    Insurance agents, FMCG reps, and microfinance field workers update CRMs, log activities, and record collections by speaking rather than typing. In Indic languages, typing speed is 18 to 23 words per minute. Speech is 130 to 150 words per minute. The productivity difference is substantial. It only works if the ASR handles the regional language the field worker actually speaks.

    ASR, Speech AI, and Voice AI: What the Terms Actually Mean

    These three terms appear constantly in vendor materials and often get used interchangeably. They are not the same thing.

    ASR is the specific technology: the model that converts audio to text. It is a component.

    Speech AI is a broader category. It includes ASR, but also TTS (text to speech), speaker diarization (who said what), speech analytics, emotion detection from audio, and other audio intelligence capabilities. When someone says they are building on a speech AI platform, they usually mean access to several of these capabilities through a single API.

    Voice AI describes complete voice-enabled products or agents: voice bots, voice assistants, voice-first applications. These are built on top of speech AI. A voice AI agent uses ASR to hear the user, an LLM to reason and respond, and TTS to speak the answer. The voice AI platform is the infrastructure layer underneath all of this.

    Shunya Labs is a speech AI and voice AI platform. Zero STT is the ASR product. Zero TTS is the text-to-speech product. Together they form the input and output layers for any voice AI application. The full platform overview is at shunyalabs.ai/overview .

    What to Look for in a Speech AI Platform for India

    If you are building something with voice, here is what to check before picking an ASR or speech AI platform.

    • Test on your audio. Not the demo. Your language, your recording conditions, your callers. Ask for a free trial on real data before committing.
    • Check streaming support. If you are building anything interactive, batch ASR adds 400 to 800ms of latency you cannot recover from.
    • Ask for WER on the specific languages you need. Hindi is not the same as Marathi. Indian English is not the same as US English. Get benchmark data for your actual use case.
    • Verify deployment options. If you are in BFSI or healthcare, understand where audio is processed and whether it meets your compliance requirements.
    • Check whether TTS is available from the same platform. Mixing an accurate ASR from one provider with a generic TTS from another produces voice agents that understand well but sound foreign. Native Indic TTS matters for user trust.

    Shunya Labs is built for India-first deployments. 

    References:

    • Fortune Business Insights (2022). With 23.7% CAGR, Speech and Voice Recognition Market Size to Reach USD 49.79 Billion [2022-2029]. [online] Yahoo Finance. Available at: https://finance.yahoo.com/news/23-7-cagr-speech-voice-080500463.html [Accessed 24 Mar. 2026].
    • IBEF (2025). India’s internet users to exceed 900 million in 2025, driven by Indic languages. [online] India Brand Equity Foundation. Available at: https://www.ibef.org/news/india-s-internet-users-to-exceed-900-million-in-2025-driven-by-indic-languages.
    • reverie (2026). Speech Recognition System: A Complete 2026 Guide – Reverie. [online] Reverie. Available at: https://reverieinc.com/blog/speech-recognition-system/ [Accessed 25 Mar. 2026].
    • Tsymbal, T. (2024). State of Conversational AI: Trends and Future [2024]. [online] Master of Code Global. Available at: https://masterofcode.com/blog/conversational-ai-trends.
    • www.marketsandmarkets.com. (n.d.). Speech and Voice Recognition Market Size, Share and Trends forecast to 2026 by Delivery Method, Technology Speech Recognition | COVID-19 Impact Analysis | MarketsandMarketsTM. [online] Available at: https://www.marketsandmarkets.com/Market-Reports/speech-voice-recognition-market-202401714.html.

  • Voice AI for BFSI: How Indian Banks Can Automate Millions of Calls

    Voice AI for BFSI: How Indian Banks Can Automate Millions of Calls

    TL;DR , Key Takeaways:

    • In FY23-24, 95 Indian banks received over 10 million complaints. RBI is actively pushing for AI-led resolution.
    • Six call types dominate BFSI volumes and are all automatable today: EMI reminders, balance queries, loan status, KYC follow-ups, policy renewals, and collections.
    • DPDP Rules 2025 and RBI data localization mean borrower audio cannot leave India. On-premise or India-hosted voice AI is the only compliant architecture.
    • Indic language voice AI accuracy is the deciding variable. A model producing 25% WER on your callers creates more problems than it solves.
    • Shunya Labs Zero STT and Zero TTS cover 55 Indic languages, trained on real audio, on-premise CPU deployment.

    A mid-sized private sector bank in India receives between 50,000 and 200,000 customer calls every month. Most of those calls ask about the same things. EMI due dates. Account balances. Loan application status. Policy renewal windows. KYC document submissions.

    They follow predictable patterns. The answer is almost always in a database the bank already has.

    And yet, thousands of agents spend their shifts answering them. The same questions, hundreds of times a day. In languages that shift depending on which state the caller is in.

    Voice AI can change that equation. Not by eliminating human agents, but by handling the calls that genuinely do not need one.

    This post covers what those calls are and what it takes to automate them well in Indian BFSI. It also explains why the language and compliance requirements make this harder than most global solutions account for.

    Complaints to Indian banks

    FY23-24, across 95 banks (RBI)

    India call centre AI market by 2030

    Up from $103.8M in 2024

    Agent hours saved globally

    Projected by 2026 from voice AI

    Why BFSI Has the Highest Call Volume of Any Sector

    Banking and insurance generate more customer calls than almost any other industry. The reasons are structural.

    Financial products are complex by nature. A home loan, a health insurance policy, a fixed deposit: each carries terms, due dates, and status updates that customers track over months and years.

    Unlike a one-time purchase, the relationship is ongoing. Every EMI cycle and every renewal period generates a new wave of inbound calls.

    Regulatory obligations make this worse. RBI guidelines require specific disclosures. IRDAI mandates communication touchpoints in insurance workflows. These compliance requirements generate outbound call obligations that banks cannot reduce without regulatory risk. The call volume is, in part, built into the rules.

    The staffing situation compounds the pressure. Indian contact centres in BFSI report 30 to 45% annual agent turnover. Every departing agent takes product knowledge and language capability with them. The cost of replacing and retraining that capacity, multiplied across thousands of agents, is significant and recurring.

    In FY23-24, 95 Indian banks together received more than 10 million customer complaints. The RBI is now encouraging banks to use AI to sort, tag, and resolve them faster. That is not a suggestion. It is a regulatory signal.

    The Six Call Types That Voice AI Can Handle Today

    Not all BFSI calls are equal. The ones that work best for automation share two properties. They follow a consistent conversation structure. The correct response already exists in a system the bank already runs.

    EMI reminders and payment follow-ups

    Outbound reminder calls before an EMI due date can reduce defaults and free the collections team from managing problems that a timely reminder would have prevented. These calls are short and predictable. The agent confirms the date, the amount, and the payment method. A voice agent handles this at scale in any Indian language without adding headcount. 

    Balance and transaction queries

    A caller asking for their current balance or last five transactions needs authentication, a database lookup, and a clear spoken response. This is one of the highest-volume query types in Indian retail banking and one of the cleanest automation candidates. The conversation rarely deviates from a predictable path and the data is always available instantly.

    Loan application status

    Borrowers call to check where their application stands. Approved or pending? More documents needed? When does disbursement happen? These calls are high in volume and low in complexity. The answer sits in the loan origination system. A voice agent retrieves it and can deliver it in the caller’s language.

    KYC follow-ups and document collection

    Incomplete KYC is one of the most common reasons customer onboarding stalls in Indian banking. Following up on missing documents, confirming what was received, and guiding resubmission all follow a defined process. Teams at several Indian private banks and NBFCs have deployed voice agents for exactly this workflow.

    Policy renewals and insurance servicing

    Insurance customers need reminders before their policy lapses and answers about coverage during the renewal window. This is high-value outbound communication that insurers currently run through agent-heavy call centres. Voice AI handles it at a fraction of the cost per contact with consistent accuracy on the disclosure language that compliance teams require.

    Collections and soft recovery

    Early-stage collections is one of the most widely deployed voice AI use cases in Indian BFSI. The goal here is a payment reminder and a commitment. The call structure is defined, the outcome is measurable, and the economics are clear.

    Lead qualification costs can drop from Rs 800 to Rs 120 per lead after voice AI deployment. Overall operational cost per account can fall from 20 to 30%.

    The Compliance Layer That Most Solutions Miss

    Indian BFSI has strict regulatory requirements. They change the architecture of any voice AI deployment.

    Teams that treat compliance as an afterthought end up rebuilding their infrastructure. The time to address it is before deployment, not after.

    DPDP Act and data localization

    India’s Digital Personal Data Protection Rules were notified in November 2025. Under these rules and RBI data localization guidelines, audio containing personal financial data from Indian customers generally cannot be routed to servers outside India. Full substantive compliance is required by May 2027, with the Data Protection Board now operational.

    For most global cloud STT providers, this creates a fundamental problem. Their inference infrastructure sits in the US or EU. The audio round-trip adds both latency and compliance exposure. Banks likely classified as Significant Data Fiduciaries face added obligations: Data Protection Impact Assessments, algorithmic transparency, and audit trails. Penalties run up to Rs 250 crore per violation.

    TRAI 1600 series directive

    From January 2026, TRAI made the 1600 series number mandatory for outbound commercial calls in India. Any voice platform making outbound collections or reminder calls for a bank or NBFC must support DLT-registered 1600 calling. This is a hard requirement. Platforms that do not support it cannot make compliant outbound calls, regardless of everything else they offer.

    RBI fair practices code

    The RBI fair practices code for lenders sets requirements around how borrower communications are conducted. Calling hour restrictions, mandatory disclosures, accessible escalation paths. A voice agent that cannot reliably follow these rules on every call, in every language, creates regulatory risk that outweighs the operational savings.

    BFSI voice AI compliance requirements in India

    Data residency: borrower audio must stay within India. Requires on-premise or India-hosted STT inference.
    DPDP Act (notified Nov 2025): consent management, 72-hour breach notification, data minimisation. Full enforcement from May 2027.
    TRAI 1600 series (effective Jan-Feb 2026): mandatory for all outbound commercial AI calls. Non-compliance blocks deployment entirely.
    RBI fair practices code: disclosure requirements, calling hour restrictions, grievance access on every call.
    Significant Data Fiduciary obligations: DPIA, algorithmic transparency, regular audits for banks handling large personal data volumes.

    Why the STT Layer Determines Whether the Agent Works

    The biggest reason BFSI voice AI deployments underperform in India is not the LLM. It is not the workflow logic. It is speech recognition. If the agent cannot accurately understand what the caller said, nothing downstream works correctly.

    Indian BFSI callers do not sound like the training data that most global models were built on. They call from mobile phones with variable audio quality. They speak regional languages with real dialectal variation. They switch between Hindi and English in the same sentence. They use financial vocabulary that differs across states and communities.

    A global ASR model scoring 5% WER on US English can exceed 25% WER on Marathi, Bhojpuri, or Gujarati telephone audio. At that error rate, one word in four can be wrong.

    An agent trying to confirm an EMI amount on audio that broken is not automating the call. It is generating a worse outcome than no call at all.

    The only models that work reliably on Indian BFSI audio are built specifically for it. Not adapted from English. Not fine-tuned on a small Indic dataset.

    Built from the ground up on real Indian conditions. Telephony compression, regional accents, code-switched sentences, financial vocabulary, and background noise from where real callers actually are. 

    A voice agent that misunderstands one word in four is not automating your call centre. It is generating more complaints. The STT layer is not a commodity decision in Indian BFSI. It is the most consequential architectural choice you make.

    What Good BFSI Voice AI Infrastructure Looks Like

    Four requirements define a deployment that holds up in production. These are not aspirational benchmarks. They are the baseline.

    Indic language STT trained on real audio

    The model needs to have been trained on real Indian phone call data across your specific languages. Word error rate must be measured on production-representative audio, not a global benchmark.

    Shunya Labs Zero STT covers around 200 languages. Each is trained on real audio with the dialectal variation, code-switching patterns, and financial domain vocabulary of actual Indian BFSI calls. Independent benchmark data showing 3.1% WER.

    On-device deployment without GPU hardware

    For teams under DPDP and RBI data localization requirements, audio cannot leave your infrastructure. The model needs to run on-premise, on standard CPU servers, without requiring GPU hardware. The on-device model runs on CPU-only hardware with no GPU requirement. Full deployment guide at shunyalabs.ai/deployment .

    Indic model for the voice response

    The voice your agent speaks matters as much as what it hears. A caller in rural Maharashtra will disengage from a voice that sounds robotic in their language. Generic models adapted from English produce output that native Indic language speakers can immediately register as unnatural.

    Shunya Lab’s model was built natively for Indic languages. Prosody and rhythm are trained on native speakers across all 55 supported languages. 

    Real-time latency for live conversations

    An outbound collection call is a live conversation. If there is an 800ms pause before every agent response, callers start talking over it, repeat themselves, and eventually hang up. Shunya Lab’s streaming latency is under 100ms time-to-first-transcript on production audio. Combined with a right-sized LLM and Zero TTS, total turn latency stays below 650ms. That is within the range where calls feel natural rather than mechanical.

    A Practical Rollout Sequence

    Most BFSI voice AI deployments can follow a three-phase approach. It helps reduce risk and builds confidence before moving to higher-stakes use cases.

    Phase one is outbound reminder calls for EMIs or policy renewals. Volume is high, the conversation is short, and savings are visible within weeks.

    The cost difference is stark. Human agent calls in India can run Rs 25 to 40 per call. Automated voice agent calls can run Rs 2 to 3. A bank sending 50,000 reminder calls a month feels that gap within the first week.

    Phase two adds inbound balance and status queries. This requires connecting the STT layer to the core banking system through an API. Response accuracy depends on the STT model handling banking terminology correctly in the caller’s language. Amounts, dates, account numbers, all must transcribe accurately for the downstream logic to work.

    Phase three, for teams that have validated the first two, is collections automation. This is the highest-value use case and the most scrutinised. Every call must follow the RBI fair practices code. Escalation paths must work. Grievance access must be real and functional. The compliance architecture needs to be in place before collections goes live.

    Contact Shunya Labs now to know more.

    References