Category: Product

  • Introducing Zero STT Med: Shunya Labs’ Purpose-Built Medical Speech-to-Text Transcription for Healthcare

    Introducing Zero STT Med: Shunya Labs’ Purpose-Built Medical Speech-to-Text Transcription for Healthcare

    Actual hospitals are inundated with alarms, cross-talk, muffled conversations through surgical masks and contextual shorthands that can generally be understood only by highly specialized participants.

    The urgent directions echoing through the floors require fast execution by specific stakeholders, for which it is necessary that the intended recipients can recognize that they are being addressed, comprehend what’s being shared and respond accordingly.

    Generic ASR systems are not trained to identify the subtle distinction between near-homophones in medical conditions or prescriptions with Latinate names.

    This is why we built our domain-specific model for healthcare, Zero STT Med, which has attained exceptional accuracy and real time transcription speed across medical environments, while offering enterprise-grade privacy and compliance for healthcare settings.

    Why domain specialisation really matters in medical speech transcription

    Generic ASR systems are generally effective at decoding casual speech. But clinical speech is another matter: near-homophones abound, drug names and specialty jargon are plentiful, and abbreviations vary by department.

    Domain-specific medical speech-to-text models are trained on medical data, terminology, and concepts so they can stay reliable inside this reality—not just on clean, conversational demos.

    To make this concrete, here are a few examples where a small transcription error can have a very large impact.

    Near-homophone drug names with very different uses

    Example pairWhat each is used forWhy confusion is dangerous
    Celebrex(celecoxib) vs Celexa(citalopram)Celebrex: anti-inflammatory for pain/arthritis. Celexa: SSRI antidepressant.The wrong drug can mean uncontrolled pain or undertreated depression, plus withdrawal risk if antidepressant doses are missed.
    Hydralazine vs HydroxyzineHydralazine: vasodilator for hypertension/heart failure. Hydroxyzine: antihistamine used for itching, allergy, or anxiety.Mixing these up can leave blood pressure uncontrolled or give unnecessary sedation instead of cardiovascular treatment.
    Zantac(ranitidine) vs Xanax(alprazolam)Zantac: acid-suppressing drug (H₂ blocker; no longer widely marketed in many regions). Xanax: benzodiazepine for anxiety.Confusion can lead to missed anxiety management, unexpected sedation, or inappropriate long-term benzodiazepine exposure.

    These are exactly the kinds of look-alike / sound-alike (“LASA”) pairs flagged in medication safety literature and ISMP/FDA tall-man lettering lists.

    Abbreviations that shift meaning with speciality and context

    AbbreviationPossible meanings (by context)Why this is risky
    MIMost commonly myocardial infarction (“heart attack”). Historically also used for mitral insufficiencymitral incompetence in some contexts.If a system (or reader) assumes the wrong expansion, care teams can misinterpret whether the issue is coronary ischemia or valve disease.
    RARheumatoid arthritisright atrium, or room air, among others.“RA” in a cardiology note vs a rheumatology note vs a respiratory observation can mean very different things; misreading it flips the clinical picture.
    MSMultiple sclerosismitral stenosis, or morphine sulfate(the latter now discouraged as an abbreviation).Confusing a chronic neurologic disease, a valve lesion, and a high-risk opioid dose can radically change diagnosis, treatment, and safety decisions.
    CPChest pain in many ED/ICU notes vs cerebral palsyin neurology or rehab contexts.In triage notes, “CP” usually points to possible cardiac ischemia; in pediatrics it often refers to a lifelong neurodevelopmental condition. Context is everything.

    This is very important because the cost of mishearing is too high in healthcare. In other domains, a mistaken word may be annoying; in medicine, a confident wrong word is a matter of life and death.

    If you mishear a drug name, it can change the entire treatment plan for the patient. A missed negation (“no chest pain”) reverses the interpretation of a symptom. Attributing a statement to the wrong speaker changes who is responsible for a decision in the chain of care. Domain-specialised medical ASR exists to reduce exactly these kinds of errors.

    Shunya Labs’ research that powers ASR designed for real clinical complexity

    Rather than blindly increasing dataset sizes under the banner of AI at scale, we prioritized curated, information-rich clinical audio, enabling the model to develop robust performance capabilities in uncertain scenarios.

    Zero STT Med is trained with a deliberate emphasis on challenging, high-entropy conditions:

    • Acoustic environment: alarms, ventilators, reverberation, masked speech, poor microphone quality, laptop microphones, simultaneous speakers.
    • Audio variety: local pronunciations, dialect changes, infrequent phoneme sequences, in-sentence code-mixing.
    • Language diversity: specialty terminology, similar drug names, abbreviations, and unconventional expressions within various departments.
    • Situational ambiguity: multi-morbid histories, complaints changing on the same visit, and acronyms that only seem to clarify in relation to symptoms, medications, vitals, and specialty context.

    Clinical audio is not simple: emergency consults over alarms; OR chatter through masks; ICU handoffs with ventilator audio; telehealth visits on everyday devices with family members stepping in mid-call. A good system must distinguish speakers, track turns, and be consistent in this environment, not just in a quiet laboratory setting.

    Conventional methods that rely on fixed custom vocabularies, specialty packs, and frequent retraining are ultimately fragile and costly. We instead focus on getting the base model right: training directly on messy, multilingual, multi-speaker clinical audio so it naturally learns to handle the ambiguity and shifting medical language it will encounter in context, rather than a long list of manual exceptions.

    That is why we built Zero STT Med to stay accurate over time, even as new drug names, workflows, and clinical realities change over time.

    Medical transcription that understands clinical terminology

    Zero STT Med is not only designed to “hear” speech clearly; it is also designed to recognise when something is clinically important. In addition to getting the audio right, Zero STT Med is able to identify clinical terms, and therefore, to get them right in transcription.

    Our model can reliably transcribe:

    • Medications and drugs – brand and generic names, including look-alike/ sound-alike pairs.
    • Diagnoses – primary problems, differentials, and comorbidities, even when they appear in long, conversational dictations.
    • Anatomical terms – body parts, regions, and structures as they are actually described in imaging, consults, and operative reports.
    • Procedures and interventions – surgeries, imaging studies, bedside procedures, and therapies mentioned in passing or as part of a longer plan.
    • Labs, measurements, and units – numbers, ranges, and units captured together so values remain clinically meaningful.
    • Clinical shorthand and acronyms – abbreviations whose meaning depends on specialty and context, resolved using the surrounding note rather than a fixed glossary.

    This generates more accurate outcomes that clinicians can rely on, and in turn makes them more reliable for downstream systems like the EHR, coding workflows, and decision-support tools.

    Accurate where it matters the most—getting medical terms right

    When we discuss accuracy for Zero STT Med, our primary concern is whether transcriptions can stay accurate on medical data.

    On medical speech benchmarks with noisy, multi-speaker clinical audio, Zero STT Med reaches:

    • 11.1% Word Error Rate (WER)
    • 5.1% Character Error Rate (CER)

    outperforming ASR systems like OpenAI Whisper, ElevenLabs Scribe, and AWS Transcribe in such assessments.

    The outcome is a transcript that clinicians spend less time on correcting drug names, conditions, and negations, so they can solely focus on patient care quality.

    See how our model performs on your own cases in our Zero STT Med medical speech-to-text demo widget.

    Low latency real-time transcription for clinical conversations with multiple speakers

    In clinical settings, latency is more than a technical parameter—it directly shapes how people experience and adopt the tool.

    • Emergency consults are fast-paced and noisy.
    • OR and ICU communication happens through masks and around equipment.
    • Telehealth visits run on everyday hardware, with interruptions and multiple speakers.

    When the transcript is lagging behind the discussion, individuals tend to restate points, decelerate their speech unnaturally, or cease utilizing the system altogether. Slow transcription also mute the benefit of making patient care truly accessible with live captioning or translation for understanding across languages or accents.

    Zero STT Med is engineered for streaming use cases so that transcription aligns with the flow of clinical conversation, even amidst environmental noise or interruptions.

    Importantly this includes live speaker diarization: the system tracks who is speaking in real time (for example, doctor vs patient vs nurse) so the transcript remains structured and intelligible during the conversation.

    Combined, low latency and live speaker diarization provide a truly ambient experience: notes are created during the visit itself, rather than reconstructed post hoc. Doctors have the opportunity to review, revise, and complete documentation with significantly reduced effort, maintaining attention on the patient before them.

    Privacy & security: enterprise-grade compliance, on your terms

    Clinical transcription requires the same level of quality as your entire clinical stack, particularly when dealing with protected health information and imagery. Zero STT Med is engineered to prioritize privacy, security and compliance as core functionalities rather than optional enhancements.

    • On-prem and private cloud options: run entirely inside your hospital network, private cloud, or VPC so that patient photos, audio, and transcripts never leave your environment to be transcribed.
    • Enterprise-grade compliance: designed to meet the privacy and security standards employed by hospitals and health systems globally, ensuring legal, security, and compliance teams have a straightforward process for review and approval.
    • Comprehensive security measures: data encryption during transmission and storage, robust access controls, and traceable actions ensure that sensitive clinical information is securely managed at all stages.

    This is how we end up with a medical speech-to-text solution that can live where the care actually happens — within your own infrastructure — and that meets the needs of clinical, IT, and compliance audiences by providing an enterprise-grade, privacy-first solution.

    Ready to Deploy: Medical Transcription API Integration

    That’s why we created Zero STT Med, to seamlessly integrate with the current state of hospitals and clinics. The system is operational and designed for practical application during clinical sessions.

    To explore deployment and pricing, contact our team about Zero STT Med API integration.

  • Why Multilingual Voice AI Fails on Real-World Audio — and How We Fixed It

    Why Multilingual Voice AI Fails on Real-World Audio — and How We Fixed It

    Picture this: Your contact center handles calls in Hindi, Tamil, and English—sometimes all three in the same conversation. Your current speech-to-text system transcribes the English perfectly, mangles the Hindi, and completely gives up when customers code-switch mid-sentence. Sound familiar?

    You’re not alone. Most multilingual ASR (Automatic Speech Recognition) systems face a tradeoff: cover more languages and watch accuracy collapse, or stay accurate in a handful of languages and leave most of your users behind.

    At Shunya Labs, we built Zero STT to break that tradeoff—delivering production-grade accuracy across 200+ languages without the lag, cost, or complexity that usually comes with multilingual voice AI. Here’s how we did it, and why it matters for teams shipping voice features in contact centers, media, healthcare, and beyond.

    The Problem: Why Most Multilingual ASR Systems Struggle

    Traditional multilingual speech recognition systems force you to choose your pain:

    Option A: Broad coverage, poor accuracy. Systems that claim to support 100+ languages often deliver mediocre results across all of them—especially on the “long-tail” languages that matter most to your users.

    Option B: High accuracy, narrow coverage. Language-specific models work great for English or Mandarin, but leave you scrambling to patch together solutions for regional languages, accents, and code-mixing.

    Option C: Good accuracy and coverage, but painfully slow. Some systems achieve both breadth and precision by using massive models that take seconds to transcribe short utterances—useless for real-time applications like live captioning or voice assistants.

    The core issue? Most multilingual models are trained on massive, undifferentiated datasets where Hindi street noise gets the same weight as studio-quality English recordings. The model learns everything equally—which means it masters nothing that matters.

    Understanding the Tradeoffs: What You’re Actually Measuring

    Before we explain how Zero STT solves this, let’s break down the two fundamental tensions in multilingual ASR—and the metrics that reveal them.

    Tension #1: Accuracy ↔ Versatility

    The problem: When you ask a fixed-size model to cover many languages, its “parameter budget” per language shrinks. This phenomenon—called the “curse of multilinguality”—means that per-language accuracy often drops as coverage increases.

    Think of it like hiring one person to speak 50 languages versus hiring 50 native speakers. The generalist will miss nuances.

    Concrete example: OpenAI’s Whisper offers both English-only and multilingual checkpoints. The English-only version consistently outperforms the multilingual version on English audio, while the multilingual version wins on breadth. That’s the tradeoff in action.

    How accuracy is measured:

    • Word Error Rate (WER): The industry-standard metric. It counts substitutions, deletions, and insertions against the reference transcript. A WER of 5% means the system gets 95 out of 100 words correct. Lower is better.
    • Character Error Rate (CER): Useful for languages where “word” boundaries are fuzzy (like many Asian scripts). It measures edit distance at the character level. Also lower is better.

    What to watch for: Don’t just look at headline WER numbers. Ask about performance on your specific languages, accents, and domains. A model with 3% WER on clean English might hit 20% WER on accented Hindi or code-mixed Hinglish.

    Tension #2: Versatility ↔ Latency

    The problem: Streaming ASR (the kind that transcribes speech as you speak) must emit words quickly with limited look-ahead. Less future context keeps latency low but hurts accuracy. More look-ahead improves accuracy but adds delay—making the system feel sluggish.

    For multilingual systems, this tension intensifies. Juggling multiple scripts and phonetic patterns often requires either larger context windows (raising latency) or careful architectural tricks to keep latency steady without losing accuracy.

    How latency is measured:

    • Real-Time Factor (RTF): Processing time divided by audio duration. RTF < 1 means faster than real-time (good). RTF = 1 is exactly real-time. RTF > 1 means the system can’t keep up.
    • Time to First Token (TTFT): The delay from when someone starts speaking to when the first word appears. This drives perceived “snappiness”—crucial for conversational AI.
    • Endpoint latency: The delay from when someone stops speaking to when the final transcript appears. Usually reported as P50/P90/P95 percentiles.

    What to watch for: Vendors love to report best-case RTF on high-end GPUs. Ask about P95 latency on your target hardware (often commodity CPUs) and real-world network conditions. Small differences here destroy user experience.

    Our Solution: Training on High Entropy Indic Data

    Here’s where Zero STT diverges from conventional multilingual ASR.

    Instead of training on every available hour of audio, we curate our training data based on information density—what we call “high-entropy” samples. Each audio clip gets scored on four dimensions:

    • Acoustic entropy: Is the audio noisy, reverberant, or captured on low-quality devices? These “hard” conditions force the model to generalize better.
    • Phonetic entropy: Does it contain rare sounds or unusual sound combinations? This helps with accents and dialectal variation.
    • Linguistic entropy: Does it use uncommon vocabulary, syntax, or jargon? This improves performance on domain-specific language (medical terms, legal jargon, brand names).
    • Contextual entropy: Does the audio-text pair contain strong predictive signals—like code-mixing (Hinglish, Tanglish) or proper nouns?

    We keep high-surprise samples and remove redundant samples using a threshold that increases exponentially across training rounds. Think of it as teaching a student with increasingly challenging problems, not endless repetition of easy ones.

    Why this works in practice

    Hard audio becomes easy in production. By training on noisy and device-diverse clips, the model doesn’t need extra look-ahead to stay accurate in real-world conditions. The result is streaming-grade latency without giving up accuracy.

    High linguistic entropy means fewer breakdowns on real speech. Indic languages are inherently higher entropy—rich morphology and agreement, multiple grammatical genders, and flexible word order (often SOV with variations). Training on this structural diversity exposes the model to many “difficult” cases (surprises), so it learns more efficiently, stays lighter, and performs better under uncertainty.

    Compute efficiency with state-of-the-art accuracy. Our entropy-guided pruning focuses training on information-dense hours instead of brute-force scale, reaching 3.10% WER on our universal model. For full results, see our benchmarks.

    Real-time serving at scale. The models are engineered for streaming-grade latency and faster-than-real-time throughput on standard GPU tiers, so you can ship responsive captions and agents without exotic hardware.

    Breadth that holds up. Where many stacks look great on one or two head languages and then slip, our multilingual models stay reliable across diverse languages—including Indic—because the training data preserves the right diversity, not just more of the same.

    What This Means for You

    For Contact Centers

    • Handle code-mixed conversations (English ↔ Hindi, Tamil ↔ English)
    • Transcribe noisy call-center audio accurately without expensive noise-cancellation preprocessing
    • Run on-premises for compliance without sacrificing speed or accuracy

    For Media & News

    • Live-caption multilingual broadcasts with sub-second latency
    • Transcribe field recordings with background noise and cross-talk
    • Support regional languages without maintaining separate pipelines

    For Healthcare

    • Accurately capture medical terminology across languages
    • Run offline for patient privacy (HIPAA/GDPR compliance)
    • Transcribe doctor-patient conversations with code-mixing and accents

    For Developers

    • Deploy on commodity CPUs—no GPU vendor lock-in
    • Privacy-first architecture: on-prem, offline, or cloud

    Getting Started with Zero STT

    One question we get often: “What is code-mixing, and why should I care?” Code-mixing is when speakers alternate between languages mid-conversation—like “Today ka meeting postpone ho gaya hai” (mixing English and Hindi). It’s extremely common in multilingual regions, from Mumbai call centers to Singapore offices, but it breaks most ASR systems. They’re trained on clean, monolingual speech and simply don’t know what to do when someone switches languages mid-sentence.

    Zero STT handles code-mixing natively because our high-entropy training specifically includes these mixed-language scenarios. We don’t treat them as edge cases—they’re the norm for millions of users.

    How does this compare to the big cloud providers? While services like Google Cloud Speech-to-Text and AWS Transcribe offer broad language coverage, they’re cloud-only and can struggle with code-mixing and long-tail languages. Zero STT matches or exceeds their accuracy on Indic languages while giving you the flexibility of on-prem deployment, offline operation for data privacy (GDPR, HIPAA compliant), and lower latency on commodity hardware—no expensive GPU infrastructure required.

    Ready to see it in action?

    Test Zero STT in your browser right now. Switch between languages, upload your own audio clips (noisy call recordings, accented speech, code-mixed conversations), and see how the model performs under real conditions. Launch Demo for Zero STT →

    Browse our full list of 200+ supported languages, integration guides, and API reference in our documentation. View Zero STT Documentation →

    The Bottom Line

    Multilingual ASR doesn’t have to mean choosing between accuracy, speed, and coverage. By training on high-entropy data—especially the messy, real-world audio that reflects actual user conditions—Zero STT delivers all three.

    Whether you’re building voice features for a contact center in Mumbai, a newsroom in Jakarta, or a telemedicine platform in Manila, you need ASR that works on the audio your users actually produce: noisy, accented, code-mixed, and real.

    That’s what we built.

    Evaluating Zero STT for your organization? Reach out to us and talk to an expert for your use case. Book a meeting →