Category: Build & Learn

  • Code-Switching ASR Explained: Why Hinglish Breaks Every Standard Model

    Code-Switching ASR Explained: Why Hinglish Breaks Every Standard Model

    Current image: Code-switching ASR

    TL;DR , Key Takeaways:

    • Most voice technology was built for clean, single-language speech and struggles the moment someone mixes Hindi and English or any other language
    • This is not a user error. Code-switching is how hundreds of millions of Indians naturally communicate
    • Standard ASR models fail at Hinglish because of gaps in acoustic modeling, vocabulary, language modeling, and training data
    • Fixing this requires a ground-up approach, not a patch on an existing English or Hindi model

    Most people have been there. You are talking to a voice assistant, a customer support bot, or a speech-to-text app, and mid-sentence it completely loses you. Not because you mumbled. Not because the connection was bad. Simply because you said something like:

    “Yaar, can you just reschedule the meeting to 4 baje?”

    The app either returns garbled text, skips the Hindi entirely, or stares back at you with a blinking cursor that quietly implies you said something wrong. You did not. You spoke the way most people in India speak every single day, and the technology just was not built for it.

    This is the code-switching problem. It sits at the heart of why so much voice technology feels broken the moment a real Indian user picks it up.

    What Code-Switching Actually Is

    Linguists have studied code-switching for decades. At its core, it is the practice of moving between two or more languages within a single conversation, sometimes within a single sentence. Bilingual and multilingual speakers do this naturally, fluidly, and often without noticing.

    In India, the most prominent example is Hinglish, the blend of Hindi and English that dominates urban conversation. But code-switching in India goes far beyond Hinglish. Tamil speakers in Chennai routinely mix Tamil with English. Bengali professionals in Kolkata do the same. In the South, you get Tanglish, Kanglish, Manglish. In Maharashtra, Marathi and English weave together constantly.

    The critical thing to understand is that speakers switch to convey nuance, signal social identity, fill lexical gaps, or simply because one language has a better word for the thing they are trying to say. “Jugaad” does not have an English equivalent. “Overwhelming” does not have a Hindi one that carries exactly the same feeling. So speakers use both.

    When you build speech technology that cannot handle this, you are not building speech technology for India. You are building something that works for a narrow slice of formal, scripted, monolingual speech that most real users will never produce.

    Why Standard ASR Models Break Down

    To understand why Hinglish is so difficult for most ASR systems, you need to understand how those systems are built.

    A standard automatic speech recognition model is trained on audio data paired with text transcriptions. The model learns to map acoustic patterns to linguistic units, usually phonemes or subword tokens, and then to string those units into words and sentences. The quality of the output depends enormously on how well the training data matches the input it will later see in production. 

    Most of the large ASR models in circulation today were trained overwhelmingly on English data, with some multilingual variants trained on parallel datasets in many languages, each treated as a separate, clean category. The model learns English. Or it learns Hindi. It does not learn the space between them.

    When a code-switched utterance arrives, several things go wrong at once.

    The acoustic model is the first point of failure. Hindi phonemes and English phonemes are genuinely different. The retroflex consonants in Hindi, the aspirated stops, the nasal vowels, these sounds do not exist in English in the same form. When a speaker slides from English into Hindi mid-sentence, the acoustic character of the audio shifts in ways a model trained only on one language is not equipped to follow.

    The language model compounds the problem. Modern ASR systems use language models to help decide which word sequence is most probable given the acoustic evidence. A language model trained on English assigns near-zero probability to Hindi words appearing in an English sentence. 

    So even if the acoustic model correctly identifies the sounds, the language model corrects them away, replacing them with the nearest English approximation. The Hindi word “karo” becomes “cargo.” “Bata” becomes “butter.” The output is fluent-sounding nonsense.

    Then there is the vocabulary problem. Code-switched speech pulls from two lexicons simultaneously. A model trained on a single language simply does not have the vocabulary to recognize words from the other. This is not a tuning issue. It is a fundamental architectural gap.

    Finally, there is the prosody and rhythm problem. Hindi and English have different stress patterns, different intonation curves, and different timing structures. When speakers mix languages, the prosodic cues that ASR models use to segment words and detect sentence boundaries become unreliable. The model loses its footing even at the most basic level of figuring out where one word ends and the next begins.

    The Data Problem Nobody Wants to Talk About

    Building a model that handles code-switching well requires training data that reflects code-switching, and this is where most efforts quietly fail.

    Collecting naturalistic code-switched speech is hard. You cannot simply crawl the web for audio in the way you can for text. You need real conversations, real phone calls, real customer interactions where people are speaking the way they actually speak rather than performing a scripted version of their language for a microphone. That data is expensive to collect, ethically sensitive to handle, and time-consuming to transcribe accurately.

    Transcribing code-switched speech is its own challenge. A transcriber fluent in Hindi may not accurately capture English portions and vice versa. Annotation guidelines for mixed-language text are not standardized. The same utterance might be written differently by ten different annotators, with inconsistent choices about spelling, script (Devanagari vs. Roman), and word boundaries.

    This is one of the main reasons large general-purpose models perform so poorly on mixed languages despite performing reasonably well on them separately. The training data simply does not contain enough naturalistic code-switched examples to teach the model what to do when languages collide.

    What It Actually Takes to Solve This

    The first is building language-agnostic acoustic representations. 

    Rather than training separate acoustic models for each language and hoping they transfer, you train a single model on multilingual data with enough phonemic overlap to build shared representations. The model learns to represent sounds at a level of abstraction that generalizes across language boundaries.

    The second is expanding the vocabulary and tokenization strategy. 

    Code-switched models need subword vocabularies that include units from both languages, and they need language identification signals that tell the language model which lexical distribution to draw from at any given moment. Some architectures do this with explicit language ID tags; others learn to do it implicitly from patterns in the training data.

    The third, and in some ways the most important, is training on real code-switched data at scale. 

    There is no shortcut here. A model that has never been trained on Hinglish will not suddenly learn to handle Hinglish because it has seen a lot of Hindi and a lot of English. The mixing patterns, the syntactic borrowings, the phonological adaptations that happen when languages blend, these are things the model has to learn from examples.

    Where Shunya Labs Fits Into This

    At Shunya Labs, this is not a theoretical problem. It is the core of what the team has been building toward.

    Shunya Labs was designed from the ground up for the way people actually communicate. That means training on data that includes code-switched speech rather than treating it as noise to be filtered out. It means building a vocabulary and acoustic model that can handle the phonemic landscape of Indian languages without forcing every utterance through an English or formal Hindi lens. And it means testing against real-world speech that reflects the diversity of accents, dialects, and mixing patterns that show up when a product reaches users across the country.

    The result is an ASR system that can handle a sentence like “Kya aap mujhe tomorrow ka schedule send kar sakte ho?” without losing the thread. Because the model was trained to understand the structure and patterns of code-switched speech at a deeper level.

    At Shunya Labs the speech technology work for the full range of Indian communication, not a filtered version of it. If you are building a voice product for India and your ASR only works when users speak like they are dictating a formal document, you are building on a foundation that will crack the moment real users show up.

    Why This Matters for Products Built on Voice

    The business case for getting this right is more straightforward than it might seem.

    Voice interfaces in India are not a nice-to-have. For a significant portion of the population, they are the most natural and accessible way to interact with technology. Voice search, voice-driven customer support, voice-based financial services, these are not futuristic applications. They are live, growing markets where the quality of the underlying speech recognition directly determines whether the product works or fails.

    Every percentage point of word error rate on code-switched speech is not an abstract benchmark number. It is a user who could not complete their task. It is a customer service interaction that went sideways because the system misheard a key instruction. It is a farmer who could not access agricultural information because the voice interface could not parse the way he naturally speaks.

    Building Speech That Reflects Reality

    Standard ASR models were built for a world where speakers are monolingual, accents are predictable, and language boundaries are clean. That world never really existed, and it certainly does not describe India.

    The path forward is to build models complex enough to meet users where they are.

  • Getting Started with ASR APIs: Python Quickstart

    Getting Started with ASR APIs: Python Quickstart

    Ever wonder how your phone transcribes your voice messages or how virtual assistants understand your commands? The magic behind it is Automatic Speech Recognition (ASR). ASR APIs allow developers to integrate this powerful technology into their own applications.

    What is an ASR API?

    An ASR API is a service that converts spoken language (audio) into written text. You send an audio file to the API, and it returns a transcription. This is incredibly useful for a wide range of applications, from creating subtitles for videos to enabling voice-controlled interfaces and analyzing customer service calls.

    This simple process enables complex features like:

    • 🎬 Auto-generated subtitles
    • 🗣️ Voice-controlled applications
    • 📞 Speech analytics for customer calls

    Before we dive into the code, you’ll need two things for most ASR providers:

    1. An API Key: Sign up with an ASR provider (like Google Cloud Speech-to-Text, AssemblyAI, Deepgram, or AWS Transcribe) to get your unique API key. This key authenticates your requests.
    2. An Audio File: Have a sample audio file (e.g., in .wav, .mp3, or .m4a format) ready to test. For this guide, we’ll assume you have a file named my-audio.wav.
    3. API Endpoint: The URL for the service, which we’ll assume is https://api.shunya.org/v1/transcribe.

    Integrating ASR APIs with Python

    Automatic Speech Recognition (ASR) APIs allow your applications to convert spoken language into text, unlocking powerful new user experiences. Let’s go step by step so you can confidently integrate ASR APIs—using Python.

    We’ll use the requests library to handle all our communication with the API.

    Step 1: Set Up Your Environment

    First, create a virtual environment and install requests.

    # Create and activate a virtual environment
    python -m venv venv
    source venv/bin/activate  # On Windows, use 'venv\Scripts\activate'
    
    # Install the necessary library
    pip install requests

    Step 2: Building the Python Script

    Create a file named transcribe_shunya.py and let’s build it section by section.

    Part A: Configuration

    First, we’ll import the necessary libraries and set up our configuration variables at the top of the file. This makes them easy to change later.

    # transcribe_shunya.py
    import requests
    import time
    import sys 
    
    # --- Configuration --- 
    API_KEY = "YOUR_SHUNYA_LABS_API_KEY" 
    API_URL = "https://api.shunya.org/v1/transcribe" 
    AUDIO_FILE_PATH = "my_punjabi_audio.wav" 
    # --------------------

    Here’s what each variable does:

    • API_KEY: Your personal authentication token.
    • API_URL: The endpoint where transcription jobs are submitted.
    • AUDIO_FILE_PATH: Path to your local audio file.

    Part B: Submitting the Transcription Job

    This function handles the initial POST request. It opens your audio file, specifies the language model (pingalla), and sends it all to the API to start the process.

    def submit_transcription_job(api_url, api_key, file_path):
        """Submits the audio file to the ASR API and returns the job ID."""
        print("1. Submitting transcription job...")
        headers = {"Authorization": f"Token {api_key}"}
    
        # Specify language and model; adjust based on API docs
        payload = {
            "language": "pn",
            "model": "pingala-v1"
        }
        
        try:
            # We open the file in binary read mode ('rb')
            with open(file_path, 'rb') as audio_file:
                # The 'files' dictionary is how 'requests' handles multipart/form-data
                files = {'audio_file': (file_path, audio_file, 'audio/wav')}
                response = requests.post(api_url, headers=headers, data=payload, files=files)
                response.raise_for_status()  # This will raise an error for bad responses (4xx or 5xx)
                
                job_id = response.json().get("job_id")
                print(f"   -> Job submitted successfully with ID: {job_id}")
                return job_id
        except requests.exceptions.RequestException as e:
            print(f"   -> Error submitting job: {e}")
            return None

    Part C: Displaying the Transcription Result

    Once the API finishes processing, it returns a JSON response containing your transcription and metadata.

    def print_transcription_result(result):
        """Display transcription text and segments."""
        if not result or not result.get("success"):
            print("❌ Transcription failed.")
            return
        
        print("\n✅ Transcription Complete!")
        print("=" * 50)
        print("Final Transcript:\n")
        print(result.get("text", "No transcript found"))
        print("=" * 50)
        
        # Optional: print speaker segments
        if result.get("segments"):
            print("\nSpeaker Segments:")
            for seg in result["segments"]:
                print(f"[{seg['start']}s → {seg['end']}s] {seg['speaker']}: {seg['text']}")

    Part D: Putting It All Together

    Finally, the main function orchestrates the entire process by calling our functions in the correct order. The if __name__ == "__main__": block ensures this code only runs when the script is executed directly.

    def main():
        """Main function to run the transcription process."""
        result = submit_transcription_job(API_URL, API_KEY, AUDIO_FILE_PATH)
        
        if result:
            print_transcription_result(result)
    
    if __name__ == "__main__":
        main()

    Step 3: Run the Python Script

    With your audio file in the same folder, run:

    python transcribe_shunya.py

    If everything’s set up correctly, you’ll see:

    1. Submitting transcription job…
       -> Job submitted successfully with ID: abc123
    
    ✅ Transcription Complete!
    ==================================================
    Final Transcript:
    
    ਸਤ ਸ੍ਰੀ ਅਕਾਲ! ਤੁਸੀਂ ਕਿਵੇਂ ਹੋ?
    ==================================================

    How It Works Behind the Scenes

    Here’s what your script actually does step by step:

    1. Upload: The script sends your audio and metadata to ShunyaLabs’ ASR REST API.
    2. Processing: The backend model (Pingala V1) performs multilingual ASR, handling Indian languages, accents, and speech clarity.
    3. Response: The API returns a JSON response with:
      • Full text transcript
      • Timestamps for each segment
      • Speaker diarization info (if enabled)

    This same pattern — submit → poll → retrieve — is used by nearly every ASR provider, from Google Cloud to AssemblyAI to Pingala.

    You can also use WebSocket streaming for near real-time transcription at:

    wss://tb.shunyalabs.ai/ws

    Best Practices

    1. Keep files under 10 MB for WebSocket requests (REST supports larger).
    2. Store API keys securely:export SHUNYA_API_KEY="your_key_here"
    3. Use clean mono audio (16kHz) for best accuracy.
    4. Experiment with parameters like:
      • --language-code hi for Hindi
      • --output-script Devanagari for Hindi text output
    5. Enable diarization to detect who’s speaking in multi-speaker audio.

    Using the REST API Directly (Optional)

    If you prefer using curl, try this:

    curl -X POST "https://tb.shunyalabs.ai/transcribe" \
      -H "X-API-Key: YOUR_SHUNYALABS_API_KEY" \
      -F "file=@sample.wav" \
      -F "language_code=auto" \
      -F "output_script=auto"

    The API responds with JSON:

    {
      "success": true,
      "text": "Good morning everyone, this is a sample transcription using ShunyaLabs ASR.",
      "detected_language": "English",
      "segments": [
        {
          "start": 0.0,
          "end": 3.5,
          "speaker": "SPEAKER_00",
          "text": "Good morning everyone"
        }
      ]
    }

    Final Thoughts

    You’ve just built a working speech-to-text integration using Python and the ShunyaLabs Pingala ASR API – the same foundation that powers real-time captioning, transcription tools, and voice analytics platforms.

    With its multilingual support, low-latency WebSocket streaming, and simple REST API, Pingala makes it easy for developers to integrate accurate ASR into any workflow – whether you’re building for India or the world.

    Automatic Speech Recognition bridges the gap between humans and machines, making technology more natural and inclusive.

    As models like Pingala V1 continue advancing in language accuracy and CPU efficiency, ASR is becoming not just smarter, but also more accessible — ready to transform every app that can listen.

  • Getting Started with ASR APIs: Node.js Quickstart

    Getting Started with ASR APIs: Node.js Quickstart

    Ever wonder how your phone transcribes your voice messages or how virtual assistants understand your commands? The magic behind it is Automatic Speech Recognition (ASR). ASR APIs allow developers to integrate this powerful technology into their own applications.

    What is an ASR API?

    An ASR API is a service that converts spoken language (audio) into written text. You send an audio file to the API, and it returns a transcription. This is incredibly useful for a wide range of applications, from creating subtitles for videos to enabling voice-controlled interfaces and analyzing customer service calls.

    This simple process enables complex features like:

    • 🎬 Auto-generated subtitles
    • 🗣️ Voice-controlled applications
    • 📞 Speech analytics for customer calls

    Before we dive into the code, you’ll need two things for most ASR providers:

    1. An API Key: Sign up with an ASR provider (like Google Cloud Speech-to-Text, AssemblyAI, Deepgram, or AWS Transcribe) to get your unique API key. This key authenticates your requests.
    2. An Audio File: Have a sample audio file (e.g., in .wav, .mp3, or .m4a format) ready to test. For this guide, we’ll assume you have a file named my-audio.wav.
    3. API Endpoint: The URL for the service, which we’ll assume is https://api.shunya.org/v1/transcribe.

    Integrating ASR APIs with Node.js

    Let’s go step by step and build a working Node.js script that sends an audio file to ShunyaLabs Pingala ASR API, retrieves the transcription, and displays it neatly on your terminal.

    We’ll use the following dependencies:

    • axios — for HTTP communication
    • form-data — to handle multipart file uploads

    Step 1: Set Up Your Environment

    Make sure you have Node.js v14+ installed, then set up your project:

    # Create a project folder
    mkdir asr-node-demo && cd asr-node-demo
    
    # Initialize npm
    npm init -y
    
    # Install dependencies
    npm install axios form-data

    Step 2: Building the Node.js Script

    Create a file named transcribe_shunya.js and let’s build it section by section.

    Part A: Configuration

    First, we’ll import the necessary libraries and set up our configuration variables at the top of the file. This makes them easy to change later.

    // transcribe_shunya.js
    import fs from "fs";
    import axios from "axios";
    import FormData from "form-data";
    
    // --- Configuration ---
    const API_KEY = "YOUR_SHUNYA_LABS_API_KEY";
    const API_URL = "https://tb.shunyalabs.ai/transcribe";
    const AUDIO_FILE_PATH = "sample.wav";
    // --------------------

    Here’s what each variable does:

    • API_KEY: Your personal authentication token.
    • API_URL: The endpoint where transcription jobs are submitted.
    • AUDIO_FILE_PATH: Path to your local audio file.

    Part B: Submitting the Transcription Job

    This function handles the initial POST request. It opens your audio file, specifies the language model (pingalla), and sends it all to the API to start the process.

    async function submitTranscriptionJob(apiUrl, apiKey, filePath) {
      console.log("1. Submitting transcription job...");
      
      const form = new FormData();
      form.append("file", fs.createReadStream(filePath));
      form.append("language_code", "auto");
      form.append("output_script", "auto");
      
      try {
        const response = await axios.post(apiUrl, form, {
          headers: {
            "X-API-Key": apiKey,
            ...form.getHeaders(),
          },
        });
        
        console.log("   -> Job submitted successfully!");
        return response.data;
      } catch (error) {
        console.error("   -> Error submitting job:", error.response?.data || error.message);
        return null;
      }
    }

    Part C: Displaying the Transcription Result

    Once the API finishes processing, it returns a JSON response containing your transcription and metadata.

    function printTranscriptionResult(result) {
      if (!result || !result.success) {
        console.log("❌ Transcription failed.");
        return;
      }
    
      console.log("\n✅ Transcription Complete!");
      console.log("=".repeat(50));
      console.log("Final Transcript:\n");
      console.log(result.text || "No transcript found");
      console.log("=".repeat(50));
    
      if (result.segments && result.segments.length) {
        console.log("\nSpeaker Segments:");
        result.segments.forEach((seg) => {
          console.log(`[${seg.start}s → ${seg.end}s] ${seg.speaker}: ${seg.text}`);
        });
      }
    }

    Part D: Putting It All Together

    Finally, the main function orchestrates the entire process by calling our functions in the correct order.

    async function main() {
      const result = await submitTranscriptionJob(API_URL, API_KEY, AUDIO_FILE_PATH);
      
      if (result) {
        printTranscriptionResult(result);
      }
    }
    
    main();

    Step 3: Run the Node.js Script

    With your audio file in the same folder, run:

    node transcribe_shunya.js

    If everything’s set up correctly, you’ll see:

    1. Submitting transcription job…
       -> Job submitted successfully!
    
    ✅ Transcription Complete!
    ==================================================
    Final Transcript:
    
    ਸਤ ਸ੍ਰੀ ਅਕਾਲ! ਤੁਸੀਂ ਕਿਵੇਂ ਹੋ?
    ==================================================

    How It Works Behind the Scenes

    Here’s what your script actually does step by step:

    1. Upload: The script sends your audio and metadata to ShunyaLabs’ ASR REST API.
    2. Processing: The backend model (Pingala V1) performs multilingual ASR, handling Indian languages, accents, and speech clarity.
    3. Response: The API returns a JSON response with:
      • Full text transcript
      • Timestamps for each segment
      • Speaker diarization info (if enabled)

    This same pattern — submit → poll → retrieve — is used by nearly every ASR provider, from Google Cloud to AssemblyAI to Pingala.

    Best Practices

    1. Keep files under 10 MB for WebSocket requests (REST supports larger).
    2. Store API keys securely:export SHUNYA_API_KEY="your_key_here"
    3. Use clean mono audio (16kHz) for best accuracy.
    4. Experiment with parameters like:
      • --language-code hi for Hindi
      • --output-script Devanagari for Hindi text output

    Final Thoughts

    You’ve just built a working speech-to-text integration in Node.js using ShunyaLabs Pingala ASR API – the same technology that powers real-time captioning, transcription tools, and voice analytics systems.

    With its multilingual support, low-latency streaming, and simple REST/WebSocket APIs, Pingala makes it easy for developers to bring accurate, fast, and inclusive ASR into any workflow – whether for India or the world.

    Automatic Speech Recognition bridges the gap between humans and machines, making technology more natural and inclusive.

    As models like Pingala V1 continue to improve in accuracy and efficiency, ASR is becoming not only smarter – but accessible to every app that can listen.