Using Google TTL to make Nepali Speech

Uhenraj Jarga Magar — Wed, 22 Apr 2026 14:27:39 GMT

Project Overview

Project Name: Nepali TTS Studio
Tech Stack: HTML5, Vanilla JavaScript, Tailwind CSS, Google Gemini API (gemini-2.5-flash-preview-tts), Web Audio API.
Platform: Web (Client-Side), hosted on GitHub Pages.
Live Demo: https://ficcc.github.io/nepali-tts/

The Problem: Accessibility and Language Support

Text-to-Speech (TTS) technology has advanced rapidly, but high-quality, natural-sounding voice generation often remains gated behind complex backend setups, paid subscriptions, or focuses primarily on English. Finding lightweight, accessible tools that natively support languages like Nepali is challenging.

The goal was to build an application that could take Nepali script and convert it into high-fidelity audio without requiring the user to install heavy dependencies or rely on a developer managing an expensive backend server.

The Solution: A Client-Side Architecture

I developed Nepali TTS Studio, a strictly client-side web application. By leveraging Google’s Gemini API directly from the browser, the application offloads the heavy lifting of audio generation to Google's infrastructure while keeping the deployment architecture as simple as possible—just static HTML, JavaScript, and CSS.

Key Features:

Bring-Your-Own-Key (BYOK): To keep the app free and serverless without exposing private API keys, the UI securely prompts the user for their own Gemini API key.
Granular Voice Control: Users can select between multiple Gemini voice profiles (Kore, Aoede, Zephyr, etc.) and adjust the speaking speed.
Local Export: The app doesn't just play the audio; it allows users to download the generated speech as a standard .wav file for external use.

Technical Implementation

1. Integrating the Gemini 2.5 Flash TTS API

The core of the application relies on sending a highly specific prompt to the gemini-2.5-flash-preview-tts model. To ensure the model doesn't respond with conversational English or introductory text, the payload enforces strict parameters:

const payload = {
  contents: [{
    parts: [{
      text: `Read the following text strictly in Nepali language. Speak \({selectedSpeed}. Do not add any conversational remarks... Text to read: \){textToSpeak}`
    }]
  }],
  generationConfig: {
    responseModalities: ["AUDIO"],
    speechConfig: {
      voiceConfig: { prebuiltVoiceConfig: { voiceName: selectedVoice } }
    }
  }
};

2. Handling the Audio Output (Base64 to WAV)

One of the primary technical challenges was handling the data returned by the Gemini API. The API returns the audio as a Base64 encoded string, which isn't immediately playable in a standard HTML tag or easily downloadable as a standard file format.

To solve this, I implemented a custom JavaScript function to decode the Base64 string into a Uint8Array of raw PCM data, and then manually constructed the WAV file headers (RIFF chunk, fmt sub-chunk, and data sub-chunk). This allowed the app to generate a valid Blob of type audio/wav, which is then mapped to a Blob URL for playback and downloading.

3. UI and State Management

Using Tailwind CSS via CDN allowed for rapid UI prototyping without a build step. The interface handles various asynchronous states—managing loading spinners during the API fetch, disabling buttons to prevent duplicate requests, and gracefully catching network errors using an exponential backoff strategy for the fetch requests.

Challenges & Takeaways

Prompt Engineering for Audio: Generative AI models naturally want to be conversational. Forcing the model to act strictly as a raw TTS engine required precise prompt instructions to strip away unwanted conversational filler.
Stateless Security: Keeping the application serverless meant finding a safe way to handle API keys. The "Bring Your Own Key" approach solved the security issue but required adding clear UX instructions so non-technical users could easily navigate Google AI Studio to get their own credentials.

Future Scope

While this web version serves as a lightweight, highly accessible tool that can run on any device, this architecture serves as a strong foundation for native development. The next phase involves wrapping this core logic into a desktop application using a framework like Tauri, creating a more integrated native experience while maintaining the same high-quality audio output.

Uhem's blog