DEV Community

emojiiii
emojiiii

Posted on

Running Kokoro-82M ONNX TTS Model in the Browser

The advancements in AI and machine learning have significantly expanded the boundaries of what can be accomplished in-browser. Running a text-to-speech (TTS) model directly in the browser opens up opportunities for privacy, speed, and convenience. In this blog post, we'll explore how to run the Kokoro-82M ONNX TTS model in the browser using a JavaScript implementation. If you're curious, you can test this in action at my demo: Kitt AI Text-to-Speech.

Why Run TTS Models in the Browser?

Traditionally, TTS models are executed on servers, requiring an internet connection to send input and receive synthesized speech. However, with the growing capability of WebGPU and ONNX.js, you can now run advanced models like Kokoro-82M ONNX directly in the browser. This brings several advantages:

  • Privacy: Your text data never leaves your device.
  • Low Latency: Eliminate server communication delays.
  • Offline Access: Operate even without an active internet connection.

Overview of Kokoro-82M ONNX

The Kokoro-82M ONNX model is a lightweight yet effective TTS model optimized for on-device inference. It provides high-quality speech synthesis while maintaining a small footprint, making it suitable for browser environments.

Setting Up the Project

Prerequisites

To run Kokoro-82M ONNX in the browser, you’ll need:

  1. A modern browser with WebGPU/WebGL support.
  2. The ONNX.js library for running ONNX models in JavaScript.
  3. The Kokoro.js script, which simplifies loading and processing the Kokoro-82M model.

Installation

You can set up the project by including the necessary dependencies in your package.json:

{
  "dependencies": {
    "@huggingface/transformers": "^3.3.1"
  }
}
Enter fullscreen mode Exit fullscreen mode

Next, ensure you have the Kokoro.js script, which can be obtained from this repository.

Loading the Model

To load and use the Kokoro-82M ONNX model in your browser, follow these steps:

this.model_instance = StyleTextToSpeech2Model.from_pretrained(
    this.modelId,
    {
        device: "wasm",
        progress_callback,
    }
);
this.tokenizer = AutoTokenizer.from_pretrained(this.modelId, {
   progress_callback,
});
Enter fullscreen mode Exit fullscreen mode

Running Inference

Once the model is loaded and the text is processed, you can run the inference to generate speech:

const language = speakerId.at(0); // "a" or "b"
const phonemes = await phonemize(text, language);
const { input_ids } = await tokenizer(phonemes, { truncation: true });
const num_tokens = Math.max(
   input_ids.dims.at(-1) - 2, // Without padding;
   0
);
const offset = num_tokens * STYLE_DIM;
const data = await getVoiceData(speakerId as keyof typeof VOICES);
const voiceData = data.slice(offset, offset + STYLE_DIM);
const inputs = {
   input_ids,
   style: new Tensor("float32", voiceData, [1, STYLE_DIM]),
   speed: new Tensor("float32", [speed], [1]),
};

const { waveform } = await model(inputs);
const audio = new RawAudio(waveform.data, SAMPLE_RATE).toBlob();
Enter fullscreen mode Exit fullscreen mode

Demo

You can see this in action on my live demo: Kitt AI Text-to-Speech. The demo showcases real-time text-to-speech synthesis powered by Kokoro-82M ONNX.

Conclusion

Running a TTS model like Kokoro-82M ONNX in the browser represents a leap forward in privacy-preserving and low-latency applications. With just a few lines of JavaScript and the power of ONNX.js, you can create high-quality, responsive TTS applications that delight users. Whether you’re building accessible tools, voice assistants, or interactive applications, in-browser TTS can be a game-changer.

Try the Kitt AI Text-to-Speech demo today and experience it for yourself!

References

  1. Hugging Face Transformers.js Documentation
  2. ModNet Model
  3. WebGPU API
  4. React Documentation
  5. Reference code

Top comments (0)