DEV Community

Davkharbayar
Davkharbayar

Posted on

Generate Subtitles for Audio and Video Easily with AssemblyAI Speech-to-Text

This is a submission for the AssemblyAI Challenge : Sophisticated Speech-to-Text.

What I Built

Subtitles are critical for improving accessibility, engagement, and global reach for videos and audio. As a content creator and developer, I often faced challenges when generating subtitles manually. I wanted an automated, tech-driven solution that could handle this process efficiently.

Demo

Live Demo

Code

Image description
Home screen

Image description
mp3 file create subtitle

Image description
create SRT file

Journey: Incorporating Universal-2, AssemblyAI's Speech-to-Text Model into My Application

Starting Point: The Problem

Manually creating subtitles for audio and video files was a tedious and time-consuming task. It required listening to recordings, transcribing speech into text, and carefully syncing subtitles with audio. For long or complex recordings, this process was not only error-prone but also impractical.

I envisioned building an automated solution that could handle this entire workflow seamlessly. The goals were ambitious yet practical:

Key Objectives

1. Accurately Transcribe Speech into Text
Leverage AI to precisely convert spoken words into text, even in noisy or multi-speaker environments.

2. Generate Subtitles in Popular Formats like SRT
Ensure compatibility with platforms like YouTube, social media, and video editing software.

3. Create Subtitled Videos Using FFMPEG
Integrate the subtitles directly into video files, saving users the hassle of separate configurations.

4. Add Subtitles with a Background for Audio Files
For users with audio-only content, generate a video with subtitles displayed on a beautiful, customizable background.

5. Enhance Content with Thumbnail Images and Animated WebP Files
Utilize FFMPEG to create visually engaging thumbnail images and lightweight, animated WebP files for promotional use.

The Solution
To achieve these goals, I combined AssemblyAI's Universal-2 Speech-to-Text model with the powerful media-processing capabilities of FFMPEG. The workflow ensures speed, accuracy, and flexibility, making it ideal for content creators, educators, and businesses alike.

Image description

Top comments (0)