Davkharbayar

Posted on Nov 22, 2024

Generate Subtitles for Audio and Video Easily with AssemblyAI Speech-to-Text

#devchallenge #assemblyaichallenge #ai #api

This is a submission for the AssemblyAI Challenge : Sophisticated Speech-to-Text.

What I Built

Subtitles are critical for improving accessibility, engagement, and global reach for videos and audio. As a content creator and developer, I often faced challenges when generating subtitles manually. I wanted an automated, tech-driven solution that could handle this process efficiently.

Demo

Live Demo

Code

Home screen

mp3 file create subtitle

create SRT file

Journey: Incorporating Universal-2, AssemblyAI's Speech-to-Text Model into My Application

Starting Point: The Problem

Manually creating subtitles for audio and video files was a tedious and time-consuming task. It required listening to recordings, transcribing speech into text, and carefully syncing subtitles with audio. For long or complex recordings, this process was not only error-prone but also impractical.

I envisioned building an automated solution that could handle this entire workflow seamlessly. The goals were ambitious yet practical:

Key Objectives

1. Accurately Transcribe Speech into Text
Leverage AI to precisely convert spoken words into text, even in noisy or multi-speaker environments.

2. Generate Subtitles in Popular Formats like SRT
Ensure compatibility with platforms like YouTube, social media, and video editing software.

3. Create Subtitled Videos Using FFMPEG
Integrate the subtitles directly into video files, saving users the hassle of separate configurations.

4. Add Subtitles with a Background for Audio Files
For users with audio-only content, generate a video with subtitles displayed on a beautiful, customizable background.

5. Enhance Content with Thumbnail Images and Animated WebP Files
Utilize FFMPEG to create visually engaging thumbnail images and lightweight, animated WebP files for promotional use.

The Solution
To achieve these goals, I combined AssemblyAI's Universal-2 Speech-to-Text model with the powerful media-processing capabilities of FFMPEG. The workflow ensures speed, accuracy, and flexibility, making it ideal for content creators, educators, and businesses alike.

Tools Used
Here’s an overview of the tools and technologies that powered this project:

1. AssemblyAI
Role: Core transcription engine.
Features Used:

Transcription API: Converts audio and video into text with high accuracy, providing timestamps, speaker diarization, and punctuation.
Sentence API: Extracts transcription data at the sentence level, making it easier to format and sync subtitles.

Node.js (Express and EJS Engine) Role: Backend server and template engine. Features Used:

Express.js: Built the API endpoints for handling user requests, file uploads, and processing workflows.
EJS Template Engine: Rendered dynamic web pages for the user interface, allowing seamless file uploads and result display.

FFMPEG Role: Media processing and editing. Features Used:

Subtitled Videos: Burned SRT subtitles directly into video files.
Audio-to-Video Conversion: Added subtitles to audio files by generating a video with a custom background.
Thumbnail Generation: Captured still images from videos for thumbnails.
Animated WebP Files: Created lightweight animations for social media or marketing.

Final Thoughts
Building an automated subtitle generator using AssemblyAI and FFMPEG was both an exciting challenge and a rewarding journey. By integrating state-of-the-art speech-to-text technology with powerful media processing tools, I was able to create a solution that simplifies subtitle creation, enhances accessibility, and delivers professional results effortlessly.

Key Takeaways
The Power of AI: AssemblyAI’s Universal-2 model proved to be a game-changer, offering high accuracy and advanced features like speaker diarization and timestamping.
Automation Matters: Automating tedious tasks like transcription and subtitle generation saves time and eliminates errors, making life easier for content creators and professionals.
FFMPEG’s Versatility: Whether it’s burning subtitles into videos, adding visuals to audio, or creating animated media, FFMPEG brought flexibility and polish to the project.

DEV Community

Generate Subtitles for Audio and Video Easily with AssemblyAI Speech-to-Text

What I Built

Demo

Top comments (0)

Read next

25 top open-source tools for building web apps you can't afford to miss in 2025🎉 🚀

Best New Year sales 2025 — 35 deals I'd shop right now - Tom's Guide

AI in 2024: Year in Review and Predictions for 2025

Building a Local AI Task Planner with ClientAI and Ollama