DEV Community

Cover image for YouTube Karaoke - SvelteKit app powered by AssemblyAI
Siu Pang Tommy Choi
Siu Pang Tommy Choi

Posted on

YouTube Karaoke - SvelteKit app powered by AssemblyAI

This is a submission for the AssemblyAI Challenge: Sophisticated Speech-to-Text.

What I Built

A web app which analyses a song on YouTube, then displays the lyrics in a Karaoke style.

Demo

Live demo: https://assemblyai-challenge-202411.manychois.site/
(Unfortunately, YouTube has blocked my server so the real-time transcribing does not work. You can still pick one of the pre-built examples to see how it works. Alternatively, run the app in your local machine)

Source code: https://github.com/manychois/assemblyai-challenge-202411

The Idea

When I inspected the API documentation of AseemblyAI, one of its features caught my eyes - Word level timestamp. In what situation would I need a timestamp for each spoken word? Subtitles. Displaying the subtitle texts would be a bit dull, so I twisted my idea into transcribing a song and displaying the lyrics in a Karaoke way. Let's get hands dirty!

Implementation Journey

The app will need to fulfil 3 things:

  1. Be able to download YouTube video.
  2. Utilise AssemblyAI to convert the audio part into transcript.
  3. Roll the lyrics along with the YouTube video.

The transcription part

The first two points are quite easy to implement, thanks to the cookbook provided by AssemblyAI. Not knowing yt-dlp in the past, it is really a great command-line tool to download YouTube videos and convert them into various formats.

A nice tip from the cookbook:

"m4a" is the format with the best audio version.

Since I am not an active Python developer, I pick the Typescript library instead, and write a simple function to invoke yt-dlp:

import { exec } from 'node:child_process';
function downloadYouTube(videoId: string, url: string): Promise<string> {
  return new Promise((resolve, reject) => {
    const tempFilePath = `/tmp/youtube-${videoId}.m4a`;
    const command = `yt-dlp -o ${tempFilePath} -x --audio-format m4a --audio-quality 8 "${url}"`;
    exec(command, (error) => {
      if (error) {
        console.error(`exec error: ${error}`);
        reject(error);
      }
      resolve(tempFilePath);
    });
  });
}
Enter fullscreen mode Exit fullscreen mode

And calling the AssemblyAI API is extremely easy. I don't even need to worry about handling the upload of the local file, the library does it all! Here is the function to wrap the service:

async function transcribe(language: string, file: string): Promise<Transcript> {
  const client = new AssemblyAI({ apiKey: ASSEMBLYAI_API_KEY });
  const apiParams: TranscribeParams = { audio: file, language_code: language };
  const transcript = await client.transcripts.transcribe(apiParams);
  return transcript;
}
Enter fullscreen mode Exit fullscreen mode

OK, now I will need a web app framework to piece things together. Do you know Svelte 5 is alive recently? This can be my first exercise to try out the latest SvelteKit.

The UI part

With some research, it is good to find that the official library let you interact with the player and will tell you where the video is up to. Below is a simplified code to show how I link the current video play time to the reactive state currentTime:

let currentTime = $state(0); // in milliseconds
let player = new window.YT.Player('player', { ... });
player.playVideo();
let wordHighlighter = setInterval(() => {
  currentTime = player.getCurrentTime() * 1000;
}, 100);
Enter fullscreen mode Exit fullscreen mode

Then I let Svelte to do its magic. When it is time to highlight the word, the style class start will be applied.

{#each line as { text, start, end }}
  {@const duration = Math.round((end - start) / 100) * 100}
  <span class="word" class:start={start <= currentTime}
    data-text={text}
    data-duration={duration}>{text}</span>
{/each}
Enter fullscreen mode Exit fullscreen mode

The Karaoke style trick is like this (in SCSS syntax):

.word {
  display: inline-block;
  position: relative;
  font-size: 1.5rem;
  color: #777;
  white-space: nowrap;
  margin-right: 0.5em;

  &.start {
    &::after {
      content: attr(data-text);
      position: absolute;
      left: 0;
      top: 0;
      color: #00f;
      overflow: hidden;
      animation: run-text 2s 1 linear;
      width: 100%;
    }

    @for $i from 1 through 20 {
      &[data-duration='#{$i * 100}']::after {
        animation: run-text #{$i * 100}ms 1 linear;
      }
    }
  }
}

@keyframes run-text {
  from { width: 0; }
  to { width: 100%; }
}
Enter fullscreen mode Exit fullscreen mode

As you can see, the data attribute data-text is used to create the overlay highlighted text. I have also tried to use animation: run-text attr(data-duration ms) 1 linear; to dynamically assign the duration but the browser does not support that. So I have to round off the duration and set a bunch of corresponding rules.

Finally, the scroll-along effect:

let lyricsObserver = new IntersectionObserver(
  (entries) => {
    entries.forEach((entry) => {
      if (!entry.isIntersecting) {
        const word = entry.target as HTMLElement;
        word.scrollIntoView({ behavior: 'smooth', block: 'center' });
      }
      lyricsObserver!.unobserve(entry.target);
    });
  },
  {
    root: document.querySelector('.lyrics'),
    threshold: 1,
    rootMargin: '0px 0px -30% 0px'
  }
);
let lastHightlighted: null | Element = null;
let scrollChecker = setInterval(() => {
  const highlighted = document.querySelectorAll('.word.start');
  if (highlighted.length > 0) {
    const last = highlighted[highlighted.length - 1];
    if (lastHightlighted !== last) {
      lastHightlighted = last;
      lyricsObserver!.observe(last);
    }
  }
}, 100);
Enter fullscreen mode Exit fullscreen mode

That is quite massive, but the general idea is:

  1. For every 100ms, find out the last .word.start element. That is where the lyrics are highlighted up to.
  2. Push that element to our IntersectionObserver for insepection.
  3. If it is not within the top 70% visible region, we scroll it up to the middle.
  4. Pop the element out from the IntersectionObserver to lower the performance cost.

The Result

Screenshot - Lyrics scrolling along with the video

I am happy with the result. Backed by Svelte, the lyrics will move accordingly if you pause or even jump the music at any point.

Top comments (0)