DEV Community

Cover image for Front-End Only: Real-Time AI Stream Commentary with React, OBS Virtual Camera, and GPT-4o-mini
Yuki Shindo
Yuki Shindo

Posted on

Front-End Only: Real-Time AI Stream Commentary with React, OBS Virtual Camera, and GPT-4o-mini

Implementing an “AI that sees the streaming screen and auto-comments” using React, OBS Virtual Camera, and ChatGPT (gpt-4o-mini)

AITuber OnAir

I recently released AITuber OnAir, a web application that lets you set up an AITuber streaming environment using only your browser.

It’s currently in a public beta stage, but it already provides all the essential functions for streaming:

AITuber OnAir

This article will explain a feature within this AITuber OnAir web application that I developed:

Using the “OBS Virtual Camera” to grab a screenshot of the stream, sending it to ChatGPT’s vision-capable model (gpt-4o-mini), and having the AI avatar generate a comment.

And we’ll do all this without a server-side component—just the front end.

What We’ll Cover

  • How to make everything work on the front end only, with no server needed
  • How to grab regular screenshots from OBS Virtual Camera in a React app
  • How to send images + text to OpenAI (Vision-capable model) and receive AI commentary via streaming

From there, you can implement your own custom logic for how the AI responds. I’ve prepared sample code specifically for this article, which is very close to the production code in AITuber OnAir. For example, it includes functionality to split up AI responses for TTS processing.

Implementing this lets you create a fun experience where your AI avatar on YouTube Live can automatically comment on what’s happening on-screen. And because it’s front end–only, it’s surprisingly easy to set up.

In this article, we’ll focus on the React-specific technical points while sharing sample code.


What Is AITuber OnAir?

AITuber OnAir Demo

First, a quick introduction to AITuber OnAir, the web app I’ve been developing.

AITuber OnAir is a web application designed so you can set up an AITuber streaming environment entirely from your browser. It has the following features:

  • Integrates with OBS to stream a VRM avatar (a VTuber model) that runs in your web browser
  • Automatically retrieves and replies to YouTube Live comments
  • Provides AI chat features via OpenAI API, along with integration with multiple TTS (text-to-speech) engines such as VOICEVOX / VOICEPEAK / AivisSpeech / OpenAI TTS / Niji Voice
  • Lets you handle all your streaming setup via just Chrome, with no need to build a dedicated server (although for voice engines other than OpenAI TTS, you do need to run them locally)

Because it’s designed to be front end–only, AITuber OnAir uses clever approaches like storing user-uploaded VRM files in OPFS (origin private file system) for persistent storage in the browser.

For details on how to use AITuber OnAir, see the article below (in Japanese):

https://note.com/aituberonair/n/n39b30eb3eb5b

The specific feature we’ll be covering—“Have the AI comment by analyzing the streaming screen”—is extracted from the core functionality in AITuber OnAir and presented here as a sample. If you find it interesting, definitely check out AITuber OnAir.

Note: The source code for AITuber OnAir is not open source. This is partly due to potential business considerations in the future. However, development is based on Pixiv’s ChatVRM. If you’d like to understand the bigger picture, reading the ChatVRM code is recommended. And I owe a lot to the early pioneers in the AITuber scene who have shared their knowledge. In the same spirit, I hope that by sharing these insights, it will help those building similar AI/VTuber-related projects.


Overall Flow of the Feature

Let’s get into the details. Here’s the overall flow:

  1. Turn on the OBS Virtual Camera

    • In OBS, enable the virtual camera.
  2. Access the virtual camera from the browser

    • When you call navigator.mediaDevices.getUserMedia({ video: true }), you’d expect Chrome to display a dialog allowing you to select either your built-in camera or the OBS Virtual Camera. But in practice, it may not always bring up a selection prompt automatically; it often just uses whichever camera is set in Chrome’s permissions.
    • Make sure OBS Virtual Camera is selected so you can get the live feed from OBS in a <video> element.
  3. Run the <video> element in the background without displaying it

    • For example, set <video style={{ opacity: 0 }}> so users don’t actually see it.
    • If you set display: none;, screenshot capture might fail, so opacity: 0 is a better option.
  4. Use a Canvas to capture screenshots at intervals

    • Draw the <video> element into a Canvas using drawImage(), then call toDataURL() to get a Base64-encoded image.
  5. Send the image + text to ChatGPT (gpt-4o-mini)

    • In the request to OpenAI’s Vision-capable model, include both text and the image.
  6. Receive and display/voice (TTS) the AI-generated comment in real time via streaming

    • In the sample code, we split the AI response by sentence to handle text display and text-to-speech.

By putting all these steps together, you can have your AI avatar automatically comment on what’s happening in your stream—very cool! And because you can do it all on the front end, it’s quite convenient.


How to Enable the OBS Virtual Camera

Turning on the OBS Virtual Camera

First, simply enable the virtual camera in OBS. Depending on your system, you may have to install a driver in the process.

Once this is running, when you access your web page, Chrome will ask for permission to use your camera. You need to allow it.

Camera Permission 1

Camera Permission 2

Camera Permission 3

Also, in Chrome’s settings, you must specify the OBS Virtual Camera. Once you’ve done that, Chrome will share your OBS output—i.e., your streaming screen—through the camera interface.

If you overlook this setting and the physical camera remains selected, you might accidentally send your real face to the OpenAI API instead of the stream, so be careful!

Selecting the OBS Virtual Camera in Chrome’s camera settings

The screenshots above are from AITuber OnAir’s interface.

The rest of the code examples assume that all of the above is set up correctly.


Sample Implementation Code

1. Retrieving the OBS Feed in a <video> Element

// ObsVideo.tsx
import React, { useEffect, useRef, FC } from 'react';

type ObsVideoProps = {
  onVideoReady?: (videoEl: HTMLVideoElement | null) => void;
};

const ObsVideo: FC<ObsVideoProps> = ({ onVideoReady }) => {
  const videoRef = useRef<HTMLVideoElement>(null);

  useEffect(() => {
    // 1. Request access to the OBS Virtual Camera
    navigator.mediaDevices
      .getUserMedia({ video: true })
      .then((stream) => {
        if (videoRef.current) {
          videoRef.current.srcObject = stream;

          // Auto playback is handled by the autoPlay prop
          if (onVideoReady) {
            onVideoReady(videoRef.current);
          }
        }
      })
      .catch((error) => {
        console.error('Error accessing OBS virtual camera:', error);
        if (onVideoReady) {
          onVideoReady(null);
        }
      });
  }, [onVideoReady]);

  return (
    // Not displayed, just running in the background
    <video
      ref={videoRef}
      autoPlay
      muted
      playsInline
      style={{
        position: 'absolute',
        top: 500,
        width: '100px',
        height: '80px',
        opacity: 0,
        zIndex: 9999,
      }}
      id="obsVideo"
    />
  );
};

export default ObsVideo;
Enter fullscreen mode Exit fullscreen mode

What This Component Does:

  • Fetches OBS’s feed into React: This component grabs your OBS feed via the virtual camera and assigns it to a <video> element in your React app. This allows you to capture screenshots for AI, or preview it in the browser if you wish.
  • Encapsulates the getUserMedia call: By wrapping navigator.mediaDevices.getUserMedia into a single component, you keep your main logic cleaner.
  • Hidden playback: The <video style={{ opacity: 0 }}> is used so you can capture frames without visually displaying them. Setting display: none often breaks screenshot functionality, so opacity: 0 is a common solution.
  • onVideoReady callback: The parent component can receive the <video> element reference, making it possible to call captureScreenshot() or other methods from outside.

Here’s an example of how to handle onVideoReady in the parent:

  /**
   * Callback invoked by ObsVideo once the <video> element is ready
   */
  const handleVideoReady = useCallback((videoElem: HTMLVideoElement | null) => {
    // Store the <video> element in a ref
    videoEl.current = videoElem;
  }, []);
Enter fullscreen mode Exit fullscreen mode

2. The captureScreenshot() Function

Next is the function to capture a screenshot from the videoEl.

function captureScreenshot(videoEl: HTMLVideoElement) {
  const canvas = document.createElement('canvas');
  canvas.width = videoEl.videoWidth;
  canvas.height = videoEl.videoHeight;

  const ctx = canvas.getContext('2d');
  if (!ctx) return null;

  ctx.drawImage(videoEl, 0, 0, canvas.width, canvas.height);
  return canvas.toDataURL('image/jpeg'); // Base64 format
}
Enter fullscreen mode Exit fullscreen mode

Key Points:

  • Using Canvas to get the current frame
    • We draw the <video> element onto a Canvas with drawImage(), then call toDataURL() to get a Base64-encoded still image.
    • This allows you to capture the actual rendered frame at any moment from the OBS Virtual Camera feed.
  • Adjusting size
    • We set canvas.width and canvas.height to videoEl.videoWidth / videoEl.videoHeight so the captured image matches the video’s native resolution. You can downscale if you want to reduce file size or network usage.
  • Base64 for API calls
    • The returned Base64 string can be sent directly to an OpenAI vision-capable model.
    • The default format may be image/png, but you can specify image/jpeg or other formats as needed.

3. Sending Images to the OpenAI Vision Model

Below is an example of how we send both text and images to OpenAI’s vision model:

/**
 * Sends "image (Base64) + text" to ChatGPT's vision-capable model (gpt-4o-mini)
 * and receives responses via streaming.
 */
export async function getChatResponseStreamWithVision(
  messages: VisionMessage[],
  apiKey: string,
) {
  if (!apiKey) throw new Error('Invalid API Key');

  const headers: Record<string, string> = {
    'Content-Type': 'application/json',
    Authorization: `Bearer ${apiKey}`,
  };

  const requestBody = {
    model: 'gpt-4o-mini',
    messages,
    stream: true,
    max_tokens: 300,
  };

  const res = await fetch('https://api.openai.com/v1/chat/completions', {
    method: 'POST',
    headers,
    body: JSON.stringify(requestBody),
  });

  if (res.status !== 200 || !res.body) {
    throw new Error('Something went wrong');
  }

  const reader = res.body.getReader();

  // Convert the streaming response into a ReadableStream
  const stream = new ReadableStream({
    async start(controller) {
      const decoder = new TextDecoder('utf-8');

      try {
        while (true) {
          const { done, value } = await reader.read();
          if (done) break;
          if (!value) continue;

          const data = decoder.decode(value);
          // Each line starts with "data: ..."
          const lines = data.split('\n');
          for (const line of lines) {
            const trimmed = line.trim();
            if (!trimmed || trimmed === 'data:' || trimmed === 'data: [DONE]') {
              continue;
            }
            if (trimmed.startsWith('data: ')) {
              const jsonStr = trimmed.substring(6);
              try {
                const json = JSON.parse(jsonStr);
                if (json.choices?.[0]?.delta?.content) {
                  controller.enqueue(json.choices[0].delta.content);
                }
              } catch (err) {
                console.error('Parse error chunk:', trimmed);
              }
            }
          }
        }
      } catch (error) {
        controller.error(error);
      } finally {
        reader.releaseLock();
        controller.close();
      }
    },
  });

  return stream;
}
Enter fullscreen mode Exit fullscreen mode

Key Points:

  • Using Chat Completion API with stream: true
    • We call POST https://api.openai.com/v1/chat/completions and enable streaming to receive responses in chunks.
    • This is useful when you want to process long responses as they come in.
  • Vision message structure
    • In the messages array, we include image_url in the final user content, so the model can analyze the image alongside the text.
  • Streaming response handling
    • The API returns multiple lines like data: {JSON}.... We split them and parse each JSON chunk to extract json.choices[0].delta.content.
  • Rebuilding a ReadableStream
    • We then re-package the text chunks into a standard ReadableStream so that we can process them easily in the frontend (e.g., in React).

4. Main Logic for Generating AI Comments

Below is an example of how you might incorporate everything into a single function. For instance, you might call this via setInterval every minute.

/**
 * Assume this is called periodically (e.g., every minute via setInterval)
 */
const handleAutoVisionComment = useCallback(async () => {
  if (!openAiKey) {
    console.warn('No API key provided');
    return;
  }

  // Capture screenshot
  const dataUrl = captureScreenshot(videoEl);
  if (!dataUrl) {
    console.warn('Failed to capture screenshot');
    return;
  }

  const base64Image = dataUrl.split(',')[1] || '';

  // Existing chat history (chatLog) plus a system prompt for Vision
  const messages: Message[] = [
    { role: 'system', content: visionSystemPrompt },
    ...chatLog,
  ];

  // Vision-capable message format
  const messagesForVision: VisionMessage[] = [
    ...messages,
    {
      role: 'user',
      content: [
        {
          type: 'text',
          text: 'Take a look at the streaming screen and comment on it.',
        },
        {
          type: 'image_url',
          image_url: {
            url: `data:image/jpeg;base64,${base64Image}`,
            detail: 'low',
          },
        },
      ],
    },
  ];

  // Query OpenAI
  const stream = await getChatResponseStreamWithVision(messagesForVision, openAiKey).catch(
    (err) => {
      console.error(err);
      return null;
    }
  );
  if (!stream) {
    return;
  }

  // Read the response in streaming fashion
  const reader = stream.getReader();
  let receivedMessage = '';
  const sentences = [];

  try {
    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      // Accumulate content
      receivedMessage += value;

      // For example, split by punctuation (like Japanese sentences)
      const sentenceMatch = receivedMessage.match(/^(.+[。.!?\n]|.{10,}[、,])/);
      if (sentenceMatch && sentenceMatch[0]) {
        const sentence = sentenceMatch[0];
        sentences.push(sentence);
        receivedMessage = receivedMessage.slice(sentence.length).trimStart();

        // Here, you could update the UI or feed it to TTS
      }
    }
  } catch (e) {
    console.error(e);
  } finally {
    reader.releaseLock();
  }

  // In this example, we are not appending the "vision check" logs back into chat
}, [openAiKey, chatLog, visionSystemPrompt, videoEl]);
Enter fullscreen mode Exit fullscreen mode

Key Points:

  • Scheduled screenshots + API calls
    • By running this function periodically (e.g., every 30 seconds to a minute), you can have your AI regularly check what’s on the screen and generate comments.
  • Extracting the Base64 portion
    • captureScreenshot(videoEl) → returns a dataUrl, and dataUrl.split(',')[1] gives you just the Base64 data.
  • Building the vision-enabled message
    • Putting { type: 'image_url', image_url: { … } } in the content tells the Vision model to analyze that image.
  • Incremental text processing
    • We accumulate incoming chunks (receivedMessage += value;) and then match them against punctuation or line breaks to process them sentence by sentence. This is great for displaying partial results or hooking into TTS.

Putting it all together, you get a front end–only solution for having the AI watch your OBS stream and respond to it.


Implementation Notes

  1. Security / Rate Limits

    • Embedding your OpenAI API key in the browser can be risky if it’s made publicly accessible; it could be misused by others.
    • AITuber OnAir requires each user to bring their own API key, which is stored locally (e.g., in localStorage).
  2. OBS Virtual Camera Configuration

    • Ensure you specifically choose “OBS Virtual Camera” in Chrome’s camera settings.
    • If you accidentally choose your physical camera, you’ll be sending your actual webcam feed to OpenAI instead of your streamed screen.
  3. Image Size / Network Usage

    • Sending large screenshots (like full HD) at high frequency can be costly in both tokens and bandwidth.
    • Consider downscaling or compressing the screenshot to reduce load.

Conclusion

  • By combining OBS Virtual Camera + React + OpenAI Vision, you can build an “AI that analyzes your streaming screen and auto-comments.”
  • A serverless setup (frontend only) is enough to make it work, making development and operation straightforward.
  • This opens up creative possibilities for streaming, such as the AI avatar giving live commentary on on-screen changes, not just reacting to chat.

You Can Do Quite a Lot in the Browser Alone

As mentioned, I use this technique in AITuber OnAir, where:

  • YouTube Live comments can be retrieved and read aloud
  • The AI can do automated talk segments
  • VRM avatars reflect expressions
  • The application can even analyze the streaming screen

All without a dedicated server—just the browser.

If you’d like to try AITuber streaming, see real-world usage of ChatGPT and gpt-4o-mini, or explore serverless solutions, I hope this article helps.

That’s it—happy hacking!

Top comments (0)