Josh Alphonse for ByteDance Open Source

Posted on Feb 1, 2024

BMF 📹 + Hugging Face🤗, The New Video Processing BFFs

#python #ai #ffmpeg #opensource

TL;DRif you want to test this tutorial before we start, try it out here

Hugging Face has created a major shift in the AI community. It fuels cutting-edge open source machine learning/AI models and datasets. The Hugging Face community is thriving with great ideas and innovations to the point where the possibilities seem endless.

Hugging Face is revolutionizing Natural Language Processing (NLP) with state-of-the-art solutions for tasks like translation, summarization, sentiment analysis, and contextual understanding. Its arsenal of pre-trained models makes it a robust platform for diverse NLP tasks, streamlining the integration of machine learning functionalities. Hugging Face simplifies the training, evaluation, and deployment of models with a user-friendly interface. The more I used Hugging Face in my own personal projects, the more I felt inspired to combine it with Babit Multimedia Framework (BMF).

If you're reading this and are not familiar with BMF, it's a cross-platform multimedia processing framework by ByteDance Open Source. Currently, BMF is used to process over 2 billion videos a day across multiple social media apps. Can this get complex? Yes, it sure can. However, in this article, I'll break it all down, so you know how to create unique experiences across any type of media platform.

Why BMF?

BMF stands out with its multilingual support, putting it ahead in the video processing game. BMF excels in various scenarios like video transcoding, editing, videography, and analysis. The integration of advanced technologies like Hugging Face with BMF is a game-changer for complex multimedia processing challenges.

Before we get started with the tutorial, let me share with you some ideas I envision coming to life with BMF + Hugging Face:

Multimedia Content Analysis: Leveraging Hugging Face's NLP models, BMF can delve deep into textual data associated with multimedia content, like subtitles or comments, for richer insights.
Accessibility: NLP models can automatically generate video captions, enhancing accessibility for the hard-of-hearing or deaf community.
Content Categorization and Recommendation: These models can sort multimedia content based on textual descriptions, paving the way for sophisticated recommendation systems.
Enhanced User Interaction: Sentiment analysis on user comments can offer valuable insights into user engagement and feedback for content improvement.

What now?

Open Source AI is creating the building blocks of the future. Generative AI impacts all industries, and this leads me to think about how generative AI can impact the future of broadcasting and video processing. I experimented with BMF and Hugging Face to create the building blocks for a broadcasting service that uses AI to create unique experiences for viewers. So, enough about the background, let's get it going!

What we'll build

Follow along, as we'll build a video processing pipeline with BMF that uses the runwayml/stable-diffusion-v1-5 model to generate an image to display as an overlayed image ontop of an encoded video. If that didn't make sense, don't worry, here's a picture for reference:

So why is this significant? The image of the panda is AI generated and combined with BMF , we can put it down a processing pipeline to put it on top of our video. Think about! There could be a scenario where you are creating a video broadcasting service and during live streams, you'd like to display images quickly and display them for your audience with a simple prompt. There can also be a scenario where you are using BMF to edit your videos and you'd like to add some AI-generated art. This tutorial is just one example. BMF combined with models created by the Hugging Face community opens up a whole new world of possibilities.

Let's Get Started

Prerequisites:

A GPU(I'm using google Colab A100 GPU. You can also use v100 or TP4 GPUs but they will just run a bit slower)
Install BMFGPU
Python 3.9-3.10 (strictly required to work with bmf)
FFMPEG

You can find all the BMF installation docs here. The docs will highlight more system requirements if you decide to run things locally.

Getting Started

Begin by ensuring that essential toolkits like Hugging Face Transformers and BMF are installed in your Python environment. Use pip for installation:

Initial Setup

First, we'll clone the following repository to get our video that we want to process(If you are coding along and want to use your own video, create your own repo and simply add a video file, preferably a short video and add to easily clone just like I did. You can also just save the video to the directory you're coding in.)

git clone https://github.com/Joshalphonse/Bmf-Huggingface.git

Install BabitMF-GPU to accelerate your video processing pipeline with BMF's GPU capablities

pip install BabitMF-GPU

Install the following dependencies

pip install requests diffusers transformers torch accelerate scipy safetensors moviepy Pillow tqdm numpy modelscope==1.4.2 open_clip_torch pytorch-lightning

Install ffmpeg.BMF framework utilizes the FFmpeg video decoders and encoders as the built-in modules for video decoding and encoding. It's necessary for users to install supported FFmpeg libraries before using BMF.

sudo apt install ffmpeg

dpkg -l | grep -i ffmpeg

ffmpeg -version

This package below is installed to show the BMF C++ logs in the colab console, otherwise only python logs are printed. This step is not necessary if you're not in a Colab or iPython notebook environment.

pip install wurlitzer
%load_ext wurlitzer

Create a new folder in the directory for the github repository we cloned. We'll need this path later on.

import sys
sys.path.insert(0, '/content/Bmf-Huggingface')
print(sys.path)

Creating the Module

Now it's time for the fun part. We'll create a module to process the video.Here's the module I created and I'll break it down for you below.

import bmf
from bmf import bmf_sync, Packet
from bmf import SubGraph
from diffusers import StableDiffusionPipeline
import torch

model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "a photo of a panda eating waffles"
image = pipe(prompt).images[0]

image.save("panda_photo.png")

class video_overlay(SubGraph):

    def create_graph(self, option=None):
        # create source stream
        self.inputs.append('source')
        source_stream = self.graph.input_stream('source')
        # create overlay stream
        overlay_streams = []
        for (i, _) in enumerate(option['overlays']):
            self.inputs.append('overlay_' + str(i))
            overlay_streams.append(self.graph.input_stream('overlay_' + str(i)))

        # pre-processing for source layer
        info = option['source']
        output_stream = (
            source_stream.scale(info['width'], info['height'])
                .trim(start=info['start'], duration=info['duration'])
                .setpts('PTS-STARTPTS')
        )

        # overlay processing
        for (i, overlay_stream) in enumerate(overlay_streams):
            overlay_info = option['overlays'][i]

            # overlay layer pre-processing
            p_overlay_stream = (
                overlay_stream.scale(overlay_info['width'], overlay_info['height'])
                    .loop(loop=overlay_info['loop'], size=10000)
                    .setpts('PTS+%f/TB' % (overlay_info['start']))
            )

            # calculate overlay parameter
            x = 'if(between(t,%f,%f),%s,NAN)' % (overlay_info['start'],
                                                 overlay_info['start'] + overlay_info['duration'],
                                                 str(overlay_info['pox_x']))
            y = 'if(between(t,%f,%f),%s,NAN)' % (overlay_info['start'],
                                                 overlay_info['start'] + overlay_info['duration'],
                                                 str(overlay_info['pox_y']))
            if overlay_info['loop'] == -1:
                repeat_last = 0
                shortest = 1
            else:
                repeat_last = overlay_info['repeat_last']
                shortest = 1

            # do overlay
            output_stream = (
                output_stream.overlay(p_overlay_stream, x=x, y=y,
                                      repeatlast=repeat_last)
            )

        # finish creating graph
        self.output_streams = self.finish_create_graph([output_stream])

Code Breakdown:

Importing Required Modules:

import bmf
from bmf import bmf_sync, Packet
from bmf import SubGraph
from diffusers import StableDiffusionPipeline
import torch

bmf and its components are imported to harness the functionalities of the Babit Multimedia Framework for video processing tasks.
SubGraph is a class in BMF, used to create a customizable processing node.
StableDiffusionPipeline is imported from the diffusers library that allows the generation of images using text prompts.
torch is the PyTorch library used for machine learning applications, which Stable Diffusion relies on.

Configuring the Stable Diffusion Model:

model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

The Stable Diffusion model is loaded with the specified model_id.
The torch_dtype parameter ensures the model uses lower precision to reduce memory usage.
.to("cuda") moves the model to GPU for faster computation if CUDA is available.

Generating an Image Using Stable Diffusion:

prompt = "a photo of a panda eating waffles"
image = pipe(prompt).images[0]
image.save("panda_photo.png")

We then set a text prompt to generate an image of "a photo of a panda eating waffles".
The image is created and saved to "panda_photo.png".

Defining a Custom BMF SubGraph for Video Overlay:

class video_overlay(SubGraph):

video_overlay class is derived from SubGraph. This class will define a custom graph for video overlay operations.

Creating the Graph:

def create_graph(self, option=None):

create_graph method is where the actual graph (workflow) of the video and overlays are constructed.

Processing Source and Overlay Streams:

self.inputs.append('source')
source_stream = self.graph.input_stream('source')
overlay_streams = []

Registers input streams for the source and prepares a list of overlay input streams.

Scaling and Trimming Source Video:

info = option['source']
output_stream = (
    source_stream.scale(info['width'], info['height']).trim(start=info['start'], duration=info['duration']).setpts('PTS-STARTPTS'))

The source video is scaled and trimmed according to the specified option. Adjustments are made for the timeline placement.

Scaling and Looping Overlay Streams:

p_overlay_stream = (
    overlay_stream.scale(overlay_info['width'], overlay_info['height']).loop(loop=overlay_info['loop'], size=10000).setpts('PTS+%f/TB' % (overlay_info['start'])))

Each overlay is scaled and looped as needed, providing a dynamic and flexible overlay process.

Overlaying on the Source Stream:

output_stream = (
    output_stream.overlay(p_overlay_stream, x=x, y=y,
                          repeatlast=repeat_last))

Overlays are added to the source stream at the calculated position and with the proper configuration. This allows multiple overlays to exist within the same timeframe without conflicts.

Finalizing the Graph:

self.output_streams = self.finish_create_graph([output_stream])

Final output streams are set, which concludes the creation of the graph. Now, after this, it's time for us to encode the video and display it how we want to.

Applying Hugging Face Model

Let's add our image as an overlay to the video file. Let's break down each section of the code to explain how it

input_video_path = "/content/Bmf-Huggingface/black_and_white.mp4"
logo_path = "/content/panda_photo.png"
output_path = "./complex_edit.mp4"
dump_graph = 0

duration = 10

overlay_option = {
    "dump_graph": dump_graph,
    "source": {
        "start": 0,
        "duration": duration,
        "width": 1280,
        "height": 720
    },
    "overlays": [
        {
            "start": 0,
            "duration": duration,
            "width": 300,
            "height": 200,
            "pox_x": 0,
            "pox_y": 0,
            "loop": 0,
            "repeat_last": 1
        }
    ]
}

my_graph = bmf.graph({
    "dump_graph": dump_graph
})

logo_1 = my_graph.decode({'input_path': logo_path})['video']

video1 = my_graph.decode({'input_path': input_video_path})

overlay_streams = list()
overlay_streams.append(bmf.module([video1['video'], logo_1], 'video_overlay', overlay_option, entry='__main__.video_overlay')[0])

bmf.encode(
    overlay_streams[0],
    video1['audio'],
    {"output_path": output_path}
    ).run()

Let's break this down too

Defining Paths and Options:

input_video_path = "/content/Bmf-Huggingface/black_and_white.mp4"
logo_path = "/content/panda_photo.png"
output_path = "./complex_edit.mp4"
dump_graph = 0
duration = 10

input_video_path: Specifies the file path to the input video.
logo_path: File path to the image (logo) you want to overlay on the video.
output_path: The file path where the edited video will be saved.
dump_graph: A debugging tool in BMF that can be set to 1 to visualize the graph but is set to 0 here, meaning no graph will be dumped.
duration: The duration in seconds for the overlay to be visible in the video.

Overlay Configuration:

overlay_option = {
    "dump_graph": dump_graph,
    "source": {
        "start": 0,
        "duration": duration,
        "width": 1280,
        "height": 720
    },
    "overlays": [
        {
            "start": 0,
            "duration": duration,
            "width": 300,
            "height": 200,
            "pox_x": 0,
            "pox_y": 0,
            "loop": 0,
            "repeat_last": 1
        }
    ]
}

overlay_option: A dictionary that defines the settings for the source video and the overlay.
For the source, the width and height you want to scale the video to, and when the overlay should start and end are specified.
For the overlays, detailed options such as position, size, and behavior (like loop and repeat_last) are defined.

Creating a BMF Graph:

my_graph = bmf.graph({"dump_graph": dump_graph
})

my_graph is an instance of BMF graph which sets up the processing graph (pipeline), with dump_graph passed as an option.

Decoding the Logo and Video Streams:

logo_1 = my_graph.decode({'input_path': logo_path})['video']
video1 = my_graph.decode({'input_path': input_video_path})

The video and logo are loaded and decoded to be processed. This decoding extracts the video streams to be used in subsequent steps.

Creating Overlay Streams:

overlay_streams = list()
overlay_streams.append(bmf.module([video1['video'], logo_1], 'video_overlay', overlay_option, entry='__main__.video_overlay')[0])

An empty list overlay_streams is created to hold the video layers.
The bmf.module function is used to create an overlay module, where the source video and logo are processed using the video_overlay class defined previously with the corresponding options.

Encoding the Final Output:

bmf.encode(
    overlay_streams[0],
    video1['audio'],{"output_path": output_path}).run()

The final video stream, with the overlay applied, and the original audio from the input video are encoded together into a new output file specified by output_path.
The .run() method is called to execute the encoding process.

Our final output should look something like this:

Thats it! We've explored a practical example of utilizing Babit Multimedia Framework (BMF) a video editing task using AI to create an image we can overlay on a video. Now you know how to set up a BMF graph, decode the input streams, create overlay modules, and finally encode the edited video with the overlay in place. In the future, I will consider adding more AI models, like one to improve the resolution, or even a model that creates a video from text. Through the power of BMF and Hugging Face open source models, users can create complex video editing workflows with overlays that can dynamically change over time, offering vast creative possibilities.

Try it out on CoLab and tell us what you think:

https://colab.research.google.com/drive/1eQxiZc2vZeyOggMoFle_b0xnblupbiXd?usp=sharing

Join us on our ByteDance Open Source Discord Server!

DEV Community