TL;DRif you want to test this tutorial before we start, try it out here
Hugging Face has created a major shift in the AI community. It fuels cutting-edge open source machine learning/AI models and datasets. The Hugging Face community is thriving with great ideas and innovations to the point where the possibilities seem endless.
Hugging Face is revolutionizing Natural Language Processing (NLP) with state-of-the-art solutions for tasks like translation, summarization, sentiment analysis, and contextual understanding. Its arsenal of pre-trained models makes it a robust platform for diverse NLP tasks, streamlining the integration of machine learning functionalities. Hugging Face simplifies the training, evaluation, and deployment of models with a user-friendly interface. The more I used Hugging Face in my own personal projects, the more I felt inspired to combine it with Babit Multimedia Framework (BMF).
If you're reading this and are not familiar with BMF, it's a cross-platform multimedia processing framework by ByteDance Open Source. Currently, BMF is used to process over 2 billion videos a day across multiple social media apps. Can this get complex? Yes, it sure can. However, in this article, I'll break it all down, so you know how to create unique experiences across any type of media platform.
Why BMF?
BMF stands out with its multilingual support, putting it ahead in the video processing game. BMF excels in various scenarios like video transcoding, editing, videography, and analysis. The integration of advanced technologies like Hugging Face with BMF is a game-changer for complex multimedia processing challenges.
Before we get started with the tutorial, let me share with you some ideas I envision coming to life with BMF + Hugging Face:
- Multimedia Content Analysis: Leveraging Hugging Face's NLP models, BMF can delve deep into textual data associated with multimedia content, like subtitles or comments, for richer insights.
- Accessibility: NLP models can automatically generate video captions, enhancing accessibility for the hard-of-hearing or deaf community.
- Content Categorization and Recommendation: These models can sort multimedia content based on textual descriptions, paving the way for sophisticated recommendation systems.
- Enhanced User Interaction: Sentiment analysis on user comments can offer valuable insights into user engagement and feedback for content improvement.
What now?
Open Source AI is creating the building blocks of the future. Generative AI impacts all industries, and this leads me to think about how generative AI can impact the future of broadcasting and video processing. I experimented with BMF and Hugging Face to create the building blocks for a broadcasting service that uses AI to create unique experiences for viewers. So, enough about the background, let's get it going!
What we'll build
Follow along, as we'll build a video processing pipeline with BMF that uses the runwayml/stable-diffusion-v1-5 model to generate an image to display as an overlayed image ontop of an encoded video. If that didn't make sense, don't worry, here's a picture for reference:
So why is this significant? The image of the panda is AI generated and combined with BMF , we can put it down a processing pipeline to put it on top of our video. Think about! There could be a scenario where you are creating a video broadcasting service and during live streams, you'd like to display images quickly and display them for your audience with a simple prompt. There can also be a scenario where you are using BMF to edit your videos and you'd like to add some AI-generated art. This tutorial is just one example. BMF combined with models created by the Hugging Face community opens up a whole new world of possibilities.
Let's Get Started
Prerequisites:
- A GPU(I'm using google Colab A100 GPU. You can also use v100 or TP4 GPUs but they will just run a bit slower)
- Install BMFGPU
- Python 3.9-3.10 (strictly required to work with bmf)
- FFMPEG
You can find all the BMF installation docs here. The docs will highlight more system requirements if you decide to run things locally.
Getting Started
Begin by ensuring that essential toolkits like Hugging Face Transformers and BMF are installed in your Python environment. Use pip for installation:
Initial Setup
- First, we'll clone the following repository to get our video that we want to process(If you are coding along and want to use your own video, create your own repo and simply add a video file, preferably a short video and add to easily clone just like I did. You can also just save the video to the directory you're coding in.)
git clone https://github.com/Joshalphonse/Bmf-Huggingface.git
- Install BabitMF-GPU to accelerate your video processing pipeline with BMF's GPU capablities
pip install BabitMF-GPU
- Install the following dependencies
pip install requests diffusers transformers torch accelerate scipy safetensors moviepy Pillow tqdm numpy modelscope==1.4.2 open_clip_torch pytorch-lightning
- Install ffmpeg.BMF framework utilizes the FFmpeg video decoders and encoders as the built-in modules for video decoding and encoding. It's necessary for users to install supported FFmpeg libraries before using BMF.
sudo apt install ffmpeg
dpkg -l | grep -i ffmpeg
ffmpeg -version
This package below is installed to show the BMF C++ logs in the colab console, otherwise only python logs are printed. This step is not necessary if you're not in a Colab or iPython notebook environment.
pip install wurlitzer
%load_ext wurlitzer
- Create a new folder in the directory for the github repository we cloned. We'll need this path later on.
import sys
sys.path.insert(0, '/content/Bmf-Huggingface')
print(sys.path)
Creating the Module
Now it's time for the fun part. We'll create a module to process the video.Here's the module I created and I'll break it down for you below.
import bmf
from bmf import bmf_sync, Packet
from bmf import SubGraph
from diffusers import StableDiffusionPipeline
import torch
model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "a photo of a panda eating waffles"
image = pipe(prompt).images[0]
image.save("panda_photo.png")
class video_overlay(SubGraph):
def create_graph(self, option=None):
# create source stream
self.inputs.append('source')
source_stream = self.graph.input_stream('source')
# create overlay stream
overlay_streams = []
for (i, _) in enumerate(option['overlays']):
self.inputs.append('overlay_' + str(i))
overlay_streams.append(self.graph.input_stream('overlay_' + str(i)))
# pre-processing for source layer
info = option['source']
output_stream = (
source_stream.scale(info['width'], info['height'])
.trim(start=info['start'], duration=info['duration'])
.setpts('PTS-STARTPTS')
)
# overlay processing
for (i, overlay_stream) in enumerate(overlay_streams):
overlay_info = option['overlays'][i]
# overlay layer pre-processing
p_overlay_stream = (
overlay_stream.scale(overlay_info['width'], overlay_info['height'])
.loop(loop=overlay_info['loop'], size=10000)
.setpts('PTS+%f/TB' % (overlay_info['start']))
)
# calculate overlay parameter
x = 'if(between(t,%f,%f),%s,NAN)' % (overlay_info['start'],
overlay_info['start'] + overlay_info['duration'],
str(overlay_info['pox_x']))
y = 'if(between(t,%f,%f),%s,NAN)' % (overlay_info['start'],
overlay_info['start'] + overlay_info['duration'],
str(overlay_info['pox_y']))
if overlay_info['loop'] == -1:
repeat_last = 0
shortest = 1
else:
repeat_last = overlay_info['repeat_last']
shortest = 1
# do overlay
output_stream = (
output_stream.overlay(p_overlay_stream, x=x, y=y,
repeatlast=repeat_last)
)
# finish creating graph
self.output_streams = self.finish_create_graph([output_stream])
Code Breakdown:
Importing Required Modules:
import bmf
from bmf import bmf_sync, Packet
from bmf import SubGraph
from diffusers import StableDiffusionPipeline
import torch
-
bmf
and its components are imported to harness the functionalities of the Babit Multimedia Framework for video processing tasks. -
SubGraph
is a class in BMF, used to create a customizable processing node. -
StableDiffusionPipeline
is imported from thediffusers
library that allows the generation of images using text prompts. -
torch
is the PyTorch library used for machine learning applications, which Stable Diffusion relies on.
Configuring the Stable Diffusion Model:
model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
- The Stable Diffusion model is loaded with the specified
model_id
. - The
torch_dtype
parameter ensures the model uses lower precision to reduce memory usage. -
.to("cuda")
moves the model to GPU for faster computation if CUDA is available.
Generating an Image Using Stable Diffusion:
prompt = "a photo of a panda eating waffles"
image = pipe(prompt).images[0]
image.save("panda_photo.png")
- We then set a text prompt to generate an image of "a photo of a panda eating waffles".
- The image is created and saved to "panda_photo.png".
Defining a Custom BMF SubGraph for Video Overlay:
class video_overlay(SubGraph):
-
video_overlay
class is derived fromSubGraph
. This class will define a custom graph for video overlay operations.
Creating the Graph:
def create_graph(self, option=None):
-
create_graph
method is where the actual graph (workflow) of the video and overlays are constructed.
Processing Source and Overlay Streams:
self.inputs.append('source')
source_stream = self.graph.input_stream('source')
overlay_streams = []
- Registers input streams for the source and prepares a list of overlay input streams.
Scaling and Trimming Source Video:
info = option['source']
output_stream = (
source_stream.scale(info['width'], info['height']).trim(start=info['start'], duration=info['duration']).setpts('PTS-STARTPTS'))
- The source video is scaled and trimmed according to the specified
option
. Adjustments are made for the timeline placement.
Scaling and Looping Overlay Streams:
p_overlay_stream = (
overlay_stream.scale(overlay_info['width'], overlay_info['height']).loop(loop=overlay_info['loop'], size=10000).setpts('PTS+%f/TB' % (overlay_info['start'])))
- Each overlay is scaled and looped as needed, providing a dynamic and flexible overlay process.
Overlaying on the Source Stream:
output_stream = (
output_stream.overlay(p_overlay_stream, x=x, y=y,
repeatlast=repeat_last))
- Overlays are added to the source stream at the calculated position and with the proper configuration. This allows multiple overlays to exist within the same timeframe without conflicts.
Finalizing the Graph:
self.output_streams = self.finish_create_graph([output_stream])
- Final output streams are set, which concludes the creation of the graph. Now, after this, it's time for us to encode the video and display it how we want to.
Applying Hugging Face Model
Let's add our image as an overlay to the video file. Let's break down each section of the code to explain how it
input_video_path = "/content/Bmf-Huggingface/black_and_white.mp4"
logo_path = "/content/panda_photo.png"
output_path = "./complex_edit.mp4"
dump_graph = 0
duration = 10
overlay_option = {
"dump_graph": dump_graph,
"source": {
"start": 0,
"duration": duration,
"width": 1280,
"height": 720
},
"overlays": [
{
"start": 0,
"duration": duration,
"width": 300,
"height": 200,
"pox_x": 0,
"pox_y": 0,
"loop": 0,
"repeat_last": 1
}
]
}
my_graph = bmf.graph({
"dump_graph": dump_graph
})
logo_1 = my_graph.decode({'input_path': logo_path})['video']
video1 = my_graph.decode({'input_path': input_video_path})
overlay_streams = list()
overlay_streams.append(bmf.module([video1['video'], logo_1], 'video_overlay', overlay_option, entry='__main__.video_overlay')[0])
bmf.encode(
overlay_streams[0],
video1['audio'],
{"output_path": output_path}
).run()
Let's break this down too
Defining Paths and Options:
input_video_path = "/content/Bmf-Huggingface/black_and_white.mp4"
logo_path = "/content/panda_photo.png"
output_path = "./complex_edit.mp4"
dump_graph = 0
duration = 10
-
input_video_path
: Specifies the file path to the input video. -
logo_path
: File path to the image (logo) you want to overlay on the video. -
output_path
: The file path where the edited video will be saved. -
dump_graph
: A debugging tool in BMF that can be set to1
to visualize the graph but is set to0
here, meaning no graph will be dumped. -
duration
: The duration in seconds for the overlay to be visible in the video.
Overlay Configuration:
overlay_option = {
"dump_graph": dump_graph,
"source": {
"start": 0,
"duration": duration,
"width": 1280,
"height": 720
},
"overlays": [
{
"start": 0,
"duration": duration,
"width": 300,
"height": 200,
"pox_x": 0,
"pox_y": 0,
"loop": 0,
"repeat_last": 1
}
]
}
-
overlay_option
: A dictionary that defines the settings for the source video and the overlay. - For the source, the width and height you want to scale the video to, and when the overlay should start and end are specified.
- For the overlays, detailed options such as position, size, and behavior (like
loop
andrepeat_last
) are defined.
Creating a BMF Graph:
my_graph = bmf.graph({"dump_graph": dump_graph
})
-
my_graph
is an instance of BMF graph which sets up the processing graph (pipeline), withdump_graph
passed as an option.
Decoding the Logo and Video Streams:
logo_1 = my_graph.decode({'input_path': logo_path})['video']
video1 = my_graph.decode({'input_path': input_video_path})
- The video and logo are loaded and decoded to be processed. This decoding extracts the video streams to be used in subsequent steps.
Creating Overlay Streams:
overlay_streams = list()
overlay_streams.append(bmf.module([video1['video'], logo_1], 'video_overlay', overlay_option, entry='__main__.video_overlay')[0])
- An empty list
overlay_streams
is created to hold the video layers. - The
bmf.module
function is used to create an overlay module, where the source video and logo are processed using thevideo_overlay
class defined previously with the corresponding options.
Encoding the Final Output:
bmf.encode(
overlay_streams[0],
video1['audio'],{"output_path": output_path}).run()
- The final video stream, with the overlay applied, and the original audio from the input video are encoded together into a new output file specified by
output_path
. - The
.run()
method is called to execute the encoding process.
Our final output should look something like this:
Thats it! We've explored a practical example of utilizing Babit Multimedia Framework (BMF) a video editing task using AI to create an image we can overlay on a video. Now you know how to set up a BMF graph, decode the input streams, create overlay modules, and finally encode the edited video with the overlay in place. In the future, I will consider adding more AI models, like one to improve the resolution, or even a model that creates a video from text. Through the power of BMF and Hugging Face open source models, users can create complex video editing workflows with overlays that can dynamically change over time, offering vast creative possibilities.
Try it out on CoLab and tell us what you think:
https://colab.research.google.com/drive/1eQxiZc2vZeyOggMoFle_b0xnblupbiXd?usp=sharing
Top comments (0)