A couple of weeks ago, SmolVLM-2 got released by Huggingface with an amazing feature — Video Understanding. The Vision Language Model initially been playing around the images, now has been upgraded with understanding of video inputs. It has been released in different variants such as 2.2B, 500M and 256M parameter models.
Though the model is not very much perfect, it can get the task done with a decent level of accuracy. Here’s a small and simple tutorial on how to play with this model (500M in this case).
Environment : Google Colab (with T4 GPU)
NOTE: I’ve disabled flash-attention as it is supported only by latest GPUs
Code Example
Used a sample input of a video (around 30 seconds).
!pip install wheel decord pyav num2words
!pip install flash-attn --no-build-isolation
!pip install git+https://github.com/huggingface/transformers@v4.49.0-SmolVLM-2
These are some pre-required libraries to be installed for the script to work with video inputs. flash-attn
may not be required as the T4 GPU is not compatible with flash-attention. SmolVLM-2 is not yet available in the stable release of transformers package, so the release version is used in pip installation.
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
model_path = "HuggingFaceTB/SmolVLM2-500M-Video-Instruct"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
# _attn_implementation="flash_attention_2"
).to("cuda")
Required libraries are imported after the installation. Then, the model is downloaded using transformers and then loaded into the GPU (‘cuda’).
# Ensure all parameters are in bfloat16
for name, param in model.named_parameters():
if param.dtype == torch.float32:
param.data = param.data.to(torch.bfloat16)
We need to make sure that all the model parameters are in the dtype bfloat16
from google.colab import drive
drive.mount('/content/drive')
vdo_pth = '/content/drive/MyDrive/videoplayback.mp4'
Now, the sample video file is loaded from GDrive.
messages = [
{
"role": "user",
"content": [
{"type": "video", "path": vdo_pth},
{"type": "text", "text": "Describe this video in detail"}
]
},
]
The messages list is constructed in the applicable chat template, with the video file path as input along with a message to describe the video.
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to("cuda")
Now, the input chat template is tokenized and stored as tensors in the GPU (‘cuda’).
for key, value in inputs.items():
if torch.is_floating_point(value): # Convert only float tensors
inputs[key] = value.to(torch.bfloat16)
As same as the model, the inputs are converted into bfloat16
dtype.
generated_ids = model.generate(**inputs, max_new_tokens=254)
generated_texts = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
)
print('OUT: ', generated_texts[0])
Now, the model is called for inference and then generated response is decoded into actual output.
OUT: User: You are provided the following series of twenty-four frames from a 0:00:24 [H:MM:SS] video.
Frame from 00:00:
Frame from 00:01:
Frame from 00:02:
Frame from 00:03:
Frame from 00:04:
Frame from 00:05:
Frame from 00:06:
Frame from 00:07:
Frame from 00:08:
Frame from 00:09:
Frame from 00:10:
Frame from 00:11:
Frame from 00:12:
Frame from 00:13:
Frame from 00:14:
Frame from 00:15:
Frame from 00:16:
Frame from 00:17:
Frame from 00:19:
Frame from 00:20:
Frame from 00:21:
Frame from 00:22:
Frame from 00:23:
Frame from 00:24:
Describe this video in detail
Assistant: The video begins with a view of a large commercial airplane in flight, showcasing its sleek design and the expansive sky. The text "explained" appears at the top of the frame, indicating the topic of the video. The airplane is shown in various angles, emphasizing its size and the vastness of the sky. The scene transitions to a close-up of the airplane's engines, highlighting their size and the impressive performance of the jet. The text "explained" reappears, and the airplane is shown in a different angle, emphasizing its size and the expansive sky.
The video then shifts to a view of a small white airplane on a runway, with the text "explained" at the top. The airplane is shown in various angles, emphasizing its size and the expansive sky. The scene transitions to a view of a large airplane in flight, showcasing its sleek design and the expansive sky. The text "explained" reappears, and the airplane is shown in a different angle, emphasizing its size and the expansive sky.
The video continues with a view of a large commercial airplane in flight, showcasing its sleek design and the expansive sky. The text "explained" appears at the top of the frame, indicating the topic of the video
The output is as above, description generated about the video input. It is not very much perfect, but for a very small model, this is fine.
Input Video Link : GDrive Link of Video
You can try inferring different samples and also with other variants such as 256M model and 2.2B model. This can even be integrated and deployed in mobile devices.
Happy Learning !!
Top comments (0)