DEV Community

Small Model from Huggingface with Video understanding

A couple of weeks ago, SmolVLM-2 got released by Huggingface with an amazing feature — Video Understanding. The Vision Language Model initially been playing around the images, now has been upgraded with understanding of video inputs. It has been released in different variants such as 2.2B, 500M and 256M parameter models.

Image generated with AI - Flux.1-dev

Though the model is not very much perfect, it can get the task done with a decent level of accuracy. Here’s a small and simple tutorial on how to play with this model (500M in this case).

Environment : Google Colab (with T4 GPU)
NOTE: I’ve disabled flash-attention as it is supported only by latest GPUs

Code Example

Used a sample input of a video (around 30 seconds).

!pip install wheel decord pyav num2words
!pip install flash-attn --no-build-isolation
!pip install git+https://github.com/huggingface/transformers@v4.49.0-SmolVLM-2
Enter fullscreen mode Exit fullscreen mode

These are some pre-required libraries to be installed for the script to work with video inputs. flash-attn may not be required as the T4 GPU is not compatible with flash-attention. SmolVLM-2 is not yet available in the stable release of transformers package, so the release version is used in pip installation.

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_path = "HuggingFaceTB/SmolVLM2-500M-Video-Instruct"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    # _attn_implementation="flash_attention_2"
).to("cuda")
Enter fullscreen mode Exit fullscreen mode

Required libraries are imported after the installation. Then, the model is downloaded using transformers and then loaded into the GPU (‘cuda’).

# Ensure all parameters are in bfloat16
for name, param in model.named_parameters():
    if param.dtype == torch.float32:
        param.data = param.data.to(torch.bfloat16)
Enter fullscreen mode Exit fullscreen mode

We need to make sure that all the model parameters are in the dtype bfloat16

from google.colab import drive
drive.mount('/content/drive')

vdo_pth = '/content/drive/MyDrive/videoplayback.mp4'
Enter fullscreen mode Exit fullscreen mode

Now, the sample video file is loaded from GDrive.

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": vdo_pth},
            {"type": "text", "text": "Describe this video in detail"}
        ]
    },
]
Enter fullscreen mode Exit fullscreen mode

The messages list is constructed in the applicable chat template, with the video file path as input along with a message to describe the video.

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to("cuda")
Enter fullscreen mode Exit fullscreen mode

Now, the input chat template is tokenized and stored as tensors in the GPU (‘cuda’).

for key, value in inputs.items():
    if torch.is_floating_point(value):  # Convert only float tensors
        inputs[key] = value.to(torch.bfloat16)
Enter fullscreen mode Exit fullscreen mode

As same as the model, the inputs are converted into bfloat16 dtype.

generated_ids = model.generate(**inputs, max_new_tokens=254)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print('OUT: ', generated_texts[0])
Enter fullscreen mode Exit fullscreen mode

Now, the model is called for inference and then generated response is decoded into actual output.

OUT:  User: You are provided the following series of twenty-four frames from a 0:00:24 [H:MM:SS] video.

Frame from 00:00:
Frame from 00:01:
Frame from 00:02:
Frame from 00:03:
Frame from 00:04:
Frame from 00:05:
Frame from 00:06:
Frame from 00:07:
Frame from 00:08:
Frame from 00:09:
Frame from 00:10:
Frame from 00:11:
Frame from 00:12:
Frame from 00:13:
Frame from 00:14:
Frame from 00:15:
Frame from 00:16:
Frame from 00:17:
Frame from 00:19:
Frame from 00:20:
Frame from 00:21:
Frame from 00:22:
Frame from 00:23:
Frame from 00:24:

Describe this video in detail
Assistant: The video begins with a view of a large commercial airplane in flight, showcasing its sleek design and the expansive sky. The text "explained" appears at the top of the frame, indicating the topic of the video. The airplane is shown in various angles, emphasizing its size and the vastness of the sky. The scene transitions to a close-up of the airplane's engines, highlighting their size and the impressive performance of the jet. The text "explained" reappears, and the airplane is shown in a different angle, emphasizing its size and the expansive sky.

The video then shifts to a view of a small white airplane on a runway, with the text "explained" at the top. The airplane is shown in various angles, emphasizing its size and the expansive sky. The scene transitions to a view of a large airplane in flight, showcasing its sleek design and the expansive sky. The text "explained" reappears, and the airplane is shown in a different angle, emphasizing its size and the expansive sky.

The video continues with a view of a large commercial airplane in flight, showcasing its sleek design and the expansive sky. The text "explained" appears at the top of the frame, indicating the topic of the video
Enter fullscreen mode Exit fullscreen mode

The output is as above, description generated about the video input. It is not very much perfect, but for a very small model, this is fine.

Input Video Link : GDrive Link of Video

You can try inferring different samples and also with other variants such as 256M model and 2.2B model. This can even be integrated and deployed in mobile devices.

Happy Learning !!

Top comments (0)