DEV Community

Cover image for Unlock the Magic of Images: A Quick and Easy Guide to Using the Cutting-Edge SmolVLM-500M Model
Alexander Uspenskiy
Alexander Uspenskiy

Posted on

Unlock the Magic of Images: A Quick and Easy Guide to Using the Cutting-Edge SmolVLM-500M Model

The model SmolVLM-500M-Instruct is a state-of-the-art, compact model with 500 million parameters. Despite its relatively small size, its capabilities are remarkably impressive.

Let's jump to the code:

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import warnings

warnings.filterwarnings("ignore", message="Some kwargs in processor config are unused")

def upload_and_describe_image(image_path):
    processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
    model = AutoModelForVision2Seq.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")

    image = Image.open(image_path)

    prompt = "Describe the content of this <image> in detail, give only answers in a form of text"
    inputs = processor(text=[prompt], images=[image], return_tensors="pt")

    with torch.no_grad():
        outputs = model.generate(
            pixel_values=inputs["pixel_values"],
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=150,
            do_sample=True,
            temperature=0.7
        )

    description = processor.batch_decode(outputs, skip_special_tokens=True)[0]
    return description.strip()

if __name__ == "__main__":
    image_path = "images/bender.jpg"

    try:
        description = upload_and_describe_image(image_path)
        print("Image Description:", description)
    except Exception as e:
        print(f"An error occurred: {e}")
Enter fullscreen mode Exit fullscreen mode

This Python script uses the Hugging Face Transformers library to generate a textual description of an image. It loads a pre-trained vision-to-sequence model and processor, processes an input image, and generates a descriptive text based on the image content. The script handles exceptions and prints the generated description.

You can download it here: https://github.com/alexander-uspenskiy/vlm

Based on this original non-stock image (put it to the image directory of the project):

Image description

Take a look at the description generated by the model (you can play with the prompt and parameters in the code to format the output better for any propose): The robot is sitting on a couch. It has eyes and mouth. He is reading something. He is holding a book with his hands. He is looking at the book. In the background, there are books in a shelf. Behind the books, there is a wall and a door. At the bottom of the image, there is a chair. The chair is white. The chair has a cushion on it. In the background, the wall is brown. The floor is grey. in the image, the robot is silver and cream color. The book is brown. The book is open. The robot is holding the book with both hands. The robot is looking at the book. The robot is sitting on the couch.

It looks excellent, and the model is both fast and resource-efficient compared to LLMs.

Happy coding!

Top comments (0)