DEV Community

Cover image for Engineering of Small Things #3 : Small Language Models
ShatilKhan
ShatilKhan

Posted on • Edited on

Engineering of Small Things #3 : Small Language Models

Disclaimer! I am NOT an AI Influencer! This is literally just me learning about how to implement a hugging face model for the first time.
And Boy Was it Hard! :))

What is this even about?

I built a document Question & answering bot for this demonstration. It takes an Image & we can query & ask questions regarding that Image.
I had no idea how to implement language models going into this. So it was a really fun experience. Now this is part of a bigger project. Today I'm just sharing one part of it.

Going Crazy

This was for a hackathon project where I was trying to use @streamlit & HuggingFace , I never even had a hugging face account & had only basic tutorial level experience on @streamlit . But I really wanted to learn & implement something on my own. I was tired of following tutorials & it didn't matter if it was a standard solution or not!

Image description

But like the heading says, I did go crazy a couple times :)

The beginning of Insanity

Now I'm a web developer who had only heard about hugging face, didn't really care about the hype. But then I decided to experiment a little with this tech. I did not have the necessary setup on my local device , so firstly I had to install PyTorch & TesseractOCR on my local PC first. I will not be sharing this trauma :)
Those who know...know 😫

Image description

But I will take you through how I implemented it!

First we need the basic ingredients! A transformer!

Image description
How does it work? - I have no f**king Idea!
What does it do? - Makes language model go "brrrr"

And a library to read Image files.

Basic Imports:

Image description

Now to initialize our pipeline.
What does that even mean? Basically we are selecting a language model from huggingface model catalogue. And also setting what type of model that is. Here comes the main point of this blog. Small Language Models. At first I did try to use a popular Large Language Model (Mistral) , but here's the thing, after we've initialized a pipeline, when I run the program for the first time, it needs to download the model onto my local device. But like I've stated before, I have shitty internet & the Mistral-8B was like 2GB+. Every time I would start the project, the model would download halfway & give up.
So I opted to choose a Small Language Model called impira/layoutlm-document-qa

This is a 500 MB model that was a good enough job of answering questions from an uploaded document. But there's some parameters, like the uploaded document needs to be an image, hence the need for a separate library to read image files (PIL).

Pipeline Initialization:

Image description

But here's the hard lesson I learned. I used streamlit cloud to deploy the project. So it was running on a cloud gpu. But after a few minutes of usage it would become WAY TOO resource intensive & the project would shut down due to streamlit cloud's resource limitations. I couldn't figure this out for SOO long. But then streamlit came to the rescue again. Using st.cache we can cache out data so that they become less resource intensive. But I was using st.cache_data at first, now this API only caches Images or other Data, NOT AI MODELS. Because AI Language Models are classified as resources. So later I had to switch to st.cache_resource which finally solved the problem!

Caching Pipeline Initialization:

Image description

The next steps were pretty easy once I solved the main problem.
On the next stage I used an if-statement to check if image file is uploaded & loaded the pipeline so the SLM could read the file.

Verify Image Upload:

Image description

After that I initialized a form using st.form to submit a question

Initialize form:

Image description

If a question is submitted , the image will be opened & the AI Model will query the image.

Open Query:

Image description

Now the query performed doesn't just produce one answer. Remember, this is a language model, it doesn't understand human language, but can only predict the next word based on training data. So in this case it will produce multiple answer.
Finally it will choose the most probable answer and show it to the user!

Get the best answer:

Image description

Here's the main file: https://github.com/ShatilKhan/Hemo/blob/main/hemo.py

There's a lot of other features as it is part of a larger project, I just explained the part where I used a Language Model is all.
Hopefully I'll write more about other features of this project soon!
Happy Coding!

Image description

Top comments (0)