DEV Community

Invoice Reader App with GenAI in 10 minutes — Tutorial

Let us try to create a simple web application with the help of Generative AI tech and tools. We will be using ChatGPT to get the basic setup of the project. With that, we can integrate additional updates to the code further as required.

Image generated by StabilityAI - StableDiffusion

Pre-Requisites

Below are the pre-requisites (or env/tools used) for the development of the application:

  • Python 3.10
  • Linux (in this case)
  • IDE or Editor
  • ≥ 2GB memory

Once everything is set, then we can use a prompt (or similar) as shown below and get the entire code setup for our application from ChatGPT. (Please ignore “huggingface smolagents” in the context — it was intended for a different activity 😄)

you are an expert in python programming with high skills of huggingface smolagents library and other required libraries. Generate a complete project with any number of required python program files and functions to implement a web application (using flask and basic html css and js) that requires no user auth, with a page having a mandatory image file upload option part of a form submission. after submitting, it should be parsed by a vision lang model and then the parsed output should be written as a doc file and then emailed to a default email id
Enter fullscreen mode Exit fullscreen mode

This was able to accomplish the request, yet it didn’t perform well as expected, so I had to give 2 followup messages (one for getting all dependency script files, and the other for wrapping the entire project scripts as a shell script). Later, I got the shell script to establish the project setup, which I saved in the local and then executed it with the following command:

chmod +x flask_vlm_app.sh
./flask_vlm_app.sh
Enter fullscreen mode Exit fullscreen mode

Code with Explanation

Let us go through the code snippets along in the working flow of the application.

app.py

from flask import Flask, render_template, request, send_file
import os
from werkzeug.utils import secure_filename
from image_processor import process_image
from document_generator import create_doc
from email_sender import send_email

app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads'
app.config['OUTPUT_FOLDER'] = 'outputs'
DEFAULT_EMAIL = "amrs.tech@gmail.com"

os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)
os.makedirs(app.config['OUTPUT_FOLDER'], exist_ok=True)

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/upload', methods=['POST'])
def upload():
    if 'image' not in request.files:
        return "No file part", 400
    file = request.files['image']
    if file.filename == '':
        return "No selected file", 400

    filename = secure_filename(file.filename)
    filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename)
    file.save(filepath)

    extracted_text = process_image(filepath)

    doc_path = os.path.join(app.config['OUTPUT_FOLDER'], f"{os.path.splitext(filename)[0]}.docx")
    create_doc(extracted_text, doc_path)

    send_email(DEFAULT_EMAIL, doc_path)

    return send_file(doc_path, as_attachment=True)

if __name__ == '__main__':
    app.run(debug=True)
Enter fullscreen mode Exit fullscreen mode

This is the main file that drives the flask application (server) and our dependency scripts are imported here (image_processor, document_generator and email_sender). There are two endpoints — One for the home page with file upload form and the other for (POST request) form submission and invoice parsing. The latter part flows from getting the uploaded file and saving it in a directory in the server and then parse it with our VLM, followed by sending the parsed content as a document to an email address.

image_processor.py

from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
import torch

def process_image(image_path):
    processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
    model = AutoModelForImageTextToText.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")

    conversation = [
        {
            "role": "user",
            "content":[
                {"type": "image", "url": image_path},
                {"type": "text", "text": "Describe this image in detail."}
            ]
        }
    ]

    inputs = processor.apply_chat_template(
        conversation,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    ).to(model.device)

    output_ids = model.generate(**inputs, max_new_tokens=400)
    generated_texts = processor.batch_decode(output_ids, skip_special_tokens=True)
    print("OUTPUT==>",generated_texts)
    return generated_texts
Enter fullscreen mode Exit fullscreen mode

This is the script used to process the uploaded image from the user. HuggingFace’s SmolVLM-500M-Instruct model is used for image parsing, which works better than most of the Image models (ChatGPT gave the code with Salesforce BLIP-Image-Captioning model which didn’t do good). For more accuracy and better task-specific results, we can provide a few more example shots with the messages list.

email_sender.py

import smtplib
import os
from email.message import EmailMessage

def send_email(recipient, file_path):
    EMAIL_ADDRESS = os.getenv("EMAIL_ADDRESS")
    EMAIL_PASSWORD = os.getenv("EMAIL_PASSWORD")
    # print('emailauth==>',EMAIL_ADDRESS, EMAIL_PASSWORD)

    msg = EmailMessage()
    msg['Subject'] = 'Extracted Text Document'
    msg['From'] = EMAIL_ADDRESS
    msg['To'] = recipient
    msg.set_content("Please find the extracted text document attached.")

    with open(file_path, 'rb') as f:
        msg.add_attachment(
            f.read(),
            maintype='application',
            subtype='octet-stream',
            filename=os.path.basename(file_path),
            disposition='inline'
        )

    with smtplib.SMTP('smtp.gmail.com', 587) as server:
        server.starttls()
        server.login(EMAIL_ADDRESS, EMAIL_PASSWORD)
        server.send_message(msg)
Enter fullscreen mode Exit fullscreen mode

This script is to send email to the DEFAULT_EMAIL we set in the beginning (we can change it to be fetched from user as well, but it’s just for example). We are constructing the EmailMessage object with Subject, From, To and then the Body. Then, we’re attaching the document (invoice content) and then logging in and sending the email message. If virtual environment is used, then you can edit env/bin/activate to add the email id and password to the environment, else you can directly add it to the system environment. (NOTE: You need app password for sending email with Gmail server — Ref — Google Support Answer for App Passwords)

Great! Now, with the project code all set, the files and folder structure of the project should be something like this:

flask_vlm_app/        # Project root
│── uploads/          # Folder for uploaded images (created dynamically)
│── outputs/          # Folder for generated documents (created dynamically)
│── templates/        # HTML template directory
│   └── index.html    # Webpage for image upload
│── app.py            # Main Flask application
│── image_processor.py # Processes image using a vision-language model
│── document_generator.py # Creates .docx file from extracted text
│── email_sender.py   # Sends extracted text document via email
│── requirements.txt  # List of required dependencies (if needed)
Enter fullscreen mode Exit fullscreen mode

Make sure that all the required libraries are installed (in virtual environment if applicable, and activate the env) with pip install -r requirements.txt or pip install <pkg> for each library.

Flask==3.0.3
torch==2.6.0
transformers
pillow==11.1.0
python-docx==1.1.2
Enter fullscreen mode Exit fullscreen mode

Let it begins — Start the flask app server with the command python app.py inside the project folder. You should be able to view a page as in the below screenshot after going to localhost:5000 (sometimes, the port might be 8000 — based on your flask app config) in the web browser.
You can then upload an image file of an invoice and click on Upload. It should take a couple of minutes for the model inference and then sending the email.

UI Screenshot

Voila! You should have received an email (provided the recipient email address is yours 😁) with the parsed content as a docx attachment.

This shows how easily Generative AI tools can be utilized to enhance the productivity with less efforts. Feel free to react 😜 and leave feedback in the comments. Thanks.

Github Link : https://github.com/amrs-tech/invoice-reader

Happy Learning!

Top comments (0)