Let us try to create a simple web application with the help of Generative AI tech and tools. We will be using ChatGPT to get the basic setup of the project. With that, we can integrate additional updates to the code further as required.
Pre-Requisites
Below are the pre-requisites (or env/tools used) for the development of the application:
- Python 3.10
- Linux (in this case)
- IDE or Editor
- ≥ 2GB memory
Once everything is set, then we can use a prompt (or similar) as shown below and get the entire code setup for our application from ChatGPT. (Please ignore “huggingface smolagents” in the context — it was intended for a different activity 😄)
you are an expert in python programming with high skills of huggingface smolagents library and other required libraries. Generate a complete project with any number of required python program files and functions to implement a web application (using flask and basic html css and js) that requires no user auth, with a page having a mandatory image file upload option part of a form submission. after submitting, it should be parsed by a vision lang model and then the parsed output should be written as a doc file and then emailed to a default email id
This was able to accomplish the request, yet it didn’t perform well as expected, so I had to give 2 followup messages (one for getting all dependency script files, and the other for wrapping the entire project scripts as a shell script). Later, I got the shell script to establish the project setup, which I saved in the local and then executed it with the following command:
chmod +x flask_vlm_app.sh
./flask_vlm_app.sh
Code with Explanation
Let us go through the code snippets along in the working flow of the application.
app.py
from flask import Flask, render_template, request, send_file
import os
from werkzeug.utils import secure_filename
from image_processor import process_image
from document_generator import create_doc
from email_sender import send_email
app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads'
app.config['OUTPUT_FOLDER'] = 'outputs'
DEFAULT_EMAIL = "amrs.tech@gmail.com"
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)
os.makedirs(app.config['OUTPUT_FOLDER'], exist_ok=True)
@app.route('/')
def index():
return render_template('index.html')
@app.route('/upload', methods=['POST'])
def upload():
if 'image' not in request.files:
return "No file part", 400
file = request.files['image']
if file.filename == '':
return "No selected file", 400
filename = secure_filename(file.filename)
filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename)
file.save(filepath)
extracted_text = process_image(filepath)
doc_path = os.path.join(app.config['OUTPUT_FOLDER'], f"{os.path.splitext(filename)[0]}.docx")
create_doc(extracted_text, doc_path)
send_email(DEFAULT_EMAIL, doc_path)
return send_file(doc_path, as_attachment=True)
if __name__ == '__main__':
app.run(debug=True)
This is the main file that drives the flask application (server) and our dependency scripts are imported here (image_processor, document_generator and email_sender). There are two endpoints — One for the home page with file upload form and the other for (POST request) form submission and invoice parsing. The latter part flows from getting the uploaded file and saving it in a directory in the server and then parse it with our VLM, followed by sending the parsed content as a document to an email address.
image_processor.py
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
import torch
def process_image(image_path):
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
model = AutoModelForImageTextToText.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
conversation = [
{
"role": "user",
"content":[
{"type": "image", "url": image_path},
{"type": "text", "text": "Describe this image in detail."}
]
}
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=400)
generated_texts = processor.batch_decode(output_ids, skip_special_tokens=True)
print("OUTPUT==>",generated_texts)
return generated_texts
This is the script used to process the uploaded image from the user. HuggingFace’s SmolVLM-500M-Instruct model is used for image parsing, which works better than most of the Image models (ChatGPT gave the code with Salesforce BLIP-Image-Captioning model which didn’t do good). For more accuracy and better task-specific results, we can provide a few more example shots with the messages list.
email_sender.py
import smtplib
import os
from email.message import EmailMessage
def send_email(recipient, file_path):
EMAIL_ADDRESS = os.getenv("EMAIL_ADDRESS")
EMAIL_PASSWORD = os.getenv("EMAIL_PASSWORD")
# print('emailauth==>',EMAIL_ADDRESS, EMAIL_PASSWORD)
msg = EmailMessage()
msg['Subject'] = 'Extracted Text Document'
msg['From'] = EMAIL_ADDRESS
msg['To'] = recipient
msg.set_content("Please find the extracted text document attached.")
with open(file_path, 'rb') as f:
msg.add_attachment(
f.read(),
maintype='application',
subtype='octet-stream',
filename=os.path.basename(file_path),
disposition='inline'
)
with smtplib.SMTP('smtp.gmail.com', 587) as server:
server.starttls()
server.login(EMAIL_ADDRESS, EMAIL_PASSWORD)
server.send_message(msg)
This script is to send email to the DEFAULT_EMAIL we set in the beginning (we can change it to be fetched from user as well, but it’s just for example). We are constructing the EmailMessage object with Subject, From, To and then the Body. Then, we’re attaching the document (invoice content) and then logging in and sending the email message. If virtual environment is used, then you can edit env/bin/activate
to add the email id and password to the environment, else you can directly add it to the system environment. (NOTE: You need app password for sending email with Gmail server — Ref — Google Support Answer for App Passwords)
Great! Now, with the project code all set, the files and folder structure of the project should be something like this:
flask_vlm_app/ # Project root
│── uploads/ # Folder for uploaded images (created dynamically)
│── outputs/ # Folder for generated documents (created dynamically)
│── templates/ # HTML template directory
│ └── index.html # Webpage for image upload
│── app.py # Main Flask application
│── image_processor.py # Processes image using a vision-language model
│── document_generator.py # Creates .docx file from extracted text
│── email_sender.py # Sends extracted text document via email
│── requirements.txt # List of required dependencies (if needed)
Make sure that all the required libraries are installed (in virtual environment if applicable, and activate the env) with pip install -r requirements.txt
or pip install <pkg>
for each library.
Flask==3.0.3
torch==2.6.0
transformers
pillow==11.1.0
python-docx==1.1.2
Let it begins — Start the flask app server with the command python app.py
inside the project folder. You should be able to view a page as in the below screenshot after going to localhost:5000
(sometimes, the port might be 8000 — based on your flask app config) in the web browser.
You can then upload an image file of an invoice and click on Upload. It should take a couple of minutes for the model inference and then sending the email.
Voila! You should have received an email (provided the recipient email address is yours 😁) with the parsed content as a docx attachment.
This shows how easily Generative AI tools can be utilized to enhance the productivity with less efforts. Feel free to react 😜 and leave feedback in the comments. Thanks.
Github Link : https://github.com/amrs-tech/invoice-reader
Happy Learning!
Top comments (0)