A recent project concept, using a serverless application powered by Docling document ingestion/preparation capacities.
Introduction
As part of my professional activities, I am very often engaged in helping our business partners to gain technical hands-on experience with technologies and tools we recommend to them. What follows is a part of a global project in which we helped our partner by some coding samples to accelerate the first phase of their project.
> The code provided below is to used as a starter or helper, and is adopted to the real use-case. So it should not be considered as finished or an end-to-end project, but a project starter/helper.
The main idea is;
- An application uploads documents by users on a cloud file system.
- A serverless job application using Docling fetches documents and prepares them for future utilization and drops the result in another cloud file system.
The serverless application deployed on IBM Code Engine, fetches source and updates from a private GitHub repository.
What is Docling and what is it used for
Docling simplifies document processing, parsing diverse formats โ including advanced PDF understanding โ and providing seamless integrations with the gen AI ecosystem.
Features
- ๐๏ธ Parsing of multiple document formats incl. PDF, DOCX, XLSX, HTML, images, and more
- ๐ Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
- ๐งฌ Unified, expressive DoclingDocument representation format
- โช๏ธ Various export formats and options, including Markdown, HTML, and lossless JSON
- ๐ Local execution capabilities for sensitive data and air-gapped environments
- ๐ค Plug-and-play integrations incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
- ๐ Extensive OCR support for scanned PDFs and images
- ๐ป Simple and convenient CLI
The file uploading application
I proposed two simple application to upload and store files. At first I wrote an application using Fastapi.
File uploading using Fastapi
import os
from fastapi import FastAPI, Request, File, UploadFile, HTTPException
from fastapi.responses import HTMLResponse, RedirectResponse
from fastapi.templating import Jinja2Templates
app = FastAPI()
templates = Jinja2Templates(directory="templates")
UPLOAD_DIR = "uploads"
os.makedirs(UPLOAD_DIR, exist_ok=True)
def get_uploaded_files():
try:
files = os.listdir(UPLOAD_DIR)
files.sort()
return files
except FileNotFoundError:
return []
@app.get("/", response_class=HTMLResponse)
async def read_root(request: Request):
uploaded_files = get_uploaded_files()
return templates.TemplateResponse("index.html", {"request": request, "filename": None, "message": None, "uploaded_files": uploaded_files})
@app.post("/upload", response_class=HTMLResponse)
async def upload_file(request: Request, file: UploadFile = File(...)):
filename = file.filename
filepath = os.path.join(UPLOAD_DIR, filename)
if os.path.exists(filepath):
return templates.TemplateResponse("confirm.html", {"request": request, "filename": filename})
else:
with open(filepath, "wb") as f:
contents = await file.read()
f.write(contents)
uploaded_files = get_uploaded_files() # Refresh file list
return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"File '{filename}' uploaded successfully.", "uploaded_files": uploaded_files})
@app.post("/confirm_replace", response_class=HTMLResponse)
async def confirm_replace(request: Request):
form = await request.form()
filename = form.get("filename")
replace = form.get("replace")
if not filename or not replace:
return templates.TemplateResponse("index.html", {"request": request, "message": "Missing filename or replace value."})
filepath = os.path.join(UPLOAD_DIR, filename)
if replace == "yes":
try:
files = await request.files() # Correct way to get the file
file = files.get("file")
if not file:
return templates.TemplateResponse("index.html", {"request": request, "message": "No file uploaded for replacement."})
contents = await file.read()
with open(filepath, "wb") as f:
f.write(contents)
uploaded_files = get_uploaded_files() # Refresh file list
return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"File '{filename}' replaced successfully.", "uploaded_files": uploaded_files})
except Exception as e:
return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"Error replacing file: {e}"})
elif replace == "no":
uploaded_files = get_uploaded_files() # Refresh file list
return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"No action taken for '{filename}'. File already exists.", "uploaded_files": uploaded_files})
else:
return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": "Invalid response."})
@app.post("/delete", response_class=RedirectResponse)
async def delete_files(request: Request):
form = await request.form()
files_to_delete = form.getlist("files")
if files_to_delete:
for file_to_delete in files_to_delete:
filepath = os.path.join(UPLOAD_DIR, file_to_delete)
try:
os.remove(filepath)
except Exception as e:
print(f"Error deleting {file_to_delete}: {e}")
return RedirectResponse("/", status_code=303)
return RedirectResponse("/", status_code=303)
Index.html
/* index.html */
<!DOCTYPE html>
<html>
<head>
<title>File Upload</title>
<style>
body {
font-family: sans-serif;
background-color: #f4f4f4;
color: #333;
margin: 20px;
display: flex;
flex-direction: column;
align-items: center; /* Center content horizontally */
}
h1 {
color: #007bff; /* Blue heading */
margin-bottom: 20px;
}
form {
background-color: #fff;
padding: 20px;
border-radius: 8px;
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1);
margin-bottom: 20px;
width: 400px; /* Set a fixed width for the form */
}
input[type="file"] {
margin-bottom: 10px;
}
input[type="submit"] {
background-color: #007bff;
color: #fff;
padding: 10px 15px;
border: none;
border-radius: 4px;
cursor: pointer;
}
input[type="submit"]:hover {
background-color: #0056b3;
}
h2 {
margin-top: 20px;
color: #343a40; /* Darker heading */
}
ul {
list-style: none;
padding: 0;
}
li {
margin-bottom: 5px;
display: flex; /* Align checkbox and label */
align-items: center;
}
input[type="checkbox"] {
margin-right: 5px;
}
p {
color: #d9534f; /* Red message for errors or feedback */
margin-top: 10px;
}
.uploaded-file-list { /* Style the uploaded files list */
background-color: #fff;
padding: 15px;
border-radius: 8px;
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1);
width: 400px; /* Match the form width */
}
</style>
<script>
function validateForm() {
const fileInput = document.querySelector('input[type="file"]');
if (fileInput.files.length === 0) {
alert("No files selected!");
return false; // Prevent form submission
}
return true; // Allow form submission
}
function validateDeleteForm() {
const checkboxes = document.querySelectorAll('input[type="checkbox"]:checked');
if (checkboxes.length === 0) {
alert("No files selected for deletion!");
return false;
}
return true;
}
</script>
</head>
<body>
<h1>Upload a File</h1>
<form action="/upload" method="post" enctype="multipart/form-data" onsubmit="return validateForm();">
<input type="file" name="file">
<input type="submit" value="Upload">
</form>
{% if filename %}
<h2>Uploaded File: {{ filename }}</h2>
{% endif %}
{% if message %}
<p>{{ message }}</p>
{% endif %}
<div class="uploaded-file-list"> <h2>Uploaded Files:</h2>
<form action="/delete" method="post">
<ul>
{% for file in uploaded_files %}
<li>
<input type="checkbox" name="files" value="{{ file }}" id="{{ file }}">
<label for="{{ file }}">{{ file }}</label>
</li>
{% endfor %}
</ul>
<input type="submit" value="Delete Selected">
</form>
</div>
</body>
</html>
Confirm.html
/* confirm.html */
<!DOCTYPE html>
<html>
<head>
<title>Confirm Replace</title>
<style>
body {
font-family: sans-serif;
background-color: #f4f4f4;
color: #333;
margin: 20px;
display: flex;
flex-direction: column;
align-items: center; /* Center content horizontally */
}
h1 {
color: #d9534f; /* Red heading for warning */
margin-bottom: 20px;
}
p {
margin-bottom: 20px;
}
form {
background-color: #fff;
padding: 20px;
border-radius: 8px;
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1);
width: 400px; /* Set a fixed width for the form */
}
input[type="file"] {
margin-bottom: 10px;
width: calc(100% - 10px); /* Ensures the file input doesn't overflow */
}
label {
margin-right: 10px; /* Space between radio button and label */
}
input[type="radio"] {
margin-right: 5px;
}
input[type="submit"] {
background-color: #007bff;
color: #fff;
padding: 10px 15px;
border: none;
border-radius: 4px;
cursor: pointer;
margin-top: 10px; /* Space above the button */
}
input[type="submit"]:hover {
background-color: #0056b3;
}
</style>
</head>
<body>
<h1>File Already Exists</h1>
<p>The file '{{ filename }}' already exists. Do you want to replace it?</p>
<form action="/confirm_replace" method="post" enctype="multipart/form-data">
<input type="hidden" name="filename" value="{{ filename }}">
<input type="file" name="file" required><br> <input type="radio" id="yes" name="replace" value="yes" required>
<label for="yes">Yes</label>
<input type="radio" id="no" name="replace" value="no">
<label for="no">No</label><br>
<input type="submit" value="Confirm">
</form>
</body>
</html>
The Dockerfile which builds an image for the application.
# Use a Python base image
FROM python:3.11-slim-buster
# Set the working directory inside the container
WORKDIR /app
# Copy the requirements file (if you have one)
# --- Create this file if you use external packages
COPY requirements.txt .
# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt # Install from requirements.txt
# Or install dependencies directly (if you don't have a requirements.txt file)
# RUN pip install --no-cache-dir fastapi uvicorn Jinja2 python-multipart
# Copy the application code
COPY . .
# Expose the port that Uvicorn will run on
EXPOSE 8000
# Start the Uvicorn server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
And some sample YAML file for the deployment part (which does not represent the actual cluster).
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-fastapi-deployment
namespace: files # Deploy to the "files" namespace
spec:
replicas: 3 # Number of pods (adjust as needed)
selector:
matchLabels:
app: my-fastapi-app
template:
metadata:
labels:
app: my-fastapi-app
spec:
containers:
- name: my-fastapi-container
image: my-fastapi-image:latest # Replace with your Docker image name and tag
ports:
- containerPort: 8000
volumeMounts:
- name: uploads-volume
mountPath: /app/uploads # Mount the volume to the uploads directory
resources: # Resource requests and limits
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
volumes:
- name: uploads-volume
persistentVolumeClaim: # Use a PersistentVolumeClaim for persistent storage
claimName: my-fastapi-pvc # Create this PVC separately
---
apiVersion: v1
kind: Service
metadata:
name: my-fastapi-service
namespace: files
spec:
selector:
app: my-fastapi-app
ports:
- protocol: TCP
port: 80 # External port
targetPort: 8000 # Container port
type: LoadBalancer # Use a LoadBalancer to expose the service externally
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-fastapi-pvc
namespace: files
spec:
accessModes: [ "ReadWriteOnce" ] # Or ReadWriteMany if needed
resources:
requests:
storage: 1Gi # Adjust storage size as needed
File uploading using Streamlit
However it seemed that a framework like Streamlit comes more handy and easy to deploy as a containerized application using a cluster based deployment.
import os
import streamlit as st
from pathlib import Path
UPLOAD_DIR = Path("uploads") # Use Path for better path handling
UPLOAD_DIR.mkdir(exist_ok=True) # Create uploads directory if it doesn't exist
def get_uploaded_files():
return sorted([f.name for f in UPLOAD_DIR.iterdir()])
st.title("File Upload and Management")
uploaded_file = st.file_uploader("Choose a file", type=None) # Allow any file type
if uploaded_file is not None:
filepath = UPLOAD_DIR / uploaded_file.name
if filepath.exists():
replace = st.radio(f"File '{uploaded_file.name}' already exists. Replace?", ("Yes", "No"))
if replace == "Yes":
with open(filepath, "wb") as f:
f.write(uploaded_file.getbuffer())
st.success(f"File '{uploaded_file.name}' replaced successfully.")
else:
st.info(f"No action taken for '{uploaded_file.name}'. File already exists.")
else:
with open(filepath, "wb") as f:
f.write(uploaded_file.getbuffer())
st.success(f"File '{uploaded_file.name}' uploaded successfully.")
st.subheader("Uploaded Files:")
uploaded_files = get_uploaded_files()
if uploaded_files:
for file in uploaded_files:
if st.checkbox(file): # Checkbox for each file
if st.button(f"Delete {file}"): # Delete button next to checkbox
try:
(UPLOAD_DIR / file).unlink() # Delete the file
st.experimental_rerun() # Refresh the app to reflect changes
st.success(f"File '{file}' deleted successfully.")
except Exception as e:
st.error(f"Error deleting '{file}': {e}")
else:
st.info("No files uploaded yet.")
Building a container for the code above!
# Use a Python base image
FROM python:3.11-slim-buster
# Set the working directory
WORKDIR /app
# Copy requirements.txt (recommended)
COPY requirements.txt .
# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Expose the Streamlit port (default is 8501)
EXPOSE 8501
# Run Streamlit
CMD ["streamlit", "run", "main_st.py"] # Replace app.py with your Streamlit file name
Sample Docling application using Streamlit framwork
Hereafter a starter code which is used as a helper for a Docling web based application.
import json
import logging
import time
from pathlib import Path
import os
import shutil # For copying directories
import streamlit as st
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
AcceleratorDevice,
AcceleratorOptions,
PdfPipelineOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
_log = logging.getLogger(__name__)
# Define the mount paths
KUBERNETES_VOLUME_MOUNT_PATH = "/app/uploads"
SCRATCH_VOLUME_MOUNT_PATH = "/app/scratch"
def process_pdf(input_doc_path, scratch_dir, pipeline_options):
"""Processes a single PDF file."""
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
try:
conv_result = doc_converter.convert(input_doc_path)
doc_filename = conv_result.input.file.stem
with (scratch_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp:
json.dump(conv_result.document.export_to_dict(), fp)
with (scratch_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp:
fp.write(conv_result.document.export_to_text())
with (scratch_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:
fp.write(conv_result.document.export_to_markdown())
with (scratch_dir / f"{doc_filename}.doctags").open("w", encoding="utf-8") as fp:
fp.write(conv_result.document.export_to_document_tokens())
return True # Indicate success
except Exception as e:
st.error(f"Error processing {input_doc_path}: {e}")
return False # Indicate failure
def main():
logging.basicConfig(level=logging.INFO)
st.title("Docling Document Conversion")
# Kubernetes volume directory
kubernetes_volume_dir = Path(KUBERNETES_VOLUME_MOUNT_PATH)
if not kubernetes_volume_dir.exists():
st.error(f"Kubernetes volume not found at {KUBERNETES_VOLUME_MOUNT_PATH}")
return
# Scratch directory
scratch_dir = Path(SCRATCH_VOLUME_MOUNT_PATH)
scratch_dir.mkdir(parents=True, exist_ok=True)
# ... (pipeline options, OCR language, number of threads - same as before)
# ... (Make sure pipeline_options is defined here)
if st.button("Convert Documents in Volume"):
with st.spinner("Converting documents..."):
start_time = time.time()
success_count = 0
fail_count = 0
for file_path in kubernetes_volume_dir.rglob("*.pdf"): # Recursive search for PDFs
if process_pdf(file_path, scratch_dir, pipeline_options):
success_count += 1
else:
fail_count += 1
end_time = time.time() - start_time
st.write(f"Conversion completed in {end_time:.2f} seconds.")
st.write(f"Successfully converted {success_count} PDFs.")
st.write(f"Failed to convert {fail_count} PDFs.")
st.write(f"Files saved to {SCRATCH_VOLUME_MOUNT_PATH}")
if __name__ == "__main__":
main()
And a Dockerfile to build an image.
FROM python:3.11-slim-buster
WORKDIR /app
# Create a requirements.txt with docling and its dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["streamlit", "run", "Docling_st.py"]
A YAML helper if the Docling application to be deployed inside a cluster later (for the time being it is a severless test application).
apiVersion: apps/v1
kind: Deployment
metadata:
name: docling-deployment
namespace: files # Deploy to the same "files" namespace
spec:
replicas: 1 # Adjust as needed
selector:
matchLabels:
app: docling-app
template:
metadata:
labels:
app: docling-app
spec:
containers:
- name: docling-container
image: docling-image:latest # Replace with your Docling Docker image
ports:
- containerPort: 8501 # Streamlit default port
volumeMounts:
- name: scratch-volume
mountPath: /app/scratch # Mount the scratch volume
- name: uploads-volume # Mount the existing uploads volume
mountPath: /app/uploads # Or another suitable path
resources:
requests:
cpu: 200m # Adjust as needed
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
volumes:
- name: scratch-volume
persistentVolumeClaim:
claimName: docling-pvc # Create this PVC separately
- name: uploads-volume # Use the existing uploads volume
persistentVolumeClaim:
claimName: my-fastapi-pvc # The existing PVC
---
apiVersion: v1
kind: Service
metadata:
name: docling-service
namespace: files
spec:
selector:
app: docling-app
ports:
- protocol: TCP
port: 8501 # External port
targetPort: 8501 # Container port
type: LoadBalancer # Or ClusterIP if internal access is sufficient
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: docling-pvc
namespace: files
spec:
accessModes: [ "ReadWriteOnce" ] # Or ReadWriteMany if needed
resources:
requests:
storage: 1Gi # Adjust storage size as needed
Conclusion
The sample codes provided here are the building blocks for a web based application which prepares a file repository/volume with document types such as images, PDFs and word. These documents are ingested and changed to MD files by Docling which makes them ready for a generative ai application.
Again, this is not an end-to-end project, but portions of code to be enhanced, industrialized and deployed.
Thanks for reading ๐ค
Useful links
- Docling: https://github.com/DS4SD/docling
- Docling documentation: https://ds4sd.github.io/docling/
Top comments (0)