DEV Community

Alain Airom
Alain Airom

Posted on

Yet another document ingestion project with Docling and IBM Cloud Code Engine (serverless)

A recent project concept, using a serverless application powered by Docling document ingestion/preparation capacities.

Image description

Introduction

As part of my professional activities, I am very often engaged in helping our business partners to gain technical hands-on experience with technologies and tools we recommend to them. What follows is a part of a global project in which we helped our partner by some coding samples to accelerate the first phase of their project.

> The code provided below is to used as a starter or helper, and is adopted to the real use-case. So it should not be considered as finished or an end-to-end project, but a project starter/helper.

The main idea is;

  • An application uploads documents by users on a cloud file system.
  • A serverless job application using Docling fetches documents and prepares them for future utilization and drops the result in another cloud file system.

The serverless application deployed on IBM Code Engine, fetches source and updates from a private GitHub repository.

Image description

What is Docling and what is it used for

Docling simplifies document processing, parsing diverse formats โ€” including advanced PDF understanding โ€” and providing seamless integrations with the gen AI ecosystem.

Features

  • ๐Ÿ—‚๏ธ Parsing of multiple document formats incl. PDF, DOCX, XLSX, HTML, images, and more
  • ๐Ÿ“‘ Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
  • ๐Ÿงฌ Unified, expressive DoclingDocument representation format
  • โ†ช๏ธ Various export formats and options, including Markdown, HTML, and lossless JSON
  • ๐Ÿ”’ Local execution capabilities for sensitive data and air-gapped environments
  • ๐Ÿค– Plug-and-play integrations incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
  • ๐Ÿ” Extensive OCR support for scanned PDFs and images
  • ๐Ÿ’ป Simple and convenient CLI

The file uploading application

I proposed two simple application to upload and store files. At first I wrote an application using Fastapi.

File uploading using Fastapi

import os
from fastapi import FastAPI, Request, File, UploadFile, HTTPException
from fastapi.responses import HTMLResponse, RedirectResponse
from fastapi.templating import Jinja2Templates

app = FastAPI()

templates = Jinja2Templates(directory="templates")
UPLOAD_DIR = "uploads"

os.makedirs(UPLOAD_DIR, exist_ok=True)

def get_uploaded_files():
    try:
        files = os.listdir(UPLOAD_DIR)
        files.sort()
        return files
    except FileNotFoundError:
        return []

@app.get("/", response_class=HTMLResponse)
async def read_root(request: Request):
    uploaded_files = get_uploaded_files()
    return templates.TemplateResponse("index.html", {"request": request, "filename": None, "message": None, "uploaded_files": uploaded_files})

@app.post("/upload", response_class=HTMLResponse)
async def upload_file(request: Request, file: UploadFile = File(...)):
    filename = file.filename
    filepath = os.path.join(UPLOAD_DIR, filename)

    if os.path.exists(filepath):
        return templates.TemplateResponse("confirm.html", {"request": request, "filename": filename})
    else:
        with open(filepath, "wb") as f:
            contents = await file.read()
            f.write(contents)
        uploaded_files = get_uploaded_files()  # Refresh file list
        return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"File '{filename}' uploaded successfully.", "uploaded_files": uploaded_files})

@app.post("/confirm_replace", response_class=HTMLResponse)
async def confirm_replace(request: Request):
    form = await request.form()
    filename = form.get("filename")
    replace = form.get("replace")

    if not filename or not replace:
        return templates.TemplateResponse("index.html", {"request": request, "message": "Missing filename or replace value."})

    filepath = os.path.join(UPLOAD_DIR, filename)

    if replace == "yes":
        try:
            files = await request.files()  # Correct way to get the file
            file = files.get("file")
            if not file:
                return templates.TemplateResponse("index.html", {"request": request, "message": "No file uploaded for replacement."})
            contents = await file.read()
            with open(filepath, "wb") as f:
                f.write(contents)
            uploaded_files = get_uploaded_files() # Refresh file list
            return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"File '{filename}' replaced successfully.", "uploaded_files": uploaded_files})
        except Exception as e:
            return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"Error replacing file: {e}"})

    elif replace == "no":
        uploaded_files = get_uploaded_files() # Refresh file list
        return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"No action taken for '{filename}'. File already exists.", "uploaded_files": uploaded_files})
    else:
        return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": "Invalid response."})


@app.post("/delete", response_class=RedirectResponse)
async def delete_files(request: Request):
    form = await request.form()
    files_to_delete = form.getlist("files")

    if files_to_delete:
        for file_to_delete in files_to_delete:
            filepath = os.path.join(UPLOAD_DIR, file_to_delete)
            try:
                os.remove(filepath)
            except Exception as e:
                print(f"Error deleting {file_to_delete}: {e}")

        return RedirectResponse("/", status_code=303)

    return RedirectResponse("/", status_code=303)
Enter fullscreen mode Exit fullscreen mode

Index.html

/* index.html */
<!DOCTYPE html>
<html>
<head>
    <title>File Upload</title>
    <style>
        body {
            font-family: sans-serif;
            background-color: #f4f4f4;
            color: #333;
            margin: 20px;
            display: flex;
            flex-direction: column;
            align-items: center; /* Center content horizontally */
        }

        h1 {
            color: #007bff; /* Blue heading */
            margin-bottom: 20px;
        }

        form {
            background-color: #fff;
            padding: 20px;
            border-radius: 8px;
            box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1);
            margin-bottom: 20px;
            width: 400px; /* Set a fixed width for the form */
        }

        input[type="file"] {
            margin-bottom: 10px;
        }

        input[type="submit"] {
            background-color: #007bff;
            color: #fff;
            padding: 10px 15px;
            border: none;
            border-radius: 4px;
            cursor: pointer;
        }

        input[type="submit"]:hover {
            background-color: #0056b3;
        }

        h2 {
            margin-top: 20px;
            color: #343a40; /* Darker heading */
        }

        ul {
            list-style: none;
            padding: 0;
        }

        li {
            margin-bottom: 5px;
            display: flex; /* Align checkbox and label */
            align-items: center;
        }

        input[type="checkbox"] {
            margin-right: 5px;
        }

        p {
            color: #d9534f; /* Red message for errors or feedback */
            margin-top: 10px;
        }

        .uploaded-file-list { /* Style the uploaded files list */
            background-color: #fff;
            padding: 15px;
            border-radius: 8px;
            box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1);
            width: 400px; /* Match the form width */
        }

    </style>
    <script>
        function validateForm() {
            const fileInput = document.querySelector('input[type="file"]');
            if (fileInput.files.length === 0) {
                alert("No files selected!");
                return false; // Prevent form submission
            }
            return true; // Allow form submission
        }

        function validateDeleteForm() {
            const checkboxes = document.querySelectorAll('input[type="checkbox"]:checked');
            if (checkboxes.length === 0) {
                alert("No files selected for deletion!");
                return false;
            }
            return true;
        }
    </script>

</head>
<body>
    <h1>Upload a File</h1>
    <form action="/upload" method="post" enctype="multipart/form-data" onsubmit="return validateForm();">
        <input type="file" name="file">
        <input type="submit" value="Upload">
    </form>

    {% if filename %}
    <h2>Uploaded File: {{ filename }}</h2>
    {% endif %}

    {% if message %}
    <p>{{ message }}</p>
    {% endif %}

    <div class="uploaded-file-list">  <h2>Uploaded Files:</h2>
        <form action="/delete" method="post">
            <ul>
                {% for file in uploaded_files %}
                <li>
                    <input type="checkbox" name="files" value="{{ file }}" id="{{ file }}">
                    <label for="{{ file }}">{{ file }}</label>
                </li>
                {% endfor %}
            </ul>
            <input type="submit" value="Delete Selected">
        </form>
    </div>

</body>
</html>
Enter fullscreen mode Exit fullscreen mode

Confirm.html

/* confirm.html */
<!DOCTYPE html>
<html>
<head>
    <title>Confirm Replace</title>
    <style>
        body {
            font-family: sans-serif;
            background-color: #f4f4f4;
            color: #333;
            margin: 20px;
            display: flex;
            flex-direction: column;
            align-items: center; /* Center content horizontally */
        }

        h1 {
            color: #d9534f; /* Red heading for warning */
            margin-bottom: 20px;
        }

        p {
            margin-bottom: 20px;
        }

        form {
            background-color: #fff;
            padding: 20px;
            border-radius: 8px;
            box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1);
            width: 400px; /* Set a fixed width for the form */
        }

        input[type="file"] {
            margin-bottom: 10px;
            width: calc(100% - 10px); /* Ensures the file input doesn't overflow */
        }

        label {
            margin-right: 10px; /* Space between radio button and label */
        }

        input[type="radio"] {
            margin-right: 5px;
        }


        input[type="submit"] {
            background-color: #007bff;
            color: #fff;
            padding: 10px 15px;
            border: none;
            border-radius: 4px;
            cursor: pointer;
            margin-top: 10px; /* Space above the button */
        }

        input[type="submit"]:hover {
            background-color: #0056b3;
        }
    </style>
</head>
<body>
    <h1>File Already Exists</h1>
    <p>The file '{{ filename }}' already exists. Do you want to replace it?</p>
    <form action="/confirm_replace" method="post" enctype="multipart/form-data">
        <input type="hidden" name="filename" value="{{ filename }}">
        <input type="file" name="file" required><br>  <input type="radio" id="yes" name="replace" value="yes" required>
        <label for="yes">Yes</label>
        <input type="radio" id="no" name="replace" value="no">
        <label for="no">No</label><br>
        <input type="submit" value="Confirm">
    </form>
</body>
</html>
Enter fullscreen mode Exit fullscreen mode

The Dockerfile which builds an image for the application.

# Use a Python base image
FROM python:3.11-slim-buster  

# Set the working directory inside the container
WORKDIR /app

# Copy the requirements file (if you have one)
# --- Create this file if you use external packages
COPY requirements.txt .  

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt  # Install from requirements.txt

# Or install dependencies directly (if you don't have a requirements.txt file)
# RUN pip install --no-cache-dir fastapi uvicorn Jinja2 python-multipart

# Copy the application code
COPY . .

# Expose the port that Uvicorn will run on
EXPOSE 8000

# Start the Uvicorn server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Enter fullscreen mode Exit fullscreen mode

Image description
And some sample YAML file for the deployment part (which does not represent the actual cluster).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-fastapi-deployment
  namespace: files  # Deploy to the "files" namespace
spec:
  replicas: 3 # Number of pods (adjust as needed)
  selector:
    matchLabels:
      app: my-fastapi-app
  template:
    metadata:
      labels:
        app: my-fastapi-app
    spec:
      containers:
      - name: my-fastapi-container
        image: my-fastapi-image:latest # Replace with your Docker image name and tag
        ports:
        - containerPort: 8000
        volumeMounts:
        - name: uploads-volume
          mountPath: /app/uploads # Mount the volume to the uploads directory
        resources: # Resource requests and limits
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
      volumes:
      - name: uploads-volume
        persistentVolumeClaim: # Use a PersistentVolumeClaim for persistent storage
          claimName: my-fastapi-pvc # Create this PVC separately

---

apiVersion: v1
kind: Service
metadata:
  name: my-fastapi-service
  namespace: files
spec:
  selector:
    app: my-fastapi-app
  ports:
  - protocol: TCP
    port: 80 # External port
    targetPort: 8000 # Container port
  type: LoadBalancer # Use a LoadBalancer to expose the service externally

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-fastapi-pvc
  namespace: files
spec:
  accessModes: [ "ReadWriteOnce" ] # Or ReadWriteMany if needed
  resources:
    requests:
      storage: 1Gi # Adjust storage size as needed
Enter fullscreen mode Exit fullscreen mode

File uploading using Streamlit

However it seemed that a framework like Streamlit comes more handy and easy to deploy as a containerized application using a cluster based deployment.

import os
import streamlit as st
from pathlib import Path

UPLOAD_DIR = Path("uploads")  # Use Path for better path handling
UPLOAD_DIR.mkdir(exist_ok=True)  # Create uploads directory if it doesn't exist

def get_uploaded_files():
    return sorted([f.name for f in UPLOAD_DIR.iterdir()])

st.title("File Upload and Management")

uploaded_file = st.file_uploader("Choose a file", type=None)  # Allow any file type

if uploaded_file is not None:
    filepath = UPLOAD_DIR / uploaded_file.name

    if filepath.exists():
        replace = st.radio(f"File '{uploaded_file.name}' already exists. Replace?", ("Yes", "No"))
        if replace == "Yes":
            with open(filepath, "wb") as f:
                f.write(uploaded_file.getbuffer())
            st.success(f"File '{uploaded_file.name}' replaced successfully.")
        else:
            st.info(f"No action taken for '{uploaded_file.name}'. File already exists.")
    else:
        with open(filepath, "wb") as f:
            f.write(uploaded_file.getbuffer())
        st.success(f"File '{uploaded_file.name}' uploaded successfully.")

st.subheader("Uploaded Files:")
uploaded_files = get_uploaded_files()
if uploaded_files:
    for file in uploaded_files:
        if st.checkbox(file):  # Checkbox for each file
            if st.button(f"Delete {file}"):  # Delete button next to checkbox
                try:
                    (UPLOAD_DIR / file).unlink() # Delete the file
                    st.experimental_rerun()  # Refresh the app to reflect changes
                    st.success(f"File '{file}' deleted successfully.")
                except Exception as e:
                    st.error(f"Error deleting '{file}': {e}")
else:
    st.info("No files uploaded yet.")
Enter fullscreen mode Exit fullscreen mode

Building a container for the code above!

# Use a Python base image
FROM python:3.11-slim-buster 
# Set the working directory
WORKDIR /app

# Copy requirements.txt (recommended)
COPY requirements.txt .

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Expose the Streamlit port (default is 8501)
EXPOSE 8501

# Run Streamlit
CMD ["streamlit", "run", "main_st.py"] # Replace app.py with your Streamlit file name
Enter fullscreen mode Exit fullscreen mode

Sample Docling application using Streamlit framwork

Hereafter a starter code which is used as a helper for a Docling web based application.

import json
import logging
import time
from pathlib import Path
import os
import shutil  # For copying directories

import streamlit as st

from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    AcceleratorDevice,
    AcceleratorOptions,
    PdfPipelineOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption

_log = logging.getLogger(__name__)

# Define the mount paths
KUBERNETES_VOLUME_MOUNT_PATH = "/app/uploads"
SCRATCH_VOLUME_MOUNT_PATH = "/app/scratch"

def process_pdf(input_doc_path, scratch_dir, pipeline_options):
    """Processes a single PDF file."""
    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )
    try:
        conv_result = doc_converter.convert(input_doc_path)
        doc_filename = conv_result.input.file.stem

        with (scratch_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp:
            json.dump(conv_result.document.export_to_dict(), fp)
        with (scratch_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp:
            fp.write(conv_result.document.export_to_text())
        with (scratch_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:
            fp.write(conv_result.document.export_to_markdown())
        with (scratch_dir / f"{doc_filename}.doctags").open("w", encoding="utf-8") as fp:
            fp.write(conv_result.document.export_to_document_tokens())
        return True  # Indicate success

    except Exception as e:
        st.error(f"Error processing {input_doc_path}: {e}")
        return False  # Indicate failure

def main():
    logging.basicConfig(level=logging.INFO)

    st.title("Docling Document Conversion")

    # Kubernetes volume directory
    kubernetes_volume_dir = Path(KUBERNETES_VOLUME_MOUNT_PATH)
    if not kubernetes_volume_dir.exists():
        st.error(f"Kubernetes volume not found at {KUBERNETES_VOLUME_MOUNT_PATH}")
        return

    # Scratch directory
    scratch_dir = Path(SCRATCH_VOLUME_MOUNT_PATH)
    scratch_dir.mkdir(parents=True, exist_ok=True)


    # ... (pipeline options, OCR language, number of threads - same as before)
    # ... (Make sure pipeline_options is defined here)

    if st.button("Convert Documents in Volume"):
        with st.spinner("Converting documents..."):
            start_time = time.time()
            success_count = 0
            fail_count = 0

            for file_path in kubernetes_volume_dir.rglob("*.pdf"):  # Recursive search for PDFs
                if process_pdf(file_path, scratch_dir, pipeline_options):
                    success_count += 1
                else:
                    fail_count += 1

            end_time = time.time() - start_time
            st.write(f"Conversion completed in {end_time:.2f} seconds.")
            st.write(f"Successfully converted {success_count} PDFs.")
            st.write(f"Failed to convert {fail_count} PDFs.")
            st.write(f"Files saved to {SCRATCH_VOLUME_MOUNT_PATH}")

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

And a Dockerfile to build an image.

FROM python:3.11-slim-buster

WORKDIR /app

# Create a requirements.txt with docling and its dependencies
COPY requirements.txt .  
RUN pip install -r requirements.txt

COPY . .

CMD ["streamlit", "run", "Docling_st.py"] 
Enter fullscreen mode Exit fullscreen mode

Image description

A YAML helper if the Docling application to be deployed inside a cluster later (for the time being it is a severless test application).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: docling-deployment
  namespace: files  # Deploy to the same "files" namespace
spec:
  replicas: 1  # Adjust as needed
  selector:
    matchLabels:
      app: docling-app
  template:
    metadata:
      labels:
        app: docling-app
    spec:
      containers:
      - name: docling-container
        image: docling-image:latest  # Replace with your Docling Docker image
        ports:
        - containerPort: 8501 # Streamlit default port
        volumeMounts:
        - name: scratch-volume
          mountPath: /app/scratch # Mount the scratch volume
        - name: uploads-volume # Mount the existing uploads volume
          mountPath: /app/uploads # Or another suitable path
        resources:
          requests:
            cpu: 200m # Adjust as needed
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
      volumes:
      - name: scratch-volume
        persistentVolumeClaim:
          claimName: docling-pvc # Create this PVC separately
      - name: uploads-volume # Use the existing uploads volume
        persistentVolumeClaim:
          claimName: my-fastapi-pvc # The existing PVC

---

apiVersion: v1
kind: Service
metadata:
  name: docling-service
  namespace: files
spec:
  selector:
    app: docling-app
  ports:
  - protocol: TCP
    port: 8501 # External port
    targetPort: 8501 # Container port
  type: LoadBalancer  # Or ClusterIP if internal access is sufficient

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: docling-pvc
  namespace: files
spec:
  accessModes: [ "ReadWriteOnce" ] # Or ReadWriteMany if needed
  resources:
    requests:
      storage: 1Gi # Adjust storage size as needed
Enter fullscreen mode Exit fullscreen mode

Conclusion

The sample codes provided here are the building blocks for a web based application which prepares a file repository/volume with document types such as images, PDFs and word. These documents are ingested and changed to MD files by Docling which makes them ready for a generative ai application.

Again, this is not an end-to-end project, but portions of code to be enhanced, industrialized and deployed.

Thanks for reading ๐ŸคŸ

Useful links

Top comments (0)