Alain Airom

Posted on Jan 20

ROX “Building an AI-Powered Document Retrieval System with Docling and Granite 3.1”

#docling #granite #llm #watsonx

Introduction & motivation

In this article, I share my experience on running a Jupyter notebook using both IBM Granite and Docling to provide RAG capabilities.

Working on a project with a business partner, I wanted to showcase the usage of Docling for its capacities of document cenversion together with IBM Granite LLM.

Docling is an open-source toolkit developed by IBM Research that helps developers convert various document formats (like PDFs, Word documents, presentations, and more) into structured data.

To begin my work, I used a basic sample provided on the
IBM Granite Community site.

The foundation onf my work is based on a sample notebook provided here: https://github.com/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/RAG/Granite_Docling_RAG.ipynb.

This short article describes the steps I followed in order to run the sample notebook and build my use-case using it.

Running the notebook locally

The environment I use is;

Intel based MacBook Pro
VS Code 1.96.4
Jupyter notebook support extension provided by Microsoft

I would have preferred using local Anaconda notebook installation but couldn’t make it work on my laptop! Subject to future troubleshooting.

Making a virtual environment for Python

Running the notebook as is without virtual environment failed. And… it is a best practice to make virtual environments. 😅

# installing venv if not already done... (most unlikely)
python3 -m pip install venv
# creating the vitual environement
python3 -m venv <your_env_name> 

source <your_env_name>/bin/activate

Running and debugging the notebook in my configuration

Below is the original notebook.

Building an AI-Powered Document Retrieval System with Docling and Granite 3.1
Using IBM Granite Models

Recipe Overview
Welcome to this Granite recipe, in this recipe, you'll learn to harness the power of advanced tools to build AI-powered document retrieval systems. It will guide you through:

Document Processing: Learn to handle documents from various sources, parse and transform them into usable formats, and store them in vector databases using Docling.
Retrieval-Augmented Generation (RAG): Understand how to connect large language models (LLMs) like Granite 3.1 with external knowledge bases to enhance query responses and generate valuable insights.
LangChain for Workflow Integration: Discover how to use LangChain to streamline and orchestrate document processing and retrieval workflows, enabling seamless interaction between different components of the system.
This workshop leverages two cutting-edge technologies:

Docling: An open-source toolkit for parsing and converting documents.
Granite™ 3.1: A state-of-the-art LLM available via an API through Replicate, providing robust natural language capabilities.
LangChain: A powerful framework for building applications powered by language models, designed to simplify complex workflows and integrate external tools seamlessly.
By the end of this recipe, you will:

Gain proficiency in document processing and chunking.
Integrate vector databases to enhance retrieval capabilities.
Utilize RAG to perform efficient and accurate data retrieval for real-world applications.
This recipe is designed for AI developers, researchers, and enthusiasts looking to enhance their knowledge of document management and advanced NLP techniques.

Prerequisites
Familiarity with Python programming.
Basic understanding of large language models and natural language processing concepts.
Step 1: Setting up the environment
Ensure you are running python 3.10 or 3.11 in a freshly-created virtual environment.

import sys
assert sys.version_info >= (3, 10) and sys.version_info < (3, 12), "Use Python 3.10 or 3.11 to run this notebook."
Step 2: Install dependencies
! pip install "git+https://github.com/ibm-granite-community/utils.git" \
    transformers \
    langchain_community \
    langchain_huggingface \
    langchain_milvus \
    docling \
    replicate

Step 3: Selecting System Components
Choose your Embeddings Model
Specify the model to use for generating embedding vectors from text. Here we will be using one of the new Granite Embeddings models

To use a model from another provider, replace this code cell with one from this Embeddings Model recipe.

from langchain_huggingface import HuggingFaceEmbeddings
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.1-8b-instruct")

embeddings_model = HuggingFaceEmbeddings(model_name="ibm-granite/granite-embedding-30m-english")
embeddings_tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-embedding-30m-english")
Use the Granite 3.1 model
Select a Granite model from the ibm-granite org on Replicate. Here we use the Replicate Langchain client to connect to the model.

To get set up with Replicate, see Getting Started with Replicate.

To connect to a model on a provider other than Replicate, substitute this code cell with one from the LLM component recipe.

from langchain_community.llms import Replicate
from ibm_granite_community.notebook_utils import get_env_var

model = Replicate(
    model="ibm-granite/granite-3.1-8b-instruct",
    replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
    model_kwargs={
        "max_tokens": 1000, # Set the maximum number of tokens to generate as output.
        "min_tokens": 100, # Set the minimum number of tokens to generate as output.
    },
)
Now that we have the model downloaded, let's try asking it a question

query = "Who won in the Pantoja vs Asakura fight at UFC 310?"
prompt_guide_template = """\
<|start_of_role|>user<|end_of_role|>{prompt}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>"""
prompt = prompt_guide_template.format(prompt=query)

output = model.invoke(prompt)

print(output)
Now, I know that UFC 310 happened in 2024, and this does not seem to be the right Pantoja. The model doesn't seem to know the answer but at least understands that this matchup did not occur. Let's see if it has some specific UFC rules info.

query1 = "How much weight allowance is allowed in non championship fights in the UFC?"

prompt = prompt_guide_template.format(prompt=query1)
output = model.invoke(prompt)

print(output)
Based on the official UFC rules, this is also incorrect. Let's try getting some documents that contains this information for the model.

Choose your Vector Database
Specify the database to use for storing and retrieving embedding vectors.

To connect to a vector database other than Milvus, replace this code cell with one from this Vector Store recipe.

from langchain_milvus import Milvus
import tempfile

db_file = tempfile.NamedTemporaryFile(prefix="milvus_", suffix=".db", delete=False).name
print(f"The vector database will be saved to {db_file}")

vector_db = Milvus(
    embedding_function=embeddings_model,
    connection_args={"uri": db_file},
    auto_id=True,
    enable_dynamic_field=True,
    index_params={"index_type": "AUTOINDEX"},
)
Step 4: Building the Vector Database
In this example, from a set of source documents, we use Docling to convert the documents into text and then split the text into chunks, derive embedding vectors using the embedding model, and load it into the vector database. Creating this vector database will allow us to easily search across our documents, enabling us to use RAG.

Use Docling to download the documents, convert to text, and split into chunks
Here we have found a website that gives us information on UFC 310, as well as a PDF of the official UFC rules. Below we will see Docling can convert both documents and chunk them.

# Docling imports
from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.types.doc.labels import DocItemLabel
from langchain_core.documents import Document

#Here are our documents, feel free to add more documents in formats that Docling supports
sources = [
    "https://www.ufc.com/news/main-card-results-highlights-winner-interviews-ufc-310-pantoja-vs-asakura",
    "https://media.ufc.tv/discover-ufc/Unified_Rules_MMA.pdf",

]

converter = DocumentConverter()

#Convert and chunk out documents
i = 0
texts: list[Document] = [
    Document(page_content=chunk.text, metadata={"doc_id": (i:=i+1), "source": source})
    for source in sources
    for chunk in HybridChunker(tokenizer=embeddings_tokenizer).chunk(converter.convert(source=source).document)
    if any(filter(lambda c: c.label in [DocItemLabel.TEXT, DocItemLabel.PARAGRAPH], iter(chunk.meta.doc_items)))
]

print(f"{len(texts)} document chunks created")
# Print all created documents
for document in texts:
    print(f"Document ID: {document.metadata['doc_id']}")
    print(f"Source: {document.metadata['source']}")
    print(f"Content:\n{document.page_content}")
    print("=" * 80)  # Separator for clarity
Populate the vector database
NOTE: Population of the vector database may take over a minute depending on your embedding model and service.

ids = vector_db.add_documents(texts)
print(f"{len(ids)} documents added to the vector database")
Step 5: RAG with Granite
Now that we have succesfully converted our documents and vectorized them, we can set up out RAG pipeline.

Retrieve relevant chunks
Here we will test the as_retriever method to search through our newly created vector database for chunks that are relevant to our original query

retriever = vector_db.as_retriever()

docs = retriever.invoke(query)
print(docs)
Looks like it pulled some chunks that would have the information we are looking for. Let's go ahead and contruct our RAG pipeline.

Create the prompt for Granite 3.1
Next, we construct the prompt pipeline. This creates the prompt which holds the retrieved chunks from out previous search and feeds this to the model as context for answering our question.

from langchain.prompts import PromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# Create a prompt template for question-answering with the retrieved context.
prompt_template = """<|start_of_role|>system<|end_of_role|>\
{{
  Context information is below.\n---------------------\n{context}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {input}\nAnswer:\n
  }}
}}<|end_of_text|>
<|start_of_role|>user<|end_of_role|>{input}"""

# Assemble the retrieval-augmented generation chain.
qa_chain_prompt = PromptTemplate.from_template(prompt_template)
combine_docs_chain = create_stuff_documents_chain(model, qa_chain_prompt)
rag_chain = create_retrieval_chain(vector_db.as_retriever(), combine_docs_chain)
Generate a retrieval-augmented response to a question
Using the chunks from the similarity search as context, the response from Granite RAG 3.1 is recieved in JSON document. This cell then parses the JSON document to retrieve the sentences of the response along with metadata about the sentence which can be used to guide the displayed output.

output = rag_chain.invoke({"input": query})

print(output['answer'])
Awesome! It looks like the model figured out our first question. Let's see if it figure out the rule we were looking for.

output = rag_chain.invoke({"input": query1})

print(output['answer'])
Awesome! We can now see that we have created a pipeline that can successfully leverage knowledge from multiple document types for generation.

What I changed…

And the changes I had to make in order to make it run on my configuration.

#seperation of pip install which for yet some reason couldn't run as is on my configuration
! pip install "git+https://github.com/ibm-granite-community/utils.git" \
    tf-keras \
    langchain_community


! pip instal backports.tarfile
! pip install milvus
! pip install replicate
! pip install langchain_milvus
! pip install docling 
! pip install langchain_huggingface
! pip install transformers
# added as encountered some problems in the notebook
! pip install 'docling-core[chunking]'
! pip install docling transformers

It is also necessary, if the sample is to be run as is, to obtain a “Replicate” API Key.

from langchain_community.llms import Replicate
from ibm_granite_community.notebook_utils import get_env_var

model = Replicate(
    model="ibm-granite/granite-3.1-8b-instruct",
    replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
    model_kwargs={
        "max_tokens": 1000, # Set the maximum number of tokens to generate as output.
        "min_tokens": 100, # Set the minimum number of tokens to generate as output.
    },
)

Otherwise the rest of the code runs well. For instance;

from langchain_milvus import Milvus
import tempfile

db_file = tempfile.NamedTemporaryFile(prefix="milvus_", suffix=".db", delete=False).name
print(f"The vector database will be saved to {db_file}")

vector_db = Milvus(
    embedding_function=embeddings_model,
    connection_args={"uri": db_file},
    auto_id=True,
    enable_dynamic_field=True,
    index_params={"index_type": "AUTOINDEX"},
)


#The vector database will be saved to /var/folders/bd/3l7ymgv145g1_9blgv56_9qr0000gn/T/milvus_2sqv7l31.db

Once the sample code works, it can be tuned and modified with any other use case.

Conclusion

I shared my experience on how to run a Jupyter notebook using both IBM Granite and Docling to provide RAG capabilities.

Thanks for reading 🙇

DEV Community

ROX “Building an AI-Powered Document Retrieval System with Docling and Granite 3.1”

Introduction & motivation

Running the notebook locally

Making a virtual environment for Python

Running and debugging the notebook in my configuration

What I changed…

Conclusion

Top comments (0)

Read next

Run DeepSeek-R1 on Your Laptop with Ollama

AI-Powered Second Brain! :)

Run DeepSeek R-1 Thinking Model in Kaggle Notebook Using Ollama

Run DeepSeek Locally in Obsidian: Complete Beginner's Guide