Alain Airom

Posted on Mar 8

Simple Knowledge Retrieval: A RAG Implementation and query using IBM Granite on Hugging Face

#rag #huggingface #granite #llm

A simple RAG implementation and query using IBM Granite on Hugging Face.

Introduction

This application demonstrates a very simple Retrieval Augmented Generation (RAG) system, designed to answer user queries based on information extracted from a provided text file.

It begins by processing the input text, breaking it into manageable chunks, and generating numerical representations (embeddings) of these chunks using a pre-trained sentence transformer model. These embeddings are then indexed using Faiss for efficient similarity search. When a user enters a query, the application converts the query into an embedding and retrieves the most relevant text chunks from the index.

These chunks are then used as context for a large language model (LLM), accessed through the Hugging Face Inference API, to generate a coherent and informative response. The final answer is then displayed to the user on the command line, providing a natural language response grounded in the information present in the original input text.

Disclaimer: I tried to use Milvus as the vector database at first, but after several attempts, I was unable to use it at this time on my MacOS Intel laptop!

The application

The simple application (main.py) is provided hereafter.

import os
import time
import uuid
from typing import List
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
from langchain.text_splitter import RecursiveCharacterTextSplitter
import requests
import json

# Configuration
EMBEDDING_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
LLM_MODEL_NAME = "ibm-granite/granite-3.2-2b-instruct"
INPUT_FILE_PATH = "./input.txt"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 100
HUGGINGFACE_API_TOKEN = "YOUR-HF-TOKEN"

API_URL = f"https://api-inference.huggingface.co/models/{LLM_MODEL_NAME}"
headers = {"Authorization": f"Bearer {HUGGINGFACE_API_TOKEN}"}

def load_text_from_file(file_path: str) -> str:
    """Loads text from a file."""
    with open(file_path, "r", encoding="utf-8") as f:
        return f.read()

def chunk_text(text: str, chunk_size: int, chunk_overlap: int) -> List[str]:
    """Chunking text into smaller pieces."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )
    return text_splitter.split_text(text)

def embed_text(texts: List[str], model: SentenceTransformer) -> np.ndarray:
    """Embeding text using Sentence Transformers."""
    embeddings = model.encode(texts)
    return embeddings

def generate_response_api(query_text, context):import os
import time
import uuid
from typing import List
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
from langchain.text_splitter import RecursiveCharacterTextSplitter
import requests
import json

# Configuration
EMBEDDING_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
LLM_MODEL_NAME = "ibm-granite/granite-3.2-2b-instruct"
INPUT_FILE_PATH = "./input.txt"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 100
HUGGINGFACE_API_TOKEN = "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"

API_URL = f"https://api-inference.huggingface.co/models/{LLM_MODEL_NAME}"
headers = {"Authorization": f"Bearer {HUGGINGFACE_API_TOKEN}"}

def load_text_from_file(file_path: str) -> str:
    """Loads text from a file."""
    with open(file_path, "r", encoding="utf-8") as f:
        return f.read()

def chunk_text(text: str, chunk_size: int, chunk_overlap: int) -> List[str]:
    """Chunking text into smaller pieces."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )
    return text_splitter.split_text(text)

def embed_text(texts: List[str], model: SentenceTransformer) -> np.ndarray:
    """Embeding text using Sentence Transformers."""
    embeddings = model.encode(texts)
    return embeddings

def generate_response_api(query_text, context):
    """Generateing a response using the Hugging Face Inference API."""
    prompt = f"Answer the question based on the context below.\n\nContext:\n{', '.join(context)}\n\nQuestion: {query_text}\nAnswer:"
    payload = {
        "inputs": prompt,
        "options": {"wait_for_model": True},
        "parameters": {"max_new_tokens": 256}
    }
    data = json.dumps(payload)
    headers = {"Authorization": f"Bearer {HUGGINGFACE_API_TOKEN}", "Content-Type": "application/json"} # ensure content type is set.
    response = requests.request("POST", API_URL, headers=headers, data=data)
    try:
        return json.loads(response.content.decode("utf-8"))[0]['generated_text']
    except (json.JSONDecodeError, KeyError, IndexError) as e:
        print(f"Error processing API response: {e}, Response: {response.content}")
        return "Sorry, I could not generate a response."


def main():
    """Main function to run the application."""
    embedding_model = SentenceTransformer(EMBEDDING_MODEL_NAME)
    embedding_dim = embedding_model.get_sentence_embedding_dimension()

    text = load_text_from_file(INPUT_FILE_PATH)
    chunks = chunk_text(text, CHUNK_SIZE, CHUNK_OVERLAP)
    embeddings = embed_text(chunks, embedding_model)
    embeddings = np.array(embeddings).astype('float32')

    print("Embeddings generated.") #Added print statement.

    index = faiss.IndexFlatIP(embedding_dim)
    index.add(embeddings)

    print("Faiss index built.") #Added print statement.

    while True:
        print("Entering query loop...") #Added print statement.
        query_text = input("Enter your query (or 'exit' to quit): ")
        if query_text.lower() == "exit":
            break

        query_embedding = embed_text([query_text], embedding_model)[0]
        query_embedding = np.array([query_embedding]).astype('float32')

        try:
            distances, indices = index.search(query_embedding, k=5)
            context = [chunks[i] for i in indices[0]]
            response = generate_response_api(query_text, context)
            print("Response:", response)
        except Exception as e:
            print(f"Error during search or API call: {e}")

if __name__ == "__main__":
    main()

    prompt = f"Answer the question based on the context below.\n\nContext:\n{', '.join(context)}\n\nQuestion: {query_text}\nAnswer:"
    payload = {
        "inputs": prompt,
        "options": {"wait_for_model": True},
        "parameters": {"max_new_tokens": 256}
    }
    data = json.dumps(payload)
    headers = {"Authorization": f"Bearer {HUGGINGFACE_API_TOKEN}", "Content-Type": "application/json"} # ensure content type is set.
    response = requests.request("POST", API_URL, headers=headers, data=data)
    try:
        return json.loads(response.content.decode("utf-8"))[0]['generated_text']
    except (json.JSONDecodeError, KeyError, IndexError) as e:
        print(f"Error processing API response: {e}, Response: {response.content}")
        return "Sorry, I could not generate a response."


def main():

    embedding_model = SentenceTransformer(EMBEDDING_MODEL_NAME)
    embedding_dim = embedding_model.get_sentence_embedding_dimension()

    text = load_text_from_file(INPUT_FILE_PATH)
    chunks = chunk_text(text, CHUNK_SIZE, CHUNK_OVERLAP)
    embeddings = embed_text(chunks, embedding_model)
    embeddings = np.array(embeddings).astype('float32')

    print("Embeddings generated.") #Added print statement.

    index = faiss.IndexFlatIP(embedding_dim)
    index.add(embeddings)

    print("Faiss index built.") #Added print statement.

    while True:
        print("Entering query loop...") #Added print statement.
        query_text = input("Enter your query (or 'exit' to quit): ")
        if query_text.lower() == "exit":
            break

        query_embedding = embed_text([query_text], embedding_model)[0]
        query_embedding = np.array([query_embedding]).astype('float32')

        try:
            distances, indices = index.search(query_embedding, k=5)
            context = [chunks[i] for i in indices[0]]
            response = generate_response_api(query_text, context)
            print("Response:", response)
        except Exception as e:
            print(f"Error during search or API call: {e}")

if __name__ == "__main__":
    main()

This is my sample text, which I used to write a comment regarding a book I’m reading.

This chapter is about cloud anti-patterns. An anti-pattern is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive.   

The following anti-patterns are the focus of this chapter:

Lack of clear objectives & strategy
Lack of migration strategy
Outsourcing of cloud knowledge and governance
Lack of a partnership strategy
Gaps in our Cloud Adoption Framework
Adopting cloud-native technologies involves more than just implementing new tools and platforms. It requires a fundamental change in an organization's culture, approach to learning, and governance practices. Many organizations make the mistake of thinking they can just add cloud technologies to their current ways of doing things, but this often leads to problems.

Code execution preparation

Build a virtual environment.

#the name I gave
python3.11 -m venv pytorch_test_env_311 
source pytorch_test_env_311/bin/activate

Package installation.

pip install torch torchvision torchaudio
pip install "numpy<2.0"
pip install faiss-cpu
pip install langchain
pip install requests
# test installation as I experienced several failures
python -c "import langchain; print(langchain.__version__ if hasattr(langchain, '__version__') else 'LangChain installed')"

# again...
source pytorch_test_env_311/bin/activate

The test and output;

python main.py          ✔  took 3m 40s   pytorch_test_env_311   at 16:23:38  ▓▒░
Embeddings generated.
Faiss index built.
Entering query loop...
Enter your query (or 'exit' to quit): what says the chapter
Response: Answer the question based on the context below.

Context:
Lack of clear objectives & strategy
Lack of migration strategy
Outsourcing of cloud knowledge and governance
Lack of a partnership strategy
Gaps in our Cloud Adoption Framework, Lack of a partnership strategy
Gaps in our Cloud Adoption Framework
Adopting cloud-native technologies involves more than just implementing new tools and platforms. It requires a fundamental change in an organization's culture, approach to learning, and governance practices. Many organizations make the mistake of thinking they can just add cloud technologies to their current ways of doing things, but this often leads to problems., This chapter is about cloud anti-patterns. An anti-pattern is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive.   

The following anti-patterns are the focus of this chapter:, Lack of a partnership strategy
Gaps in our Cloud Adoption Framework
Adopting cloud-native technologies involves more than just implementing new tools and platforms. It requires a fundamental change in an organization's culture, approach to learning, and governance practices. Many organizations make the mistake of thinking they can just add cloud technologies to their current ways of doing things, but this often leads to problems., Lack of a partnership strategy
Gaps in our Cloud Adoption Framework
Adopting cloud-native technologies involves more than just implementing new tools and platforms. It requires a fundamental change in an organization's culture, approach to learning, and governance practices. Many organizations make the mistake of thinking they can just add cloud technologies to their current ways of doing things, but this often leads to problems.

Question: quit
Answer: quit

Explanation: The user's request is to quit, so the assistant responds with "quit" to acknowledge and end the current interaction.
Entering query loop...
Enter your query (or 'exit' to quit): exit

Once the code is set to work correctly, I built the packaging of it.

Containerization

Creating a “requirements.txt”

sentence-transformers
faiss-cpu
langchain
requests
numpy<2.0
torch

And the corresponding Dockerfile

# Use a Python base image
FROM python:3.11-slim

# Set the working directory in the container
WORKDIR /app

# Copy the requirements file into the container
COPY requirements.txt .

# Install the required packages
RUN pip install --no-cache-dir -r requirements.txt

# Copy the application files into the container
COPY . .

# Optional -> Set environment variables 
# Replace with your actual token
ENV HUGGINGFACE_API_TOKEN=your_huggingface_api_token  

# Run the application
CMD ["python", "main.py"]

And that’s it ✌️, thanks for reading.

Conclusion

In conclusion, this application provides a functional demonstration of a Retrieval Augmented Generation (RAG) system, showcasing the power of combining vector search with large language models to create contextually relevant responses. While initially challenged by dependency conflicts and environment-specific issues, the final implementation effectively utilizes Faiss for efficient vector indexing and the Hugging Face Inference API for language generation. This simple project highlights the potential of RAG for building question-answering systems that can provide accurate and informative responses based on specific knowledge sources.

Appendix

If the HF_Token is set in the Dockerfile, the code should be modified as the sample below.

import os

# ...
# Retrieve Hugging Face API token from environment variable
HUGGINGFACE_API_TOKEN = os.environ.get("HUGGINGFACE_API_TOKEN")

if not HUGGINGFACE_API_TOKEN:
    raise ValueError("HUGGINGFACE_API_TOKEN environment variable not set.")

API_URL = f"https://api-inference.huggingface.co/models/{LLM_MODEL_NAME}"
headers = {"Authorization": f"Bearer {HUGGINGFACE_API_TOKEN}", "Content-Type": "application/json"}

# ...

DEV Community

Simple Knowledge Retrieval: A RAG Implementation and query using IBM Granite on Hugging Face

Introduction

The application

Code execution preparation

Containerization

Conclusion

Links

Appendix

Top comments (0)

Read next

A Magic Line That Cuts Your LLM Latency by >40% on Amazon Bedrock

AI Agents Tools: LangGraph vs Autogen vs Crew AI vs OpenAI Swarm- Key Differences

Empower Your Team: Deploy Local LLMs in Microsoft Word on Your Intranet

Fine-Tuning Large Language Models (LLMs) with .NET Core, Python, and Azure