A simple RAG implementation and query using IBM Granite on Hugging Face.
Introduction
This application demonstrates a very simple Retrieval Augmented Generation (RAG) system, designed to answer user queries based on information extracted from a provided text file.
It begins by processing the input text, breaking it into manageable chunks, and generating numerical representations (embeddings) of these chunks using a pre-trained sentence transformer model. These embeddings are then indexed using Faiss for efficient similarity search. When a user enters a query, the application converts the query into an embedding and retrieves the most relevant text chunks from the index.
These chunks are then used as context for a large language model (LLM), accessed through the Hugging Face Inference API, to generate a coherent and informative response. The final answer is then displayed to the user on the command line, providing a natural language response grounded in the information present in the original input text.
Disclaimer: I tried to use Milvus as the vector database at first, but after several attempts, I was unable to use it at this time on my MacOS Intel laptop!
The application
The simple application (main.py) is provided hereafter.
import os
import time
import uuid
from typing import List
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
from langchain.text_splitter import RecursiveCharacterTextSplitter
import requests
import json
# Configuration
EMBEDDING_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
LLM_MODEL_NAME = "ibm-granite/granite-3.2-2b-instruct"
INPUT_FILE_PATH = "./input.txt"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 100
HUGGINGFACE_API_TOKEN = "YOUR-HF-TOKEN"
API_URL = f"https://api-inference.huggingface.co/models/{LLM_MODEL_NAME}"
headers = {"Authorization": f"Bearer {HUGGINGFACE_API_TOKEN}"}
def load_text_from_file(file_path: str) -> str:
"""Loads text from a file."""
with open(file_path, "r", encoding="utf-8") as f:
return f.read()
def chunk_text(text: str, chunk_size: int, chunk_overlap: int) -> List[str]:
"""Chunking text into smaller pieces."""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
return text_splitter.split_text(text)
def embed_text(texts: List[str], model: SentenceTransformer) -> np.ndarray:
"""Embeding text using Sentence Transformers."""
embeddings = model.encode(texts)
return embeddings
def generate_response_api(query_text, context):import os
import time
import uuid
from typing import List
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
from langchain.text_splitter import RecursiveCharacterTextSplitter
import requests
import json
# Configuration
EMBEDDING_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
LLM_MODEL_NAME = "ibm-granite/granite-3.2-2b-instruct"
INPUT_FILE_PATH = "./input.txt"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 100
HUGGINGFACE_API_TOKEN = "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"
API_URL = f"https://api-inference.huggingface.co/models/{LLM_MODEL_NAME}"
headers = {"Authorization": f"Bearer {HUGGINGFACE_API_TOKEN}"}
def load_text_from_file(file_path: str) -> str:
"""Loads text from a file."""
with open(file_path, "r", encoding="utf-8") as f:
return f.read()
def chunk_text(text: str, chunk_size: int, chunk_overlap: int) -> List[str]:
"""Chunking text into smaller pieces."""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
return text_splitter.split_text(text)
def embed_text(texts: List[str], model: SentenceTransformer) -> np.ndarray:
"""Embeding text using Sentence Transformers."""
embeddings = model.encode(texts)
return embeddings
def generate_response_api(query_text, context):
"""Generateing a response using the Hugging Face Inference API."""
prompt = f"Answer the question based on the context below.\n\nContext:\n{', '.join(context)}\n\nQuestion: {query_text}\nAnswer:"
payload = {
"inputs": prompt,
"options": {"wait_for_model": True},
"parameters": {"max_new_tokens": 256}
}
data = json.dumps(payload)
headers = {"Authorization": f"Bearer {HUGGINGFACE_API_TOKEN}", "Content-Type": "application/json"} # ensure content type is set.
response = requests.request("POST", API_URL, headers=headers, data=data)
try:
return json.loads(response.content.decode("utf-8"))[0]['generated_text']
except (json.JSONDecodeError, KeyError, IndexError) as e:
print(f"Error processing API response: {e}, Response: {response.content}")
return "Sorry, I could not generate a response."
def main():
"""Main function to run the application."""
embedding_model = SentenceTransformer(EMBEDDING_MODEL_NAME)
embedding_dim = embedding_model.get_sentence_embedding_dimension()
text = load_text_from_file(INPUT_FILE_PATH)
chunks = chunk_text(text, CHUNK_SIZE, CHUNK_OVERLAP)
embeddings = embed_text(chunks, embedding_model)
embeddings = np.array(embeddings).astype('float32')
print("Embeddings generated.") #Added print statement.
index = faiss.IndexFlatIP(embedding_dim)
index.add(embeddings)
print("Faiss index built.") #Added print statement.
while True:
print("Entering query loop...") #Added print statement.
query_text = input("Enter your query (or 'exit' to quit): ")
if query_text.lower() == "exit":
break
query_embedding = embed_text([query_text], embedding_model)[0]
query_embedding = np.array([query_embedding]).astype('float32')
try:
distances, indices = index.search(query_embedding, k=5)
context = [chunks[i] for i in indices[0]]
response = generate_response_api(query_text, context)
print("Response:", response)
except Exception as e:
print(f"Error during search or API call: {e}")
if __name__ == "__main__":
main()
prompt = f"Answer the question based on the context below.\n\nContext:\n{', '.join(context)}\n\nQuestion: {query_text}\nAnswer:"
payload = {
"inputs": prompt,
"options": {"wait_for_model": True},
"parameters": {"max_new_tokens": 256}
}
data = json.dumps(payload)
headers = {"Authorization": f"Bearer {HUGGINGFACE_API_TOKEN}", "Content-Type": "application/json"} # ensure content type is set.
response = requests.request("POST", API_URL, headers=headers, data=data)
try:
return json.loads(response.content.decode("utf-8"))[0]['generated_text']
except (json.JSONDecodeError, KeyError, IndexError) as e:
print(f"Error processing API response: {e}, Response: {response.content}")
return "Sorry, I could not generate a response."
def main():
embedding_model = SentenceTransformer(EMBEDDING_MODEL_NAME)
embedding_dim = embedding_model.get_sentence_embedding_dimension()
text = load_text_from_file(INPUT_FILE_PATH)
chunks = chunk_text(text, CHUNK_SIZE, CHUNK_OVERLAP)
embeddings = embed_text(chunks, embedding_model)
embeddings = np.array(embeddings).astype('float32')
print("Embeddings generated.") #Added print statement.
index = faiss.IndexFlatIP(embedding_dim)
index.add(embeddings)
print("Faiss index built.") #Added print statement.
while True:
print("Entering query loop...") #Added print statement.
query_text = input("Enter your query (or 'exit' to quit): ")
if query_text.lower() == "exit":
break
query_embedding = embed_text([query_text], embedding_model)[0]
query_embedding = np.array([query_embedding]).astype('float32')
try:
distances, indices = index.search(query_embedding, k=5)
context = [chunks[i] for i in indices[0]]
response = generate_response_api(query_text, context)
print("Response:", response)
except Exception as e:
print(f"Error during search or API call: {e}")
if __name__ == "__main__":
main()
This is my sample text, which I used to write a comment regarding a book I’m reading.
This chapter is about cloud anti-patterns. An anti-pattern is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive.
The following anti-patterns are the focus of this chapter:
Lack of clear objectives & strategy
Lack of migration strategy
Outsourcing of cloud knowledge and governance
Lack of a partnership strategy
Gaps in our Cloud Adoption Framework
Adopting cloud-native technologies involves more than just implementing new tools and platforms. It requires a fundamental change in an organization's culture, approach to learning, and governance practices. Many organizations make the mistake of thinking they can just add cloud technologies to their current ways of doing things, but this often leads to problems.
Code execution preparation
- Build a virtual environment.
#the name I gave
python3.11 -m venv pytorch_test_env_311
source pytorch_test_env_311/bin/activate
- Package installation.
pip install torch torchvision torchaudio
pip install "numpy<2.0"
pip install faiss-cpu
pip install langchain
pip install requests
# test installation as I experienced several failures
python -c "import langchain; print(langchain.__version__ if hasattr(langchain, '__version__') else 'LangChain installed')"
# again...
source pytorch_test_env_311/bin/activate
- The test and output;
python main.py ✔ took 3m 40s pytorch_test_env_311 at 16:23:38 ▓▒░
Embeddings generated.
Faiss index built.
Entering query loop...
Enter your query (or 'exit' to quit): what says the chapter
Response: Answer the question based on the context below.
Context:
Lack of clear objectives & strategy
Lack of migration strategy
Outsourcing of cloud knowledge and governance
Lack of a partnership strategy
Gaps in our Cloud Adoption Framework, Lack of a partnership strategy
Gaps in our Cloud Adoption Framework
Adopting cloud-native technologies involves more than just implementing new tools and platforms. It requires a fundamental change in an organization's culture, approach to learning, and governance practices. Many organizations make the mistake of thinking they can just add cloud technologies to their current ways of doing things, but this often leads to problems., This chapter is about cloud anti-patterns. An anti-pattern is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive.
The following anti-patterns are the focus of this chapter:, Lack of a partnership strategy
Gaps in our Cloud Adoption Framework
Adopting cloud-native technologies involves more than just implementing new tools and platforms. It requires a fundamental change in an organization's culture, approach to learning, and governance practices. Many organizations make the mistake of thinking they can just add cloud technologies to their current ways of doing things, but this often leads to problems., Lack of a partnership strategy
Gaps in our Cloud Adoption Framework
Adopting cloud-native technologies involves more than just implementing new tools and platforms. It requires a fundamental change in an organization's culture, approach to learning, and governance practices. Many organizations make the mistake of thinking they can just add cloud technologies to their current ways of doing things, but this often leads to problems.
Question: quit
Answer: quit
Explanation: The user's request is to quit, so the assistant responds with "quit" to acknowledge and end the current interaction.
Entering query loop...
Enter your query (or 'exit' to quit): exit
Once the code is set to work correctly, I built the packaging of it.
Containerization
- Creating a “requirements.txt”
sentence-transformers
faiss-cpu
langchain
requests
numpy<2.0
torch
- And the corresponding Dockerfile
# Use a Python base image
FROM python:3.11-slim
# Set the working directory in the container
WORKDIR /app
# Copy the requirements file into the container
COPY requirements.txt .
# Install the required packages
RUN pip install --no-cache-dir -r requirements.txt
# Copy the application files into the container
COPY . .
# Optional -> Set environment variables
# Replace with your actual token
ENV HUGGINGFACE_API_TOKEN=your_huggingface_api_token
# Run the application
CMD ["python", "main.py"]
And that’s it ✌️, thanks for reading.
Conclusion
In conclusion, this application provides a functional demonstration of a Retrieval Augmented Generation (RAG) system, showcasing the power of combining vector search with large language models to create contextually relevant responses. While initially challenged by dependency conflicts and environment-specific issues, the final implementation effectively utilizes Faiss for efficient vector indexing and the Hugging Face Inference API for language generation. This simple project highlights the potential of RAG for building question-answering systems that can provide accurate and informative responses based on specific knowledge sources.
Links
- Granite models on Hugging Face: https://huggingface.co/collections/ibm-granite/granite-32-language-models-67b3bc8c13508f6d064cff9a
Appendix
If the HF_Token is set in the Dockerfile, the code should be modified as the sample below.
import os
# ...
# Retrieve Hugging Face API token from environment variable
HUGGINGFACE_API_TOKEN = os.environ.get("HUGGINGFACE_API_TOKEN")
if not HUGGINGFACE_API_TOKEN:
raise ValueError("HUGGINGFACE_API_TOKEN environment variable not set.")
API_URL = f"https://api-inference.huggingface.co/models/{LLM_MODEL_NAME}"
headers = {"Authorization": f"Bearer {HUGGINGFACE_API_TOKEN}", "Content-Type": "application/json"}
# ...
Top comments (0)