RAG: Giving LLMs a Timely and Tailored Education

Large Language Models (LLMs) are impressive, having seemingly absorbed all the knowledge in their training data. However, they have limitations. One major issue is "hallucinations," where they confidently give incorrect answers. This happens because LLMs are trained on a fixed dataset with a cutoff date, leaving them without access to current information or specific data sources.

So how do we fix this? Enter RAG, or Retrieval Augmented Generation. Think of it as giving LLMs a timely and tailored education. RAG updates their knowledge and makes LLM-powered applications more domain-specific, all while being less expensive than fine-tuning the entire model.

How RAG Works: A Two-Step Process

1. Preprocessing: We take our data and convert it into "vector embeddings." This means turning text into numerical representations that can be understood by the LLM. These embeddings are then stored in a Vector Database (VectorDB) for quick retrieval. Popular tools for this include OpenAI Embeddings (ada-002) and Sentence Transformers.

2. Retrieval: When you ask a question, we convert your query into a vector embedding and perform a "semantic search" in the VectorDB. This retrieves the most relevant documents, which are then ranked and sent to the LLM along with your prompt. The LLM, acting as the "generator," uses this information to create a response.

In this blog, we'll be using Qdrant as our VectorDB, hosted locally via Docker.

Why Use RAG?

RAG is perfect for building Q&A chatbots for specific domains. While fine-tuning is an alternative, it's costly and requires a lot of data preparation. Fine-tuning is the better choice if you need maximum accuracy and strict control over the response format.

Now, you might wonder, with models like Google's Gemini accepting up to 2 million tokens, is RAG becoming obsolete? Absolutely not!

RAG offers a crucial advantage: data control. With RAG, your data stays on your premises, giving you complete control over updates and modifications. Sending all your data to a third-party model means less control. RAG ensures data privacy and allows for agile updates, making it the go-to framework for Q&A systems across various data sources.

In short, RAG is not dead; it's the smart way to build powerful and reliable LLM applications.

RAG Workflow

Deploying Qdrant on Docker Desktop with API and Dashboard Access

To deploy Qdrant on Docker Desktop, start by pulling the official Qdrant image using the command docker pull qdrant/qdrant. Once the image is downloaded, run the container with docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant, which exposes both the HTTP API (`6333`) and gRPC API (`6334`) ports. After the container is up and running, you can verify its status by navigating to http://localhost:6333/healthz, where a successful deployment will return {"status":"ok"}. Additionally, Qdrant provides a built-in Dashboard to explore collections and manage vector data. You can access the dashboard by visiting http://localhost:6333/dashboard in your browser. This setup allows you to run a local instance of Qdrant for vector similarity search, AI applications, and data exploration through its dashboard.

RAG Implementation


from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from dotenv import load_dotenv
from langchain_community.chat_models import ChatOpenAI
from langchain_community.document_loaders import PyPDFLoader
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.http.models import PointStruct, VectorParams, Distance
from langchain_core.prompts import ChatPromptTemplate
import warnings

load_dotenv()
encoder = SentenceTransformer("all-MiniLM-L6-v2")

# Initialize Qdrant client
client = QdrantClient("http://localhost:6333")

# load PDF docs

loader =PyPDFLoader("mahabharata.pdf")
docs =loader.load_and_split()
encoder = SentenceTransformer("all-MiniLM-L6-v2")

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)


docs = text_splitter.split_documents(docs)

# Check if the collection already exists
collections = client.get_collections()
if "mahabharata_embeddings" in [collection.name for collection in collections.collections]:
    client.delete_collection(collection_name="mahabharata_embeddings")
# Create a collection in Qdrant
client.create_collection(
    collection_name="mahabharata_embeddings",
    vectors_config=VectorParams( 
        size=encoder.get_sentence_embedding_dimension(),
        distance=Distance.COSINE

    ),
)

# Insert embeddings into Qdrant
points = []
for i, doc in enumerate(docs):
    embedding = encoder.encode(doc.page_content)
    point = PointStruct(id=i, vector=embedding, payload={"page_content": doc.page_content})
    points.append(point)

client.upsert(
    collection_name="mahabharata_embeddings",
    points=points
)

print("Embeddings inserted successfully")

template = """Answer the question based only on the following context. If you do not know the answer,
 just say 'I do not know'. Provide clear and concise answers, at most 1 lines:
Context:{context}
Question:{question}"""

prompt = ChatPromptTemplate.from_template(template)

# Initialize the chat model with OpenAI
chat_model = ChatOpenAI()

while True:
    question = input("Ask a question: ")
    if question == "exit":
        break
    # Encode the question
    query_vector = encoder.encode(question)

    # Query the collection
    search_result = client.query_points(
    collection_name="mahabharata_embeddings",
    query=query_vector,
    limit=3  # Number of results to return
).points
    # Get the context
    text = "\n".join(result.payload['page_content'] for result in search_result)
    for result in search_result:
        print(f"Document ID: {result.id}, Score: {result.score}")
        print(f"Page Content: {result.payload['page_content']}")

# Ask the chat model the question based on the context
    response = chat_model.invoke(prompt.format(context=text, question=question))

# Print the response
    print(response.content)

python

In our vector search configuration, we retrieve three documents per query based on their cosine similarity score. Each document's ID and similarity score are printed alongside the retrieved results. Before returning the final response, we can also apply ranking techniques to refine the order of results based on relevance.


search_result = client.query_points(
    collection_name="mahabharata_embeddings",
    query=query_vector,
    limit=3  # Number of results to return

python

Ask a question: who is BHAGAVAN VYASA
Document ID: 12, Score: 0.5401658
Document ID: 555, Score: 0.5282817
Document ID: 1027, Score: 0.52122784
Answer: BHAGAVAN VYASA is the celebrated compiler of the Vedas and the son of the great sage Parasara.

Ask a question: Who is Parikshit?
Document ID: 1016, Score: 0.40114498
Document ID: 994, Score: 0.3752214
Document ID: 581, Score: 0.36482185
Parikshit is the son of Abhimanyu and the grandson of the Pandavas who was crowned king after the holocaust claimed the Kauravas and the Pandavas.

Ask a question: who is VIDURA?
Document ID: 86, Score: 0.6633505
Document ID: 1030, Score: 0.6146341
Document ID: 87, Score: 0.5774884

Vidura is the incarnation of Lord Dharma and was known for his unparalleled knowledge of dharma, sastras, and statesmanship.

Ask a question: who is BHIMA?
Document ID: 157, Score: 0.67709696
Document ID: 723, Score: 0.63936245
Document ID: 146, Score: 0.6344832
*Bhima is one of the Pandavas, known for his strength and bravery.
*

Ask a question: who is Duryodhana?
Document ID: 230, Score: 0.6566757
Document ID: 553, Score: 0.60688937
Document ID: 269, Score: 0.6039754
*Duryodhana is the son of Dhritarashtra and the main antagonist in the Indian epic Mahabharata.
Ask a question: *

Here’s how we can improve RAG accuracy:

Optimize Chunking Strategies: Experiment with different chunk sizes and overlap percentages to better preserve context and improve retrieval accuracy.
Test Different Embedding Models: Try various embedding models and fine-tune them on your domain-specific data to enhance retrieval precision.
Use Advanced Retrieval Techniques: Implement RAG-Fusion, combining multiple retrieval methods for more comprehensive and accurate responses.
Incorporate Self-Reflective RAG: Add reflection and critique tokens to help the model decide when to retrieve more information and assess response quality.
Apply Corrective RAG (CRAG): Use an evaluator to assess the quality of documents and expand searches when necessary for more relevant results.
Regularly Update Knowledge Sources: Keep your information updated and curated to ensure high-quality, relevant data is being used for retrieval.
Experiment with Query Transformations: Rephrase queries or use techniques like Hypothetical Document Embeddings (HyDE) to improve retrieval.
Use Re-ranking Models: Refine the selection of retrieved chunks by applying re-ranking models before the final generation step.
Play with Prompt Engineering: Try different prompt phrasings or include few-shot examples to better guide the language model’s responses.