Retrieval Augmented Generation (RAG) is a process where we augment the knowledge of Large Language Model (LLM). A regular LLMs are trained on a specific datasets and the knowledge of this LLMs are cut-off at some point in time.
RAG enables the introduction of additional knowledge-based information into an LLM to provide more accurate, current, or specific information.
How RAG Works
RAG works by retrieving relevant information from a database using a query generated by the LLM. This information is then added to the LLM's input, helping it produce more accurate and context-aware responses.
The retrieval process typically relies on a semantic search engine, which uses embeddings stored in vector databases along with advanced ranking and query rewriting to ensure the results align with the query and address the user's needs.
Concepts
Indexing
Indexing is the process of collecting and loading data from a source and processing it for further use.
Here are the steps used for Indexing data
Load / Extract data: We need to load the data from an external source such as database, local files, webpages, API etc. LangChain uses Document Loader for this process.
Split: We splits data into smaller chunks using Text Splitters. This stage is necessary to enable LLM retrieve and process data more effectively. For example OpenAI GPT model has a limit number of token they can process in a single input (e.g 4000 to 1600 tokens). Splitting documents ensures that the chunks remain within these token limits, enabling efficient processing without truncation or errors.
Store: After chunking the document, the splits are stored and indexed for searching. This is accomplished using a vector store and an embedding model. LangChain provides VectorStore functionality and integration with various text [Embedding models.]
Image Courtesy LangChain website
Retrieval and Generation
- Retrieval: The system uses the user's input to search for and retrieve related records/splits from the store.
- Generation: An LLM combines the user's input with the retrieved information, integrating external data with its training data.
Image Courtesy LangChain website
Quick Overview
RAG (Retrieval-Augmented Generation) implementation involves loading documents from external sources, which are then refined through processes like splitting or chunking. Each chunk is embedded and stored in a vector database to enable efficient search and retrieval.
During the retrieval and generation stage, relevant records are fetched from the vector database. These retrieved chunks are combined with the user's prompt and processed by the language model. The LLM generates an output that integrates the user's query, the augmented data, and its pre-trained knowledge.
Personal Profile Chatbot
In this writing we are going to build a personal Q&A personal chatbot, where we feed LLM with our resume and we can ask questions based on the resume we’ve provided it with.
Pre-requisite:
- Knowledge of Expressjs
- Express.js server setup
Installation:
Here are LangChain packages we need for our project
npm i langchain @langchain/core @langchain/community
Picking a LLM model
This implementation uses "llama3-8b-8192" model from Meta via the Groq.
Groq is a company that offers fast AI inference, powered by LPU™ AI inference technology which delivers fast, affordable, and energy efficient AI. You can sign up ****to obtain your API key. You can choose any model you desire in the playground.
LangChain provides a consistent interface for working with chat models from different providers while offering additional features for monitoring, debugging, and optimizing the performance of applications that use LLMs.
With the right LLM choice out of the way, the next step is to install the chat model interface, which in our case ChatGroq
npm i @langchain/groq
Add your API key to the environment variable
GROQ_API_KEY="my_groq_api_key"
Create a new file model.js
import { ChatGroq } from "@langchain/groq";
// Now we can instantiate our model object and generate chat completions:
const model = new ChatGroq({
apiKey: process.env.GROQ_API_KEY,
model: "llama3-8b-8192",
temperature: 0.9,
});
export default model;
Personal information details:
To train the LLM on our personal information, we need some data we can feed to the LLM, so we can perform some inferences based on the data. In my own case I used my resume PDF file I downloaded from LinkedIn. Create a new folder called public and add the PDF document into the assets dir.
In the src
dir. you can create a chat.controller.js
in the controller
dir.
├── src
│ ├── controller
│ │ ├── chat.controller.js
│── public
│ ├── assets
│ │ ├── my_profle.pdf
chat.controller.js
import { WebPDFLoader } from "@langchain/community/document_loaders/web/pdf";
import fs from "fs/promises";
import path from "path";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import embeddings from "../lib/embeddingModel.js";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import model from "../lib/model.js";
export default async function myProfile(req, res) {
const profilePdfPath = path.join(
process.cwd(),
"public",
"assets",
"profile.pdf"
);
const buffer = await fs.readFile(profilePdfPath);
const profileBlob = new Blob([buffer], { type: "application/pdf" });
const loader = new WebPDFLoader(profileBlob, {
parsedItemSeparator: "", // handle extra white-space in the document
});
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
try {
const docs = await loader.load();
const splits = await textSplitter.splitDocuments(docs);
const vectorStore = await MemoryVectorStore.fromDocuments(
splits,
embeddings
);
const retriever = vectorStore.asRetriever();
const prompt = ChatPromptTemplate.fromTemplate(
`You are an assistant for question-answering tasks related to Bolaji Bolajoko profile information. Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}`
);
const llm = model;
const ragChain = await createStuffDocumentsChain({
llm,
prompt,
});
const retrievedDocs = await retriever.invoke("What are his experience?");
const message = await ragChain.invoke({
question: "Who's Bolaji Bolajoko?",
context: retrievedDocs,
});
return res.status(200).json({ success: true, message: message });
} catch (error) {
res.status(500).json({ success: false, message: "internal server error" });
}
}
Detailed Walkthrough
Now, let’s go through each step and how the code above works for better understanding.
1. Indexing: Load
We need a way to load our documents which will be use to train the LLM. In the code above we used DocumentLoaders which is an object that help us to load contents from a source. They return a list of Document, that contains some pageContents (string) and metadata (Record).
In our code we used WebPDFLoader
which is one of the DocumentLoaders provided by LangChain.
WebPDFLoader
accepts the PDF file as a blob, which was created via the node.js Blob API by passing the buffer version of the file, reading the entire contents using fs
module.
We can view the contents and Metadata of the documents
const docs = await loader.load();
// page content
console.log(docs[0].pageContent) // Bolaji Bolajoko full-stack software developer...
console.log(docs[0].metadata)
/**page metadata
pdf: {
version: '1.10.100',
info: {
PDFFormatVersion: '1.4',
IsAcroFormPresent: false,
IsXFAPresent: false,
Title: 'Resume',
Author: 'LinkedIn',
**/
// number of chracters
console.log(docs[0].pageContent.length) // 1612
2. Indexing: Split
Splitting texts into smaller chunks in an important step in RAG implementation. Text splitting comes with a lots of advantages such easy context retrieval, managing token limits, scalability and more.
To handle this we splits our Document into chunks, so we can embed and store the embedded vectors in a VectorStore.
In our code we used the RecursiveCharacterTextSplitter to splits the Document into 1000 character long and and overlapping of 200. A TextSplitter is an object that splits Document into smaller chunks
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200
})
The
chunkSize
specifies the number of characters each text chunk will contain, in this case, 1000 characters per chunk. ThechunkOverlap
defines the number of characters from the previous chunk to include at the beginning of the next chunk, ensuring context continuity.
const splits = await textSplitter.splitsDocument(docs) // returns Document<Record<string, any>>[]
console.log(splits.length) // 3
console.log(splits[0].pageContent.length) // 966
3. Indexing Store
After splitting our documents into chunks we need a way to index and store the chunks so that we can search and retrieve the relevant data. For us to store each content of the document split, we need to embed chunked texts and store it in a vector store or vector database. (Embedding is the splitting of text document into vector). *Check my blog post on Embedding.*
When we want to search for a similar text in a vector database, we take our prompt query, embed it and perform a similarity search on the embedded vectors in the vector database. We used a cosine similarity search to measure the angle between the vectors to get similar vector in the vector database.
We can embed and store the vectors using Memory vector store and GoogleGenerativeAIEmbeddings model.
embedding.js
import { GoogleGenerativeAIEmbeddings } from "@langchain/google-genai";
import { TaskType } from "@google/generative-ai";
export const embeddings = new GoogleGenerativeAIEmbeddings({
model: "text-embedding-004", // 768 dimensions
taskType: TaskType.RETRIEVAL_DOCUMENT,
title: "Document title",
apiKey: process.env.GOOGLE_KEY
});
import embeddings from "../lib/embeddingModel.js";
const vectorStore = await MemoryVectorStore.fromDocuments(
splits,
embeddings
)
Vector stores
Vector stores are specialized data stores that enables indexing and retrieving information based on vector representation. This vectors are called embeddings, they capture the semantic meaning of data that has been embedded.
LangChain provides some standard interface of working with vector stores, allowing users to switch between different vector store implementations. The interface consists of basic methods of writing, deleting and searching of documents in a vector store.
In our own case we used the MemoryVectorStore an in-memory vector store provided by LangChain, that stores embeddings in-memory and does an exact linear search for the most similar embeddings.
MemoryVectorStore
accepts an embedding model and the method fromDocuments
helps to embed the documents with an embedding model and add it to the store.
Moving Forward
With the indexing pipeline out of the way, the next approach is to query for similar embedding by doing a similarity search on our query against the embeddings we have in the store, to generate some response using LLM.
4. Retrieval and Generation
a. Retriever
For us to retrieve a similar semantic relationship between the embeddings we need to take the user’s question or query, perform a similarity search on the embeddings we have in the store, then we take user’s initial prompt and combine it with our search result to provide more context with the LLM
In addition to the VectorStore capabilities, it also provides us a Retrieval system which performs a similarity search on the embeddings in the VectorStore. A VectorStore and can easily be turned into a retriever system by VectorStore.asRetriever()
.
const retriever = vectorStore.asRetriever({
k: 2, // number of document to retrieve per search query: default is 4
searchType: "similarity", // search approach "similarity" | "mmr": default is "similarity"
});
const retrievedDocument = await retriever.invoke("Past experience?");
console.log(retrivedDocument.length) // 2
console.log(retrievedDocument[0].pageContent) // Bolaji Bolajoko is a full-stack web developer with expertise in...
b. Generate
Before we move any further, let’s pick our chat model that we will be using to generate the response
In our example, we are using the the Groq as our model. Install this Groq package:
npm i @langchain/groq
Obtain your API key by sign up with Groq
llmModel.js
import { ChatGroq } from "@langchain/groq";
const llmModel = new ChatGroq({
apiKey: process.env.GROQ_API_KEY,
model: "llama3-8b-8192",
temperature: 0.9,
});
export default llmModel;
Now, let’s create our prompt template using ChatPromptTemplate.
const prompt = ChatPromptTemplate.fromTemplate(
`You are an assistant for question-answering tasks related to Bolaji Bolajoko profile information. Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}`
);
-
ChatPromptTemplate
: This is a class that helps you define and manage the structure of a prompt for a chat model. It makes it easier to fill in dynamic parts of the prompt with actual data when sending it to the model. -
fromTemplate
Method: ThefromTemplate
method allows you to create a prompt template from a string. The string can have placeholders (e.g.,{question}
,{context}
) that will be replaced with actual values at runtime.
import model from "../lib/model.js";
const llmModel = model;
const ragChain = await createStuffDocumentsChain({
llmModel,
prompt,
});
The purpose of the code above is to generate answers by combining the retrieved context with the question.
const retrievedDocs = await retriever.invoke(query);
// generate message from the prompts
const message = await ragChain.invoke({
question: query, // "What's are his past experience?",
context: retrievedDocs,
});
return res.status(200).json({ success: true, message: message });
We generate the answers by passing our question query (e.g What are his past experiences)
retrievedDocs
is the relevant chunk of data obtained from the VectorStore
via similarity search.
Then we return the message
has response to the client.
Conclusion
RAG implementation with LangChain provides a powerful way to enhance LLM capabilities by incorporating external knowledge sources. This implementation demonstrates how to:
- Create an effective indexing pipeline for document processing
- Implement efficient vector storage and retrieval systems
- Generate contextually relevant responses using LLMs
The combination of document loading, text splitting, vector storage, and retrieval systems enables the creation of sophisticated applications that can understand and respond to queries with both general knowledge and specific context-aware information.
Key benefits of this approach include:
- Enhanced accuracy through contextual awareness
- Ability to incorporate current or domain-specific information
- Scalable architecture for handling various document types
- Flexible implementation that can be adapted for different use cases
For production implementations, consider:
- Implementing proper error handling
- Adding caching mechanisms for frequently accessed data
- Monitoring system performance and response times
- Regular updates to the knowledge base
This implementation serves as a foundation that can be extended and customized based on specific application requirements and use cases.
Top comments (1)
Great article on RAG implementation! One practical tip for those implementing this in production: consider using managed vector databases like Astra DB to avoid the operational overhead of self-hosting. It integrates well with LangChain and lets you focus on the RAG logic rather than infrastructure management.