DEV Community

Cover image for Understanding Retrieval Augmented Generation (RAG)
Eze Lanza
Eze Lanza

Posted on

Understanding Retrieval Augmented Generation (RAG)

Learn what a RAG system is and how to deploy it using OPEA’s open source tools and frameworks

Originally posted on https://medium.com/p/4d1d08f736b3

By this point, most of us have used a large language model (LLM), like ChatGPT, to try to find quick answers to questions that rely on general knowledge and information. These questions range from the practical (What’s the best way to learn a new skill?) to the philosophical (What is the meaning of life?).
A screenshot listing the most common questions asked to an AI. The questions include topics like AI, homework help, the meaning of life, weather updates, creative writing, productivity, technology trends, and book or movie recommendations

But how do you get answers to questions that are personal? How much does your LLM know about you? Or your family?
Let’s test ChatGPT and see how much it knows about my parents.

A screenshot of a conversation where a user asks, ‘Do you know who is my mum?’ and the AI responds, explaining that it doesn’t have access to personal information about individuals.

It’s understandable to feel frustrated when a model doesn’t recognize you, but it’s important to remember that these models don’t have much information about our personal lives. Unless you’re a celebrity or have your own Wikipedia page (as Tom Cruise has), the training dataset used for these models likely doesn’t include our information, which is why they can’t provide specific answers about us.

A screenshot displays a question asking if the model knows who is Tom Cruise mum? the model answered Mary Lee Pfeiffer is his mom

So, how do we get our LLMs to know us better?

That’s the million-dollar question facing enterprises looking to boost productivity with GenAI. They need models that provide context-based results. In this post, we’ll explain the basics of how retrieval augmented generation (RAG) improves your LLM’s responses and show you how to easily deploy your RAG-based model using a modular approach with the open source building blocks that are part of the new Open Platform for Enterprise AI (OPEA).

What is RAG?

We know that LLMs can greatly contribute to completing an extensive number of tasks, such as writing, learning, programming, translating, and more. However, the result we receive depends on what we ask the model, in other words, on how we meticulously build our prompts. For that reason, we spend too much time looking for the perfect prompt to get the answer we want; we’re starting to become experts in model prompting.

Let’s return to the above question: “Who is my mum?” We know who our mum is, we have memories, and that information lives in our “mental” knowledge base, our brain.

When building the prompt, we need to somehow provide it with memories of our mum and try to guide the model to use that information to creatively answer the question: Who is my mum? We’ll provide it with some of mum’s history and ask the model to take her past into account when answering the question.

A screenshot displays a question asking for a creative response based on provided details: the user’s mother was born in the US, is 60, has strong Italian roots, and loves pizza. Below, a response discusses how her Italian heritage influences her personality, family traditions, and love for Italian cuisine, emphasizing her vibrant and lively nature.

As we can see, the model successfully gave us an answer that described my mum. Congratulations, we have used RAG!

Let’s inspect what we did.

Given the initial question, we tweaked the prompt to guide the model in how to use the information (context) we provided.

We can think of the RAG process in three parts :

A screenshot shows a section divided into three parts. The top part labeled “Instruct” gives instructions asking to answer the question creatively, considering how roots influence the person’s behavior. The middle section, labeled “Context,” lists facts about the user’s mother: born in the US, age 60, strong Italian roots, and a love for pizza. The bottom part, labeled “Initial Question,” asks “Who is my mum?”

  • Instruct: Guide the model. We have guided the model to use the information we provided (documents) to give us a creative answer and take into account my mum’s history. We used those instructions as an example; we could have used other guidance depending on the outcome we wanted to achieve. If we don’t want a creative answer, for example, this is the time to declare it.
  • Context: Provide the context. In this example, we already knew the information about my mother since we retrieved that information from my memories, but in a real scenario, the challenge would be finding the relevant data in a knowledge base to feed the model so that it has the context needed to provide us with an accurate response, this process is called “retrieval.”
  • Initial Question: The initial question we want answered.

Let’s explore how an enterprise can implement a real-life RAG example using open source tools and models. We’ll deploy it using the standardized frameworks and tools made available through OPEA, which was created to help streamline the implementation of enterprise AI.

Exploring the OPEA Architecture

Here’s the architecture we used for the previous example:

Image description

RAG can be understood as simply the steps mentioned above:

  1. Initial Question
  2. Context
  3. Instruct

However, implementing the process in practice can be challenging because multiple components are needed: retrievers, embedding models, and a knowledge base, as shown in the image above. Let’s explore how those parts can work together.

The key lies in providing the right context. You can compare the process to how our memories help us answer questions. For a company, this might mean drawing from a knowledge base of historical financial data or other relevant documents.

For example, when a user asks a chatbot a question before the LLM can spit out an answer, the RAG application must first dive into a knowledge base and extract the most relevant information (the retrieval process). But even before the retrieval happens, an embedding model plays a crucial role in converting the data in the knowledge base into vector representations — meaningful numerical embeddings that capture the essence of the information. These embeddings will live in the knowledge base (vector database) and will allow the retriever to efficiently match the user’s query with the most relevant documents.

Once the RAG application finds the relevant documents, it performs a rerank process to check the quality of the information and then re-orders the information based on relevance. It then builds a new prompt based on the refined context from the top-ranked documents and sends this prompt to the LLM, enabling the model to generate a high-quality, contextually informed response. Easy, right?

As you can see, the RAG architecture isn’t about just one tool or one framework; it’s composed of multiple moving pieces making it difficult to pay attention to each component. When deploying a RAG system in our enterprise, we face multiple challenges, such as ensuring scalability, handling data security, and integrating with existing infrastructure.

The Open Platform for Enterprise AI (OPEA) aims to solve those problems by treating each component in the RAG pipeline as a building block that is easily interchangeable. Say, for example, you’re using Mistral, but want to easily replace it with Falcon. Or, say you want to replace a vector database on the fly. You don’t want to have to rebuild the entire application. That would be a nightmare. OPEA makes deployment easier by providing robust tools and frameworks designed to streamline these processes and facilitate seamless integration.

You can see this process in action by running the ChatQnA example: https://github.com/opea-project/GenAIExamples/tree/main/ChatQnA. There, you’ll find all the steps needed to create the building blocks for your RAG application on your server or your AIPC.

Call to Action

We have shown you the basics of how RAG works and how to deploy a RAG pipeline using the OPEA framework. While the process is straightforward, deploying a RAG system at scale can introduce complexities. Here’s what you can do next:

  • Explore GenAIComps: Gain insights into how generative AI components work together and how you can leverage them for real-world applications. OPEA provides detailed examples and documentation to guide your exploration.
  • Explore RAG demo(ChatQnA): Each part of a RAG system presents its own challenges, including ensuring scalability, handling data security, and integrating with existing infrastructure. OPEA, as an open source platform, offers tools and frameworks designed to address these issues and make the deployment process more efficient. Explore our demos to see how these solutions come together in practice.
  • Explore GenAI Examples: OPEA is not focused only on RAG; it is about generative AI as a whole. Multiple other demos, such as VisualQnA, showcase different GenAI capabilities. These examples demonstrate how OPEA can be leveraged across various tasks, expanding beyond RAG into other innovative GenAI applications.
  • Contribute to the project! OPEA is built by a growing community of developers and AI professionals. Whether you’re interested in contributing code, improving documentation, or building new features, your involvement is key to our success.

Join us on the OPEA GitHub to start contributing or explore our issues list for ideas on where to start.

Top comments (1)

Collapse
 
winzod4ai profile image
Winzod AI

Hey folks, came across this post and thought it might be helpful for you! Check out this blog on evaluating RAG performance: metrics and benchmarks - Rag Evaluation Metrics.