Chloe Williams for Zilliz

Posted on Jan 29 • Originally published at zilliz.com

3 Key Patterns to Building Multimodal RAG: A Comprehensive Guide

#community

Large Language Models (LLMs) are highly regarded for their versatility, as we can use them in numerous AI applications, such as personalized chatbots, document summarization, document question-answering, document classification, and more.

However, one key problem when using LLMs is the risk of hallucination. Hallucination refers to a phenomenon where LLMs produce highly convincing yet untruthful responses to our queries. It's quite tricky to spot hallucinations from LLMs, especially if we're asking them questions about topics we're not really familiar with.

Among many other methods, Retrieval Augmented Generation (RAG) is an approach that can help us mitigate the risk of LLMs' hallucinations. In its early implementation, RAG was more commonly used for text input only. With the advancement of AI technologies, we can now use RAG with different modalities of data, such as images, audio, videos, etc., which we refer to as multimodal RAG.

In this article, we're going to discuss different approaches to how we can implement multimodal RAG in our AI applications. Before we delve into multimodal RAG, let's first recap the fundamentals of RAG.

The Fundamentals of RAG

RAG is a novel approach that helps mitigate the risk of LLM hallucination by providing relevant context to a user query in the prompt. Before responding to a user query, the LLM can use this relevant context as the basis for its response, resulting in more contextualized answers.

As the name suggests, RAG has three main components: retrieval, augmentation, and generation.

Retrieval: In this component, the most relevant context for a user query is fetched. There are two stages in this component: candidate retrieval and reranking. In the candidate retrieval stage, the top-n most promising contexts are fetched. In contrast, in the reranking stage, these contexts are ranked or sorted by similarity metrics, such as cosine similarity or Euclidean distance.
Augmentation: In this component, the most promising contexts are integrated with the original user query to form one final coherent prompt. This final prompt will then serve as input for our LLM.
Generation: In this component, the LLM generates the response based on the input prompt, which contains the promising contexts to answer the user's query. The response is then sent back to the user.

Figure: The complete workflow of RAG.

However, we need to set up a few things before implementing RAG in our application.

For example, we need an efficient and scalable storage system to store all the possible contexts before we can fetch them. Since the contexts typically useful for RAG are unstructured data (text, image, etc.), vector databases are the most common storage systems used in RAG applications.

In a vector database, we typically store the embedding representation of contexts instead of the raw contexts. With embeddings, we can perform similarity searches to find the most promising contexts for any given query. Therefore, we also need a deep learning model (embedding model) to transform our raw contexts into embeddings.

Embedding of similar words in two dimensional vector space.

The workflow of the RAG pipeline from the beginning is then as follows:

Transform raw contexts into embeddings with the help of an embedding model.
Store and index these embeddings in a vector database.
For any given query, transform the query into an embedding using the same embedding model we used for raw contexts.
Perform a similarity search between the query embedding and context embeddings inside the vector database.
Fetch the top-n most relevant contexts and integrate those contexts with the original query into one coherent prompt as input for our LLM.
The LLM generates a response to the query using the provided relevant contexts to give a more accurate result.

The Fundamentals of Multimodal RAG

The implementation of RAG that we discussed in the previous section proves to be really helpful in mitigating the risk of LLM hallucination and improving the overall quality of LLM responses. However, when we talk about the contexts in RAG, we normally refer to context as text. Meanwhile, we know that in real-world applications, we might want to provide other modalities as contexts instead of just text.

Let's say that we want to use a collection of documents as contexts for an internal chatbot application. As we already know, a document typically consists of not only text, but also images, charts, and tables, which contain a lot of useful information to answer users' queries. With text-based RAG, we can't store information contained in these images, charts, and tables as contexts.

Multimodal RAG is the solution to this problem, as with this RAG method, we can store all of the contexts from different sources of information, thus also improving the overall accuracy of LLM responses.

Figure: Multimodal RAG pipeline.

Thanks to the inception and rise of multimodal embedding as well as multimodal LLMs, it's now possible for us to implement multimodal RAG. The idea of multimodal RAG is exactly the same as the usual RAG, but now we're able to store embeddings from different modalities of data, such as images, audio, and videos. However, we need to make sure that we're using multimodal embedding models as well as multimodal LLMs if we want to implement multimodal RAG.

In general, we can implement multimodal RAG in various ways. Specifically, there are three different patterns that we'll talk about in detail in this article:

Ground all modalities into one primary modality.
Embed all modalities into the same vector space.
A hybrid retrieval with raw image access.

Let's discuss these patterns one by one.

Pattern 1: Ground All Modalities into One Primary Modality (Multimedia to Text)

The first pattern involves transforming all modalities into one primary modality. Although you can choose any modality as the primary one, text is the most commonly used in multimodal RAG. Therefore, we're going to use text as our primary modality throughout this section.

To transform different modalities into text, the trick is to use a multimodal LLM to generate a text summary of our data. For example, let's say we have a document that contains a bunch of text and an image. Since text is our primary modality, we don't need to do anything with the text in the document. Meanwhile, we can use a Vision Language Model (VLM) like LLAVA, Gemini, Claude Sonnet, Qwen-VL, Pixtral, etc., that accepts both images and text as inputs to generate a text summarization of our image.

Once we have the text summary of our image, we can transform this text alongside other text in the document into embeddings using a text-based embedding model. There are many text-based embedding models we can choose from, such as those from SentenceTransformers, OpenAI, VoyageAI, etc. The embeddings of these texts are then stored and indexed in a vector database.

Figure: Pattern 1 workflow.

Now, for any given query, we can transform it into an embedding using the same text-based embedding model we used before. After that, we can perform a similarity search to find the most relevant contexts and then use the text-based contexts as part of the prompt for our text-based or multimodal LLMs.

If you'd like to learn more about the implementation details of this pattern, we have a dedicated article that will walk you through the steps to build a multimodal RAG with this pattern.

This pattern would be perfect to use if you don't need access to the raw, non-text data in your use case. Your application might accept images as inputs, but the output is always text-based. For example, you might build an application that has the functionality to explain the content of images in an internal document.

However, we're still relying on text-based contexts with this pattern, just like the usual RAG system. In real-world applications, we might want to use images or other modalities as contexts. Therefore, let's talk about the second pattern.

Pattern 2: Embed All Modalities into the Same Vector Space

The second pattern involves transforming data in all modalities into embeddings in the same vector space. The secret behind this approach is the implementation of multimodal embedding models such as CLIP and ALIGN. Let's take CLIP as an example.

CLIP is a model developed by OpenAI that takes both text and image as a pair of inputs, and has been trained to determine the similarity between the text and the image. As a result, CLIP will give a high similarity score if the text aligns with the image, and vice versa.

Figure: Embeddings of data with different modalities with CLIP in three dimensional vector space.

As you can see above, let's say we have a sentence "A smiling dog," and an image of a smiling dog. CLIP will first transform both the text and the image into embeddings with similar dimensions, and if we check the vector space, both embeddings will likely be placed close to each other.

Since we have a multimodal embedding model, the first step of this pattern is transforming our data in different modalities into embeddings with this multimodal embedding model. Next, we store and index these embeddings inside a vector database like Milvus or Zilliz Cloud. Once we have a user query, we transform it using the same multimodal embedding model, and then we can perform a similarity search to find the most relevant contexts.

Figure: Pattern 2 workflow.

The retrieved contexts when applying this pattern can be data with a variety of modalities, such as image and text. Therefore, we need to use a multimodal LLM to take these contexts into account and generate the final response. If our data consists of images and text, we can use a Vision Language Model (VLM) like LLAVA, Gemini, Claude Sonnet, Qwen-VL, Pixtral, etc.

If you'd like to learn more about the implementation details of this pattern, we have a dedicated article that will walk you through the steps to build a multimodal RAG with this pattern. However, keep in mind that in that article, the raw images are not directly stored inside the vector database, but rather stored in local memory.

The main advantage of this pattern is its versatility and simplicity. The implementation of a multimodal embedding model means that we don't need an additional step to convert the content of all modalities into a primary modality like we did in the first pattern. Also, the contexts retrieved after similarity search can be data of any modalities instead of just one particular modality.

However, since we're able to use any modalities of data as relevant contexts for our multimodal LLM, we need to also store the raw data when implementing this pattern. The problem is, the memory size of non-text data such as images is large, and storing them directly in a vector database can lead to inefficient use of resources. This will eventually also lead to slower query times and larger storage costs.

Therefore, we recommend using this pattern if you need to use data with different modalities as contexts, but scalability is not a concern for your use case.

Pattern 3: A Hybrid Retrieval with Raw Image Access.

If you need to use data with various modalities as contexts, and scalability is also a concern, then you can implement this pattern. The main idea of this pattern is the separation of concerns: we use a vector database to perform fast and efficient similarity searches to find relevant contexts, and use dedicated object storage systems like AWS S3 or Google Cloud Storage to store the raw data.

During the implementation of this pattern, we need to perform two different steps. First, we store the actual raw data in a dedicated object storage system like AWS S3 or Google Cloud Storage. Second, we store the metadata of our raw data inside the vector database, such as the URL of our image that resides in the dedicated object storage system.

Figure: Pattern 3 workflow.

Since we use a separate system to store our raw data, the way we perform RAG is almost similar to the first pattern. Let's say that text is our primary modality. The first thing we'll need to do is use a multimodal LLM to generate text summaries of our raw data. Next, we can use a text-based embedding model to transform the text summaries into embeddings. We then store the embeddings as well as the metadata of our raw data (URL of raw data in the dedicated storage system) into a vector database.

For any given query, we use the same multimodal LLM to generate a text summary and then transform the summary of the query into an embedding using the same text-based embedding model. Next, we can perform a similarity search and fetch the text summary as well as the URL of the relevant contexts. Finally, we can pass the raw data into a multimodal LLM via the retrieved URL.

Again, you can refer to this article that will walk you through the steps to build a multimodal RAG with this pattern. However, keep in mind that in the article, the raw images are not stored in a typical production-ready storage system like AWS or GCP, but rather in local memory.

Out of the three options, this pattern is the most scalable one due to the separation of raw data storage. As we might already know, vector databases are optimized for querying unstructured data, not for storing and serving large binary objects like images. In fact, retrieving binary objects from vector databases is often slower than retrieving them from a dedicated object storage system.

Therefore, we recommend you use this pattern if you want to use data with various modalities as contexts and scalability is a big concern for your use case.

How Milvus Vector Database Supports Multimodal RAG

As mentioned in the previous sections, vector databases play a crucial role in the application of Retrieval Augmented Generation (RAG). Milvus is a vector database that would be perfect for use in your RAG system or other AI applications due to its advanced features.

Milvus offers indexing methods ranging from the simplest to more advanced ones such as IVFFLAT, HNSW, and SCANN, enabling us to store huge collections of data in a fast and efficient manner. The implementation of these advanced indexing methods also speeds up the similarity search process to find relevant contexts in a RAG implementation.

Easy integration of Milvus with popular tools for multimodal RAG.

Milvus also offers easy integration with all of the RAG components we have discussed in the previous section, such as embedding models, LLMs, and orchestration tools. In terms of embedding models, you can directly use popular options from OpenAI, Cohere, SentenceTransformers, HuggingFace, VoyageAI, and others with the Python SDK of Milvus called pymilvus. You can install pymilvus with a simple pip command:


pip install -U pymilvus

Now let’s say you want to use the embedding model from SentenceTransformers, you can do so easily with pymilvus as follows:


pip install "pymilvus[model]"

from pymilvus import model

sentence_transformer_ef = model.dense.SentenceTransformerEmbeddingFunction(

    model_name='all-MiniLM-L6-v2', # Specify the model name

    device='cpu' # Specify the device to use, e.g., 'cpu' or 'cuda:0'

)

doc = [

    "Artificial intelligence was founded as an academic discipline in 1956."]

doc_embedding = sentence_transformer_ef.encode_documents(doc)

You can learn more about different kinds of embedding models supported by Pymilvus in this integration page.

In terms of LLMs and orchestration tools, Milvus can be integrated easily with popular frameworks like vLLM, Ollama, Gemini, LlamaIndex, and Langchain. If you’d like to learn more regarding the integration of Milvus and all of these tools, we have a collection of tutorials you can check on this page. We also have a simple tutorial where you can learn how to build a simple multimodal RAG with Milvus on this doc page.

Conclusion

Multimodal RAG represents a significant advancement in utilizing diverse data modalities to improve the accuracy of LLM responses. We have discussed three key patterns to implement multimodal RAG in this article: grounding all modalities into a primary modality, embedding them into a unified vector space, or employing hybrid retrieval with raw data access. The choice of a suitable pattern depends on your AI application's specific needs.

With its advanced indexing methods and easy integration with embedding models, LLMs, and orchestration tools, the Milvus vector database offers a suitable system for implementing multimodal RAG systems. As AI applications expand in scope and complexity, utilizing a scalable vector database system like Milvus becomes increasingly crucial.

DEV Community

3 Key Patterns to Building Multimodal RAG: A Comprehensive Guide

The Fundamentals of RAG

The Fundamentals of Multimodal RAG

Pattern 1: Ground All Modalities into One Primary Modality (Multimedia to Text)

Pattern 2: Embed All Modalities into the Same Vector Space

Pattern 3: A Hybrid Retrieval with Raw Image Access.

How Milvus Vector Database Supports Multimodal RAG

Conclusion

Top comments (0)

Read next

"Unlocking Stellar Secrets: Tidal Forces and Filament Formation in Space"

Why AI Can't Handle Uncertainty Like Nature Does: New Research Shows Key Evolution Lessons

AI's Creative Mistakes May Speed Up Drug Discovery, Study Shows

New Light-Based Computer Chip Makes AI 4.4x Faster Using Silicon Photonics