DEV Community

Mikhail Borodin
Mikhail Borodin

Posted on

Running locally DeepSeek-R1 for RAG

Introduction

For the last 2 weeks the internet has been abuzz with the emergence of the new DeepSeek generative model. The biggest surprise was that with similar quality of ChatGPT answers it cost an order of magnitude less to train, this has already affected Nvidia's stock price, which lost about 20% of its value in one go. This model can be used for free either through the web or through a mobile app. But today I would like to highlight another of the possibilities of using this model, or rather the possibility of running it locally. And let's look at it on a small project. Let's imagine that we have documentation and we need to search for information or analyze it. For this purpose we can apply the RAG technology.

What is RAG?

Retrieval-Augmented Generation (RAG) is an advanced AI technique designed to improve the accuracy and reliability of language models by integrating the search for external information into the response generation process. Unlike traditional generative models that rely solely on pre-trained knowledge, RAG dynamically searches for relevant information before generating a response, reducing hallucinations and improving fact accuracy.

RAG's workflow consists of three key steps. First, it retrieves relevant documents or data from a knowledge base, which may include structured databases, vector stores, or even real-time APIs. The retrieved information is then combined with the model's internal knowledge to ensure that the answers are based on relevant and reliable sources. Finally, the model generates a reasoned answer using both its trained language capabilities and the newly acquired data.

This approach offers significant advantages. By basing answers on external sources, RAG minimizes inaccuracies and ensures that information is up-to-date, making it particularly useful in fast-changing fields such as finance, healthcare, and law. It is widely used in chatbots, AI systems with advanced search, enterprise knowledge discovery, and AI-driven research assistants where accuracy and factual validity are critical.

What will the architecture of the solution look like?

A knowledge base is a collection of relevant and up-to-date information that serves as the basis for a RAG. In our case, these are documents stored in the catalog.

Image description

Before you start implementing this architecture, the following libraries must be installed (tested on Python 3.11):

llama-index   
transformers   
torch   
sentence-transformers   
llama-index-llms-ollama 
Enter fullscreen mode Exit fullscreen mode

Here's how you can upload your documents to LlamaIndex as objects:

from llama_index.core import SimpleDirectoryReader 

loader = SimpleDirectoryReader(   
    input_dir=input_dir_path, required_exts=[.pdf],   
    recursive=True   
)   
docs = loader.load_data() 
Enter fullscreen mode Exit fullscreen mode

The next step is to build a vector store index, which are a key component of search-enhanced generation (RAG), and so you will use them in almost every application that uses LlamaIndex, either directly or indirectly.

Vector stores take a list of Node objects and build an index from them

VectorStoreIndex in RAG

In Retrieval-Augmented Generation (RAG), the VectorStoreIndex index is used to store and retrieve vector embeddings of documents. This allows the system to find relevant information based on semantic similarity rather than exact keyword matches.

The process starts with document embedding. Textual data is converted into vector embeddings using an embedding model. These embeddings represent the meaning of the text in numerical form, making it easier to compare and find similar content. Once created, these embeddings are stored in a vector database such as FAISS, Pinecone, Weaviate, Qdrant or Chroma.

When a user submits a query, it is also converted into an embedding using the same model. The system then searches for similar embeddings by comparing the query with the stored vectors using similarity metrics such as cosine similarity or Euclidean distance. The most relevant documents are retrieved based on their similarity scores.

The retrieved documents are passed as context to a large language model (LLM), which uses them to generate a response. This process ensures that the LLM has access to relevant information, improving accuracy and reducing hallucinations.

The use of VectorStoreIndex improves RAG systems by providing semantic search, improving scalability and supporting real-time search. This ensures that answers are contextually accurate and based on the most relevant information available.

Building a vector index is easy

from llama_index.core import VectorStoreIndex 

index = VectorStoreIndex.from_documents(docs)   

### Save the index to a file   
index.storage_context.persist(persist_dir=./index)
Enter fullscreen mode Exit fullscreen mode

The next step is to download and run the deepseek-R1 model using ollama. To do this, you need to install ollama https://ollama.com/download and download the deepseek-r1 model.
https://ollama.com/library/deepseek-r1

After that it is enough to execute the following code to make RAG start working

from llama_index.core import StorageContext, load_index_from_storage  
from llama_index.core import Settings, PromptTemplate  
from llama_index.llms.ollama import Ollama

# rebuild storage context  
storage_context = StorageContext.from_defaults(persist_dir=./index)  

# load index  
index = load_index_from_storage(storage_context)  
# Creating a prompt template  
qa_prompt_tmpl_str = (  
    Context information is below. \n  
    __\n  
    {context_str} \n  
    __in Given the context information above I want you \n  
    to think step by step to answer the query in a crisp \n  
    'cause you don't know the answer say \n'”  
    “'I don't know!' \n  
    Query: {query_str}{n}  
    Answer: )  
qa_prompt_tmpl = PromptTemplate(qa_prompt_tmpl_str)  

# Setting up a query engine  
llm = Ollama(model=deepseek-r1:1.5b, request_timeout=360.0)  

# Setup a query engine on the index previously created  
Settings.llm = llm # specifying the llm to be used  
query_engine = index.as_query_engine(  
    similarity_top_k=10  
)  
query_engine.update_prompts({response_synthesizer:text_qa_template: qa_prompt_tmpl})  
response = query_engine.query('What is the ACID?')  
print(response)
Enter fullscreen mode Exit fullscreen mode

Final step. Interface

We can embed this solution into a desktop client by creating a user interface using Streamlit to allow the user to interact with our RAG application via chat. Or you can make a Telegram bot, then it will be even easier to create an interface

Top comments (0)