Introduction
For the last 2 weeks the internet has been abuzz with the emergence of the new DeepSeek generative model. The biggest surprise was that with similar quality of ChatGPT answers it cost an order of magnitude less to train, this has already affected Nvidia's stock price, which lost about 20% of its value in one go. This model can be used for free either through the web or through a mobile app. But today I would like to highlight another of the possibilities of using this model, or rather the possibility of running it locally. And let's look at it on a small project. Let's imagine that we have documentation and we need to search for information or analyze it. For this purpose we can apply the RAG technology.
What is RAG?
Retrieval-Augmented Generation (RAG) is an advanced AI technique designed to improve the accuracy and reliability of language models by integrating the search for external information into the response generation process. Unlike traditional generative models that rely solely on pre-trained knowledge, RAG dynamically searches for relevant information before generating a response, reducing hallucinations and improving fact accuracy.
RAG's workflow consists of three key steps. First, it retrieves relevant documents or data from a knowledge base, which may include structured databases, vector stores, or even real-time APIs. The retrieved information is then combined with the model's internal knowledge to ensure that the answers are based on relevant and reliable sources. Finally, the model generates a reasoned answer using both its trained language capabilities and the newly acquired data.
This approach offers significant advantages. By basing answers on external sources, RAG minimizes inaccuracies and ensures that information is up-to-date, making it particularly useful in fast-changing fields such as finance, healthcare, and law. It is widely used in chatbots, AI systems with advanced search, enterprise knowledge discovery, and AI-driven research assistants where accuracy and factual validity are critical.
What will the architecture of the solution look like?
A knowledge base is a collection of relevant and up-to-date information that serves as the basis for a RAG. In our case, these are documents stored in the catalog.
Before you start implementing this architecture, the following libraries must be installed (tested on Python 3.11):
llama-index
transformers
torch
sentence-transformers
llama-index-llms-ollama
Here's how you can upload your documents to LlamaIndex as objects:
from llama_index.core import SimpleDirectoryReader
loader = SimpleDirectoryReader(
input_dir=input_dir_path, required_exts=[“.pdf”],
recursive=True
)
docs = loader.load_data()
The next step is to build a vector store index, which are a key component of search-enhanced generation (RAG), and so you will use them in almost every application that uses LlamaIndex, either directly or indirectly.
Vector stores take a list of Node objects and build an index from them
VectorStoreIndex in RAG
In Retrieval-Augmented Generation (RAG), the VectorStoreIndex index is used to store and retrieve vector embeddings of documents. This allows the system to find relevant information based on semantic similarity rather than exact keyword matches.
The process starts with document embedding. Textual data is converted into vector embeddings using an embedding model. These embeddings represent the meaning of the text in numerical form, making it easier to compare and find similar content. Once created, these embeddings are stored in a vector database such as FAISS, Pinecone, Weaviate, Qdrant or Chroma.
When a user submits a query, it is also converted into an embedding using the same model. The system then searches for similar embeddings by comparing the query with the stored vectors using similarity metrics such as cosine similarity or Euclidean distance. The most relevant documents are retrieved based on their similarity scores.
The retrieved documents are passed as context to a large language model (LLM), which uses them to generate a response. This process ensures that the LLM has access to relevant information, improving accuracy and reducing hallucinations.
The use of VectorStoreIndex improves RAG systems by providing semantic search, improving scalability and supporting real-time search. This ensures that answers are contextually accurate and based on the most relevant information available.
Building a vector index is easy
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(docs)
### Save the index to a file
index.storage_context.persist(persist_dir=“./index”)
The next step is to download and run the deepseek-R1 model using ollama. To do this, you need to install ollama https://ollama.com/download and download the deepseek-r1 model.
https://ollama.com/library/deepseek-r1
After that it is enough to execute the following code to make RAG start working
from llama_index.core import StorageContext, load_index_from_storage
from llama_index.core import Settings, PromptTemplate
from llama_index.llms.ollama import Ollama
# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir=“./index”)
# load index
index = load_index_from_storage(storage_context)
# Creating a prompt template
qa_prompt_tmpl_str = (
“Context information is below. \n”
“__\n”
“{context_str} \n”
“__in” ”Given the context information above I want you \n”
“to think step by step to answer the query in a crisp \n”
“'cause you don't know the answer say \n'”
“'I don't know!' \n”
“Query: {query_str}{n}”
“Answer: “)
qa_prompt_tmpl = PromptTemplate(qa_prompt_tmpl_str)
# Setting up a query engine
llm = Ollama(model=“deepseek-r1:1.5b”, request_timeout=360.0)
# Setup a query engine on the index previously created
Settings.llm = llm # specifying the llm to be used
query_engine = index.as_query_engine(
similarity_top_k=10
)
query_engine.update_prompts({“response_synthesizer:text_qa_template”: qa_prompt_tmpl})
response = query_engine.query('What is the ACID?')
print(response)
Final step. Interface
We can embed this solution into a desktop client by creating a user interface using Streamlit to allow the user to interact with our RAG application via chat. Or you can make a Telegram bot, then it will be even easier to create an interface
Top comments (0)