I am trying to use the new AI tools like Github Copilot and Aider to develop my applications faster. But I have one problem with them: they don't know how to use the recent versions of the frameworks/libraries I am using. This is normal because the new documentations where not part of their training data.
A trivial solution to this problem would be to just include the new documentation in the context. But this will be costly and make the response slow (and inaccurate sometimes). It's also not possible if I am solving a problem that invloves multiple tools (their docs would exceed the context size of the model).
I need a way to extract only the relevant parts of the documentations and add them to the prompt, I believe this is the idea behind RAG. So in this series of articles, I will share my journey of building a simple CLI tool that helps me store documentations and retrieve the relevant parts from them. I can then integrate this tool with AI assistants like Copilot and Aider, or just use it as an offline semantic search engine on documentations of tools I am using.
About this series
Why am I writing these articles?
I will publish the resulting tool as an open source project, and these articles will serve as a sort of documentation for that project. I believe this type of documentation is very interesting but rare. For example, I wish I could read the story of Linus when he was building the first version of Git, what was his thinking process? different problems he faced? what were the different designs/solutions he tried? etc. I think I would learn a lot from such story. So my goal here is to share my thinking process while I am building this tool, and the different problems I face and how I solve them. I hope you learn something from it, otherwise feel free to point out things I am doing wrong, I also want to learn.
The writing style
One easy way for me to explain my thinking process is to imagine that you are working with me on this project, and asking me questions like "why did you do this?", "did you consider this alternative?", etc. So whenever I think you may have a question, I will write it down as a quote and answer it, like the following:
Don't you think this will be distracting for the reader? I think it will be better to right in a normal way.
I know it might be distracting for some people, but others may like it, and I like it. So I will try it and see what people think in the comments.
Analyzing the problem
Let's start by defining what we are going to build and what tools we will be using.
Goal: Given a prompt, retrieve the relevant parts from multiple up-to-date documentations and add them to the context of the AI tool (e.g., Github Copilot).
To achieve this goal, I need to:
- Download the up-to-date documentations
- Do semantic search and extract the relevant parts
- Add the relevant parts to the context of the AI tool
Let's analyze each one of these steps in more details.
1. Download the up-to-date documentations
For this step, I will assume that the documentations are available as markdown files on a Git repository. This assumption will make it easy to create the first version of the tool. We can add support for other types of docs and sources in the future. With this assumption, Downloading the documentations can be done by cloning the Git repository and extracting all markdown files from the docs directory.
2. Do semantic search and extract the relevant parts
What do you mean by semantic search?
From Wikipedia: Semantic search denotes search with meaning, as distinguished from lexical search where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query.
In short:
- Lexical search: finding literal matches of the query words or variants of them.
- Semantic search: understanding the meaning of the query and finding the relevant parts of the documentations.
I get the Lexical search, it's just looking for the parts that include some specific words. But I still don't get the semantic search, what do you mean by "understanding the meaning of the query"?
Here is a simplified explanation of how semantic search works:
Imagine that we can associate to each word a number that represents its meaning, and that words with similar meanings will have close numbers. Then if we want to test that two words have similar meanings, we can just compare their numbers. Now we can represent the meaning of a sentence (from the query or the documentation) as the list of numbers of each word in the sentence. Then to test if two sentences have similar meanings, we can just compute the distance between their list of numbers. The smaller the distance, the more similar the meanings.
Ok, but how do you come up with the number that represent the meaning of a word, wouldn't it change depending on the context?
Yes it would, the explanation above is just for you to get the idea. In practice, we use a library or an LLM to generate the list of numbers for a chunk of text, called embedding. The embedding would capture the overall meaning of the chunk, taking into account the contextual meaning of the words and the relationship between them. The AI models that are specialized in this are called embedding models. To generate embeddings, we can use an API like OpenAI's embedding models or run an open source emdedding model locally using a tool like Ollama.
I will go with the offline approach for this tool, because I am already familiar with Ollama and don't want the tool to require an API key from OpenAI or other service to work. I understand that this will add a new requirement to run the tool, but installing and running Ollama is easy and we can automate it if needed (I am thinking of a setup
command that installs all requirements of the tool: Ollama, Git, etc).
Now that we have an idea of how to generate embeddings, here is how we can do semantic search:
Initialization:
- Split the documentations markdown files into chunks.
- Generate embeddings for all chunks.
- Store each chunk and its embedding into a database.
Semantic search:
- Generate an embedding for the prompt.
- Query the database for chunks with similar embeddings.
So the next questions we should think about are:
- How to split the documentations into chunks?
- Which database we should use to store embeddings and query them?
Splitting the documentations into chunks
Input: The content of a markdown file
Output: A list of chunks
Constraints:
- The number of tokens in a chunk should not exceed the limit of the embedding model.
- We should not cut a heading or a sentence in the middle.
- We should avoid cutting a paragraph, a code block, a table or a list in the middle as much as possible.
- We should add metadata to each chunk to give context like the name of the framework/library, the version, the headings/titles that lead to the chunk, etc.
"The number of tokens in a chunk", what is a token?
A token is the unit of text used by LLMs, typically representing a word, part of a word, or character.
So we will need a way to count the number of tokens in a chunk, to ensure it does not exceed the limit, right? how?
Yes we will need to count the number of tokens in a chunk. The counting method depends on the embedding model we will choose. We will explore how to use the ollama API to count tokens during the implementation step.
Choosing a database
I would like the tool to be usable offline and easy to install, so the first database that comes to mind is SQLite. I can use an extension like sqlite-vec
to enable vector search. Using SQLite makes it possible for users to backup their data or move it to another device by simply copying the database file. They can also but the database file on a server and share it between multiple devices/users.
What about performance? and accuracy of search?
I am not sure about that to be honest, I will have to do some benchmarks to compare SQLite against other databases to have a better idea. My thinking is to keep things simple and go with SQLite for the first version of the tool. We can always add support for other databases in the future.
3. Add the relevant parts to the context of the AI tool
Once we retrieve the relevant parts of documentations, we need a way to add them to the context of the AI tool. I will focus on two tools for now: Github Copilot and Aider. The same idea works for both of them: Write the chunks to a file and add that file to the context.
Github Copilot
We can configure Github Copilot to automatically include specific files in the chat context by adding the following to VSCode user settings:
{
// ...
"github.copilot.chat.codeGeneration.instructions": [
{ "file": ".ai/docs.md" },
]
}
Then we can run our RAG tool and redirect the chunks to that file, then ask questions to Github Copilot.
Is there a way to let Github Copilot run our RAG tool on every prompt automatically? instead of having to run it manually?
This seems to be possible by building a Github Copilot extension, we can look into that in details once we finish the development of the tool.
Aider
We can configure Aider to automatically include specific files in the context using the read
property of its config file
read:
- .ai/docs.md
And there is a Feature request in the aider repository to enable integrating aider with external tools. So we could make it call our tool automatically and add the chunks to the context.
Summary
We are building a CLI tool that stores documentations of different frameworks/libraries and allows to do semantic search and extract the relevant parts from them.
Requirements:
- Git: will be used to clone the documentations repository
- Ollama: will be used to generate embeddings
Assumptions / Limitations:
- We are assuming that the documentations are available as markdown files on a Git repository
database:
We choose to go with SQLite for now and add support for other databases in the future.
Next steps
In the next article, we will design the CLI interface and start the implementation of the tool.
Top comments (0)