This is the fourth article of a series in which I am building an open source RAG CLI tool. In the previous article we designed the database schema and added a configuration file.
In this article, we will begin to implement the commands for managing documentations: docs add
, docs update
, and docs remove
. Let's start by organizing our code better.
Creating the commands files
To keep our code organized, let's move each command implementation to its own file.
src/
commands/
docs/
add.ts # Implementation of `rag docs add`
update.ts # Implementation of `rag docs update`
remove.ts # Implementation of `rag docs remove`
index.ts # exports all docs commands
get.ts # Implementation of `rag get`
index.ts # exports all commands
We only have placeholder implementations for now.
src/commands/docs/add.ts
type Options = { subdir: string; branch: string }
export async function add(name: string, repo_url: string, options: Options) {
console.log(`Adding documentation ${name} from ${repo_url} with options ${JSON.stringify(options)}`)
}
similarly for docs/update.ts
, docs/remove.ts
and get.ts
.
src/commands/docs/index.ts
export * from './add'
export * from './update'
export * from './remove'
src/commands/index.ts
export * as docs from './docs'
export * from './get'
With that, we can import the commands in the main file src/main.ts
:
import * as commands from './commands'
// ...
docs
.command('add')
// ...
.action(commands.docs.add)
docs
.command('update')
// ...
.action(commands.docs.update)
docs.command('remove')
// ...
.action(commands.docs.remove)
program
.command('get')
// ...
.action(commands.get)
Implementing the docs commands
Implementing rag docs add
rag docs add <name> <repo_url> [--subdir <subdir>] [--branch <branch>]
We want this command to do the following:
- clone the repository.
- get the markdown files from the specified subdir and branch.
- create the documentation in the database and add the files and chunks to it.
- if the documentation already exists, show an error message.
Here is the implementation:
type Options = {
subdir: string
branch: string
}
export async function add(name: string, repo_url: string, options: Options) {
if (await db.documentations.has(name)) {
throw new Error(`Documentation ${name} already exists. Use 'rag docs update ${name}' to update it.`)
}
const clone_path = await clone_repo(repo_url, options)
const documentation = await db.documentations.create({ name, repo_url, ...options })
const md_files = get_md_files(clone_path)
for await (const filename of md_files) {
await add_file_to_documentation(documentation, filename)
}
}
Notice that I am using some functions that are not implemented yet. The goal is to outline the functions we need and how we are going to use them, then implement them. So far we need:
-
db.documentations.has
: to check if a documentation already exists in the database. -
db.documentations.create
: to create a new documentation in the database. -
clone_repo
: to clone a specific branch and sub-directory of a repository and return its path. -
get_md_files
: to scan a directory recursively for markdown files and return an iterator over their relative paths. -
add_file_to_documentation
: to split a markdown file into chunks and add them to a documentation in the database.
Before implementing these functions, let's write the code for other commands.
This seems like a lazy thing to do, why don't you finish the implementation of the first command before starting the next one?
Writing the outline code of other commands will give me more visibility on the functions I need to implement. I may find out that I can reuse some functions on multiple commands if I modify their arguments slightly. I will be able to implement these functions once, instead of implementing them for the first command then refactoring them for the second one, and so on.
Implementing rag docs update
rag docs update <name> [--repo_url <repo_url>] [--subdir <subdir>] [--branch <branch>]
We want this command to do the following:
- get the documentation by name from the database.
- if the documentation does not exist, show an error message.
- update the documentation with the given options.
- clone the repository.
- get the markdown files from the updated subdir and branch.
- update files and chunks in the database:
- if the database file has the same hash as the new file, do nothing.
- if hashes are different, delete the old file and its chunks and insert the new one.
- if a file does not exist on database, add it.
- if a file exists on database but is not found on the repo, delete it.
This a bit complicated then the previous command, so let's write the code step by step.
We start with this empty function:
type Options = {
repo_url: string
subdir: string
branch: string
}
export async function update(name: string, options: Options) {
}
First we fetch the documentation by name from the database. and throw an error if it does not exist.
export async function update(name: string, options: Options) {
const documentation = await db.documentations.get(name)
if (!documentation) {
throw new Error(`Documentation ${name} not found. Use 'rag docs add ${name}' to create it.`)
}
}
Then we update the documentation with the given options and clone the repository.
export async function update(name: string, options: Options) {
// ...
await db.documentations.update(name, options)
const clone_path = await clone_repo(documentation.repo_url, documentation)
}
Then we fetch the existing files on the database and define a set to keep track of the seen filenames. This will be useful to detect files that are not found on the repo later.
export async function update(name: string, options: Options) {
// ...
const files = await db.files.find({ documentation_id: documentation.id })
const seen_filenames = new Set<string>()
}
Now we loop over the markdown files on the cloned repo:
export async function update(name: string, options: Options) {
// ...
for await (const filename of get_md_files(clone_path)) {
}
}
We track the filename as seen and find the corresponding file in the database.
export async function update(name: string, options: Options) {
// ...
for await (const filename of get_md_files(clone_path)) {
seen_filenames.add(filename)
const file = files.find((f) => f.filename === filename)
}
}
if the file exists on the database, we compare its hash with the new file and skip it if they match. Otherwise, we delete the file and its chunks from the database.
export async function update(name: string, options: Options) {
// ...
for await (const filename of get_md_files(clone_path)) {
// ...
if (file) {
const hash = await hash_file(path.join(clone_path, filename))
if (file.hash === hash) continue
await db.chunks.remove({ file_id: file.id })
await db.files.remove({ id: file.id })
}
}
}
Finally, if we didn't skip the file, we add it to the database, using the same function we used in the add
command.
export async function update(name: string, options: Options) {
// ...
for await (const filename of get_md_files(clone_path)) {
// ...
if (file) {
// ...
}
await add_file_to_documentation(documentation, filename)
}
}
Then at the end, we delete the files we didn't see on the repo.
export async function update(name: string, options: Options) {
// ...
for await (const filename of get_md_files(clone_path)) {
// ...
}
const unseen_files = files.filter((f) => !seen_filenames.has(f.filename))
await delete_files_and_chunks(unseen_files)
}
I see some performance optimizations we can do here to make this command faster ...
I know that this implementation is slow. My plan is to have a first working version, do some benchmarking, then demonstrate how we can optimize it and compare the performance before/after.
So in this second command, we reused the following functions from the add
command: clone_repo
, get_md_files
, add_file_to_documentation
. And added the following functions:
-
db.documentations.get
: to get a documentation by name from the database. -
db.documentations.update
: to update a documentation in the database. -
db.files.find
: to get a list of files from the database. -
hash_file
: to read a file and compute its hash. -
db.chunks.remove
: to delete chunks from the database. -
db.files.remove
: to delete files from the database. -
delete_files_and_chunks
: to delete files and their chunks from the database.
Let's move on to the next command.
Implementing rag docs remove
rag docs remove <name>
This function will simply remove a documentation from the database, with all associated files and chunks.
export async function remove(name: string) {
const documentation = await db.documentations.get(name)
if (!documentation) return
const files = await db.files.find({ documentation_id: documentation.id })
await delete_files_and_chunks(files)
await db.documentations.remove(name)
}
Here we reused some functions from the update
command and added a new one:
-
db.documentations.remove
: to remove a documentation from the database.
Organizing functions
Here is the list of functions we need to implement so far:
database functions
db.documentations.has
db.documentations.get
db.documentations.create
db.documentations.update
db.documentations.remove
db.files.find
db.chunks.remove
db.files.remove
files functions
hash_file
get_md_files
other functions
clone_repo
add_file_to_documentation
delete_files_and_chunks
I want to start with the database functions but these two functions bother me: add_file_to_documentation
and delete_files_and_chunks
. Why, because they hide other database functions inside. So let's start by them:
Implementing add_file_to_documentation
async function add_file_to_documentation(doc: db.Documentation, filename: string) {
const chunks = await get_file_chunks(filename) // Oops, we need to pass the absolute path here
await db.files.create({
documentation_id: doc.id,
path: filename,
hash: await hash_file(filename) // Oops, we need to pass the absolute path here
})
// ...
}
Here is a problem, we need to pass the absolute path to the get_file_chunks
and hash_file
functions. But the filename
is a relative path.
Why don't you just pass the absolute path directly?
Because we need to store the relative path in the files table on the database.
Hmm, so we need to pass the
clone_path
and relative path to the function, that way we can compute the absolute path?
Yes, that would be a valid solution, let's do it.
First we modify the code of add
and update
commands to pass the additional argument:
- await add_file_to_documentation(documentation, filename)
+ await add_file_to_documentation(documentation, clone_path, filename)
Ok, now let's continue with the add_file_to_documentation
function:
export async function add_file_to_documentation(doc: db.Documentation, clone_path: string, filename: string) {
const absolute_path = path.join(clone_path, filename)
const chunks = await get_file_chunks(absolute_path)
const file = await db.files.create({
documentation_id: doc.id,
path: filename,
hash: await hash_file(absolute_path),
})
for (let index = 0; index < chunks.length; index++) {
const { metadata, content } = chunks[index]
const text = `${yaml.stringify(metadata)}\n---\n${content}`
const embedding = await embed(text)
await db.chunks.create({ file_id: file.id, index, metadata, content, embedding })
}
}
This give us the following additional functions to implement:
-
get_file_chunks
: to read a file and split it into chunks. it returns the content and metadata of each chunk. -
db.files.create
anddb.chunks.create
: to create a new file and its chunks in the database. -
embed
: to create the embedding vector of a string.
What is the
db.Documentation
type?
It's the type that represents a documentation in the database. We didn't define it yet, we will do with the database functions.
What is the purpose of embedding
${yaml.stringify(metadata)}\n---\n${content}
instead of just embedding the content directly?
I am adding the metadata (that will contain things like the filename, the current section in the file, ...) to the embedded text to give more context in hope of improving the search results.
Again I am aware that this will be slow, I should probably use streams to do some work while reading files, I should also compute embeddings in batches, ... I will to those optimizations in the future.
Implementing delete_files_and_chunks
Here is the most basic implementation
async function delete_files_and_chunks(files: db.File[]) {
for (const file of files) {
await db.chunks.remove({ file_id: file.id })
await db.files.remove({ id: file.id })
}
}
With that we have the following remaining functions to implement:
database functions
db.documentations.has
db.documentations.get
db.documentations.create
db.documentations.update
db.documentations.remove
db.files.find
db.files.remove
db.files.create
db.chunks.create
db.chunks.remove
files functions
hash_file
get_md_files
get_file_chunks
other functions
clone_repo
embed
I will implement the database functions in the next article.
Summary
In this article, we wrote the outline code for the commands:
rag docs add
rag docs update
rag docs remove
And listed the internal functions we need to implement for those commands to work.
You can check the current code in Github: https://github.com/webNeat/rag
What's next
The next steps are:
- Implement the missing functions
- Implement the
rag get
command - Add tests
- Measure and improve performance
- Create CI/CD pipeline
- Release the first version!
Feel free to comment if you have any suggestions or feedback. See you in the next article!
Top comments (0)