DEV Community

Cover image for RAG - Implementing the docs commands
Amine Ben hammou for This is Learning

Posted on

RAG - Implementing the docs commands

This is the fourth article of a series in which I am building an open source RAG CLI tool. In the previous article we designed the database schema and added a configuration file.

In this article, we will begin to implement the commands for managing documentations: docs add, docs update, and docs remove. Let's start by organizing our code better.

Creating the commands files

To keep our code organized, let's move each command implementation to its own file.

src/
  commands/
    docs/
      add.ts     # Implementation of `rag docs add`
      update.ts  # Implementation of `rag docs update`
      remove.ts  # Implementation of `rag docs remove`
      index.ts   # exports all docs commands
    get.ts       # Implementation of `rag get`
    index.ts     # exports all commands
Enter fullscreen mode Exit fullscreen mode

We only have placeholder implementations for now.

src/commands/docs/add.ts

type Options = { subdir: string; branch: string }
export async function add(name: string, repo_url: string, options: Options) {
  console.log(`Adding documentation ${name} from ${repo_url} with options ${JSON.stringify(options)}`)
}
Enter fullscreen mode Exit fullscreen mode

similarly for docs/update.ts, docs/remove.ts and get.ts.

src/commands/docs/index.ts

export * from './add'
export * from './update'
export * from './remove'
Enter fullscreen mode Exit fullscreen mode

src/commands/index.ts

export * as docs from './docs'
export * from './get'
Enter fullscreen mode Exit fullscreen mode

With that, we can import the commands in the main file src/main.ts:

import * as commands from './commands'

// ...
docs
  .command('add')
  // ...
  .action(commands.docs.add)

docs
  .command('update')
  // ...
  .action(commands.docs.update)

docs.command('remove')
  // ...
  .action(commands.docs.remove)

program
  .command('get')
  // ...
  .action(commands.get)
Enter fullscreen mode Exit fullscreen mode

Implementing the docs commands

Implementing rag docs add

rag docs add <name> <repo_url> [--subdir <subdir>] [--branch <branch>]
Enter fullscreen mode Exit fullscreen mode

We want this command to do the following:

  • clone the repository.
  • get the markdown files from the specified subdir and branch.
  • create the documentation in the database and add the files and chunks to it.
  • if the documentation already exists, show an error message.

Here is the implementation:

type Options = {
  subdir: string
  branch: string
}
export async function add(name: string, repo_url: string, options: Options) {
  if (await db.documentations.has(name)) {
    throw new Error(`Documentation ${name} already exists. Use 'rag docs update ${name}' to update it.`)
  }

  const clone_path = await clone_repo(repo_url, options)
  const documentation = await db.documentations.create({ name, repo_url, ...options })
  const md_files = get_md_files(clone_path)
  for await (const filename of md_files) {
    await add_file_to_documentation(documentation, filename)
  }
}
Enter fullscreen mode Exit fullscreen mode

Notice that I am using some functions that are not implemented yet. The goal is to outline the functions we need and how we are going to use them, then implement them. So far we need:

  • db.documentations.has: to check if a documentation already exists in the database.
  • db.documentations.create: to create a new documentation in the database.
  • clone_repo: to clone a specific branch and sub-directory of a repository and return its path.
  • get_md_files: to scan a directory recursively for markdown files and return an iterator over their relative paths.
  • add_file_to_documentation: to split a markdown file into chunks and add them to a documentation in the database.

Before implementing these functions, let's write the code for other commands.

This seems like a lazy thing to do, why don't you finish the implementation of the first command before starting the next one?

Writing the outline code of other commands will give me more visibility on the functions I need to implement. I may find out that I can reuse some functions on multiple commands if I modify their arguments slightly. I will be able to implement these functions once, instead of implementing them for the first command then refactoring them for the second one, and so on.

Implementing rag docs update

rag docs update <name> [--repo_url <repo_url>] [--subdir <subdir>] [--branch <branch>]
Enter fullscreen mode Exit fullscreen mode

We want this command to do the following:

  • get the documentation by name from the database.
  • if the documentation does not exist, show an error message.
  • update the documentation with the given options.
  • clone the repository.
  • get the markdown files from the updated subdir and branch.
  • update files and chunks in the database:
    • if the database file has the same hash as the new file, do nothing.
    • if hashes are different, delete the old file and its chunks and insert the new one.
    • if a file does not exist on database, add it.
    • if a file exists on database but is not found on the repo, delete it.

This a bit complicated then the previous command, so let's write the code step by step.

We start with this empty function:

type Options = {
  repo_url: string
  subdir: string
  branch: string
}
export async function update(name: string, options: Options) {

}
Enter fullscreen mode Exit fullscreen mode

First we fetch the documentation by name from the database. and throw an error if it does not exist.

export async function update(name: string, options: Options) {
  const documentation = await db.documentations.get(name)
  if (!documentation) {
    throw new Error(`Documentation ${name} not found. Use 'rag docs add ${name}' to create it.`)
  }
}
Enter fullscreen mode Exit fullscreen mode

Then we update the documentation with the given options and clone the repository.

export async function update(name: string, options: Options) {
  // ...
  await db.documentations.update(name, options)
  const clone_path = await clone_repo(documentation.repo_url, documentation)
}
Enter fullscreen mode Exit fullscreen mode

Then we fetch the existing files on the database and define a set to keep track of the seen filenames. This will be useful to detect files that are not found on the repo later.

export async function update(name: string, options: Options) {
  // ...
  const files = await db.files.find({ documentation_id: documentation.id })
  const seen_filenames = new Set<string>()
}
Enter fullscreen mode Exit fullscreen mode

Now we loop over the markdown files on the cloned repo:

export async function update(name: string, options: Options) {
  // ...
  for await (const filename of get_md_files(clone_path)) {

  }
}
Enter fullscreen mode Exit fullscreen mode

We track the filename as seen and find the corresponding file in the database.

export async function update(name: string, options: Options) {
  // ...
  for await (const filename of get_md_files(clone_path)) {
    seen_filenames.add(filename)
    const file = files.find((f) => f.filename === filename)
  }
}
Enter fullscreen mode Exit fullscreen mode

if the file exists on the database, we compare its hash with the new file and skip it if they match. Otherwise, we delete the file and its chunks from the database.

export async function update(name: string, options: Options) {
  // ...
  for await (const filename of get_md_files(clone_path)) {
    // ...
    if (file) {
      const hash = await hash_file(path.join(clone_path, filename))
      if (file.hash === hash) continue
      await db.chunks.remove({ file_id: file.id })
      await db.files.remove({ id: file.id })
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Finally, if we didn't skip the file, we add it to the database, using the same function we used in the add command.

export async function update(name: string, options: Options) {
  // ...
  for await (const filename of get_md_files(clone_path)) {
    // ...
    if (file) {
      // ...
    }
    await add_file_to_documentation(documentation, filename)
  }
}
Enter fullscreen mode Exit fullscreen mode

Then at the end, we delete the files we didn't see on the repo.

export async function update(name: string, options: Options) {
  // ...
  for await (const filename of get_md_files(clone_path)) {
    // ...
  }
  const unseen_files = files.filter((f) => !seen_filenames.has(f.filename))
  await delete_files_and_chunks(unseen_files)
}
Enter fullscreen mode Exit fullscreen mode

I see some performance optimizations we can do here to make this command faster ...

I know that this implementation is slow. My plan is to have a first working version, do some benchmarking, then demonstrate how we can optimize it and compare the performance before/after.

So in this second command, we reused the following functions from the add command: clone_repo, get_md_files, add_file_to_documentation. And added the following functions:

  • db.documentations.get: to get a documentation by name from the database.
  • db.documentations.update: to update a documentation in the database.
  • db.files.find: to get a list of files from the database.
  • hash_file: to read a file and compute its hash.
  • db.chunks.remove: to delete chunks from the database.
  • db.files.remove: to delete files from the database.
  • delete_files_and_chunks: to delete files and their chunks from the database.

Let's move on to the next command.

Implementing rag docs remove

rag docs remove <name>
Enter fullscreen mode Exit fullscreen mode

This function will simply remove a documentation from the database, with all associated files and chunks.

export async function remove(name: string) {
  const documentation = await db.documentations.get(name)
  if (!documentation) return
  const files = await db.files.find({ documentation_id: documentation.id })
  await delete_files_and_chunks(files)
  await db.documentations.remove(name)
}
Enter fullscreen mode Exit fullscreen mode

Here we reused some functions from the update command and added a new one:

  • db.documentations.remove: to remove a documentation from the database.

Organizing functions

Here is the list of functions we need to implement so far:

database functions

  • db.documentations.has
  • db.documentations.get
  • db.documentations.create
  • db.documentations.update
  • db.documentations.remove
  • db.files.find
  • db.chunks.remove
  • db.files.remove

files functions

  • hash_file
  • get_md_files

other functions

  • clone_repo
  • add_file_to_documentation
  • delete_files_and_chunks

I want to start with the database functions but these two functions bother me: add_file_to_documentation and delete_files_and_chunks. Why, because they hide other database functions inside. So let's start by them:

Implementing add_file_to_documentation

async function add_file_to_documentation(doc: db.Documentation, filename: string) {
  const chunks = await get_file_chunks(filename) // Oops, we need to pass the absolute path here
  await db.files.create({
    documentation_id: doc.id,
    path: filename,
    hash: await hash_file(filename) // Oops, we need to pass the absolute path here
  })
  // ...
}
Enter fullscreen mode Exit fullscreen mode

Here is a problem, we need to pass the absolute path to the get_file_chunks and hash_file functions. But the filename is a relative path.

Why don't you just pass the absolute path directly?

Because we need to store the relative path in the files table on the database.

Hmm, so we need to pass the clone_path and relative path to the function, that way we can compute the absolute path?

Yes, that would be a valid solution, let's do it.

First we modify the code of add and update commands to pass the additional argument:

- await add_file_to_documentation(documentation, filename)
+ await add_file_to_documentation(documentation, clone_path, filename)
Enter fullscreen mode Exit fullscreen mode

Ok, now let's continue with the add_file_to_documentation function:

export async function add_file_to_documentation(doc: db.Documentation, clone_path: string, filename: string) {
  const absolute_path = path.join(clone_path, filename)
  const chunks = await get_file_chunks(absolute_path)
  const file = await db.files.create({
    documentation_id: doc.id,
    path: filename,
    hash: await hash_file(absolute_path),
  })
  for (let index = 0; index < chunks.length; index++) {
    const { metadata, content } = chunks[index]
    const text = `${yaml.stringify(metadata)}\n---\n${content}`
    const embedding = await embed(text)
    await db.chunks.create({ file_id: file.id, index, metadata, content, embedding })
  }
}
Enter fullscreen mode Exit fullscreen mode

This give us the following additional functions to implement:

  • get_file_chunks: to read a file and split it into chunks. it returns the content and metadata of each chunk.
  • db.files.create and db.chunks.create: to create a new file and its chunks in the database.
  • embed: to create the embedding vector of a string.

What is the db.Documentation type?

It's the type that represents a documentation in the database. We didn't define it yet, we will do with the database functions.

What is the purpose of embedding ${yaml.stringify(metadata)}\n---\n${content} instead of just embedding the content directly?

I am adding the metadata (that will contain things like the filename, the current section in the file, ...) to the embedded text to give more context in hope of improving the search results.

Again I am aware that this will be slow, I should probably use streams to do some work while reading files, I should also compute embeddings in batches, ... I will to those optimizations in the future.

Implementing delete_files_and_chunks

Here is the most basic implementation

async function delete_files_and_chunks(files: db.File[]) {
  for (const file of files) {
    await db.chunks.remove({ file_id: file.id })
    await db.files.remove({ id: file.id })
  }
}
Enter fullscreen mode Exit fullscreen mode

With that we have the following remaining functions to implement:

database functions

  • db.documentations.has
  • db.documentations.get
  • db.documentations.create
  • db.documentations.update
  • db.documentations.remove
  • db.files.find
  • db.files.remove
  • db.files.create
  • db.chunks.create
  • db.chunks.remove

files functions

  • hash_file
  • get_md_files
  • get_file_chunks

other functions

  • clone_repo
  • embed

I will implement the database functions in the next article.

Summary

In this article, we wrote the outline code for the commands:

  • rag docs add
  • rag docs update
  • rag docs remove

And listed the internal functions we need to implement for those commands to work.

You can check the current code in Github: https://github.com/webNeat/rag

What's next

The next steps are:

  • Implement the missing functions
  • Implement the rag get command
  • Add tests
  • Measure and improve performance
  • Create CI/CD pipeline
  • Release the first version!

Feel free to comment if you have any suggestions or feedback. See you in the next article!

Top comments (0)