Amine Ben hammou for This is Learning

Posted on Jan 28

RAG - Implementing the docs commands

#ai #rag #cli #typescript

This is the fourth article of a series in which I am building an open source RAG CLI tool. In the previous article we designed the database schema and added a configuration file.

In this article, we will begin to implement the commands for managing documentations: docs add, docs update, and docs remove. Let's start by organizing our code better.

Creating the commands files

To keep our code organized, let's move each command implementation to its own file.

src/
  commands/
    docs/
      add.ts     # Implementation of `rag docs add`
      update.ts  # Implementation of `rag docs update`
      remove.ts  # Implementation of `rag docs remove`
      index.ts   # exports all docs commands
    get.ts       # Implementation of `rag get`
    index.ts     # exports all commands

We only have placeholder implementations for now.

src/commands/docs/add.ts

type Options = { subdir: string; branch: string }
export async function add(name: string, repo_url: string, options: Options) {
  console.log(`Adding documentation ${name} from ${repo_url} with options ${JSON.stringify(options)}`)
}

similarly for docs/update.ts, docs/remove.ts and get.ts.

src/commands/docs/index.ts

export * from './add'
export * from './update'
export * from './remove'

src/commands/index.ts

export * as docs from './docs'
export * from './get'

With that, we can import the commands in the main file src/main.ts:

import * as commands from './commands'

// ...
docs
  .command('add')
  // ...
  .action(commands.docs.add)

docs
  .command('update')
  // ...
  .action(commands.docs.update)

docs.command('remove')
  // ...
  .action(commands.docs.remove)

program
  .command('get')
  // ...
  .action(commands.get)

Implementing the docs commands

Implementing `rag docs add`

rag docs add <name> <repo_url> [--subdir <subdir>] [--branch <branch>]

We want this command to do the following:

clone the repository.
get the markdown files from the specified subdir and branch.
create the documentation in the database and add the files and chunks to it.
if the documentation already exists, show an error message.

Here is the implementation:

type Options = {
  subdir: string
  branch: string
}
export async function add(name: string, repo_url: string, options: Options) {
  if (await db.documentations.has(name)) {
    throw new Error(`Documentation ${name} already exists. Use 'rag docs update ${name}' to update it.`)
  }

  const clone_path = await clone_repo(repo_url, options)
  const documentation = await db.documentations.create({ name, repo_url, ...options })
  const md_files = get_md_files(clone_path)
  for await (const filename of md_files) {
    await add_file_to_documentation(documentation, filename)
  }
}

Notice that I am using some functions that are not implemented yet. The goal is to outline the functions we need and how we are going to use them, then implement them. So far we need:

db.documentations.has: to check if a documentation already exists in the database.
db.documentations.create: to create a new documentation in the database.
clone_repo: to clone a specific branch and sub-directory of a repository and return its path.
get_md_files: to scan a directory recursively for markdown files and return an iterator over their relative paths.
add_file_to_documentation: to split a markdown file into chunks and add them to a documentation in the database.

Before implementing these functions, let's write the code for other commands.

This seems like a lazy thing to do, why don't you finish the implementation of the first command before starting the next one?

Writing the outline code of other commands will give me more visibility on the functions I need to implement. I may find out that I can reuse some functions on multiple commands if I modify their arguments slightly. I will be able to implement these functions once, instead of implementing them for the first command then refactoring them for the second one, and so on.

Implementing `rag docs update`

rag docs update <name> [--repo_url <repo_url>] [--subdir <subdir>] [--branch <branch>]

We want this command to do the following:

get the documentation by name from the database.
if the documentation does not exist, show an error message.
update the documentation with the given options.
clone the repository.
get the markdown files from the updated subdir and branch.
update files and chunks in the database:
- if the database file has the same hash as the new file, do nothing.
- if hashes are different, delete the old file and its chunks and insert the new one.
- if a file does not exist on database, add it.
- if a file exists on database but is not found on the repo, delete it.

This a bit complicated then the previous command, so let's write the code step by step.

We start with this empty function:

type Options = {
  repo_url: string
  subdir: string
  branch: string
}
export async function update(name: string, options: Options) {

}

First we fetch the documentation by name from the database. and throw an error if it does not exist.

export async function update(name: string, options: Options) {
  const documentation = await db.documentations.get(name)
  if (!documentation) {
    throw new Error(`Documentation ${name} not found. Use 'rag docs add ${name}' to create it.`)
  }
}

Then we update the documentation with the given options and clone the repository.

export async function update(name: string, options: Options) {
  // ...
  await db.documentations.update(name, options)
  const clone_path = await clone_repo(documentation.repo_url, documentation)
}

Then we fetch the existing files on the database and define a set to keep track of the seen filenames. This will be useful to detect files that are not found on the repo later.

export async function update(name: string, options: Options) {
  // ...
  const files = await db.files.find({ documentation_id: documentation.id })
  const seen_filenames = new Set<string>()
}

Now we loop over the markdown files on the cloned repo:

export async function update(name: string, options: Options) {
  // ...
  for await (const filename of get_md_files(clone_path)) {

  }
}

We track the filename as seen and find the corresponding file in the database.

export async function update(name: string, options: Options) {
  // ...
  for await (const filename of get_md_files(clone_path)) {
    seen_filenames.add(filename)
    const file = files.find((f) => f.filename === filename)
  }
}

if the file exists on the database, we compare its hash with the new file and skip it if they match. Otherwise, we delete the file and its chunks from the database.

export async function update(name: string, options: Options) {
  // ...
  for await (const filename of get_md_files(clone_path)) {
    // ...
    if (file) {
      const hash = await hash_file(path.join(clone_path, filename))
      if (file.hash === hash) continue
      await db.chunks.remove({ file_id: file.id })
      await db.files.remove({ id: file.id })
    }
  }
}

Finally, if we didn't skip the file, we add it to the database, using the same function we used in the add command.

export async function update(name: string, options: Options) {
  // ...
  for await (const filename of get_md_files(clone_path)) {
    // ...
    if (file) {
      // ...
    }
    await add_file_to_documentation(documentation, filename)
  }
}

Then at the end, we delete the files we didn't see on the repo.

export async function update(name: string, options: Options) {
  // ...
  for await (const filename of get_md_files(clone_path)) {
    // ...
  }
  const unseen_files = files.filter((f) => !seen_filenames.has(f.filename))
  await delete_files_and_chunks(unseen_files)
}

I see some performance optimizations we can do here to make this command faster ...

I know that this implementation is slow. My plan is to have a first working version, do some benchmarking, then demonstrate how we can optimize it and compare the performance before/after.

So in this second command, we reused the following functions from the add command: clone_repo, get_md_files, add_file_to_documentation. And added the following functions:

db.documentations.get: to get a documentation by name from the database.
db.documentations.update: to update a documentation in the database.
db.files.find: to get a list of files from the database.
hash_file: to read a file and compute its hash.
db.chunks.remove: to delete chunks from the database.
db.files.remove: to delete files from the database.
delete_files_and_chunks: to delete files and their chunks from the database.

Let's move on to the next command.

Implementing `rag docs remove`

rag docs remove <name>

This function will simply remove a documentation from the database, with all associated files and chunks.

export async function remove(name: string) {
  const documentation = await db.documentations.get(name)
  if (!documentation) return
  const files = await db.files.find({ documentation_id: documentation.id })
  await delete_files_and_chunks(files)
  await db.documentations.remove(name)
}

Here we reused some functions from the update command and added a new one:

db.documentations.remove: to remove a documentation from the database.

Organizing functions

Here is the list of functions we need to implement so far:

database functions

db.documentations.has
db.documentations.get
db.documentations.create
db.documentations.update
db.documentations.remove
db.files.find
db.chunks.remove
db.files.remove

files functions

hash_file
get_md_files

other functions

clone_repo
add_file_to_documentation
delete_files_and_chunks

I want to start with the database functions but these two functions bother me: add_file_to_documentation and delete_files_and_chunks. Why, because they hide other database functions inside. So let's start by them:

Implementing `add_file_to_documentation`

async function add_file_to_documentation(doc: db.Documentation, filename: string) {
  const chunks = await get_file_chunks(filename) // Oops, we need to pass the absolute path here
  await db.files.create({
    documentation_id: doc.id,
    path: filename,
    hash: await hash_file(filename) // Oops, we need to pass the absolute path here
  })
  // ...
}

Here is a problem, we need to pass the absolute path to the get_file_chunks and hash_file functions. But the filename is a relative path.

Why don't you just pass the absolute path directly?

Because we need to store the relative path in the files table on the database.

Hmm, so we need to pass the clone_path and relative path to the function, that way we can compute the absolute path?

Yes, that would be a valid solution, let's do it.

First we modify the code of add and update commands to pass the additional argument:

- await add_file_to_documentation(documentation, filename)
+ await add_file_to_documentation(documentation, clone_path, filename)

Ok, now let's continue with the add_file_to_documentation function:

export async function add_file_to_documentation(doc: db.Documentation, clone_path: string, filename: string) {
  const absolute_path = path.join(clone_path, filename)
  const chunks = await get_file_chunks(absolute_path)
  const file = await db.files.create({
    documentation_id: doc.id,
    path: filename,
    hash: await hash_file(absolute_path),
  })
  for (let index = 0; index < chunks.length; index++) {
    const { metadata, content } = chunks[index]
    const text = `${yaml.stringify(metadata)}\n---\n${content}`
    const embedding = await embed(text)
    await db.chunks.create({ file_id: file.id, index, metadata, content, embedding })
  }
}

This give us the following additional functions to implement:

get_file_chunks: to read a file and split it into chunks. it returns the content and metadata of each chunk.
db.files.create and db.chunks.create: to create a new file and its chunks in the database.
embed: to create the embedding vector of a string.

What is the db.Documentation type?

It's the type that represents a documentation in the database. We didn't define it yet, we will do with the database functions.

What is the purpose of embedding ${yaml.stringify(metadata)}\n---\n${content} instead of just embedding the content directly?

I am adding the metadata (that will contain things like the filename, the current section in the file, ...) to the embedded text to give more context in hope of improving the search results.

Again I am aware that this will be slow, I should probably use streams to do some work while reading files, I should also compute embeddings in batches, ... I will to those optimizations in the future.

Implementing `delete_files_and_chunks`

Here is the most basic implementation

async function delete_files_and_chunks(files: db.File[]) {
  for (const file of files) {
    await db.chunks.remove({ file_id: file.id })
    await db.files.remove({ id: file.id })
  }
}

With that we have the following remaining functions to implement:

database functions

db.documentations.has
db.documentations.get
db.documentations.create
db.documentations.update
db.documentations.remove
db.files.find
db.files.remove
db.files.create
db.chunks.create
db.chunks.remove

files functions

hash_file
get_md_files
get_file_chunks

other functions

clone_repo
embed

I will implement the database functions in the next article.

Summary

In this article, we wrote the outline code for the commands:

rag docs add
rag docs update
rag docs remove

And listed the internal functions we need to implement for those commands to work.

You can check the current code in Github: https://github.com/webNeat/rag

What's next

The next steps are:

Implement the missing functions
Implement the rag get command
Add tests
Measure and improve performance
Create CI/CD pipeline
Release the first version!

Feel free to comment if you have any suggestions or feedback. See you in the next article!

DEV Community

RAG - Implementing the docs commands

Creating the commands files

Implementing the docs commands

Implementing `rag docs add`

Implementing `rag docs update`

Implementing `rag docs remove`

Organizing functions

Implementing `add_file_to_documentation`

Implementing `delete_files_and_chunks`

Summary

What's next

Top comments (0)

Read next

Exploring TypeScript 5.8 Beta: Smarter Returns, Improved Module Support, and More

Supabase: Setting up Auth and Database

A Beginner’s Guide to AIOps: The Future of Automated DevOps

Treating AI Chats Like Git Branches

Creating the commands files

Implementing the docs commands

Implementing rag docs add

Implementing rag docs update

Implementing rag docs remove

Organizing functions

Implementing add_file_to_documentation

Implementing delete_files_and_chunks

Summary

What's next

Read next

Exploring TypeScript 5.8 Beta: Smarter Returns, Improved Module Support, and More

Supabase: Setting up Auth and Database

A Beginner’s Guide to AIOps: The Future of Automated DevOps

Treating AI Chats Like Git Branches

Implementing `rag docs add`

Implementing `rag docs update`

Implementing `rag docs remove`

Implementing `add_file_to_documentation`

Implementing `delete_files_and_chunks`