David Mezzetti for NeuML

Posted on Mar 2, 2022 • Edited on Apr 25, 2024 • Originally published at neuml.hashnode.dev

Anatomy of a txtai index

#ai #llm #rag #vectordatabase

This article inspects the filesystem of a txtai embeddings index and gives an overview of the structure.

Install dependencies

Install txtai and all dependencies.

pip install txtai

Create index

Let's first create an index to inspect. We'll use the classic txtai example.

from txtai.embeddings import Embeddings

data = ["US tops 5 million confirmed virus cases",
        "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
        "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
        "The National Park Service warns against sacrificing slower friends in a bear attack",
        "Maine man wins $1M from $25 lottery ticket",
        "Make huge profits without work, earn up to $100,000 a day"]

# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2", "content": True, "objects": True})

# Create an index for the list of text
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])

# Run a search
embeddings.search("feel good story", 1)

[{'id': '4',
  'score': 0.08329004049301147,
  'text': 'Maine man wins $1M from $25 lottery ticket'}]

Print index info

Embeddings indexes have an info method which prints metadata about the index. This can be used to see when the index was build, what settings were used and when it was last updated.

# Print metadata
embeddings.info()

{
  "backend": "faiss",
  "build": {
    "create": "2022-03-02T15:18:41Z",
    "python": "3.7.12",
    "settings": {
      "components": "IDMap,Flat"
    },
    "system": "Linux (x86_64)",
    "txtai": "4.3.0"
  },
  "content": "sqlite",
  "dimensions": 768,
  "objects": true,
  "offset": 6,
  "path": "sentence-transformers/nli-mpnet-base-v2",
  "update": "2022-03-02T15:18:41Z"
}

Save index and review file structure

Next let's save the index and review the file structure. This section prints each file, and runs commands to show

# Save the index
embeddings.save("index")

# Show basic details about index files
for f in ["config", "documents", "embeddings"]:
  !ls -l "index/{f}"
  !xxd "index/{f}" | head -5
  !file "index/{f}"
  !echo

-rw-r--r-- 1 root root 295 Mar  2 15:18 index/config
00000000: 8004 951c 0100 0000 0000 007d 9428 8c04  ...........}.(..
00000010: 7061 7468 948c 2773 656e 7465 6e63 652d  path..'sentence-
00000020: 7472 616e 7366 6f72 6d65 7273 2f6e 6c69  transformers/nli
00000030: 2d6d 706e 6574 2d62 6173 652d 7632 948c  -mpnet-base-v2..
00000040: 0763 6f6e 7465 6e74 948c 0673 716c 6974  .content...sqlit
index/config: data

-rw-r--r-- 1 root root 28672 Mar  2 15:18 index/documents
00000000: 5351 4c69 7465 2066 6f72 6d61 7420 3300  SQLite format 3.
00000010: 1000 0101 0040 2020 0000 0001 0000 0007  .....@  ........
00000020: 0000 0000 0000 0000 0000 0001 0000 0004  ................
00000030: 0000 0000 0000 0000 0000 0001 0000 0000  ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
index/documents: SQLite 3.x database, last written using SQLite version 3022000

-rw-r--r-- 1 root root 18570 Mar  2 15:18 index/embeddings
00000000: 4978 4d70 0003 0000 0600 0000 0000 0000  IxMp............
00000010: 0000 1000 0000 0000 0000 1000 0000 0000  ................
00000020: 0100 0000 0049 7846 4900 0300 0006 0000  .....IxFI.......
00000030: 0000 0000 0000 0010 0000 0000 0000 0010  ................
00000040: 0000 0000 0001 0000 0000 0012 0000 0000  ................
index/embeddings: data

The directory has three files: config, documents and embeddings.

config - The input configuration passed into the Embeddings object. Serialized with Python's pickle format.
documents - SQLite database. Stores the input text content and associated data.
embeddings - The embeddings index file. This is an Approximate Nearest Neighbor (ANN) index with either Faiss (default), Hnswlib or Annoy, depending on the settings.

Config

Given that the configuration file is serialized with Python pickle, it can be loaded in Python.

import json
import pickle

with open("index/config", "rb") as config:
  print(json.dumps(pickle.load(config), sort_keys=True, indent=2))

{
  "backend": "faiss",
  "build": {
    "create": "2022-03-02T15:18:41Z",
    "python": "3.7.12",
    "settings": {
      "components": "IDMap,Flat"
    },
    "system": "Linux (x86_64)",
    "txtai": "4.3.0"
  },
  "content": "sqlite",
  "dimensions": 768,
  "objects": true,
  "offset": 6,
  "path": "sentence-transformers/nli-mpnet-base-v2",
  "update": "2022-03-02T15:18:41Z"
}

Notice how this is the same output as embeddings.info().

Documents

The documents file is a SQLite database with three tables, documents, objects and sections. Let's take a look inside.

import pandas as pd
import sqlite3

from IPython.display import display, Markdown

# Print details of a txtai SQLite document database
def showdb(path):
  db = sqlite3.connect(path)

  display(Markdown("## Tables"))
  df = pd.read_sql_query("select name FROM sqlite_master where type='table'", db)
  display(df.style.hide_index())

  for table in df["name"]:
    display(Markdown(f"## {table}"))
    df = pd.read_sql_query(f"select * from {table}", db)

    # Truncate large binary objects
    if "object" in df:
      df["object"] = df["object"].str.slice(0, 25)

    display(df.style.hide_index())

showdb("index/documents")

Tables

name
documents
objects
sections

documents

id	data	tags	entry

objects

id	object	tags	entry

sections

indexid	id	text	tags	entry
0	0	US tops 5 million confirmed virus cases	None	2022-03-02 15:18:40.591760
1	1	Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg	None	2022-03-02 15:18:40.591760
2	2	Beijing mobilises invasion craft along coast as Taiwan tensions escalate	None	2022-03-02 15:18:40.591760
3	3	The National Park Service warns against sacrificing slower friends in a bear attack	None	2022-03-02 15:18:40.591760
4	4	Maine man wins $1M from $25 lottery ticket	None	2022-03-02 15:18:40.591760
5	5	Make huge profits without work, earn up to $100,000 a day	None	2022-03-02 15:18:40.591760

documents stores additional text fields as JSON, objects stores binary content and sections stores indexed text. The only table with data as of now is sections. sections stores the input (id, text, tags) elements along with internal ids and entry dates.

We'll come back to documents and objects.

Embeddings

Embeddings is the ANN index and what is queried when running similarity search. The default setting is to use Faiss. Let's inspect!

import faiss
import numpy as np

# Query
query = "feel good story"

# Read index
index = faiss.read_index("index/embeddings")
print(index)
print(f"Total records: {index.ntotal}, dimensions: {index.d}")
print()

# Generate query embeddings and run query
queries = np.array([embeddings.transform((None, query, None))])
scores, ids = index.search(queries, 1)

# Lookup query result from original data array
result = data[ids[0][0]]

# Show results
print("Query:", query)
print("Results:", result, ids, scores)

<faiss.swigfaiss.IndexIDMap; proxy of <Swig Object of type 'faiss::IndexIDMapTemplate< faiss::Index > *' at 0x7f68631cd750> >
Total records: 6, dimensions: 768

Query: feel good story
Results: Maine man wins $1M from $25 lottery ticket [[4]] [[0.08329004]]

Index compression

txtai normally saves index files to a directory. Indexes can also be compressed. Nothing is different other than the files being in an compressed file format vs a directory.

# Save index as tar.xz
embeddings.save("index.tar.xz")
!tar -tvJf index.tar.xz
!echo
!xz -l index.tar.xz
!echo

# Reload index
embeddings.load("index.tar.xz")

# Test search matches
embeddings.search("feel good story", 1)

drwx------ root/root         0 2022-03-02 15:18 ./
-rw-r--r-- root/root       295 2022-03-02 15:18 ./config
-rw-r--r-- root/root     28672 2022-03-02 15:18 ./documents
-rw-r--r-- root/root     18570 2022-03-02 15:18 ./embeddings

Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
    1       1     18.1 KiB     50.0 KiB  0.361  CRC64   index.tar.xz

[{'id': '4',
  'score': 0.08329004049301147,
  'text': 'Maine man wins $1M from $25 lottery ticket'}]

Content storage

Let's add additional metadata and binary content to the index and see how that is stored in the SQLite database.

import urllib

from IPython.display import Image

# Get an image
request = urllib.request.urlopen("https://raw.githubusercontent.com/neuml/txtai/master/demo.gif")

# Get data
data = request.read()

# Upsert new record having both text and an object
embeddings.upsert([("txtai", {"text": "txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.", "size": len(data), "object": data}, None)])

embeddings.save("index")
showdb("index/documents")

Tables

name
documents
objects
sections

documents

id	data	tags	entry
txtai	{"text": "txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.", "size": 47189}	None	2022-03-02 15:19:00.708223

objects

id	object	tags	entry
txtai	b'GIF89a\x9b\x04\x18\x03\xf5\x00\x00\x12\x13\x14\xcc\xcc\xcc\x13\x14\x15\xbd\xbd\xbd'	None	2022-03-02 15:19:00.708223

sections

indexid	id	text	tags	entry
0	0	US tops 5 million confirmed virus cases	None	2022-03-02 15:18:40.591760
1	1	Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg	None	2022-03-02 15:18:40.591760
2	2	Beijing mobilises invasion craft along coast as Taiwan tensions escalate	None	2022-03-02 15:18:40.591760
3	3	The National Park Service warns against sacrificing slower friends in a bear attack	None	2022-03-02 15:18:40.591760
4	4	Maine man wins $1M from $25 lottery ticket	None	2022-03-02 15:18:40.591760
5	5	Make huge profits without work, earn up to $100,000 a day	None	2022-03-02 15:18:40.591760
6	txtai	txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.	None	2022-03-02 15:19:00.708223

This section added a new record with metadata and binary content (truncated when printed here). The documents table enables additional fielded search with SQL.

embeddings.search("select * from txtai where size > 0")

[{'data': '{"text": "txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.", "size": 47189}',
  'entry': '2022-03-02 15:19:00.708223',
  'id': 'txtai',
  'indexid': 6,
  'object': <_io.BytesIO at 0x7f6861408a70>,
  'score': None,
  'tags': None,
  'text': 'txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.'}]

Metadata fields can also be selected and combined with similarity queries.

embeddings.search("select text, size, score from txtai where similar('machine learning') and score > 0.25 and size > 0")

[{'score': 0.5479326844215393,
  'size': 47189,
  'text': 'txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.'}]

The objects table enables additional binary content to be stored alongside an embeddings index. In some cases (image search), the object content is used to build embeddings.

Otherwise, it's the text field from sections. In both cases, associated binary objects are available at search time.

embeddings.search("select object from txtai where object is not null")

[{'object': <_io.BytesIO at 0x7f6863246470>}]

Wrapping up

This article gave an overview of the txtai embeddings index file format. This hopefully gives a basic understanding of the architecture and/or helps with debugging when running into issues.

See the following links for more information.

DEV Community

Anatomy of a txtai index

Install dependencies

Create index

Print index info

Save index and review file structure

Config

Documents

Tables

documents

objects

sections

Embeddings

Index compression

Content storage

Tables

documents

objects

sections

Wrapping up

Top comments (0)

Read next

How to Use Twitter API v2 (X API Free): A Complete Guide for Developers

AI for 3D Object Manufacturing: Innovate Your Workflow

Understanding Cody AI: A Powerful Programming Assistant

AI Assistant Learns to Google: New Model Combines Language AI with Autonomous Web Search