DEV Community

David Mezzetti for NeuML

Posted on • Edited on • Originally published at neuml.hashnode.dev

Embeddings index components

The main components of txtai are embeddings, pipeline, workflow and an api. The following shows the top level view of the txtai src tree.

Abbreviated listing of src/txtai
 ann
 api
 database
 embeddings
 pipeline
 scoring
 vectors
 workflow
Enter fullscreen mode Exit fullscreen mode

One might ask, why are ann, database, scoring and vectors top level packages and not under the embeddings package? The embeddings package provides the glue between these components, making everything easy to use. The reason is that each of these packages are modular and can be used on their own!

This article will go through a series of examples demonstrating how these components can be used standalone as well as combined together to build custom search indexes.

Note: This is intended as a deep dive into txtai embeddings components. There are much simpler high-level APIs for standard use cases.

Install dependencies

Install txtai and all dependencies.

# Install txtai
pip install txtai datasets
Enter fullscreen mode Exit fullscreen mode

Load dataset

This example will use the ag_news dataset, which is a collection of news article headlines.

from datasets import load_dataset

dataset = load_dataset("ag_news", split="train")
Enter fullscreen mode Exit fullscreen mode

Approximate nearest neighbor (ANN) and Vectors

In this section, we'll use the ann and vectors package to build a similarity index over the ag_news dataset.

The first step is vectorizing the text. We'll use a sentence-transformers model.

import numpy as np

from txtai.vectors import VectorsFactory

model = VectorsFactory.create({"path": "sentence-transformers/all-MiniLM-L6-v2"}, None)

embeddings = []

# List of all text elements
texts = dataset["text"]

# Create embeddings buffer, vector model has 384 features
embeddings = np.zeros(dtype=np.float32, shape=(len(texts), 384))

# Vectorize text in batches
batch, index, batchsize = [], 0, 128
for text in texts:
  batch.append(text)

  if len(batch) == batchsize:
    vectors = model.encode(batch)
    embeddings[index : index + vectors.shape[0]] = vectors
    index += vectors.shape[0]
    batch = []

# Last batch
if batch:
    vectors = model.encode(batch)
    embeddings[index : index + vectors.shape[0]] = vectors

# Normalize embeddings
embeddings /= np.linalg.norm(embeddings, axis=1)[:, np.newaxis]

# Print shape
embeddings.shape
Enter fullscreen mode Exit fullscreen mode
(120000, 384)
Enter fullscreen mode Exit fullscreen mode

Next we'll build a vector index using these embeddings!

from txtai.ann import ANNFactory

# Create Faiss index using normalized embeddings
ann = ANNFactory.create({"backend": "faiss"})
ann.index(embeddings)

# Show total
ann.count()
Enter fullscreen mode Exit fullscreen mode
120000
Enter fullscreen mode Exit fullscreen mode

Now let's run a search.

query = model.encode(["best planets to explore for life"])
query /= np.linalg.norm(query)

for uid, score in ann.search(query, 3)[0]:
  print(uid, texts[uid], score)
Enter fullscreen mode Exit fullscreen mode
17752 Rocky Road: Planet hunting gets closer to Earth Astronomers have discovered the three lightest planets known outside the solar system, moving researchers closer to the goal of finding extrasolar planets that resemble Earth. 0.599043607711792
16158 Earth #39;s  #39;big brothers #39; floating around stars Washington - A new class of planets has been found orbiting stars besides our sun, in a possible giant leap forward in the search for Earth-like planets that might harbour life. 0.5688529014587402
45029 Coming Soon: "Good" Jupiters Most of the extrasolar planets discovered to date are gas giants like Jupiter, but their orbits are either much closer to their parent stars or are highly eccentric. Planet hunters are on the verge of confirming the discovery of Jupiter-size planets with Jupiter-like orbits. Solar systems that contain these "good" Jupiters may harbor habitable Earth-like planets as well. 0.5606889724731445
Enter fullscreen mode Exit fullscreen mode

And there it is, a full vector search system without using the embeddings package.

Just as a reminder, the following much simpler code does the same thing with an Embeddings instance.

from txtai.embeddings import Embeddings

embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2"})
embeddings.index((x, text, None) for x, text in enumerate(texts))

for uid, score in embeddings.search("best planets to explore for life"):
  print(uid, texts[uid], score)
Enter fullscreen mode Exit fullscreen mode
17752 Rocky Road: Planet hunting gets closer to Earth Astronomers have discovered the three lightest planets known outside the solar system, moving researchers closer to the goal of finding extrasolar planets that resemble Earth. 0.599043607711792
16158 Earth #39;s  #39;big brothers #39; floating around stars Washington - A new class of planets has been found orbiting stars besides our sun, in a possible giant leap forward in the search for Earth-like planets that might harbour life. 0.568852961063385
45029 Coming Soon: "Good" Jupiters Most of the extrasolar planets discovered to date are gas giants like Jupiter, but their orbits are either much closer to their parent stars or are highly eccentric. Planet hunters are on the verge of confirming the discovery of Jupiter-size planets with Jupiter-like orbits. Solar systems that contain these "good" Jupiters may harbor habitable Earth-like planets as well. 0.560688853263855
Enter fullscreen mode Exit fullscreen mode

Database

When the content parameter is enabled, an Embeddings instance stores both vector content and raw content in a database. But the database package can be used standalone too.

from txtai.database import DatabaseFactory

# Load content into database
database = DatabaseFactory.create({"content": True})
database.insert((x, row, None) for x, row in enumerate(dataset))

# Show total
database.search("select count(*) from txtai")
Enter fullscreen mode Exit fullscreen mode
[{'count(*)': 120000}]
Enter fullscreen mode Exit fullscreen mode

The full txtai SQL query syntax is available, including working with dynamically created fields.

database.search("select count(*), label from txtai group by label")
Enter fullscreen mode Exit fullscreen mode
[{'count(*)': 30000, 'label': 0},
 {'count(*)': 30000, 'label': 1},
 {'count(*)': 30000, 'label': 2},
 {'count(*)': 30000, 'label': 3}]
Enter fullscreen mode Exit fullscreen mode

Let's run a query to find text containing the word planets.

for row in database.search("select id, text from txtai where text like '%planets%' limit 3"):
  print(row["id"], row["text"])
Enter fullscreen mode Exit fullscreen mode
100 Comets, Asteroids and Planets around a Nearby Star (SPACE.com) SPACE.com - A nearby star thought to harbor comets and asteroids now appears to be home to planets, too. The presumed worlds are smaller than Jupiter and could be as tiny as Pluto, new observations suggest.
102 Redesigning Rockets: NASA Space Propulsion Finds a New Home (SPACE.com) SPACE.com - While the exploration of the Moon and other planets in our solar system is nbsp;exciting, the first task for astronauts and robots alike is to actually nbsp;get to those destinations.
272 Sharpest Image Ever Obtained of a Circumstellar Disk Reveals Signs of Young Planets MAUNA KEA, Hawaii -- The sharpest image ever taken of a dust disk around another star has revealed structures in the disk which are signs of unseen planets.     Dr...
Enter fullscreen mode Exit fullscreen mode

Since this is just a SQL database, text search is quite limited. The query above just retrieved results with the word planets in it.

Scoring

Since the original txtai release, there has been a scoring package. The main use case for this package is building a weighted sentence embeddings vector when using word vector models. But this package can also be used standalone to build BM25, TF-IDF and/or SIF text indexes.

from txtai.scoring import ScoringFactory

# Build index
scoring = ScoringFactory.create({"method": "bm25", "terms": True, "content": True})
scoring.index((x, text, None) for x, text in enumerate(texts))

# Show total
scoring.count()
Enter fullscreen mode Exit fullscreen mode
120000
Enter fullscreen mode Exit fullscreen mode
for row in scoring.search("planets explore life earth", 3):
  print(row["id"], row["text"], row["score"])
Enter fullscreen mode Exit fullscreen mode
16327 3 Planets Are Found Close in Size to Earth, Making Scientists Think 'Life' A trio of newly discovered worlds are much smaller than any other planets previously discovered outside of the solar system. 17.768332448130707
16158 Earth #39;s  #39;big brothers #39; floating around stars Washington - A new class of planets has been found orbiting stars besides our sun, in a possible giant leap forward in the search for Earth-like planets that might harbour life. 17.65941968170793
16620 New Planets could advance search for Life Astronomers in Europe and the United States have found two new planets about 20 times the size of Earth beyond the solar system. The discovery might be a giant leap forward in  17.65941968170793
Enter fullscreen mode Exit fullscreen mode

The search above ran a BM25 search across the dataset. The search will return more keyword/literal results. With proper query construction, the results can be decent.

Comparing the vector search results earlier and these results are a good lesson in the differences between keyword and vector search.

Database and Scoring

Earlier we showed how the ann and vectors components can be combined to build a vector search engine. Can we combine the database and scoring components to add keyword search to a database? Yes!

def search(query, limit=3):
  # Get similar clauses, if any
  similar = database.parse(query).get("similar")
  return database.search(query, [scoring.search(args[0], limit * 10) for args in similar] if similar else None, limit)

# Rebuild scoring - only need terms index
scoring = ScoringFactory.create({"method": "bm25", "terms": True})
scoring.index((x, text, None) for x, text in enumerate(texts))

for row in search("select id, text, score from txtai where similar('planets explore life earth') and label = 0"):
  print(row["id"], row["text"], row["score"])
Enter fullscreen mode Exit fullscreen mode
15363 NASA to Announce New Class of Planets Astronomers have discovered four new planets in a week's time, an exciting end-of-summer flurry that signals a sharper era in the hunt for new worlds.    While none of these new bodies would be mistaken as Earth's twin, some appear to be noticeably smaller and more solid - more like Earth and Mars - than the gargantuan, gaseous giants identified before... 12.582923259697132
15900 Astronomers Spot Smallest Planets Yet American astronomers say they have discovered the two smallest planets yet orbiting nearby stars, trumping a small planet discovery by European scientists five days ago and capping the latest round in a frenzied hunt for other worlds like Earth.    All three of these smaller planets belong to a new class of "exoplanets" - those that orbit stars other than our sun, the scientists said in a briefing Tuesday... 12.563928231067155
15879 Astronomers see two new planets US astronomers find the smallest worlds detected circling other stars and say it is a breakthrough in the search for life in space. 12.078383982352994
Enter fullscreen mode Exit fullscreen mode

And there it is, scoring-based similarity search with the same syntax as standard txtai vector queries, including additional filters!

txtai is built on vector search, machine learning and finding results based on semantic meaning. It's been well-discussed from a functionality standpoint how vector search has many advantages over keyword search. The one advantage keyword search has is speed.

Wrapping up

This notebook walked through each of the packages used by an Embeddings index. The Embeddings index makes this all transparent and easy to use. But each of the components do stand on their own and can be individually integrated into a project!

Top comments (0)