txtai provides many default settings to help a developer quickly get started. For example, metadata is stored in SQLite, dense vectors in Faiss, sparse vectors in a terms index and graph data with NetworkX.
Each of these components is customizable and can be swapped with alternate implementations. This has been covered in several previous articles.
This article will introduce how to store metadata in client-server RDBMS systems. In addition to SQLite and DuckDB, any SQLAlchemy-supported database with JSON support can now be used.
Install dependencies
Install txtai
and all dependencies.
# Install txtai
pip install txtai[database] elasticsearch==7.10.1 datasets
Install Postgres
Next, we'll install Postgres and start a Postgres instance.
# Install and start Postgres
apt-get update && apt-get install postgresql
service postgresql start
sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"
Load a dataset
Now we're ready to load a dataset. We'll use the ag_news
dataset. This dataset consists of 120,000 news headlines.
from datasets import load_dataset
# Load dataset
ds = load_dataset("ag_news", split="train")
Build an Embeddings instance with Postgres
Let's load this dataset into an embeddings database. We'll configure this instance to store metadata in Postgres. Note that the content parameter below is a SQLAlchemy connection string.
This embeddings database will use the default vector settings and build that index locally.
import txtai
# Create embeddings
embeddings = txtai.Embeddings(
content="postgresql+psycopg2://postgres:postgres@localhost/postgres",
)
# Index dataset
embeddings.index(ds["text"])
Let's run a search query and see what comes back.
embeddings.search("red sox defeat yankees")
[{'id': '63561',
'text': 'Red Sox Beat Yankees 6-4 in 12 Innings BOSTON - Down to their last three outs of the season, the Boston Red Sox rallied - against Mariano Rivera, the New York Yankees and decades of disappointment. Bill Mueller singled home the tying run off Rivera in the ninth inning and David Ortiz homered against Paul Quantrill in the 12th, leading Boston to a 6-4 victory Sunday over the Yankees that avoided a four-game sweep in the AL championship series...',
'score': 0.8104304671287537},
{'id': '63221',
'text': 'Red Sox Beat Yankees 6-4 in 12 Innings BOSTON - Down to their last three outs of the season, the Boston Red Sox rallied - against Mariano Rivera, the New York Yankees and decades of disappointment. Bill Mueller singled home the tying run off Rivera in the ninth inning and David Ortiz homered against Paul Quantrill in the 12th, leading Boston to a 6-4 victory over the Yankees on Sunday night that avoided a four-game sweep in the AL championship series...',
'score': 0.8097385168075562},
{'id': '66861',
'text': 'Record-Breaking Red Sox Clinch World Series Berth NEW YORK (Reuters) - The Boston Red Sox crushed the New York Yankees 10-3 Wednesday to complete an historic comeback victory over their arch-rivals by four games to three in the American League Championship Series.',
'score': 0.8003846406936646}]
As expected, we get the standard id, text, score
fields with the top matches for the query. The difference though is that all the database metadata normally stored in a local SQLite file is now stored in a Postgres server.
This opens up several possibilities such as row-level security. If a row isn't returned by the database, it won't be shown here. Alternatively, this search could optionally return only the ids and scores, which lets the user know a record exists they don't have access to.
As with other supported databases, underlying database functions can be called from txtai SQL.
embeddings.search("SELECT id, text, md5(text), score FROM txtai WHERE similar('red sox defeat yankees')")
[{'id': '63561',
'text': 'Red Sox Beat Yankees 6-4 in 12 Innings BOSTON - Down to their last three outs of the season, the Boston Red Sox rallied - against Mariano Rivera, the New York Yankees and decades of disappointment. Bill Mueller singled home the tying run off Rivera in the ninth inning and David Ortiz homered against Paul Quantrill in the 12th, leading Boston to a 6-4 victory Sunday over the Yankees that avoided a four-game sweep in the AL championship series...',
'md5': '1e55a78fdf0cb3be3ef61df650f0a50f',
'score': 0.8104304671287537},
{'id': '63221',
'text': 'Red Sox Beat Yankees 6-4 in 12 Innings BOSTON - Down to their last three outs of the season, the Boston Red Sox rallied - against Mariano Rivera, the New York Yankees and decades of disappointment. Bill Mueller singled home the tying run off Rivera in the ninth inning and David Ortiz homered against Paul Quantrill in the 12th, leading Boston to a 6-4 victory over the Yankees on Sunday night that avoided a four-game sweep in the AL championship series...',
'md5': 'a0417e1fc503a5a2945c8755b6fb18d5',
'score': 0.8097385168075562},
{'id': '66861',
'text': 'Record-Breaking Red Sox Clinch World Series Berth NEW YORK (Reuters) - The Boston Red Sox crushed the New York Yankees 10-3 Wednesday to complete an historic comeback victory over their arch-rivals by four games to three in the American League Championship Series.',
'md5': '398a8508692aed109bd8c56f067a8083',
'score': 0.8003846406936646}]
Note the addition of the Postgres md5
function to the query.
Let's save and show the files in the embeddings database.
embeddings.save("vectors")
!ls -l vectors
total 183032
-rw-r--r-- 1 root root 355 Sep 7 16:38 config
-rw-r--r-- 1 root root 187420123 Sep 7 16:38 embeddings
Only the configuration and the local vectors index are stored in this case.
External indexing
As mentioned previously, all of the main components of txtai can be replaced with custom components. For example, there are external integrations for storing dense vectors in Weaviate and Qdrant to name a few.
Next, we'll build an example that stores metadata in Postgres and builds a sparse index with Elasticsearch.
Scoring component for Elasticsearch
First, we need to define a custom scoring component for Elasticsearch. While could have used an existing integration, it's important to show that creating a new component isn't a large LOE (~70 lines of code). See below.
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from txtai.scoring import Scoring
class Elastic(Scoring):
def __init__(self, config=None):
# Scoring configuration
self.config = config if config else {}
# Server parameters
self.url = self.config.get("url", "http://localhost:9200")
self.indexname = self.config.get("indexname", "testindex")
# Elasticsearch connection
self.connection = Elasticsearch(self.url)
self.terms = True
self.normalize = self.config.get("normalize")
def insert(self, documents, index=None):
rows = []
for uid, document, tags in documents:
rows.append((index, document))
# Increment index
index = index + 1
bulk(self.connection, ({"_index": self.indexname, "_id": uid, "text": text} for uid, text in rows))
def index(self, documents=None):
self.connection.indices.refresh(index=self.indexname)
def search(self, query, limit=3):
return self.batchsearch([query], limit)
def batchsearch(self, queries, limit=3):
# Generate bulk queries
request = []
for query in queries:
req_head = {"index": self.indexname, "search_type": "dfs_query_then_fetch"}
req_body = {
"_source": False,
"query": {"multi_match": {"query": query, "type": "best_fields", "fields": ["text"], "tie_breaker": 0.5}},
"size": limit,
}
request.extend([req_head, req_body])
# Run ES query
response = self.connection.msearch(body=request, request_timeout=600)
# Read responses
results = []
for resp in response["responses"]:
result = resp["hits"]["hits"]
results.append([(r["_id"], r["_score"]) for r in result])
return results
def count(self):
response = self.connection.cat.count(self.indexname, params={"format": "json"})
return int(response[0]["count"])
def load(self, path):
# No local storage
pass
def save(self, path):
# No local storage
pass
Elasticsearch server
As with Postgres, we'll install and start an Elasticsearch instance.
import os
# Download and extract elasticsearch
os.system("wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz")
os.system("tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz")
os.system("chown -R daemon:daemon elasticsearch-7.10.1")
from subprocess import Popen, PIPE, STDOUT
# Start and wait for serverw
server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))
!sleep 30
Let's build the index. The only difference from the previous example is setting the custom scoring
component.
import txtai
# Creat embeddings
embeddings = txtai.Embeddings(
keyword=True,
content="postgresql+psycopg2://postgres:postgres@localhost/postgres",
scoring= "__main__.Elastic"
)
# Index dataset
embeddings.index(ds["text"])
Below is the same search as shown before.
embeddings.search("red sox defeat yankees")
[{'id': '66954',
'text': 'Boston Red Sox make history Believe it, New England -- the Boston Red Sox are in the World Series. And they got there with the most unbelievable comeback of all, with four sweet swings after decades of defeat, shaming the dreaded New York Yankees.',
'score': 21.451942},
{'id': '69577',
'text': 'Passing thoughts on Yankees-Red Sox series The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game. The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game and it wasn #39;t even close.',
'score': 20.923117},
{'id': '67253',
'text': 'Sox Victorious At Last!! BOSTON -- After suffering decades of defeat and disappointment, the 2004 Boston Red Sox made history Wednesday night, beating the Yankees in the house that Ruth built and claiming the American League championship trophy.',
'score': 20.865997}]
And once again we get the top matches. This time though the index is in Elasticsearch. Why are results and scores different? This is because this is a keyword index and it's using Elasticsearch's raw BM25 scores.
One enhancement to this component would be adding score normalization as seen in the standard scoring components.
For good measure, let's also show that the md5
function can be called here too.
embeddings.search("SELECT id, text, md5(text), score FROM txtai WHERE similar('red sox defeat yankees')")
[{'id': '66954',
'text': 'Boston Red Sox make history Believe it, New England -- the Boston Red Sox are in the World Series. And they got there with the most unbelievable comeback of all, with four sweet swings after decades of defeat, shaming the dreaded New York Yankees.',
'md5': '29084f8640d4d72e402e991bc9fdbfa0',
'score': 21.451942},
{'id': '69577',
'text': 'Passing thoughts on Yankees-Red Sox series The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game. The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game and it wasn #39;t even close.',
'md5': '056983d301975084b49a5987185f2ddf',
'score': 20.923117},
{'id': '67253',
'text': 'Sox Victorious At Last!! BOSTON -- After suffering decades of defeat and disappointment, the 2004 Boston Red Sox made history Wednesday night, beating the Yankees in the house that Ruth built and claiming the American League championship trophy.',
'md5': '7838fcf610f0b569829c9bafdf9012f2',
'score': 20.865997}]
Same results with the additional md5
column, as expected.
Explore the data stores
The last thing we'll do is see where and how this data is stored in Postgres and Elasticsearch.
Let's connect to the local Postgres instance and sample content from the sections
table.
select id, text from sections where text like '%Red Sox%' and text like '%Yankees%' and text like '%defeat%' limit 3;
[('66954', 'Boston Red Sox make history Believe it, New England -- the Boston Red Sox are in the World Series. And they got there with the most unbelievable comeback of all, with four sweet swings after decades of defeat, shaming the dreaded New York Yankees.'),
('62732', "BoSox, Astros Play for Crucial Game 4 Wins (AP) AP - The Boston Red Sox entered this AL championship series hoping to finally overcome their bitter r ... (50 characters truncated) ... n-game defeat last October. Instead, they've been reduced to trying to prevent the Yankees from completing a humiliating sweep in their own ballpark."),
('62752', "BoSox, Astros Play for Crucial Game 4 Wins The Boston Red Sox entered this AL championship series hoping to finally overcome their bitter rivals from ... (42 characters truncated) ... game defeat last October. Instead, they've been reduced to trying to prevent the Yankees from completing a humiliating sweep in their own ballpark...")]
As expected, we can see content stored directly in Postgres!
Now let's check Elasticsearch.
import json
import requests
response = requests.get("http://localhost:9200/_search?q=red+sox+defeat+yankees&size=3")
print(json.dumps(response.json(), indent=2))
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3297,
"relation": "eq"
},
"max_score": 21.451942,
"hits": [
{
"_index": "testindex",
"_type": "_doc",
"_id": "66954",
"_score": 21.451942,
"_source": {
"text": "Boston Red Sox make history Believe it, New England -- the Boston Red Sox are in the World Series. And they got there with the most unbelievable comeback of all, with four sweet swings after decades of defeat, shaming the dreaded New York Yankees."
}
},
{
"_index": "testindex",
"_type": "_doc",
"_id": "69577",
"_score": 20.923117,
"_source": {
"text": "Passing thoughts on Yankees-Red Sox series The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game. The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game and it wasn #39;t even close."
}
},
{
"_index": "testindex",
"_type": "_doc",
"_id": "67253",
"_score": 20.865997,
"_source": {
"text": "Sox Victorious At Last!! BOSTON -- After suffering decades of defeat and disappointment, the 2004 Boston Red Sox made history Wednesday night, beating the Yankees in the house that Ruth built and claiming the American League championship trophy."
}
}
]
}
}
Same query results as what was run through the embeddings database.
Let's save the embeddings database and review what's stored.
embeddings.save("elastic")
!ls -l elastic
total 4
-rw-r--r-- 1 root root 155 Sep 7 16:39 config
And all we have is the configuration. No database
, embeddings
or scoring
files. That data is in Postgres and Elasticsearch!
Wrapping up
This article showed how external databases and other external integrations can be used with embeddings databases. This architecture ensures that as new ways to index and store data become available, txtai can easily adapt.
This article also showed how creating a custom component is a low level of effort and can easily be done for a component without an existing integration.
Top comments (0)