DEV Community

David Mezzetti for NeuML

Posted on • Edited on • Originally published at neuml.hashnode.dev

External database integration

txtai provides many default settings to help a developer quickly get started. For example, metadata is stored in SQLite, dense vectors in Faiss, sparse vectors in a terms index and graph data with NetworkX.

Each of these components is customizable and can be swapped with alternate implementations. This has been covered in several previous articles.

This article will introduce how to store metadata in client-server RDBMS systems. In addition to SQLite and DuckDB, any SQLAlchemy-supported database with JSON support can now be used.

Install dependencies

Install txtai and all dependencies.

# Install txtai
pip install txtai[database] elasticsearch==7.10.1 datasets
Enter fullscreen mode Exit fullscreen mode

Install Postgres

Next, we'll install Postgres and start a Postgres instance.

# Install and start Postgres
apt-get update && apt-get install postgresql
service postgresql start
sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"
Enter fullscreen mode Exit fullscreen mode

Load a dataset

Now we're ready to load a dataset. We'll use the ag_news dataset. This dataset consists of 120,000 news headlines.

from datasets import load_dataset

# Load dataset
ds = load_dataset("ag_news", split="train")
Enter fullscreen mode Exit fullscreen mode

Build an Embeddings instance with Postgres

Let's load this dataset into an embeddings database. We'll configure this instance to store metadata in Postgres. Note that the content parameter below is a SQLAlchemy connection string.

This embeddings database will use the default vector settings and build that index locally.

import txtai

# Create embeddings
embeddings = txtai.Embeddings(
    content="postgresql+psycopg2://postgres:postgres@localhost/postgres",
)

# Index dataset
embeddings.index(ds["text"])
Enter fullscreen mode Exit fullscreen mode

Let's run a search query and see what comes back.

embeddings.search("red sox defeat yankees")
Enter fullscreen mode Exit fullscreen mode
[{'id': '63561',
  'text': 'Red Sox Beat Yankees 6-4 in 12 Innings BOSTON - Down to their last three outs of the season, the Boston Red Sox rallied - against Mariano Rivera, the New York Yankees and decades of disappointment. Bill Mueller singled home the tying run off Rivera in the ninth inning and David Ortiz homered against Paul Quantrill in the 12th, leading Boston to a 6-4 victory Sunday over the Yankees that avoided a four-game sweep in the AL championship series...',
  'score': 0.8104304671287537},
 {'id': '63221',
  'text': 'Red Sox Beat Yankees 6-4 in 12 Innings BOSTON - Down to their last three outs of the season, the Boston Red Sox rallied - against Mariano Rivera, the New York Yankees and decades of disappointment. Bill Mueller singled home the tying run off Rivera in the ninth inning and David Ortiz homered against Paul Quantrill in the 12th, leading Boston to a 6-4 victory over the Yankees on Sunday night that avoided a four-game sweep in the AL championship series...',
  'score': 0.8097385168075562},
 {'id': '66861',
  'text': 'Record-Breaking Red Sox Clinch World Series Berth  NEW YORK (Reuters) - The Boston Red Sox crushed the New  York Yankees 10-3 Wednesday to complete an historic comeback  victory over their arch-rivals by four games to three in the  American League Championship Series.',
  'score': 0.8003846406936646}]
Enter fullscreen mode Exit fullscreen mode

As expected, we get the standard id, text, score fields with the top matches for the query. The difference though is that all the database metadata normally stored in a local SQLite file is now stored in a Postgres server.

This opens up several possibilities such as row-level security. If a row isn't returned by the database, it won't be shown here. Alternatively, this search could optionally return only the ids and scores, which lets the user know a record exists they don't have access to.

As with other supported databases, underlying database functions can be called from txtai SQL.

embeddings.search("SELECT id, text, md5(text), score FROM txtai WHERE similar('red sox defeat yankees')")
Enter fullscreen mode Exit fullscreen mode
[{'id': '63561',
  'text': 'Red Sox Beat Yankees 6-4 in 12 Innings BOSTON - Down to their last three outs of the season, the Boston Red Sox rallied - against Mariano Rivera, the New York Yankees and decades of disappointment. Bill Mueller singled home the tying run off Rivera in the ninth inning and David Ortiz homered against Paul Quantrill in the 12th, leading Boston to a 6-4 victory Sunday over the Yankees that avoided a four-game sweep in the AL championship series...',
  'md5': '1e55a78fdf0cb3be3ef61df650f0a50f',
  'score': 0.8104304671287537},
 {'id': '63221',
  'text': 'Red Sox Beat Yankees 6-4 in 12 Innings BOSTON - Down to their last three outs of the season, the Boston Red Sox rallied - against Mariano Rivera, the New York Yankees and decades of disappointment. Bill Mueller singled home the tying run off Rivera in the ninth inning and David Ortiz homered against Paul Quantrill in the 12th, leading Boston to a 6-4 victory over the Yankees on Sunday night that avoided a four-game sweep in the AL championship series...',
  'md5': 'a0417e1fc503a5a2945c8755b6fb18d5',
  'score': 0.8097385168075562},
 {'id': '66861',
  'text': 'Record-Breaking Red Sox Clinch World Series Berth  NEW YORK (Reuters) - The Boston Red Sox crushed the New  York Yankees 10-3 Wednesday to complete an historic comeback  victory over their arch-rivals by four games to three in the  American League Championship Series.',
  'md5': '398a8508692aed109bd8c56f067a8083',
  'score': 0.8003846406936646}]
Enter fullscreen mode Exit fullscreen mode

Note the addition of the Postgres md5 function to the query.

Let's save and show the files in the embeddings database.

embeddings.save("vectors")
!ls -l vectors
Enter fullscreen mode Exit fullscreen mode
total 183032
-rw-r--r-- 1 root root       355 Sep  7 16:38 config
-rw-r--r-- 1 root root 187420123 Sep  7 16:38 embeddings
Enter fullscreen mode Exit fullscreen mode

Only the configuration and the local vectors index are stored in this case.

External indexing

As mentioned previously, all of the main components of txtai can be replaced with custom components. For example, there are external integrations for storing dense vectors in Weaviate and Qdrant to name a few.

Next, we'll build an example that stores metadata in Postgres and builds a sparse index with Elasticsearch.

Scoring component for Elasticsearch

First, we need to define a custom scoring component for Elasticsearch. While could have used an existing integration, it's important to show that creating a new component isn't a large LOE (~70 lines of code). See below.

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

from txtai.scoring import Scoring

class Elastic(Scoring):
  def __init__(self, config=None):
    # Scoring configuration
    self.config = config if config else {}

    # Server parameters
    self.url = self.config.get("url", "http://localhost:9200")
    self.indexname = self.config.get("indexname", "testindex")

    # Elasticsearch connection
    self.connection = Elasticsearch(self.url)

    self.terms = True
    self.normalize = self.config.get("normalize")

  def insert(self, documents, index=None):
    rows = []
    for uid, document, tags in documents:
        rows.append((index, document))

        # Increment index
        index = index + 1

    bulk(self.connection, ({"_index": self.indexname, "_id": uid, "text": text} for uid, text in rows))

  def index(self, documents=None):
    self.connection.indices.refresh(index=self.indexname)

  def search(self, query, limit=3):
    return self.batchsearch([query], limit)

  def batchsearch(self, queries, limit=3):
    # Generate bulk queries
    request = []
    for query in queries:
      req_head = {"index": self.indexname, "search_type": "dfs_query_then_fetch"}
      req_body = {
        "_source": False,
        "query": {"multi_match": {"query": query, "type": "best_fields", "fields": ["text"], "tie_breaker": 0.5}},
        "size": limit,
      }
      request.extend([req_head, req_body])

      # Run ES query
      response = self.connection.msearch(body=request, request_timeout=600)

      # Read responses
      results = []
      for resp in response["responses"]:
        result = resp["hits"]["hits"]
        results.append([(r["_id"], r["_score"]) for r in result])

      return results

  def count(self):
    response = self.connection.cat.count(self.indexname, params={"format": "json"})
    return int(response[0]["count"])

  def load(self, path):
    # No local storage
    pass

  def save(self, path):
    # No local storage
    pass
Enter fullscreen mode Exit fullscreen mode

Elasticsearch server

As with Postgres, we'll install and start an Elasticsearch instance.

import os

# Download and extract elasticsearch
os.system("wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz")
os.system("tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz")
os.system("chown -R daemon:daemon elasticsearch-7.10.1")
Enter fullscreen mode Exit fullscreen mode
from subprocess import Popen, PIPE, STDOUT

# Start and wait for serverw
server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))
!sleep 30
Enter fullscreen mode Exit fullscreen mode

Let's build the index. The only difference from the previous example is setting the custom scoring component.

import txtai

# Creat embeddings
embeddings = txtai.Embeddings(
    keyword=True,
    content="postgresql+psycopg2://postgres:postgres@localhost/postgres",
    scoring= "__main__.Elastic"
)

# Index dataset
embeddings.index(ds["text"])
Enter fullscreen mode Exit fullscreen mode

Below is the same search as shown before.

embeddings.search("red sox defeat yankees")
Enter fullscreen mode Exit fullscreen mode
[{'id': '66954',
  'text': 'Boston Red Sox make history Believe it, New England -- the Boston Red Sox are in the World Series. And they got there with the most unbelievable comeback of all, with four sweet swings after decades of defeat, shaming the dreaded New York Yankees.',
  'score': 21.451942},
 {'id': '69577',
  'text': 'Passing thoughts on Yankees-Red Sox series The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game. The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game and it wasn #39;t even close.',
  'score': 20.923117},
 {'id': '67253',
  'text': 'Sox Victorious At Last!! BOSTON -- After suffering decades of defeat and disappointment, the 2004 Boston Red Sox made history Wednesday night, beating the Yankees in the house that Ruth built and claiming the American League championship trophy.',
  'score': 20.865997}]
Enter fullscreen mode Exit fullscreen mode

And once again we get the top matches. This time though the index is in Elasticsearch. Why are results and scores different? This is because this is a keyword index and it's using Elasticsearch's raw BM25 scores.

One enhancement to this component would be adding score normalization as seen in the standard scoring components.

For good measure, let's also show that the md5 function can be called here too.

embeddings.search("SELECT id, text, md5(text), score FROM txtai WHERE similar('red sox defeat yankees')")
Enter fullscreen mode Exit fullscreen mode
[{'id': '66954',
  'text': 'Boston Red Sox make history Believe it, New England -- the Boston Red Sox are in the World Series. And they got there with the most unbelievable comeback of all, with four sweet swings after decades of defeat, shaming the dreaded New York Yankees.',
  'md5': '29084f8640d4d72e402e991bc9fdbfa0',
  'score': 21.451942},
 {'id': '69577',
  'text': 'Passing thoughts on Yankees-Red Sox series The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game. The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game and it wasn #39;t even close.',
  'md5': '056983d301975084b49a5987185f2ddf',
  'score': 20.923117},
 {'id': '67253',
  'text': 'Sox Victorious At Last!! BOSTON -- After suffering decades of defeat and disappointment, the 2004 Boston Red Sox made history Wednesday night, beating the Yankees in the house that Ruth built and claiming the American League championship trophy.',
  'md5': '7838fcf610f0b569829c9bafdf9012f2',
  'score': 20.865997}]
Enter fullscreen mode Exit fullscreen mode

Same results with the additional md5 column, as expected.

Explore the data stores

The last thing we'll do is see where and how this data is stored in Postgres and Elasticsearch.

Let's connect to the local Postgres instance and sample content from the sections table.

select id, text from sections where text like '%Red Sox%' and text like '%Yankees%' and text like '%defeat%' limit 3;
Enter fullscreen mode Exit fullscreen mode
[('66954', 'Boston Red Sox make history Believe it, New England -- the Boston Red Sox are in the World Series. And they got there with the most unbelievable comeback of all, with four sweet swings after decades of defeat, shaming the dreaded New York Yankees.'),
 ('62732', "BoSox, Astros Play for Crucial Game 4 Wins (AP) AP - The Boston Red Sox entered this AL championship series hoping to finally overcome their bitter r ... (50 characters truncated) ... n-game defeat last October. Instead, they've been reduced to trying to prevent the Yankees from completing a humiliating sweep in their own ballpark."),
 ('62752', "BoSox, Astros Play for Crucial Game 4 Wins The Boston Red Sox entered this AL championship series hoping to finally overcome their bitter rivals from ... (42 characters truncated) ... game defeat last October. Instead, they've been reduced to trying to prevent the Yankees from completing a humiliating sweep in their own ballpark...")]
Enter fullscreen mode Exit fullscreen mode

As expected, we can see content stored directly in Postgres!

Now let's check Elasticsearch.

import json
import requests

response = requests.get("http://localhost:9200/_search?q=red+sox+defeat+yankees&size=3")
print(json.dumps(response.json(), indent=2))
Enter fullscreen mode Exit fullscreen mode
{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3297,
      "relation": "eq"
    },
    "max_score": 21.451942,
    "hits": [
      {
        "_index": "testindex",
        "_type": "_doc",
        "_id": "66954",
        "_score": 21.451942,
        "_source": {
          "text": "Boston Red Sox make history Believe it, New England -- the Boston Red Sox are in the World Series. And they got there with the most unbelievable comeback of all, with four sweet swings after decades of defeat, shaming the dreaded New York Yankees."
        }
      },
      {
        "_index": "testindex",
        "_type": "_doc",
        "_id": "69577",
        "_score": 20.923117,
        "_source": {
          "text": "Passing thoughts on Yankees-Red Sox series The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game. The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game and it wasn #39;t even close."
        }
      },
      {
        "_index": "testindex",
        "_type": "_doc",
        "_id": "67253",
        "_score": 20.865997,
        "_source": {
          "text": "Sox Victorious At Last!! BOSTON -- After suffering decades of defeat and disappointment, the 2004 Boston Red Sox made history Wednesday night, beating the Yankees in the house that Ruth built and claiming the American League championship trophy."
        }
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Same query results as what was run through the embeddings database.

Let's save the embeddings database and review what's stored.

embeddings.save("elastic")
!ls -l elastic
Enter fullscreen mode Exit fullscreen mode
total 4
-rw-r--r-- 1 root root 155 Sep  7 16:39 config
Enter fullscreen mode Exit fullscreen mode

And all we have is the configuration. No database, embeddings or scoring files. That data is in Postgres and Elasticsearch!

Wrapping up

This article showed how external databases and other external integrations can be used with embeddings databases. This architecture ensures that as new ways to index and store data become available, txtai can easily adapt.

This article also showed how creating a custom component is a low level of effort and can easily be done for a component without an existing integration.

Top comments (0)