Part 2 and Part 3 of this series showed how to index and search data in txtai. Part 2 indexed and searched a Hugging Face Dataset, Part 3 indexed and searched an external data source.
txtai is modular in design, it's components can be individually used. txtai has a similarity function that works on lists of text. This method can be integrated with any external search service, such as a REST API, a SQL query or anything else that returns text search results.
In this article, we'll take the same Hugging Face Dataset used in Part 2, index it in Elasticsearch and rank the search results using a semantic similarity function from txtai.
Install dependencies
Install txtai
, datasets
and Elasticsearch
.
# Install txtai, datasets and elasticsearch python client
pip install txtai datasets elasticsearch
# Download and extract elasticsearch
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.10.1
Start an instance of Elasticsearch.
import os
from subprocess import Popen, PIPE, STDOUT
# Start and wait for server
server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))
sleep 30
Load data into Elasticsearch
The following block loads the dataset into Elasticsearch.
from datasets import load_dataset
from elasticsearch import Elasticsearch, helpers
# Connect to ES instance
es = Elasticsearch(hosts=["http://localhost:9200"], timeout=60, retry_on_timeout=True)
# Load HF dataset
dataset = load_dataset("ag_news", split="train")["text"][:50000]
# Elasticsearch bulk buffer
buffer = []
rows = 0
for x, text in enumerate(dataset):
# Article record
article = {"_id": x, "_index": "articles", "title": text}
# Buffer article
buffer.append(article)
# Increment number of articles processed
rows += 1
# Bulk load every 1000 records
if rows % 1000 == 0:
helpers.bulk(es, buffer)
buffer = []
print("Inserted {} articles".format(rows), end="\r")
if buffer:
helpers.bulk(es, buffer)
print("Total articles inserted: {}".format(rows))
Total articles inserted: 50000
Query data with Elasticsearch
Elasticsearch is a token-based search system. Queries and documents are parsed into tokens and the most relevant query-document matches are calculated using a scoring algorithm. The default scoring algorithm is BM25. Powerful queries can be built using a rich query syntax and Query DSL.
The following section runs a query against Elasticsearch, finds the top 5 matches and returns the corresponding documents associated with each match.
from IPython.display import display, HTML
def table(category, query, rows):
html = """
<style type='text/css'>
@import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');
table {
border-collapse: collapse;
width: 900px;
}
th, td {
border: 1px solid #9e9e9e;
padding: 10px;
font: 15px Oswald;
}
</style>
"""
html += "<h3>[%s] %s</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead>" % (category, query)
for score, text in rows:
html += "<tr><td>%.4f</td><td>%s</td></tr>" % (score, text)
html += "</table>"
display(HTML(html))
def search(query, limit):
query = {
"size": limit,
"query": {
"query_string": {"query": query}
}
}
results = []
for result in es.search(index="articles", body=query)["hits"]["hits"]:
source = result["_source"]
results.append((min(result["_score"], 18) / 18, source["title"]))
return results
limit = 3
query= "+yankees lose"
table("Elasticsearch", query, search(query, limit))
[Elasticsearch] +yankees lose
Score | Text |
---|---|
0.5817 | El Duque adds to gloomy NY forecast The Yankees #39; staff infection has spread to the one man the team can #39;t afford to lose. Orlando Hernandez was scratched from last night #39;s scheduled start because |
0.5697 | Rangers Derail Red Sox The Red Sox lose for the first time in 11 games, falling to the Rangers 8-6 Saturday and missing a chance to pull within 1 1/2 games of the Yankees in the AL East. |
0.5069 | Rout leaves Yanks #39; lead at 3 Royals gain control with 10-run 5th Against a nothing-to-lose team such as the Kansas City Royals, the Yankees #39; manager wanted his team to put down the hammer early and not let baseball #39;s second worst team believe it had a chance. |
The table above shows the results for the query +yankees lose
. This query requires the token yankees
. The search doesn't understand the semantic meaning of the query. It returns the most relevant results with those two tokens.
We can see in this case, the results aren't capturing the meaning of the search. Let's try adding semantic similarity to the search!
Ranking search results with txtai
txtai has a similarity module that computes the similarity between a query and a list of strings. Of course, txtai can also build a full index as shown in the previous articles but in this case we'll just use the ad-hoc similarity function.
The code below creates a Similarity instance and defines a ranking function to order search results based on the computed similarity.
ranksearch
queries Elasticsearch for a larger set of results, ranks the results using the similarity instance and returns the top n results.
from txtai.pipeline import Similarity
def ranksearch(query, limit):
results = [text for _, text in search(query, limit * 10)]
return [(score, results[x]) for x, score in similarity(query, results)][:limit]
# Create similarity instance for re-ranking
similarity = Similarity("valhalla/distilbart-mnli-12-3")
Now let's re-run the previous search.
# Run the search
table("Elasticsearch + txtai", query, ranksearch(query, limit))
[Elasticsearch + txtai] +yankees lose
Score | Text |
---|---|
0.9929 | Ouch! Yankees hit new low INDIANS 22, YANKEES 0---At New York, Omar Vizquel went 6-for-7 to tie the American League record for hits as Cleveland handed the Yankees the largest loss in their history last night. |
0.9874 | Vazquez and Yankees Buckle Early Because Javier Vazquez fizzled while Brad Radke flourished, the Yankees sustained their first regular-season defeat by the Minnesota Twins since 2001. |
0.9542 | Slide of the Yankees: Pinstripes Punished George Steinbrenner watched from his box as his Yankees suffered the most one-sided loss in the franchise's long history. |
The results above do a much better job of finding results semantically similar in meaning to the query. Instead of just finding matches with yankees
and lose
, it finds matches where the yankees lose
.
This combination is effective and powerful. It takes advantage of the high performance of Elasticsearch while adding a semantic search capability. We may already have a large Elasticsearch cluster with TBs (or PBs)+ of data and years of engineering investment that solves most use cases. Semantically ranking search results is a practical approach.
More examples
Now for some more examples comparing the results from Elasticsearch vs Elasticsearch + txtai.
for query in ["good news +economy", "bad news +economy"]:
table("Elasticsearch", query, search(query, limit))
table("Elasticsearch + txtai", query, ranksearch(query, limit))
[Elasticsearch] good news +economy
Score | Text |
---|---|
0.8756 | Surprise drop US wholesale prices is mixed news for economy (AFP) AFP - A surprise drop in US wholesale prices in August showed inflation apparently in check, but analysts said this was good and bad news for the US economy. |
0.7379 | China investment slows Good news for officials who are trying to cool an overheated economy; austerity measures to remain. BEIJING (Reuters) - China reported a marked slowdown in investment and money supply growth Monday, but stubbornly |
0.7145 | Spending Rebounds, Good News for Growth WASHINGTON (Reuters) - U.S. consumer spending rebounded sharply July, government data showed on Monday, erasing the disappointment of June and bolstering hopes that the U.S. economy has recovered from its recent soft spot. |
[Elasticsearch + txtai] good news +economy
Score | Text |
---|---|
0.9996 | Spending Rebounds, Good News for Growth WASHINGTON (Reuters) - U.S. consumer spending rebounded sharply in July, the government said on Monday, erasing the disappointment of June and bolstering hopes that the U.S. economy has recovered from its recent soft spot. |
0.9996 | Spending Rebounds, Good News for Growth WASHINGTON (Reuters) - U.S. consumer spending rebounded sharply July, government data showed on Monday, erasing the disappointment of June and bolstering hopes that the U.S. economy has recovered from its recent soft spot. |
0.9993 | Home building surges Housing construction in August jumped to its highest level in five months, a dose of encouraging news for the economy #39;s expansion. |
[Elasticsearch] bad news +economy
Score | Text |
---|---|
0.9228 | Surprise drop US wholesale prices is mixed news for economy (AFP) AFP - A surprise drop in US wholesale prices in August showed inflation apparently in check, but analysts said this was good and bad news for the US economy. |
0.6405 | Field Poll: Californians liking economy Bee Staff Writer. Californians are slowly growing more optimistic about the health of the economy, but a majority still feels the state is in bad economic times, according to a new Field Poll. |
0.6188 | ADB says China should raise rates to cool economy China should raise interest rates to cool the economy and prevent a future buildup of bad loans in the banking system, the Asian Development Bank #39;s (ADB) Bei-jing representative Bruce Murray said. |
[Elasticsearch + txtai] bad news +economy
Score | Text |
---|---|
0.9977 | Aging society hits Japan #39;s economy Japan #39;s economy will be the most severely affected among industrialized nations by population aging, Kyodo News said Thursday. |
0.9963 | Funds: Fund Mergers Can Hurt Investors (Reuters) Reuters - Mergers and acquisitions have\played an enormous role in the U.S. economy during the past\several decades, but sometimes the results have been bad for\consumers. Similarly, consolidation in the mutual fund\business has sometimes hurt fund investors. |
0.9958 | Signs of listless economy persist In a sign of persistent weakness in the US economy, a widely watched measure of business activity declined in August for the third consecutive month. |
Once again while Elasticsearch usually returns quality results, occasionally it will match results that aren't semantically relevant. The power of semantic search is that not only will it find direct matches but matches with the same meaning.
Top comments (0)