Semantic search is a new category of search built on recent advances in Natural Language Processing (NLP). Traditional search systems use keywords to find data. Semantic search has an understanding of natural language and identifies results that have the same meaning, not necessarily the same keywords.
While semantic search adds amazing capabilities, sparse keyword indexes can still add value. There may be cases where finding an exact match is important or we just want a fast index to quickly do an initial scan of a dataset.
Both methods have their merits. What if we combine them together to build a unified hybrid
search capability? Can we get the best of both worlds?
This article will explore the benefits of hybrid search.
Install dependencies
Install txtai
and all dependencies.
# Install txtai
pip install txtai pytrec_eval rank-bm25 elasticsearch
Introducing semantic, keyword and hybrid search
Before diving into the benchmarks, let's briefly discuss how semantic and keyword search works.
Semantic search uses large language models to vectorize inputs into arrays of numbers. Similar concepts will have similar values. The vectors are typically stored in a vector database, which is a system that specializes in storing these numerical arrays and finding matches. Vector search transforms an input query into a vector and then runs a search to find the best conceptual results.
Keyword search tokenizes text into lists of tokens per document. These tokens are aggregated into token frequencies per document and stored in term frequency sparse arrays. At search time, the query is tokenized and the tokens of the query are compared to the tokens in the dataset. This is more a literal process. Keyword search is like string matching, it has no conceptual understanding, it matches on characters and bytes.
Hybrid search combines the scores from semantic and keyword indexes. Given that semantic search scores are typically 0 - 1 and keyword search scores are unbounded, a method is needed to combine the results.
The two methods supported in txtai are:
- Convex Combination when sparse scores are normalized
- Reciprocal Rank Fusion (RRF) when sparse scores aren't normalized
The default method in txtai is convex combination and we'll use that.
Evaluating performance
Now it's time to benchmark the results. For these tests, we'll use the BEIR dataset. We'll also use a benchmarks script from the txtai project. This benchmarks script has methods to work with the BEIR dataset.
We'll select a subset of the BEIR sources for brevity. For each source, we'll benchmark a bm25
index, an embeddings
index and a hybrid
or combined index.
import os
# Get benchmarks script
os.system("wget https://raw.githubusercontent.com/neuml/txtai/master/examples/benchmarks.py")
# Create output directory
os.makedirs("beir", exist_ok=True)
# Download subset of BEIR datasets
datasets = ["nfcorpus", "fiqa", "arguana", "scidocs", "scifact"]
for dataset in datasets:
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip"
os.system(f"wget {url}")
os.system(f"mv {dataset}.zip beir")
os.system(f"unzip -d beir beir/{dataset}.zip")
# Remove existing benchmark data
if os.path.exists("benchmarks.json"):
os.remove("benchmarks.json")
Now let's run the benchmarks.
# Remove existing benchmark data
if os.path.exists("benchmarks.json"):
os.remove("benchmarks.json")
# Runs benchmark evaluation
def evaluate(method):
for dataset in datasets:
command = f"python benchmarks.py beir {dataset} {method}"
print(command)
os.system(command)
# Calculate benchmarks
for method in ["bm25", "embed", "hybrid"]:
evaluate(method)
import json
import pandas as pd
def benchmarks():
# Read JSON lines data
with open("benchmarks.json") as f:
data = f.read()
df = pd.read_json(data, lines=True).sort_values(by=["source", "ndcg_cut_10"], ascending=[True, False])
return df[["source", "method", "ndcg_cut_10", "map_cut_10", "recall_10", "P_10", "index", "search", "memory"]].reset_index(drop=True)
# Load benchmarks dataframe
df = benchmarks()
df[df.source == "nfcorpus"].reset_index(drop=True)
source | method | ndcg_cut_10 | map_cut_10 | recall_10 | P_10 | index | search | memory |
---|---|---|---|---|---|---|---|---|
nfcorpus | hybrid | 0.34531 | 0.13369 | 0.17437 | 0.25480 | 29.46 | 3.57 | 2900 |
nfcorpus | embed | 0.30917 | 0.10810 | 0.15327 | 0.23591 | 33.64 | 3.33 | 2876 |
nfcorpus | bm25 | 0.30639 | 0.11728 | 0.14891 | 0.21734 | 2.72 | 0.96 | 652 |
df[df.source == "fiqa"].reset_index(drop=True)
source | method | ndcg_cut_10 | map_cut_10 | recall_10 | P_10 | index | search | memory |
---|---|---|---|---|---|---|---|---|
fiqa | hybrid | 0.36642 | 0.28846 | 0.43799 | 0.10340 | 233.90 | 68.42 | 3073 |
fiqa | embed | 0.36071 | 0.28450 | 0.43188 | 0.10216 | 212.30 | 58.83 | 2924 |
fiqa | bm25 | 0.23559 | 0.17619 | 0.29855 | 0.06559 | 19.78 | 12.84 | 76 |
df[df.source == "arguana"].reset_index(drop=True)
source | method | ndcg_cut_10 | map_cut_10 | recall_10 | P_10 | index | search | memory |
---|---|---|---|---|---|---|---|---|
arguana | hybrid | 0.48467 | 0.40101 | 0.75320 | 0.07532 | 37.80 | 21.22 | 2924 |
arguana | embed | 0.47781 | 0.38781 | 0.76671 | 0.07667 | 34.11 | 10.21 | 2910 |
arguana | bm25 | 0.45713 | 0.37118 | 0.73471 | 0.07347 | 3.39 | 10.95 | 663 |
df[df.source == "scidocs"].reset_index(drop=True)
source | method | ndcg_cut_10 | map_cut_10 | recall_10 | P_10 | index | search | memory |
---|---|---|---|---|---|---|---|---|
scidocs | embed | 0.21718 | 0.12982 | 0.23217 | 0.1146 | 127.63 | 4.41 | 2929 |
scidocs | hybrid | 0.21104 | 0.12450 | 0.22938 | 0.1134 | 138.00 | 6.43 | 2999 |
scidocs | bm25 | 0.15063 | 0.08756 | 0.15637 | 0.0772 | 13.07 | 1.42 | 722 |
df[df.source == "scifact"].reset_index(drop=True)
source | method | ndcg_cut_10 | map_cut_10 | recall_10 | P_10 | index | search | memory |
---|---|---|---|---|---|---|---|---|
scifact | hybrid | 0.71305 | 0.66773 | 0.83722 | 0.09367 | 39.51 | 2.35 | 2918 |
scifact | bm25 | 0.66324 | 0.61764 | 0.78761 | 0.08700 | 4.40 | 0.93 | 658 |
scifact | embed | 0.65149 | 0.60193 | 0.78972 | 0.08867 | 35.15 | 1.48 | 2889 |
The sections above show the metrics per source and method.
The table headers list the source (dataset)
, index method
, NDCG@10
/MAP@10
/RECALL@10
/P@10
accuracy metrics, index time(s)
, search time(s)
and memory usage(MB)
. The tables are sorted by NDCG@10
descending.
Looking at the results, we can see that hybrid
search often performs better than embeddings
or bm25
individually. In some cases, as with scidocs, the combination performs worse. But in the aggregate, the scores are better. This holds true for the entire BEIR dataset. For some sources, bm25
does best, some embeddings
but overall the combined hybrid
scores do the best.
Hybrid search isn't free though, it is slower as it has extra logic to combine the results. For individual queries, the results are often negligible.
Wrapping up
This article covered ways to improve search accuracy using a hybrid approach. We evaluated performance over a subset of the BEIR dataset to show how hybrid search, in many situations, can improve overall accuracy.
Custom datasets can also be evaluated using this method as specified in this link. This article and the associated benchmarks script can be reused to evaluate what method works best on your data.
Top comments (0)