DEV Community

David Mezzetti for NeuML

Posted on • Edited on • Originally published at neuml.hashnode.dev

Query translation

txtai supports two main types of queries: natural language statements and SQL statements. Natural language queries handles a search engine like query. SQL statements enable more complex filtering, sorting and column selection. Query translation bridges the gap between the two and enables filtering for natural language queries.

For example, the query:

Tell me a feel good story since yesterday
Enter fullscreen mode Exit fullscreen mode

becomes

select * from txtai where similar("Tell me a feel good story") and
entry >= date('now', '-1 day')
Enter fullscreen mode Exit fullscreen mode

Install dependencies

Install txtai and all dependencies.

pip install txtai[pipeline]
Enter fullscreen mode Exit fullscreen mode

Create index

Let's first recap how to create an index. We'll use the classic txtai example.

from txtai.embeddings import Embeddings

data = ["US tops 5 million confirmed virus cases",
        "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
        "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
        "The National Park Service warns against sacrificing slower friends in a bear attack",
        "Maine man wins $1M from $25 lottery ticket",
        "Make huge profits without work, earn up to $100,000 a day"]

# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2", "content": True})

# Create an index for the list of text
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])

# Run a search
embeddings.search("feel good story", 1)
Enter fullscreen mode Exit fullscreen mode
[{'id': '4',
  'score': 0.08329011499881744,
  'text': 'Maine man wins $1M from $25 lottery ticket'}]
Enter fullscreen mode Exit fullscreen mode

Query translation models

Next we'll explore how query translation models work with examples.

from txtai.pipeline import Sequences

sequences = Sequences("NeuML/t5-small-txtsql")

queries = [
  "feel good story",
  "feel good story since yesterday",
  "feel good story with lottery in text",
  "how many feel good story",
  "feel good story translated to fr",
  "feel good story summarized"
]

# Prefix to pass to T5 model
prefix = "translate English to SQL: "

for query in queries:
  print(f"Input: {query}")
  print(f"SQL: {sequences(query, prefix)}")
  print()
Enter fullscreen mode Exit fullscreen mode
Input: feel good story
SQL: select id, text, score from txtai where similar('feel good story')

Input: feel good story since yesterday
SQL: select id, text, score from txtai where similar('feel good story') and entry >= date('now', '-1 day')

Input: feel good story with lottery in text
SQL: select id, text, score from txtai where similar('feel good story') and text like '% lottery%'

Input: how many feel good story
SQL: select count(*) from txtai where similar('feel good story')

Input: feel good story translated to fr
SQL: select id, translate(text, 'fr') text, score from txtai where similar('feel good story')

Input: feel good story summarized
SQL: select id, summary(text) text, score from txtai where similar('feel good story')
Enter fullscreen mode Exit fullscreen mode

Looking at the query translations above gives an idea on how this model works.

t5-small-txtsql is the default model. Custom domain query syntax languages can be created using this same methodology, including for other languages. Natural language can be translated to functions, query clauses, column selection and more!

Natural language filtering

Now it's time for this in action! Let's first initialize the embeddings index with the appropriate settings.

from txtai.pipeline import Translation

def translate(text, lang):
  return translation(text, lang)

translation = Translation()

# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2",
                         "content": True,
                         "query": {"path": "NeuML/t5-small-txtsql"},
                         "functions": [translate]})

# Create an index for the list of text
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])

query = "select id, score, translate(text, 'de') 'text' from txtai where similar('feel good story')"

# Run a search using a custom SQL function
embeddings.search(query)[0]
Enter fullscreen mode Exit fullscreen mode
{'id': '4',
 'score': 0.08329011499881744,
 'text': 'Maine Mann gewinnt $1M von $25 Lotterie-Ticket'}
Enter fullscreen mode Exit fullscreen mode

Note how the query model was provided as a embeddings index configuration parameter. Custom SQL functions were also added in. Let's now see if the same SQL statement can be run with a natural language query.

embeddings.search("feel good story translated to de")[0]
Enter fullscreen mode Exit fullscreen mode
{'id': '4',
 'score': 0.08329011499881744,
 'text': 'Maine Mann gewinnt $1M von $25 Lotterie-Ticket'}
Enter fullscreen mode Exit fullscreen mode

Same result. Let's try a few more.

embeddings.search("feel good story since yesterday")[0]
Enter fullscreen mode Exit fullscreen mode
{'id': '4',
 'score': 0.08329011499881744,
 'text': 'Maine man wins $1M from $25 lottery ticket'}
Enter fullscreen mode Exit fullscreen mode
embeddings.search("feel good story with lottery in text")[0]
Enter fullscreen mode Exit fullscreen mode
{'id': '4',
 'score': 0.08329011499881744,
 'text': 'Maine man wins $1M from $25 lottery ticket'}
Enter fullscreen mode Exit fullscreen mode

For good measure, a couple queries with filters that return no results.

embeddings.search("feel good story with missing in text")
Enter fullscreen mode Exit fullscreen mode
[]
Enter fullscreen mode Exit fullscreen mode
embeddings.search("feel good story with field equal 14")
Enter fullscreen mode Exit fullscreen mode
[]
Enter fullscreen mode Exit fullscreen mode

Query translation with applications

Of course this is all available with YAML-configured applications.

config = """
translation:

writable: true
embeddings:
  path: sentence-transformers/nli-mpnet-base-v2
  content: true
  query:
    path: NeuML/t5-small-txtsql
  functions:
    - {name: translate, argcount: 2, function: translation}
"""

from txtai.app import Application

# Build application and index data
app = Application(config)
app.add([{"id": x, "text": row} for x, row in enumerate(data)])
app.index()

# Run search query
app.search("feel good story translated to de")[0]
Enter fullscreen mode Exit fullscreen mode
{'id': '4',
 'score': 0.08329011499881744,
 'text': 'Maine Mann gewinnt $1M von $25 Lotterie-Ticket'}
Enter fullscreen mode Exit fullscreen mode

Wrapping up

This article introduced natural language filtering with query translation models. This powerful feature adds filtering and pipelines to natural language statements. Custom domain-specific query languages can be created to enable rich queries natively in txtai.

Top comments (0)