DEV Community

David Mezzetti for NeuML

Posted on • Edited on • Originally published at neuml.hashnode.dev

Extractive QA with Elasticsearch

txtai is datastore agnostic, the library analyzes sets of text. The following example shows how extractive question-answering can be added on top of an Elasticsearch system.

Install dependencies

Install txtai and Elasticsearch.

# Install txtai and elasticsearch python client
pip install txtai elasticsearch

# Download and extract elasticsearch
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.10.1
Enter fullscreen mode Exit fullscreen mode

Start an instance of Elasticsearch.

import os
from subprocess import Popen, PIPE, STDOUT

# Start and wait for server
server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))
Enter fullscreen mode Exit fullscreen mode
sleep 30
Enter fullscreen mode Exit fullscreen mode

Download data

This example is going to work off a subset of the CORD-19 dataset. COVID-19 Open Research Dataset (CORD-19) is a free resource of scholarly articles, aggregated by a coalition of leading research groups, covering COVID-19 and the coronavirus family of viruses.

The following download is a SQLite database generated from a Kaggle notebook. More information on this data format, can be found in the CORD-19 Analysis notebook.

wget https://github.com/neuml/txtai/releases/download/v1.1.0/tests.gz
gunzip tests.gz
mv tests articles.sqlite
Enter fullscreen mode Exit fullscreen mode

Load data into Elasticsearch

The following block copies rows from SQLite to Elasticsearch.

import sqlite3

import regex as re

from elasticsearch import Elasticsearch, helpers

# Connect to ES instance
es = Elasticsearch(hosts=["http://localhost:9200"], timeout=60, retry_on_timeout=True)

# Connection to database file
db = sqlite3.connect("articles.sqlite")
cur = db.cursor()

# Elasticsearch bulk buffer
buffer = []
rows = 0

# Select tagged sentences without a NLP label. NLP labels are set for non-informative sentences.
cur.execute("SELECT s.Id, Article, Title, Published, Reference, Name, Text FROM sections s JOIN articles a on s.article=a.id WHERE (s.labels is null or s.labels NOT IN ('FRAGMENT', 'QUESTION')) AND s.tags is not null")
for row in cur:
  # Build dict of name-value pairs for fields
  article = dict(zip(("id", "article", "title", "published", "reference", "name", "text"), row))
  name = article["name"]

  # Only process certain document sections
  if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
    # Bulk action fields
    article["_id"] = article["id"]
    article["_index"] = "articles"

    # Buffer article
    buffer.append(article)

    # Increment number of articles processed
    rows += 1

    # Bulk load every 1000 records
    if rows % 1000 == 0:
      helpers.bulk(es, buffer)
      buffer = []

      print("Inserted {} articles".format(rows), end="\r")

if buffer:
  helpers.bulk(es, buffer)

print("Total articles inserted: {}".format(rows))
Enter fullscreen mode Exit fullscreen mode
Total articles inserted: 21499
Enter fullscreen mode Exit fullscreen mode

Query data

The following runs a query against Elasticsearch for the terms "risk factors". It finds the top 5 matches and returns the corresponding documents associated with each match.

import pandas as pd

from IPython.display import display, HTML

pd.set_option("display.max_colwidth", None)

query = {
    "_source": ["article", "title", "published", "reference", "text"],
    "size": 5,
    "query": {
        "query_string": {"query": "risk factors"}
    }
}

results = []
for result in es.search(index="articles", body=query)["hits"]["hits"]:
  source = result["_source"]
  results.append((source["title"], source["published"], source["reference"], source["text"]))

df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match"])

display(HTML(df.to_html(index=False)))
Enter fullscreen mode Exit fullscreen mode
Title Published Reference Match
Management of osteoarthritis during COVID‐19 pandemic 2020-05-21 00:00:00 https://doi.org/10.1002/cpt.1910 Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .
Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection 2020-04-24 00:00:00 http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1 This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.
Does apolipoprotein E genotype predict COVID-19 severity? 2020-04-27 00:00:00 https://doi.org/10.1093/qjmed/hcaa142 Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors .
COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants 2020-07-23 00:00:00 https://www.ncbi.nlm.nih.gov/pubmed/32705587/ BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease.
COVID-19: what has been learned and to be learned about the novel coronavirus disease 2020-03-15 00:00:00 https://doi.org/10.7150/ijbs.45134 • Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.

Derive columns with Extractive QA

The next section uses Extractive QA to derive additional columns. For each article, the full text is retrieved and a series of questions are asked of the document. The answers are added as a derived column per article.

from txtai.embeddings import Embeddings
from txtai.pipeline import Extractor

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})

# Create extractor instance using qa model designed for the CORD-19 dataset
extractor = Extractor(embeddings, "NeuML/bert-small-cord19qa")
Enter fullscreen mode Exit fullscreen mode
document = {
    "_source": ["id", "name", "text"],
    "size": 1000,
    "query": {
        "term": {"article": None}
    },
    "sort" : ["id"]
}

def sections(article):
  rows = []

  search = document.copy()
  search["query"]["term"]["article"] = article

  for result in es.search(index="articles", body=search)["hits"]["hits"]:
    source = result["_source"]
    name, text = source["name"], source["text"]

    if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
      rows.append(text)

  return rows

results = []
for result in es.search(index="articles", body=query)["hits"]["hits"]:
  source = result["_source"]

  # Use QA extractor to derive additional columns
  answers = extractor([("Risk factors", "risk factor", "What are names of risk factors?", False),
                       ("Locations", "city country state", "What are names of locations?", False)], sections(source["article"]))

  results.append((source["title"], source["published"], source["reference"], source["text"]) + tuple([answer[1] for answer in answers]))

df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match", "Risk Factors", "Locations"])

display(HTML(df.to_html(index=False)))
Enter fullscreen mode Exit fullscreen mode
Title Published Reference Match Risk Factors Locations
Management of osteoarthritis during COVID‐19 pandemic 2020-05-21 00:00:00 https://doi.org/10.1002/cpt.1910 Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) . Comorbidities extrapulmonary sites
Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection 2020-04-24 00:00:00 http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1 This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors. CVD, risk factors but no CVD, and neither CVD None
Does apolipoprotein E genotype predict COVID-19 severity? 2020-04-27 00:00:00 https://doi.org/10.1093/qjmed/hcaa142 Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors . socioeconomic inequalities and risk factors None
COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants 2020-07-23 00:00:00 https://www.ncbi.nlm.nih.gov/pubmed/32705587/ BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease. Frailty and multimorbidity comorbidity groupings
COVID-19: what has been learned and to be learned about the novel coronavirus disease 2020-03-15 00:00:00 https://doi.org/10.7150/ijbs.45134 • Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia. age and underlying disease are strongly correlated cities, provinces, and countries

Top comments (0)