DEV Community

David Mezzetti for NeuML

Posted on • Edited on • Originally published at neuml.hashnode.dev

Extractive QA to build structured data

Traditional ETL/data parsing systems establish rules to extract information of interest. Regular expressions, string parsing and similar methods define fixed rules. This works in many cases but what if you are working with unstructured data containing numerous variations? The rules can be cumbersome and hard to maintain over time.

This notebook uses machine learning and extractive question-answering (QA) to utilize the vast knowledge built into large language models. These models have been trained on extremely large datasets, learning the many variations of natural language.

Install dependencies

Install txtai and all dependencies.

pip install txtai[pipeline-train]
Enter fullscreen mode Exit fullscreen mode

Train a QA model with few-shot learning

The code below trains a new QA model using a few examples. These examples gives the model hints on the type of questions that will be asked and the type of answers to look for. It doesn't take a lot of examples to do this as shown below.

import pandas as pd
from txtai.pipeline import HFTrainer, Questions, Labels

# Training data for few-shot learning
data = [
    {"question": "What is the url?",
     "context": "Faiss (https://github.com/facebookresearch/faiss) is a library for efficient similarity search.",
     "answers": "https://github.com/facebookresearch/faiss"},
    {"question": "What is the url", "context": "The last release was Wed Sept 25 2021", "answers": None},
    {"question": "What is the date?", "context": "The last release was Wed Sept 25 2021", "answers": "Wed Sept 25 2021"},
    {"question": "What is the date?", "context": "The order total comes to $44.33", "answers": None},
    {"question": "What is the amount?", "context": "The order total comes to $44.33", "answers": "$44.33"},
    {"question": "What is the amount?", "context": "The last release was Wed Sept 25 2021", "answers": None},
]

# Fine-tune QA model
trainer = HFTrainer()
model, tokenizer = trainer("distilbert-base-cased-distilled-squad", data, task="question-answering")
Enter fullscreen mode Exit fullscreen mode

Parse data into a structured table

The next section takes a series of rows of text and runs a set of questions against each row. The answers are then used to build a pandas DataFrame.

# Input data
context = ["Released on 6/03/2021",
           "Release delayed until the 11th of August",
           "Documentation can be found here: neuml.github.io/txtai",
           "The stock price fell to three dollars",
           "Great day: closing price for March 23rd is $33.11, for details - https://finance.google.com"]

# Define column queries
queries = ["What is the url?", "What is the date?", "What is the amount?"]

# Extract fields
questions = Questions(path=(model, tokenizer), gpu=True)
results = [questions([question] * len(context), context) for question in queries]
results.append(context)

# Load into DataFrame
pd.DataFrame(list(zip(*results)), columns=["URL", "Date", "Amount", "Text"])
Enter fullscreen mode Exit fullscreen mode
URL Date Amount Text
0 None 6/03/2021 None Released on 6/03/2021
1 None 11th of August None Release delayed until the 11th of August
2 neuml.github.io/txtai None None Documentation can be found here: neuml.github....
3 None None three dollars The stock price fell to three dollars
4 https://finance.google.com March 23rd $33.11 Great day: closing price for March 23rd is $33...

Add additional columns

This method can be combined with other models to categorize, group or otherwise derive additional columns. The code below derives an additional sentiment column.

# Add sentiment
labels = Labels(path="distilbert-base-uncased-finetuned-sst-2-english", dynamic=False)
labels = ["POSITIVE" if x[0][0] == 1 else "NEGATIVE" for x in labels(context)]
results.insert(len(results) - 1, labels)

# Load into DataFrame
pd.DataFrame(list(zip(*results)), columns=["URL", "Date", "Amount", "Sentiment", "Text"])
Enter fullscreen mode Exit fullscreen mode
URL Date Amount Sentiment Text
0 None 6/03/2021 None POSITIVE Released on 6/03/2021
1 None 11th of August None NEGATIVE Release delayed until the 11th of August
2 neuml.github.io/txtai None None NEGATIVE Documentation can be found here: neuml.github....
3 None None three dollars NEGATIVE The stock price fell to three dollars
4 https://finance.google.com March 23rd $33.11 POSITIVE Great day: closing price for March 23rd is $33...

Top comments (0)