txtai has a robust training pipeline that can fine-tune large language models (LLMs) for downstream tasks such as labeling text. txtai also has the ability to train language models from scratch.
The vast majority of time, fine-tuning a LLM yields the best results. But when making significant changes to the structure of a model, training from scratch is often required.
Examples of significant changes are:
- Changing the vocabulary size
- Changing the number of hidden dimensions
- Changing the number of attention heads or layers
This article will show how to build a new tokenizer and train a small language model (known as a micromodel) from scratch.
Install dependencies
Install txtai
and all dependencies.
# Install txtai
pip install txtai[pipeline-train] datasets sentence-transformers onnxruntime onnx
Load dataset
This example will use the ag_news
dataset, which is a collection of news article headlines.
from datasets import load_dataset
dataset = load_dataset("ag_news", split="train")
Train the tokenizer
The first step is to train the tokenizer. We could use an existing tokenizer but in this case, we want a smaller vocabulary.
from transformers import AutoTokenizer
def stream(batch=10000):
for x in range(0, len(dataset), batch):
yield dataset[x: x + batch]["text"]
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = tokenizer.train_new_from_iterator(stream(), vocab_size=500, length=len(dataset))
tokenizer.model_max_length = 512
tokenizer.save_pretrained("bert")
Let's test the tokenizer.
print(tokenizer.tokenize("Red Sox defeat Yankees 5-3"))
['re', '##d', 'so', '##x', 'de', '##f', '##e', '##at', 'y', '##ank', '##e', '##es', '5', '-', '3']
With a limited vocabulary size of 500, most words require multiple tokens. This limited vocabulary lowers the number of token representations the model needs to learn.
Train the language model
Now it's time to train the model. We'll train a micromodel, which is an extremely small language model with a limited vocabulary. Micromodels, when paired with a limited vocabulary have the potential to work in limited compute environments like edge devices and microcontrollers.
from transformers import AutoTokenizer, BertConfig, BertForMaskedLM
from txtai.pipeline import HFTrainer
config = BertConfig(
vocab_size = 500,
hidden_size = 50,
num_hidden_layers = 2,
num_attention_heads = 2,
intermediate_size = 100,
)
model = BertForMaskedLM(config)
model.save_pretrained("bert")
tokenizer = AutoTokenizer.from_pretrained("bert")
train = HFTrainer()
# Train model
train((model, tokenizer), dataset, task="language-modeling", output_dir="bert",
fp16=True, per_device_train_batch_size=128, num_train_epochs=10,
dataloader_num_workers=2)
Sentence embeddings
Next let's take the language model and fine-tune it to build sentence embeddings.
wget https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/examples/training/nli/training_nli_v2.py
python training_nli_v2.py bert
mv output/* bert-nli
Embeddings search
Now we'll build a txtai embeddings index using the fine-tuned model. We'll index the ag_news
dataset.
from txtai.embeddings import Embeddings
# Get list of all text
texts = dataset["text"]
embeddings = Embeddings({"path": "bert-nli", "content": True})
embeddings.index((x, text, None) for x, text in enumerate(texts))
Let's run a search and see how much the model has learned.
embeddings.search("Boston Red Sox Cardinals World Series")
[{'id': '76733',
'text': 'Red Sox sweep Cardinals to win World Series The Boston Red Sox ended their 86-year championship drought with a 3-0 win over the St. Louis Cardinals in Game Four of the World Series.',
'score': 0.8008379936218262},
{'id': '71169',
'text': 'Red Sox lead 2-0 over Cardinals of World Series The host Boston Red Sox scored a 6-2 victory over the St. Louis Cardinals, helped by Curt Schilling #39;s pitching through pain and seeping blood, in World Series Game 2 on Sunday night.',
'score': 0.7896029353141785},
{'id': '70100',
'text': 'Sports: Red Sox 9 Cardinals 7 after 7 innings BOSTON Boston has scored twice in the seventh inning to take an 9-to-7 lead over the St. Louis Cardinals in the World Series opener at Fenway Park.',
'score': 0.7735188603401184}]
Not too bad. It's far from perfect but we can tell that it has some knowledge! This model was trained for 5 minutes, there is certainly room for improvement in training longer and/or with a larger dataset.
The standard bert-base-uncased
model has 110M parameters and is around 440MB. Let's see how many parameters this model has.
# Show number of parameters
parameters = sum(p.numel() for p in embeddings.model.model.parameters())
print(f"Number of parameters:\t\t{parameters:,}")
print(f"% of bert-base-uncased\t\t{(parameters / 110000000) * 100:.2f}%")
Number of parameters: 94,450
% of bert-base-uncased 0.09%
ls -lh bert-nli/pytorch_model.bin
-rw-r--r-- 1 root root 386K Jan 11 20:52 bert-nli/pytorch_model.bin
This model is 386KB and has only 0.1% of the parameters. With proper vocabulary selection, a small language model has potential.
Quantization
If 386KB isn't small enough, we can quantize the model to get it down even further.
from txtai.pipeline import HFOnnx
onnx = HFOnnx()
onnx("bert-nli", task="pooling", output="bert-nli.onnx", quantize=True)
embeddings = Embeddings({"path": "bert-nli.onnx", "tokenizer": "bert-nli", "content": True})
embeddings.index((x, text, None) for x, text in enumerate(texts))
embeddings.search("Boston Red Sox Cardinals World Series")
[{'id': '76733',
'text': 'Red Sox sweep Cardinals to win World Series The Boston Red Sox ended their 86-year championship drought with a 3-0 win over the St. Louis Cardinals in Game Four of the World Series.',
'score': 0.8008379936218262},
{'id': '71169',
'text': 'Red Sox lead 2-0 over Cardinals of World Series The host Boston Red Sox scored a 6-2 victory over the St. Louis Cardinals, helped by Curt Schilling #39;s pitching through pain and seeping blood, in World Series Game 2 on Sunday night.',
'score': 0.7896029353141785},
{'id': '70100',
'text': 'Sports: Red Sox 9 Cardinals 7 after 7 innings BOSTON Boston has scored twice in the seventh inning to take an 9-to-7 lead over the St. Louis Cardinals in the World Series opener at Fenway Park.',
'score': 0.7735188603401184}]
ls -lh bert-nli.onnx
-rw-r--r-- 1 root root 187K Jan 11 20:53 bert-nli.onnx
We're down to 187KB with a quantized model!
Train on BERT dataset
The BERT paper has all the information regarding training parameters and datasets used. Hugging Face Datasets hosts the bookcorpus
and wikipedia
datasets.
Training on this size of a dataset is out of scope for this article but example code is shown below on how to build the BERT dataset.
bookcorpus = load_dataset("bookcorpus", split="train")
wiki = load_dataset("wikipedia", "20220301.en", split="train")
wiki = wiki.remove_columns([col for col in wiki.column_names if col != "text"])
dataset = concatenate_datasets([bookcorpus, wiki])
Then the same steps to train the tokenizer and model can be run. The dataset is 25GB compressed, so it will take some space and time to process!
Wrapping up
This article covered how to build micromodels from scratch with txtai. Micromodels can be fully rebuilt in hours using the most up-to-date knowledge available. If properly constructed, prepared and trained, micromodels have the potential to be a viable choice for limited resource environments. They can also help when realtime response is more important than having the highest accuracy scores.
It's our hope that further research and exploration into micromodels leads to productive and useful models.
Top comments (1)
Hey David, really cool article! I appreciate that you also added a link to a colab to directly try out the code. txtai also looks really great!. :-) Would love it if you would check out Kern AI refinery as well, it might be really interesting for you.