Introduction
Natural Language Processing (NLP) is expanding its reach into underserved languages. In this blog, we’ll explore how to fine-tune the T5-Small model to translate between English and Limbu, a Tibeto-Burman language spoken in Nepal and neighboring regions.
Preparing the Data
We created an English-Limbu translation dataset in JSON format, containing over 1,500 pairs. Below is a sample of the data:
[
{
"id": 1,
"translation": {
"en": "hi",
"lim": "ᤜᤠᤤ ॥"
}
},
{
"id": 2,
"translation": {
"en": "Let's eat.",
"lim": "ᤀᤠᤏᤡ᤹ ᤆᤠᤶ ॥ "
}
},
{
"id": 3,
"translation": {
"en": "We saw it.",
"lim": "ᤀᤏᤡᤃᤧ ᤁᤴ ᤏᤡᤔᤠᤏᤠ ॥ "
}
},
...
]
The dataset was saved as limbu-english.json
.
Setting Up the Environment
Install the required libraries in Google Colab:
!pip install transformers datasets evaluate sacrebleu
!pip install transformers[sentencepiece]
!pip install sentencepiece
Load the dataset:
from datasets import load_dataset
path = 'limbu-english.json'
translations = load_dataset('json', data_files=path)
translations = translations["train"].train_test_split(test_size=0.2)
Loading the Pretrained Model
We initialized the T5-Small model:
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM
checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)
Tokenizing the Dataset
We generated a custom tokenizer and tokenized the dataset:
def get_training_corpus():
dataset = translations["train"]
for start_idx in range(0, len(dataset), 1000):
yield [item['lim'] for item in dataset[start_idx:start_idx + 1000]['translation']]
lim_tokenizer = tokenizer.train_new_from_iterator(get_training_corpus(), 52000)
source_lang = "en"
target_lang = "lim"
prefix = "translate English to Limbu: "
def preprocess_function(examples):
inputs = [prefix + example[source_lang] for example in examples["translation"]]
targets = [example[target_lang] for example in examples["translation"]]
return lim_tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
tokenized_translations = translations.map(preprocess_function, batched=True)
Preparing for Training
The tokenized data was prepared for the TensorFlow model:
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=lim_tokenizer, model=checkpoint)
tf_train_set = model.prepare_tf_dataset(
tokenized_translations["train"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
tf_test_set = model.prepare_tf_dataset(
tokenized_translations["test"],
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)
Training the Model
We used AdamWeightDecay for optimization:
from transformers import AdamWeightDecay
import tensorflow as tf
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
model.compile(optimizer=optimizer)
Let's define the metrics to observe while training
from transformers.keras_callbacks import KerasMetricCallback
def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = lim_tokenizer.batch_decode(preds, skip_special_tokens=True)
labels = np.where(labels != -100, labels, lim_tokenizer.pad_token_id)
decoded_labels = lim_tokenizer.batch_decode(labels, skip_special_tokens=True)
decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
result = metric.compute(predictions=decoded_preds, references=decoded_labels)
result = {"bleu": result["score"]}
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
result["gen_len"] = np.mean(prediction_lens)
result = {k: round(v, 4) for k, v in result.items()}
return result
metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_test_set)
These metrics can be seen in log file as well, but instead we will store it into the huggingface
from huggingface_hub import notebook_login
notebook_login()
and push the logs into huggingface as
from transformers.keras_callbacks import PushToHubCallback
push_to_hub_callback = PushToHubCallback(output_dir="eng-limbu-t5-001", tokenizer=lim_tokenizer)
callbacks = [push_to_hub_callback, tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)]
history = model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=500, callbacks=callbacks)
Visualizing Training Progress
We visualized the training loss:
import matplotlib.pyplot as plt
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
Testing the Model
We tested the model using the pipeline
module:
from transformers import pipeline
translator = pipeline("text2text-generation", model="bedus-creation/eng-limbu-t5-001")
result = translator("translate English to Limbu: Hello")
print(result)
Evaluating with BLEU Score
Finally, we calculated the BLEU score for translation accuracy:
bleu = evaluate.load("bleu")
predictions = [
"Hi",
]
references = [
["ᤜᤠᤤ ॥"],
]
results = bleu.compute(predictions=predictions, references=references)
print(results)
Conclusion
Fine-tuning the T5-Small model for Limbu demonstrates the potential of NLP in preserving and advancing underrepresented languages. With more training data and optimization, such models can become invaluable tools for language preservation and cross-cultural communication.
Top comments (0)