DEV Community

Yuval
Yuval

Posted on • Originally published at thecodingnotebook.com

AI for Developers: RNN encoder-decoder seq2seq translation

Training an RNN encoder-decoder to translate Spanish to English. This is a simple demonstration of sequence-to-sequence model.

This post is a summary of this Google tutorial.

NOTE: This post is intended for developers, if you are an aspiring data scientist or AI researcher this post will not dig deep enough for you.

Overview

Sequence-to-sequence (Seq2Seq) are used where the goal is to transform one sequence into another, like in language translation.

The key components are:

  1. Encoder - An RNN model that will encode the input sequence (in our case numeric tokens that represent Spanish words) into a vector (aka Context Vector), the encoder captures the essence of the input sequence.
  2. Decoder - An RNN model that given the encoded input, is responsible for generating the output sequence (in our case numeric tokens that represent English words).

Architecture

Encoder

The encoder has 2 layers:

  1. Embedding - will get the input (Spanish words as integers) and for each word creates a vector (size 256) that represents its meaning.
  2. GRU - An RNN layer with 1024 hidden units. The RNN will maintain a state that captures data of the entire sequence.

The encoder's RNN state will be passed to the decoder as input.

Decoder

The decoder gets 2 inputs:

  1. The encoder's RNN state.
  2. An input text in English. This text will be difference during training and inference, it is discussed later.

The decoder has 3 layers:

  1. Embedding - Like the encoder's embedding layer, the English input will be embedded into a vector of (size 256).
  2. GRU - An RNN layer with 1024 hidden units, same as the encoder.
  3. Dense - A "probabilities" layer, it takes the decoder output and computes the "probability" for each word in the English vocabulary of being the next word. In our data set there are 9219 words in English, so the output of this layer is a vector of size 9219. The layer is basically a "softmax" layer that computes the "logits" (out of the scope of this post).

encoder-decoder architecture

Training

Preparing the data

As we mentioned above, the input to the decoder, aside from the encoder's output, is the English text, this text is different during training and inference.

During inference we will call the decoder in a loop, on each iteration the decoder will generate 1 word in English, this word will be fed as input to the decoder on the next cycle. But what is the first word we start with?
For this we will use a special <start> token. Now when do we stop our loop? For that the decoder will have to learn to generate a special <end> token.

So when we prepare our data we will append all sentences with the special <start> token and append them with the special <end> token.
For example: I love machine learning will become <start>I love machine learning<end> (same for the Spanish sentences).

Tokenization

Obviously the models cannot work with texts, it words with numbers, in order to fix this we will create a special dictionary that will hold an integer value for each word in our vocabulary. This process is called "tokenization". We do this for both languages (holding 2 different dictionaries).
Note that the special <start> and <end> tokens will get tokenized as well.

Training dataset

In order to train our encoder-decoder model we need 3 pieces of data:

  1. An input to the encoder, this will be tokenized Spanish sentence (with the special start/end tokens).
  2. An input to the decoder, this will be the tokenized English sentence (with the special start/end tokens).
  3. The training objective (or "target", or "label"), this is what we want our decoder to learn to generate, for this we will use the tokenized English sentence BUT WITHOUT the <start> token (remember, during inference we will provide the <start> token signaling the decoder to generate the first translated word).

Running translations (inference)

Translation works a little bit different than training. If during training the decoder input was the entire translated sentence, this is not the case during inference, as we don't have it, all we have is the input sentence in Spanish.
What we'll do is this:

  1. Run the encoder on the input sentence and get the encoder's RNN hidden state.
  2. We call the decoder and pass it the encoder's state from above, along with the special <start> token.
  3. The decoder will generate the next (translated) word, and a new hidden state, we will call the decoder again using these two outputs.
  4. We continue to call the decoder N times, N being the length of the longest sentence we had during training. This may not be ideal as in real world there may be longer sentences, checking for <end> token may work better (with some safeguard to prevent infinite loop).

The Code

Dependencies

!pip uninstall tensorflow -y
!pip uninstall tf-keras -y
!pip install tensorflow==2.15.1
!pip install keras
Enter fullscreen mode Exit fullscreen mode

imports

import os
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import GRU, Dense, Embedding, Input
from tensorflow.keras.models import Model, load_model

print(tf.__version__)
Enter fullscreen mode Exit fullscreen mode

Load Data

# Download data
DATA_URL = (
    "http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip"
)
path_to_zip = tf.keras.utils.get_file(
    "spa-eng.zip", origin=DATA_URL, extract=True
)

path_to_file = os.path.join(os.path.dirname(path_to_zip), "spa-eng/spa.txt")
print("Translation data stored at:", path_to_file)

# Load into dataframe
data = pd.read_csv(
    path_to_file, sep="\t", header=None, names=["english", "spanish"]
)

# Load sentences into tensors
target_lang_sentences = data.pop('english')
input_lang_sentences = data.pop('spanish')
Enter fullscreen mode Exit fullscreen mode

Helper Utils

import re
import unicodedata

def unicode_to_ascii(s):
    """Transforms an ascii string into unicode."""
    normalized = unicodedata.normalize("NFD", s)
    return "".join(c for c in normalized if unicodedata.category(c) != "Mn")


def preprocess_sentence(w):
    """Lowers, strips, and adds <start> and <end> tags to a sentence."""
    w = unicode_to_ascii(w.lower().strip())

    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ."
    w = re.sub(r"([?.!,¿])", r" \1 ", w)

    w = re.sub(r'[" "]+', " ", w)

    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)

    w = w.rstrip().strip()

    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = "<start> " + w + " <end>"
    return w


def tokenize(lang, lang_tokenizer=None):
    """Given a list of sentences, return an integer representation

    Arguments:
    lang -- a python list of sentences
    lang_tokenizer -- keras_preprocessing.text.Tokenizer, if None
        this will be created for you

    Returns:
    tensor -- int tensor of shape (NUM_EXAMPLES,MAX_SENTENCE_LENGTH)
    lang_tokenizer -- keras_preprocessing.text.Tokenizer
    """
    if lang_tokenizer is None:
        lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters="")
        lang_tokenizer.fit_on_texts(lang)

    tensor = lang_tokenizer.texts_to_sequences(lang)

    tensor = tf.keras.preprocessing.sequence.pad_sequences(
        tensor, padding="post"
    )

    return tensor, lang_tokenizer


def preprocess(sentences, tokenizer):
    """Preprocesses then tokenizes text

    Arguments:
    sentences -- a python list of of strings
    tokenizer -- Tokenizer for mapping words to integers

    Returns:
    tensor -- int tensor of shape (NUM_EXAMPLES,MAX_SENTENCE_LENGTH)
    lang_tokenizer -- keras_preprocessing.text.Tokenizer
    """
    sentences = [preprocess_sentence(sentence) for sentence in sentences]
    tokens, _ = tokenize(sentences, tokenizer)
    return tokens


def int2word(tokenizer, int_sequence):
    """Converts integer representation to natural language representation

    Arguments:
    tokenizer -- keras_preprocessing.text.Tokenizer
    int_sequence -- an iterable or rank 1 tensor of integers

    Returns list of string tokens
    """
    return [tokenizer.index_word[t] if t != 0 else "" for t in int_sequence]
Enter fullscreen mode Exit fullscreen mode

Prepare datasets

# Training on 120000 samples and 3 epochs took 15 minutes on T4
NUM_EXAMPLES = 120000

# Preprocess the sentences (add <start> <end> etc)
preprocessed_input_lang_sentences = input_lang_sentences.head(NUM_EXAMPLES).map(preprocess_sentence)
preprocessed_target_lang_sentences = target_lang_sentences.head(NUM_EXAMPLES).map(preprocess_sentence)

# Tokenize the preprocessed sentences, note it makes all sentences the same length (append with zeros)
tokenized_input_lang_tensor, input_lang_tokenizer = tokenize(preprocessed_input_lang_sentences)
tokenized_target_lang_tensor, target_lang_tokenizer = tokenize(preprocessed_target_lang_sentences)

# Save the len of the sentences
max_sentence_length_input_lang = tokenized_input_lang_tensor.shape[1]
max_setence_length_target_lang = tokenized_target_lang_tensor.shape[1]

# Load into dataset
dataset = tf.data.Dataset.from_tensor_slices(
    (tokenized_input_lang_tensor, tokenized_target_lang_tensor)
)

# Split date to train/validation
TEST_PROP = 0.2
validation_size = int(TEST_PROP * len(dataset))
train_size = len(dataset) - validation_size

shuffeled_dataset = dataset.shuffle(10000)
train_dataset = shuffeled_dataset.take(train_size)
validation_dataset = shuffeled_dataset.skip(train_size).take(validation_size)

# Convert the dataset items to what we need for training, the training input
# are the English/Spanish sentences that goes to the encoder and decoder as input.
# The "target" (or "label") is the shifted English sentence (remove the <start>)
def to_dataset_item(input_lang_tensor, target_lang_tensor):
    encoder_input = input_lang_tensor
    decoder_input = target_lang_tensor

    # The train target should not have the first <start> token, shift it left
    target = tf.roll(decoder_input, -1, 0)

    # roll is cyclic, so the <start> token is now the last token in the tensor, replace it with 0
    zeros = tf.zeros([1], dtype=tf.int32)
    target = tf.concat([target[:-1], zeros], axis=-1)

    return ((encoder_input, decoder_input), target)

train_dataset = train_dataset.map(to_dataset_item)
validation_dataset = validation_dataset.map(to_dataset_item)

# Create training batches
BUFFER_SIZE = len(train_dataset)
BATCH_SIZE = 64

train_dataset = (
    train_dataset
    .shuffle(BUFFER_SIZE)
    .repeat()
    .batch(BATCH_SIZE, drop_remainder=True)
)

validation_dataset = validation_dataset.batch(
    BATCH_SIZE, drop_remainder=True
)
Enter fullscreen mode Exit fullscreen mode

Build the model

EMBEDDING_DIM = 256
HIDDEN_UNITS = 1024

INPUT_VOCAB_SIZE = len(input_lang_tokenizer.word_index) + 1
TARGET_VOCAB_SIZE = len(target_lang_tokenizer.word_index) + 1

# Encoder
# Input layer
encoder_inputs = Input(shape=(None,), name="encoder_input")

# Embedding layer
encoder_inputs_embedded = Embedding(
    input_dim=INPUT_VOCAB_SIZE,
    output_dim=EMBEDDING_DIM,
    input_length=max_sentence_length_input_lang,
)(encoder_inputs)

# RNN
encoder_rnn = GRU(
    units=HIDDEN_UNITS,
    return_sequences=True,
    return_state=True,
    recurrent_initializer="glorot_uniform",
)

# Exec the RNN and get the encoder_state which will be the input to the decoder
encoder_outputs, encoder_state = encoder_rnn(encoder_inputs_embedded)

# Decoder
# Input layer
decoder_inputs = Input(shape=(None,), name="decoder_input")

# Embedding layer
decoder_inputs_embedded = Embedding(
    input_dim=TARGET_VOCAB_SIZE,
    output_dim=EMBEDDING_DIM,
    input_length=max_setence_length_target_lang,
)(decoder_inputs)

# RNN
decoder_rnn = GRU(
    units=HIDDEN_UNITS,
    return_sequences=True,
    return_state=True,
    recurrent_initializer="glorot_uniform",
)

# Exec the rnn, not the inputs are the decoder's embeddings and encoder's state
decoder_outputs, decoder_state = decoder_rnn(
    decoder_inputs_embedded, initial_state=encoder_state
)

# Dense layer
decoder_dense = Dense(TARGET_VOCAB_SIZE, activation="softmax")

# Get the predictions logits
predictions = decoder_dense(decoder_outputs)

# The entire training model
train_model = Model(inputs=[encoder_inputs, decoder_inputs], outputs=predictions)
train_model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
train_model.summary()
tf.keras.utils.plot_model(train_model, show_shapes=True)
Enter fullscreen mode Exit fullscreen mode

Training

STEPS_PER_EPOCH = train_size // BATCH_SIZE
EPOCHS = 3

history = train_model.fit(
    train_dataset,
    steps_per_epoch=STEPS_PER_EPOCH,
    validation_data=validation_dataset,
    epochs=EPOCHS,
)
Enter fullscreen mode Exit fullscreen mode

Inference

To generate text we will have to call the decoder in a loop to generate words, each time we will pass it the generated sentence thus far and its own RNN hidden state.
The initial state of the decoder is the encoded Spanish sentence.

For that we will create a new encoder model that we will call once, and a decoder model that we will call in a loop. Note that we use the layers of the train_model so we are using trained weights.

Now we are reusing the weights from the training we just did. To use this in production the new encoder/decoder models will have to be exported and saved, this will save them along with the trained weights and can be loaded and used in production.
The two tokenizers should be exported as well and used in production, the tokenization in production should be done using the same tokens we had in training.

# reuse the encoder's training weights
# The output (encoder rnn state) will be decoder's initial state
encoder_model = Model(inputs=encoder_inputs, outputs=encoder_state)

# This will hold the encoder's output and be used as the decoder initial state
decoder_state_input = Input(
    shape=(HIDDEN_UNITS,), name="decoder_state_input"
)

# Run the train_model decoder_rnn (reuse the weights)
decoder_outputs, decoder_state = decoder_rnn(
    decoder_inputs_embedded, initial_state=decoder_state_input
)

# Reuses weights from the decoder_dense layer
predictions = decoder_dense(decoder_outputs)

decoder_model = Model(
    inputs=[decoder_inputs, decoder_state_input],
    outputs=[predictions, decoder_state],
)

tf.keras.utils.plot_model(decoder_model, show_shapes=True)

# In order to run translation in production the encoder_model and decoder_model
# should be exported and saved, along with the tokenizers.
Enter fullscreen mode Exit fullscreen mode

Translation Process

def decode_sequences(input_seqs, output_tokenizer, max_decode_length=50):
    """
    Arguments:
    input_seqs: tokenized tensor of Spanish sencteces, shape (BATCH_SIZE, SEQ_LEN)
    output_tokenizer: Tokenizer used to conver from int to words (english tokenizer)

    Returns translated sentences
    """
    # Encode the input as state vectors.
    states_value = encoder_model(input_seqs)

    # Populate the first character of target sequence with the start character.
    batch_size = input_seqs.shape[0]
    translated_seq = tf.ones([batch_size, 1]) # 1 is <start> token

    translated_sentences = [[] for _ in range(batch_size)]

    # Decode word-by-word (theoretically we could have stopped after getting <end>)
    for i in range(max_decode_length):
        output_tokens, decoder_state = decoder_model([translated_seq, states_value])

        # Sample the results token.
        # The model outputs a probabilities vector for each word in the
        # vocabulary, we take the word with the hihgest probability.
        # The output_tokens shape is [BATCH_SIZE, number of generated words in our case it is 1, probabilities vector on the entire vocabulary]
        sampled_token_index = np.argmax(output_tokens[:, 0, :], axis=-1)

        # Convert the output token to word
        tokens = int2word(output_tokenizer, sampled_token_index)
        for j in range(batch_size):
            translated_sentences[j].append(tokens[j])

        # Use the generated token as the input for the next run.
        # sampled_token_index is a 1D array of the generated token per batch input.
        # we convert the 1D array to 2D of shape [BATCH_SIZE, 1]
        tf.expand_dims(tf.constant(sampled_token_index), axis=-1)

        # Update states for next run
        states_value = decoder_state

    return translated_sentences

# Translate these sentences
sentences = [
    "No estamos comiendo.",
    "Está llegando el invierno.",
    "El invierno se acerca.",
    "Tom no comio nada.",
    "Su pierna mala le impidió ganar la carrera.",
    "Su respuesta es erronea.",
    "¿Qué tal si damos un paseo después del almuerzo?",
]

reference_translations = [
    "We're not eating.",
    "Winter is coming.",
    "Winter is coming.",
    "Tom ate nothing.",
    "His bad leg prevented him from winning the race.",
    "Your answer is wrong.",
    "How about going for a walk after lunch?",
]

machine_translations = decode_sequences(
    preprocess(sentences, input_lang_tokenizer), target_lang_tokenizer, max_setence_length_target_lang
)

for i in range(len(sentences)):
    print("-")
    print("INPUT:")
    print(sentences[i])
    print("REFERENCE TRANSLATION:")
    print(reference_translations[i])
    print("MACHINE TRANSLATION:")
    print(machine_translations[i])
Enter fullscreen mode Exit fullscreen mode

Output:

-
INPUT:
No estamos comiendo.
REFERENCE TRANSLATION:
We're not eating.
MACHINE TRANSLATION:
['we', 're', 'not', 'eating', '.', '<end>', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
-
INPUT:
Está llegando el invierno.
REFERENCE TRANSLATION:
Winter is coming.
MACHINE TRANSLATION:
['the', 'rain', 'is', 'cold', '.', '<end>', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
-
INPUT:
El invierno se acerca.
REFERENCE TRANSLATION:
Winter is coming.
MACHINE TRANSLATION:
['winter', 'is', 'approaching', '.', '<end>', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
-
INPUT:
Tom no comio nada.
REFERENCE TRANSLATION:
Tom ate nothing.
MACHINE TRANSLATION:
['tom', 'ate', 'nothing', '.', '<end>', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
-
INPUT:
Su pierna mala le impidió ganar la carrera.
REFERENCE TRANSLATION:
His bad leg prevented him from winning the race.
MACHINE TRANSLATION:
['his', 'hair', 'turned', 'down', 'to', 'the', 'bottom', 'of', 'the', 'snow', '.', '<end>', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
-
INPUT:
Su respuesta es erronea.
REFERENCE TRANSLATION:
Your answer is wrong.
MACHINE TRANSLATION:
['your', 'answer', 'is', 'incorrect', '.', '<end>', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
-
INPUT:
¿Qué tal si damos un paseo después del almuerzo?
REFERENCE TRANSLATION:
How about going for a walk after lunch?
MACHINE TRANSLATION:
['how', 'about', 'we', 'spend', 'a', 'little', 'day', '?', '<end>', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

Enter fullscreen mode Exit fullscreen mode

While the results are funny I am quite impressed that some translations are not that bad, considering we used only 120k sentences and 3 epochs.
In the paper "Attention is all you need" they used the WMT 2014 English-French dataset consisting of 36M sentences!!


Enter fullscreen mode Exit fullscreen mode

Top comments (0)