Prepare data to train NLP Transformer

Preparing Data for Effective Transformer Training: From Raw Text Selection and Download to Crafting Custom Tokenizers and Establishing Efficient Data Pipelines

In the previous three tutorials, we got familiar with Transformer's architecture and implemented it step-by-step in TensorFlow. So, at this point, we have a complete Transformer model that we could train within the standard TensorFlow "fit" function. But there is one "but"; we need to prepare our data so it would be optimized for the training process.

For this task, I chose a dataset from the OPUS dataset (a collection of translated texts from the web). 

I chose a dataset with 1 000 000 sentences in English and Spanish. You can download this data from HERE. It is a large dataset, and you should keep in mind that training a Transformer on this data may take several days. So, for testing purposes, you should use a smaller dataset, for example, 100 000 sentences.

Not to download it manually, I wrote a script that downloads the data and saves it to the Datasets folder:

import os
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup

# URL to the directory containing the files to be downloaded
language = "en-es"
url = f"{language}/"
save_directory = f"./Datasets/{language}"

# Create the save directory if it doesn't exist
os.makedirs(save_directory, exist_ok=True)

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML response
soup = BeautifulSoup(response.content, 'html.parser')

# Find all the anchor tags in the HTML
links = soup.find_all('a')

# Extract the href attribute from each anchor tag
file_links = [link['href'] for link in links if '.' in link['href']]

# Download each file
for file_link in tqdm(file_links):
    file_url = url + file_link
    save_path = os.path.join(save_directory, file_link)
    print(f"Downloading {file_url}")
    # Send a GET request for the file
    file_response = requests.get(file_url)
    if file_response.status_code == 404:
        print(f"Could not download {file_url}")
    # Save the file to the specified directory
    with open(save_path, 'wb') as file:
    print(f"Saved {file_link}")

print("All files have been downloaded.")

If you want to download a different dataset, change the language variable to the language you want to translate. Make sure that language exists on the OPUS dataset list.

When we have downloaded our dataset, we need to read it into memory; we do it with the following code:

en_training_data_path = "Datasets/en-es/opus.en-es-train.en"
en_validation_data_path = "Datasets/en-es/opus.en-es-dev.en"
es_training_data_path = "Datasets/en-es/"
es_validation_data_path = "Datasets/en-es/"

def read_files(path):
    with open(path, "r", encoding="utf-8") as f:
        en_train_dataset ="\n")[:-1]
    return en_train_dataset

en_training_data = read_files(en_training_data_path)
en_validation_data = read_files(en_validation_data_path)
es_training_data = read_files(es_training_data_path)
es_validation_data = read_files(es_validation_data_path)

max_lenght = 500
train_dataset = [[es_sentence, en_sentence] for es_sentence, en_sentence in zip(es_training_data, en_training_data) if len(es_sentence) <= max_lenght and len(en_sentence) <= max_lenght]
val_dataset = [[es_sentence, en_sentence] for es_sentence, en_sentence in zip(es_validation_data, en_validation_data) if len(es_sentence) <= max_lenght and len(en_sentence) <= max_lenght]
es_training_data, en_training_data = zip(*train_dataset)
es_validation_data, en_validation_data = zip(*val_dataset)


The provided code performs the following steps:

  1. File Paths: Four file paths are defined, representing the locations of different data files. The files are named according to their language pairs, where "en" denotes English and "es" denotes Spanish. These files are used for training and validation datasets;
  2. read_files Function: This function is defined to read the contents of a file given its path. It uses the open function with "r" mode (read) and specifies the "utf-8" encoding to handle text data. The process then reads the file and splits it into lines using the split method with the newline character ("\n") as the delimiter. The last element of the resulting list is removed with [:-1] to exclude any empty lines. The function returns the list of lines as the content of the file;
  3. Reading Data Files: The read_files function is used to read the contents of the four data files for English training, English validation, Spanish training, and Spanish validation, respectively. The data from each file is stored in separate variables: en_training_data, en_validation_data, es_training_data, and es_validation_data;
  4. Filtering Dataset: The code sets a maximum sentence length (max_length). It then creates two new datasets, train_dataset and val_dataset, by zipping the Spanish and English sentences from the training and validation datasets. However, only those pairs of sentences are included in the new datasets where both the Spanish and English sentences have lengths less than or equal to the specified max_length;
  5. Unzipping Datasets: After filtering the datasets, the code uses the zip function in combination with the * operator to "unzip" the train_dataset and val_dataset into separate lists for Spanish and English sentences. This results in es_training_data, en_training_data, es_validation_data, and en_validation_data containing the filtered Spanish and English sentences for training and validation, respectively.

The overall purpose of this code is to read text data from files, filter out sentences that exceed a specified maximum length, and then organize the filtered data into separate lists for training and validation. This filtered and organized data is intended to be used as input for training a language model or another natural language processing task.

When we run the above code, we should see the following output:

('Fueron los asbestos aquí. ¡Eso es lo que ocurrió!', 'Me voy de aquí.', 'Una vez, juro que cagué una barra de tiza.')
("It was the asbestos in here, that's what did it!", "I'm out of here.", 'One time, I swear I pooped out a stick of chalk.')

This shows us that we have 995249 sentences in training and 1990 sentences in validation datasets. Also, it prints out three Spanish sentences that I need help understanding and their English translation.

Setting up the Tokenizer

To handle sentences, I created a custom Tokenizer. This Tokenizer is similar to the Tokenizer from tensorflow.keras.preprocessing.text module. The difference is that when I'll be ready to use the trained Transformer model, I won't need to install a huge TensorFlow library to use the Tokenizer class. 

Here is the code for the CustomTokenizer object:

import os
import json
import typing
from tqdm import tqdm

class CustomTokenizer:
    """ Custom Tokenizer class to tokenize and detokenize text data into sequences of integers

        split (str, optional): Split token to use when tokenizing text. Defaults to " ".
        char_level (bool, optional): Whether to tokenize at character level. Defaults to False.
        lower (bool, optional): Whether to convert text to lowercase. Defaults to True.
        start_token (str, optional): Start token to use when tokenizing text. Defaults to "<start>".
        end_token (str, optional): End token to use when tokenizing text. Defaults to "<eos>".
        filters (list, optional): List of characters to filter out. Defaults to 
            ['!', "'", '"', '#', '$', '%', '&', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', 
            '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '\t', '\n'].
        filter_nums (bool, optional): Whether to filter out numbers. Defaults to True.
        start (int, optional): Index to start tokenizing from. Defaults to 1.
    def __init__(
            split: str=" ", 
            char_level: bool=False,
            lower: bool=True, 
            start_token: str="<start>", 
            end_token: str="<eos>",
            filters: list = ['!', "'", '"', '#', '$', '%', '&', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '\t', '\n'],
            filter_nums: bool = True,
            start: int=1,
        ) -> None:
        self.split = split
        self.char_level = char_level
        self.lower = lower
        self.index_word = {}
        self.word_index = {}
        self.max_length = 0
        self.start_token = start_token
        self.end_token = end_token
        self.filters = filters
        self.filter_nums = filter_nums
        self.start = start

    def start_token_index(self):
        return self.word_index[self.start_token]
    def end_token_index(self):
        return self.word_index[self.end_token]

    def sort(self):
        """ Sorts the word_index and index_word dictionaries"""
        self.index_word = dict(enumerate(dict(sorted(self.word_index.items())), start=self.start))
        self.word_index = {v: k for k, v in self.index_word.items()}

    def split_line(self, line: str):
        """ Splits a line of text into tokens

            line (str): Line of text to split

            list: List of string tokens
        line = line.lower() if self.lower else line

        if self.char_level:
            return [char for char in line]

        # split line with split token and check for filters
        line_tokens = line.split(self.split)

        new_tokens = []
        for index, token in enumerate(line_tokens):
            filtered_tokens = ['']
            for c_index, char in enumerate(token):
                if char in self.filters or (self.filter_nums and char.isdigit()):
                    filtered_tokens += [char, ''] if c_index != len(token) -1 else [char]
                    filtered_tokens[-1] += char

            new_tokens += filtered_tokens
            if index != len(line_tokens) -1:
                new_tokens += [self.split]

        new_tokens = [token for token in new_tokens if token != '']

        return new_tokens

    def fit_on_texts(self, lines: typing.List[str]):
        """ Fits the tokenizer on a list of lines of text
        This function will update the word_index and index_word dictionaries and set the max_length attribute

            lines (typing.List[str]): List of lines of text to fit the tokenizer on
        self.word_index = {key: value for value, key in enumerate([self.start_token, self.end_token, self.split] + self.filters)}
        for line in tqdm(lines, desc="Fitting tokenizer"):
            line_tokens = self.split_line(line)
            self.max_length = max(self.max_length, len(line_tokens) +2) # +2 for start and end tokens

            for token in line_tokens:
                if token not in self.word_index:
                    self.word_index[token] = len(self.word_index)


    def update(self, lines: typing.List[str]):
        """ Updates the tokenizer with new lines of text
        This function will update the word_index and index_word dictionaries and set the max_length attribute

            lines (typing.List[str]): List of lines of text to update the tokenizer with
        new_tokens = 0
        for line in tqdm(lines, desc="Updating tokenizer"):
            line_tokens = self.split_line(line)
            self.max_length = max(self.max_length, len(line_tokens) +2) # +2 for start and end tokens
            for token in line_tokens:
                if token not in self.word_index:
                    self.word_index[token] = len(self.word_index)
                    new_tokens += 1

        print(f"Added {new_tokens} new tokens")

    def detokenize(self, sequences: typing.List[int], remove_start_end: bool=True):
        """ Converts a list of sequences of tokens back into text

            sequences (typing.list[int]): List of sequences of tokens to convert back into text
            remove_start_end (bool, optional): Whether to remove the start and end tokens. Defaults to True.
            typing.List[str]: List of strings of the converted sequences
        lines = []
        for sequence in sequences:
            line = ""
            for token in sequence:
                if token == 0:
                if remove_start_end and (token == self.start_token_index or token == self.end_token_index):

                line += self.index_word[token]


        return lines

    def texts_to_sequences(self, lines: typing.List[str], include_start_end: bool=True):
        """ Converts a list of lines of text into a list of sequences of tokens
            lines (typing.list[str]): List of lines of text to convert into tokenized sequences
            include_start_end (bool, optional): Whether to include the start and end tokens. Defaults to True.

            typing.List[typing.List[int]]: List of sequences of tokens
        sequences = []
        for line in lines:
            line_tokens = self.split_line(line)
            sequence = [self.word_index[word] for word in line_tokens if word in self.word_index]
            if include_start_end:
                sequence = [self.word_index[self.start_token]] + sequence + [self.word_index[self.end_token]]


        return sequences
    def save(self, path: str, type: str="json"):
        """ Saves the tokenizer to a file
            path (str): Path to save the tokenizer to
            type (str, optional): Type of file to save the tokenizer to. Defaults to "json".
        serialised_dict = self.dict()
        if type == "json":
            if os.path.dirname(path):
                os.makedirs(os.path.dirname(path), exist_ok=True)
            with open(path, "w") as f:
                json.dump(serialised_dict, f)

    def dict(self):
        """ Returns a dictionary of the tokenizer

            dict: Dictionary of the tokenizer
        return {
            "split": self.split,
            "lower": self.lower,
            "char_level": self.char_level,
            "index_word": self.index_word,
            "max_length": self.max_length,
            "start_token": self.start_token,
            "end_token": self.end_token,
            "filters": self.filters,
            "filter_nums": self.filter_nums,
            "start": self.start

    def load(path: typing.Union[str, dict], type: str="json"):
        """ Loads a tokenizer from a file

            path (typing.Union[str, dict]): Path to load the tokenizer from or a dictionary of the tokenizer
            type (str, optional): Type of file to load the tokenizer from. Defaults to "json".

            CustomTokenizer: Loaded tokenizer
        if isinstance(path, str):
            if type == "json":
                with open(path, "r") as f:
                    load_dict = json.load(f)

        elif isinstance(path, dict):
            load_dict = path

        tokenizer = CustomTokenizer()
        tokenizer.split = load_dict["split"]
        tokenizer.lower = load_dict["lower"]
        tokenizer.char_level = load_dict["char_level"]
        tokenizer.index_word = {int(k): v for k, v in load_dict["index_word"].items()}
        tokenizer.max_length = load_dict["max_length"]
        tokenizer.start_token = load_dict["start_token"]
        tokenizer.end_token = load_dict["end_token"]
        tokenizer.filters = load_dict["filters"]
        tokenizer.filter_nums = bool(load_dict["filter_nums"])
        tokenizer.start = load_dict["start"]
        tokenizer.word_index = {v: int(k) for k, v in tokenizer.index_word.items()}

        return tokenizer
    def lenght(self):
        return len(self.index_word)

    def __len__(self):
        return len(self.index_word)

I'll include this object in the MLTU package that you can install from PyPi, so you won't need to copy and paste it. We'll come to this later.

It is a custom implementation of a text tokenizer, which takes raw text data and converts it into sequences of integers (tokens). The class provides several methods to perform tokenization and detokenization.

The tokenizer can be initialized with various parameters, such as the split token (used to split the input text), whether tokenization should be done at the character or word level, whether to convert the text to lowercase, and more. It also allows you to specify start and end tokens, which are added to the tokenized sequences. Additionally, you can filter out specific characters and numbers during tokenization.

The class contains methods like split_line, which splits a line of text into tokens, fit_on_texts, which fits the tokenizer on a list of lines of text and updates its internal dictionaries; and detokenize, which converts sequences of tokens back into text. Other methods include texts_to_sequences for converting lines of text into sequences of tokens and save/load for saving and loading the tokenizer to/from files.

Overall, this CustomTokenizer class provides a flexible and customizable way to preprocess text data for machine-learning models requiring integer sequences as input. It enables you to tokenize and detokenize text while handling various preprocessing options according to the requirements of your specific application.

So, we have two languages and need to create two tokenizers, one for each language. Let's do it:

# prepare Spanish tokenizer, this is the input language
tokenizer = CustomTokenizer(char_level=True)

# prepare English tokenizer, this is the output language
detokenizer = CustomTokenizer(char_level=True)

The code demonstrates the preparation of tokenizers for two languages, Spanish and English. These tokenizers play a crucial role in natural language processing tasks, where they are responsible for converting raw text data into sequences of tokens, which are essential for training machine learning models.

The first section of the code focuses on preparing the Spanish tokenizer, which will be used as the input language tokenizer. The tokenizer is initialized using the CustomTokenizer class, configured to tokenize text at the character level, meaning each character in the input text will be treated as a separate token. The fit_on_texts method is then applied to the Spanish training data (es_training_data), which fits the tokenizer on a list of Spanish sentences, updating its internal dictionaries and settings. By doing this, the tokenizer learns the mapping between characters and their corresponding integer representations. Finally, the tokenizer is saved to a file named "tokenizer.json" in the specified path.

The second part of the code focuses on preparing the English tokenizer, which will be used as the output language tokenizer. It is identical to the first part, except that we use it for English sentences.

When we run the above code, we should see similar output:

Fitting tokenizer: 100%|██████████| 995249/995249 [00:10<00:00, 95719.57it/s] 
Fitting tokenizer: 100%|██████████| 995249/995249 [00:07<00:00, 134446.71it/s]

In the above output, the tokenizer progress in displayed.

These tokenizers can later be used to convert text data into sequences of characters, enabling further natural language processing tasks such as sequence-to-sequence translation, text generation, or any other task requiring tokenized input and output. The saved tokenizer files can be loaded in subsequent stages of the NLP pipeline to maintain consistency and facilitate inference and evaluation of new data.

Let's try to use the detokenizer from above to convert the sentence into tokens and convert it back to a sentence:

tokenized_sentence = detokenizer.texts_to_sequences(["Hello world, how are you?"])[0]

detokenized_sentence = detokenizer.detokenize([tokenized_sentence], remove_start_end=False)

detokenized_sentence = detokenizer.detokenize([tokenized_sentence])

By running the above code, it should give us the following output:

[33, 51, 48, 55, 55, 58, 3, 66, 58, 61, 55, 47, 15, 3, 51, 58, 66, 3, 44, 61, 48, 3, 68, 58, 64, 36, 32]
['<start>hello world, how are you?<eos>']
['hello world, how are you?']

So, we tokenized our "Hello world, how are you?" sentence intro tokens. Then tried to detokenize. Also, I demonstrated what the difference is with the remove_start_end, when we toggle it.

Set up a data pipeline

When we have our tokenizers, we can create a data pipeline. The pipeline will be responsible for reading data from files, tokenizing it, and batching it. Let's import the DataProvider class from the mltu package:

from mltu.tensorflow.dataProvider import DataProvider

We will have two data providers, one for training data and one for validation data. And while iterating them, we should receive prepared data for our model. Let's create them:

from mltu.tensorflow.dataProvider import DataProvider
import numpy as np

def preprocess_inputs(data_batch, label_batch):
    encoder_input = np.zeros((len(data_batch), tokenizer.max_length)).astype(np.int64)
    decoder_input = np.zeros((len(label_batch), detokenizer.max_length)).astype(np.int64)
    decoder_output = np.zeros((len(label_batch), detokenizer.max_length)).astype(np.int64)

    data_batch_tokens = tokenizer.texts_to_sequences(data_batch)
    label_batch_tokens = detokenizer.texts_to_sequences(label_batch)

    for index, (data, label) in enumerate(zip(data_batch_tokens, label_batch_tokens)):
        encoder_input[index][:len(data)] = data
        decoder_input[index][:len(label)-1] = label[:-1] # Drop the [END] tokens
        decoder_output[index][:len(label)-1] = label[1:] # Drop the [START] tokens

    return (encoder_input, decoder_input), decoder_output

train_dataProvider = DataProvider(

val_dataProvider = DataProvider(

We created a Python function named preprocess_inputs and two instances of a DataProvider class, train_dataProvider and val_dataProvider. The function preprocess_inputs serves as a preprocessing step for the input data and label batches to be used in a machine learning model, specifically in sequence-to-sequence tasks.

The preprocess_inputs function takes two arguments, data_batch and label_batch, representing batches of input data and corresponding label data, respectively. Within the function, three arrays, encoder_input, decoder_input, and decoder_output, are initialized as zero-filled arrays to store the processed data.

The function first tokenizes the input and labels data batches using the previously prepared tokenizers, tokenizer and detokenizer. It converts the text data into sequences of integers, which are required for training the model.

Next, it iterates through the data and label batches, and for each data-label pair, it populates the encoder_input array with the integer sequences of the input data. Similarly, it fills the decoder_input array with the integer sequences of the label data but with the last token removed (representing the [END] token). The decoder_output array is populated with the integer sequences of the label data but with the [START] token removed. These arrays are essential for training the sequence-to-sequence model, as they form the input and target sequences during training.

The DataProvider object handles data batching during model training. It takes the training and validation datasets (train_dataset and val_dataset) and processes the data batches using the preprocess_inputs function. The batch_size parameter determines the number of data samples in each batch. Additionally, the use_cache parameter is set to true, which means the DataProvider will cache the preprocessed batches for efficient data loading during training.

In summary, the above code uses a DataProvider class to handle data batching and caching during training, making it easier to feed data into the model in batches, a common practice in machine learning to enhance training efficiency.

Let's check what would be a single output of our dataProvider:

for data_batch in train_dataProvider:
    (encoder_inputs, decoder_inputs), decoder_outputs = data_batch

    encoder_inputs_str = tokenizer.detokenize(encoder_inputs)
    decoder_inputs_str = detokenizer.detokenize(decoder_inputs, remove_start_end=False)
    decoder_outputs_str = detokenizer.detokenize(decoder_outputs, remove_start_end=False)

In the terminal output, we should see similar output:

['fueron los asbestos aquí. ¡eso es lo que ocurrió!', 'me voy de aquí.', 'una vez, juro que cagué una barra de tiza.', 'y prefiero mudarme, ¿entiendes?']
["<start>it was the asbestos in here, that's what did it!", "<start>i'm out of here.", '<start>one time, i swear i pooped out a stick of chalk.', '<start>and i will move, do you understand me?']
["it was the asbestos in here, that's what did it!<eos>", "i'm out of here.<eos>", 'one time, i swear i pooped out a stick of chalk.<eos>', 'and i will move, do you understand me?<eos>']

This is an example of what is the input while training our Transformer model. But to make more sense, we detokenized the tokens.


In summary, we explored the critical steps required to transform raw text data into a meaningful format suitable to train a Transformer model, focusing on sequence-to-sequence tasks (Language translation).

In this tutorial we:

🎯 Introduced to Data Preparation requirements for Transformers; 

🔧 Built a Flexible CustomTokenizer object;

📚 Created Language-Specific Tokenizers for Spanish and English languages;

🔗 Established a Data Pipeline for training utilizing the DataProvider object.

I demonstrated how to unlock the power of data manipulation for Transformer training. As we delve into creating custom tokenizers, organizing data pipelines, and streamlining preprocessing steps, you should have gained the tools and knowledge to set your NLP projects on a path to success.