In the previous three tutorials, we got familiar with Transformer's architecture and implemented it step-by-step in TensorFlow. So, at this point, we have a complete Transformer model that we could train within the standard TensorFlow "fit
" function. But there is one "but"; we need to prepare our data so it would be optimized for the training process.
For this task, I chose a dataset from the OPUS dataset (a collection of translated texts from the web).
I chose a dataset with 1 000 000 sentences in English and Spanish. You can download this data from HERE. It is a large dataset, and you should keep in mind that training a Transformer on this data may take several days. So, for testing purposes, you should use a smaller dataset, for example, 100 000 sentences.
Not to download it manually, I wrote a script that downloads the data and saves it to the Datasets
folder:
import os
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup
# URL to the directory containing the files to be downloaded
language = "en-es"
url = f"https://data.statmt.org/opus-100-corpus/v1.0/supervised/{language}/"
save_directory = f"./Datasets/{language}"
# Create the save directory if it doesn't exist
os.makedirs(save_directory, exist_ok=True)
# Send a GET request to the URL
response = requests.get(url)
# Parse the HTML response
soup = BeautifulSoup(response.content, 'html.parser')
# Find all the anchor tags in the HTML
links = soup.find_all('a')
# Extract the href attribute from each anchor tag
file_links = [link['href'] for link in links if '.' in link['href']]
# Download each file
for file_link in tqdm(file_links):
file_url = url + file_link
save_path = os.path.join(save_directory, file_link)
print(f"Downloading {file_url}")
# Send a GET request for the file
file_response = requests.get(file_url)
if file_response.status_code == 404:
print(f"Could not download {file_url}")
continue
# Save the file to the specified directory
with open(save_path, 'wb') as file:
file.write(file_response.content)
print(f"Saved {file_link}")
print("All files have been downloaded.")
If you want to download a different dataset, change the language
variable to the language you want to translate. Make sure that language exists on the OPUS dataset list.
When we have downloaded our dataset, we need to read it into memory; we do it with the following code:
en_training_data_path = "Datasets/en-es/opus.en-es-train.en"
en_validation_data_path = "Datasets/en-es/opus.en-es-dev.en"
es_training_data_path = "Datasets/en-es/opus.en-es-train.es"
es_validation_data_path = "Datasets/en-es/opus.en-es-dev.es"
def read_files(path):
with open(path, "r", encoding="utf-8") as f:
en_train_dataset = f.read().split("\n")[:-1]
return en_train_dataset
en_training_data = read_files(en_training_data_path)
en_validation_data = read_files(en_validation_data_path)
es_training_data = read_files(es_training_data_path)
es_validation_data = read_files(es_validation_data_path)
max_lenght = 500
train_dataset = [[es_sentence, en_sentence] for es_sentence, en_sentence in zip(es_training_data, en_training_data) if len(es_sentence) <= max_lenght and len(en_sentence) <= max_lenght]
val_dataset = [[es_sentence, en_sentence] for es_sentence, en_sentence in zip(es_validation_data, en_validation_data) if len(es_sentence) <= max_lenght and len(en_sentence) <= max_lenght]
es_training_data, en_training_data = zip(*train_dataset)
es_validation_data, en_validation_data = zip(*val_dataset)
print(len(es_training_data))
print(len(es_validation_data))
print(es_training_data[:3])
print(en_training_data[:3])
The provided code performs the following steps:
- File Paths: Four file paths are defined, representing the locations of different data files. The files are named according to their language pairs, where "en" denotes English and "es" denotes Spanish. These files are used for training and validation datasets;
read_files
Function: This function is defined to read the contents of a file given its path. It uses theopen
function with "r
" mode (read) and specifies the "utf-8
" encoding to handle text data. The process then reads the file and splits it into lines using thesplit
method with the newline character ("\n
") as the delimiter. The last element of the resulting list is removed with[:-1]
to exclude any empty lines. The function returns the list of lines as the content of the file;- Reading Data Files: The
read_files
function is used to read the contents of the four data files for English training, English validation, Spanish training, and Spanish validation, respectively. The data from each file is stored in separate variables:en_training_data
,en_validation_data
,es_training_data
, andes_validation_data
; - Filtering Dataset: The code sets a maximum sentence length (
max_length
). It then creates two new datasets,train_dataset
andval_dataset
, by zipping the Spanish and English sentences from the training and validation datasets. However, only those pairs of sentences are included in the new datasets where both the Spanish and English sentences have lengths less than or equal to the specifiedmax_length
; - Unzipping Datasets: After filtering the datasets, the code uses the
zip
function in combination with the*
operator to "unzip" thetrain_dataset
andval_dataset
into separate lists for Spanish and English sentences. This results ines_training_data
,en_training_data
,es_validation_data
, anden_validation_data
containing the filtered Spanish and English sentences for training and validation, respectively.
The overall purpose of this code is to read text data from files, filter out sentences that exceed a specified maximum length, and then organize the filtered data into separate lists for training and validation. This filtered and organized data is intended to be used as input for training a language model or another natural language processing task.
When we run the above code, we should see the following output:
995249
1990
('Fueron los asbestos aquí. ¡Eso es lo que ocurrió!', 'Me voy de aquí.', 'Una vez, juro que cagué una barra de tiza.')
("It was the asbestos in here, that's what did it!", "I'm out of here.", 'One time, I swear I pooped out a stick of chalk.')
This shows us that we have 995249 sentences in training and 1990 sentences in validation datasets. Also, it prints out three Spanish sentences that I need help understanding and their English translation.
Setting up the Tokenizer
To handle sentences, I created a custom Tokenizer
. This Tokenizer
is similar to the Tokenizer
from tensorflow.keras.preprocessing.text
module. The difference is that when I'll be ready to use the trained Transformer model, I won't need to install a huge TensorFlow library to use the Tokenizer
class.
Here is the code for the CustomTokenizer
object:
import os
import json
import typing
from tqdm import tqdm
class CustomTokenizer:
""" Custom Tokenizer class to tokenize and detokenize text data into sequences of integers
Args:
split (str, optional): Split token to use when tokenizing text. Defaults to " ".
char_level (bool, optional): Whether to tokenize at character level. Defaults to False.
lower (bool, optional): Whether to convert text to lowercase. Defaults to True.
start_token (str, optional): Start token to use when tokenizing text. Defaults to "<start>".
end_token (str, optional): End token to use when tokenizing text. Defaults to "<eos>".
filters (list, optional): List of characters to filter out. Defaults to
['!', "'", '"', '#', '$', '%', '&', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>',
'?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '\t', '\n'].
filter_nums (bool, optional): Whether to filter out numbers. Defaults to True.
start (int, optional): Index to start tokenizing from. Defaults to 1.
"""
def __init__(
self,
split: str=" ",
char_level: bool=False,
lower: bool=True,
start_token: str="<start>",
end_token: str="<eos>",
filters: list = ['!', "'", '"', '#', '$', '%', '&', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '\t', '\n'],
filter_nums: bool = True,
start: int=1,
) -> None:
self.split = split
self.char_level = char_level
self.lower = lower
self.index_word = {}
self.word_index = {}
self.max_length = 0
self.start_token = start_token
self.end_token = end_token
self.filters = filters
self.filter_nums = filter_nums
self.start = start
@property
def start_token_index(self):
return self.word_index[self.start_token]
@property
def end_token_index(self):
return self.word_index[self.end_token]
def sort(self):
""" Sorts the word_index and index_word dictionaries"""
self.index_word = dict(enumerate(dict(sorted(self.word_index.items())), start=self.start))
self.word_index = {v: k for k, v in self.index_word.items()}
def split_line(self, line: str):
""" Splits a line of text into tokens
Args:
line (str): Line of text to split
Returns:
list: List of string tokens
"""
line = line.lower() if self.lower else line
if self.char_level:
return [char for char in line]
# split line with split token and check for filters
line_tokens = line.split(self.split)
new_tokens = []
for index, token in enumerate(line_tokens):
filtered_tokens = ['']
for c_index, char in enumerate(token):
if char in self.filters or (self.filter_nums and char.isdigit()):
filtered_tokens += [char, ''] if c_index != len(token) -1 else [char]
else:
filtered_tokens[-1] += char
new_tokens += filtered_tokens
if index != len(line_tokens) -1:
new_tokens += [self.split]
new_tokens = [token for token in new_tokens if token != '']
return new_tokens
def fit_on_texts(self, lines: typing.List[str]):
""" Fits the tokenizer on a list of lines of text
This function will update the word_index and index_word dictionaries and set the max_length attribute
Args:
lines (typing.List[str]): List of lines of text to fit the tokenizer on
"""
self.word_index = {key: value for value, key in enumerate([self.start_token, self.end_token, self.split] + self.filters)}
for line in tqdm(lines, desc="Fitting tokenizer"):
line_tokens = self.split_line(line)
self.max_length = max(self.max_length, len(line_tokens) +2) # +2 for start and end tokens
for token in line_tokens:
if token not in self.word_index:
self.word_index[token] = len(self.word_index)
self.sort()
def update(self, lines: typing.List[str]):
""" Updates the tokenizer with new lines of text
This function will update the word_index and index_word dictionaries and set the max_length attribute
Args:
lines (typing.List[str]): List of lines of text to update the tokenizer with
"""
new_tokens = 0
for line in tqdm(lines, desc="Updating tokenizer"):
line_tokens = self.split_line(line)
self.max_length = max(self.max_length, len(line_tokens) +2) # +2 for start and end tokens
for token in line_tokens:
if token not in self.word_index:
self.word_index[token] = len(self.word_index)
new_tokens += 1
self.sort()
print(f"Added {new_tokens} new tokens")
def detokenize(self, sequences: typing.List[int], remove_start_end: bool=True):
""" Converts a list of sequences of tokens back into text
Args:
sequences (typing.list[int]): List of sequences of tokens to convert back into text
remove_start_end (bool, optional): Whether to remove the start and end tokens. Defaults to True.
Returns:
typing.List[str]: List of strings of the converted sequences
"""
lines = []
for sequence in sequences:
line = ""
for token in sequence:
if token == 0:
break
if remove_start_end and (token == self.start_token_index or token == self.end_token_index):
continue
line += self.index_word[token]
lines.append(line)
return lines
def texts_to_sequences(self, lines: typing.List[str], include_start_end: bool=True):
""" Converts a list of lines of text into a list of sequences of tokens
Args:
lines (typing.list[str]): List of lines of text to convert into tokenized sequences
include_start_end (bool, optional): Whether to include the start and end tokens. Defaults to True.
Returns:
typing.List[typing.List[int]]: List of sequences of tokens
"""
sequences = []
for line in lines:
line_tokens = self.split_line(line)
sequence = [self.word_index[word] for word in line_tokens if word in self.word_index]
if include_start_end:
sequence = [self.word_index[self.start_token]] + sequence + [self.word_index[self.end_token]]
sequences.append(sequence)
return sequences
def save(self, path: str, type: str="json"):
""" Saves the tokenizer to a file
Args:
path (str): Path to save the tokenizer to
type (str, optional): Type of file to save the tokenizer to. Defaults to "json".
"""
serialised_dict = self.dict()
if type == "json":
if os.path.dirname(path):
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "w") as f:
json.dump(serialised_dict, f)
def dict(self):
""" Returns a dictionary of the tokenizer
Returns:
dict: Dictionary of the tokenizer
"""
return {
"split": self.split,
"lower": self.lower,
"char_level": self.char_level,
"index_word": self.index_word,
"max_length": self.max_length,
"start_token": self.start_token,
"end_token": self.end_token,
"filters": self.filters,
"filter_nums": self.filter_nums,
"start": self.start
}
@staticmethod
def load(path: typing.Union[str, dict], type: str="json"):
""" Loads a tokenizer from a file
Args:
path (typing.Union[str, dict]): Path to load the tokenizer from or a dictionary of the tokenizer
type (str, optional): Type of file to load the tokenizer from. Defaults to "json".
Returns:
CustomTokenizer: Loaded tokenizer
"""
if isinstance(path, str):
if type == "json":
with open(path, "r") as f:
load_dict = json.load(f)
elif isinstance(path, dict):
load_dict = path
tokenizer = CustomTokenizer()
tokenizer.split = load_dict["split"]
tokenizer.lower = load_dict["lower"]
tokenizer.char_level = load_dict["char_level"]
tokenizer.index_word = {int(k): v for k, v in load_dict["index_word"].items()}
tokenizer.max_length = load_dict["max_length"]
tokenizer.start_token = load_dict["start_token"]
tokenizer.end_token = load_dict["end_token"]
tokenizer.filters = load_dict["filters"]
tokenizer.filter_nums = bool(load_dict["filter_nums"])
tokenizer.start = load_dict["start"]
tokenizer.word_index = {v: int(k) for k, v in tokenizer.index_word.items()}
return tokenizer
@property
def lenght(self):
return len(self.index_word)
def __len__(self):
return len(self.index_word)
I'll include this object in the MLTU package that you can install from PyPi, so you won't need to copy and paste it. We'll come to this later.
It is a custom implementation of a text tokenizer, which takes raw text data and converts it into sequences of integers (tokens). The class provides several methods to perform tokenization and detokenization.
The tokenizer can be initialized with various parameters, such as the split token (used to split the input text), whether tokenization should be done at the character or word level, whether to convert the text to lowercase, and more. It also allows you to specify start and end tokens, which are added to the tokenized sequences. Additionally, you can filter out specific characters and numbers during tokenization.
The class contains methods like split_line
, which splits a line of text into tokens, fit_on_texts
, which fits the tokenizer on a list of lines of text and updates its internal dictionaries; and detokenize
, which converts sequences of tokens back into text. Other methods include texts_to_sequences
for converting lines of text into sequences of tokens and save
/load
for saving and loading the tokenizer to/from files.
Overall, this CustomTokenizer
class provides a flexible and customizable way to preprocess text data for machine-learning models requiring integer sequences as input. It enables you to tokenize and detokenize text while handling various preprocessing options according to the requirements of your specific application.
So, we have two languages and need to create two tokenizers, one for each language. Let's do it:
# prepare Spanish tokenizer, this is the input language
tokenizer = CustomTokenizer(char_level=True)
tokenizer.fit_on_texts(es_training_data)
tokenizer.save("tokenizer.json")
# prepare English tokenizer, this is the output language
detokenizer = CustomTokenizer(char_level=True)
detokenizer.fit_on_texts(en_training_data)
detokenizer.save("detokenizer.json")
The code demonstrates the preparation of tokenizers for two languages, Spanish and English. These tokenizers play a crucial role in natural language processing tasks, where they are responsible for converting raw text data into sequences of tokens, which are essential for training machine learning models.
The first section of the code focuses on preparing the Spanish tokenizer, which will be used as the input language tokenizer. The tokenizer is initialized using the CustomTokenizer
class, configured to tokenize text at the character level, meaning each character in the input text will be treated as a separate token. The fit_on_texts
method is then applied to the Spanish training data (es_training_data
), which fits the tokenizer on a list of Spanish sentences, updating its internal dictionaries and settings. By doing this, the tokenizer learns the mapping between characters and their corresponding integer representations. Finally, the tokenizer is saved to a file named "tokenizer.json" in the specified path.
The second part of the code focuses on preparing the English tokenizer, which will be used as the output language tokenizer. It is identical to the first part, except that we use it for English sentences.
When we run the above code, we should see similar output:
Fitting tokenizer: 100%|██████████| 995249/995249 [00:10<00:00, 95719.57it/s]
Fitting tokenizer: 100%|██████████| 995249/995249 [00:07<00:00, 134446.71it/s]
In the above output, the tokenizer progress in displayed.
These tokenizers can later be used to convert text data into sequences of characters, enabling further natural language processing tasks such as sequence-to-sequence translation, text generation, or any other task requiring tokenized input and output. The saved tokenizer files can be loaded in subsequent stages of the NLP pipeline to maintain consistency and facilitate inference and evaluation of new data.
Let's try to use the detokenizer
from above to convert the sentence into tokens and convert it back to a sentence:
tokenized_sentence = detokenizer.texts_to_sequences(["Hello world, how are you?"])[0]
print(tokenized_sentence)
detokenized_sentence = detokenizer.detokenize([tokenized_sentence], remove_start_end=False)
print(detokenized_sentence)
detokenized_sentence = detokenizer.detokenize([tokenized_sentence])
print(detokenized_sentence)
By running the above code, it should give us the following output:
[33, 51, 48, 55, 55, 58, 3, 66, 58, 61, 55, 47, 15, 3, 51, 58, 66, 3, 44, 61, 48, 3, 68, 58, 64, 36, 32]
['<start>hello world, how are you?<eos>']
['hello world, how are you?']
So, we tokenized our "Hello world, how are you?" sentence intro tokens. Then tried to detokenize. Also, I demonstrated what the difference is with the remove_start_end
, when we toggle it.
Set up a data pipeline
When we have our tokenizers, we can create a data pipeline. The pipeline will be responsible for reading data from files, tokenizing it, and batching it. Let's import the DataProvider
class from the mltu
package:
from mltu.tensorflow.dataProvider import DataProvider
We will have two data providers, one for training data and one for validation data. And while iterating them, we should receive prepared data for our model. Let's create them:
from mltu.tensorflow.dataProvider import DataProvider
import numpy as np
def preprocess_inputs(data_batch, label_batch):
encoder_input = np.zeros((len(data_batch), tokenizer.max_length)).astype(np.int64)
decoder_input = np.zeros((len(label_batch), detokenizer.max_length)).astype(np.int64)
decoder_output = np.zeros((len(label_batch), detokenizer.max_length)).astype(np.int64)
data_batch_tokens = tokenizer.texts_to_sequences(data_batch)
label_batch_tokens = detokenizer.texts_to_sequences(label_batch)
for index, (data, label) in enumerate(zip(data_batch_tokens, label_batch_tokens)):
encoder_input[index][:len(data)] = data
decoder_input[index][:len(label)-1] = label[:-1] # Drop the [END] tokens
decoder_output[index][:len(label)-1] = label[1:] # Drop the [START] tokens
return (encoder_input, decoder_input), decoder_output
train_dataProvider = DataProvider(
train_dataset,
batch_size=4,
batch_postprocessors=[preprocess_inputs],
use_cache=True
)
val_dataProvider = DataProvider(
val_dataset,
batch_size=4,
batch_postprocessors=[preprocess_inputs],
use_cache=True
)
We created a Python function named preprocess_inputs
and two instances of a DataProvider
class, train_dataProvider
and val_dataProvider
. The function preprocess_inputs
serves as a preprocessing step for the input data and label batches to be used in a machine learning model, specifically in sequence-to-sequence tasks.
The preprocess_inputs
function takes two arguments, data_batch
and label_batch
, representing batches of input data and corresponding label data, respectively. Within the function, three arrays, encoder_input
, decoder_input
, and decoder_output
, are initialized as zero-filled arrays to store the processed data.
The function first tokenizes the input and labels data batches using the previously prepared tokenizers, tokenizer
and detokenizer
. It converts the text data into sequences of integers, which are required for training the model.
Next, it iterates through the data and label batches, and for each data-label pair, it populates the encoder_input
array with the integer sequences of the input data. Similarly, it fills the decoder_input
array with the integer sequences of the label data but with the last token removed (representing the [END] token). The decoder_output
array is populated with the integer sequences of the label data but with the [START] token removed. These arrays are essential for training the sequence-to-sequence model, as they form the input and target sequences during training.
The DataProvider
object handles data batching during model training. It takes the training and validation datasets (train_dataset
and val_dataset
) and processes the data batches using the preprocess_inputs
function. The batch_size
parameter determines the number of data samples in each batch. Additionally, the use_cache
parameter is set to true
, which means the DataProvider
will cache the preprocessed batches for efficient data loading during training.
In summary, the above code uses a DataProvider
class to handle data batching and caching during training, making it easier to feed data into the model in batches, a common practice in machine learning to enhance training efficiency.
Let's check what would be a single output of our dataProvider
:
for data_batch in train_dataProvider:
(encoder_inputs, decoder_inputs), decoder_outputs = data_batch
encoder_inputs_str = tokenizer.detokenize(encoder_inputs)
decoder_inputs_str = detokenizer.detokenize(decoder_inputs, remove_start_end=False)
decoder_outputs_str = detokenizer.detokenize(decoder_outputs, remove_start_end=False)
print(encoder_inputs_str)
print(decoder_inputs_str)
print(decoder_outputs_str)
break
In the terminal output, we should see similar output:
['fueron los asbestos aquí. ¡eso es lo que ocurrió!', 'me voy de aquí.', 'una vez, juro que cagué una barra de tiza.', 'y prefiero mudarme, ¿entiendes?']
["<start>it was the asbestos in here, that's what did it!", "<start>i'm out of here.", '<start>one time, i swear i pooped out a stick of chalk.', '<start>and i will move, do you understand me?']
["it was the asbestos in here, that's what did it!<eos>", "i'm out of here.<eos>", 'one time, i swear i pooped out a stick of chalk.<eos>', 'and i will move, do you understand me?<eos>']
This is an example of what is the input while training our Transformer model. But to make more sense, we detokenized the tokens.
Conclusion:
In summary, we explored the critical steps required to transform raw text data into a meaningful format suitable to train a Transformer model, focusing on sequence-to-sequence tasks (Language translation).
In this tutorial we:
🎯 Introduced to Data Preparation requirements for Transformers;
🔧 Built a Flexible CustomTokenizer object;
📚 Created Language-Specific Tokenizers for Spanish and English languages;
🔗 Established a Data Pipeline for training utilizing the DataProvider object.
I demonstrated how to unlock the power of data manipulation for Transformer training. As we delve into creating custom tokenizers, organizing data pipelines, and streamlining preprocessing steps, you should have gained the tools and knowledge to set your NLP projects on a path to success.