In the first tutorial, we got familiar with Transformers. What are they, and what advantages and limitations do they bring. Also, we implemented the PositionalEmbedding layer in TensorFlow.
After the first tutorial, we moved to the second tutorial. In the second tutorial, we implemented Add & Norm
, BaseAttention
, CrossAttention
, GlobalSelfAttention
, CausalSelfAttention
, and FeedForward
layers.
So, using layers from the previous tutorials, we'll implement Encoder
and Decoder
layers that will be used to build a complete Transformer
Model.
While implementing and testing content in this tutorial, I am not showing essential imports from the previous tutorial, so if you are testing everything by your self, follow the previous tutorial to import the necessary code.
EncoderLayer layer
Let's start with the EncoderLayer
layer. Why is it called EncoderLayer? Because it is a single layer of the Encoder
. The Encoder
is composed of multiple EncoderLayers
. A similar structure we'll see in the Decoder
:
The EncoderLayer
consists of two sublayers: a MultiHeadAttention
layer, more specifically GlobalSelfAttention
layer, and a FeedForward
layer. Each of these sublayers has a residual connection around it, followed by a layer normalization. Residual connections help to avoid the vanishing gradient problem in deep networks.
Let's implement this layer:
class EncoderLayer(tf.keras.layers.Layer):
"""
A single layer of the Encoder. Usually there are multiple layers stacked on top of each other.
Methods:
call: Performs the forward pass of the layer.
Attributes:
self_attention (GlobalSelfAttention): The global self-attention layer.
ffn (FeedForward): The feed-forward layer.
"""
def __init__(self, d_model: int, num_heads: int, dff: int, dropout_rate: float=0.1):
"""
Constructor of the EncoderLayer.
Args:
d_model (int): The dimensionality of the model.
num_heads (int): The number of heads in the multi-head attention layer.
dff (int): The dimensionality of the feed-forward layer.
dropout_rate (float): The dropout rate.
"""
super().__init__()
self.self_attention = GlobalSelfAttention(
num_heads=num_heads,
key_dim=d_model,
dropout=dropout_rate
)
self.ffn = FeedForward(d_model, dff)
def call(self, x: tf.Tensor) -> tf.Tensor:
"""
The call function that performs the forward pass of the layer.
Args:
x (tf.Tensor): The input sequence of shape (batch_size, seq_length, d_model).
Returns:
tf.Tensor: The output sequence of shape (batch_size, seq_length, d_model).
"""
x = self.self_attention(x)
x = self.ffn(x)
return x
Now let's test it out. We will use the same random input as in the previous tutorial. The output shape should be the same as the input shape:
encoder_vocab_size = 1000
d_model = 512
encoder_embedding_layer = PositionalEmbedding(vocab_size, d_model)
random_encoder_input = np.random.randint(0, encoder_vocab_size, size=(1, 100))
encoder_embeddings = encoder_embedding_layer(random_encoder_input)
print("encoder_embeddings shape", encoder_embeddings.shape)
encoder_layer = EncoderLayer(d_model, num_heads=2, dff=2048)
encoder_layer_output = encoder_layer(encoder_embeddings)
print("encoder_layer_output shape", encoder_layer_output.shape)
We'll see the following output:
encoder_embeddings shape (1, 100, 512)
encoder_layer_output shape (1, 100, 512)
Great! We have implemented the EncoderLayer
layer. The output shape is the same as the input shape. This is because the output of the EncoderLayer
is the same as the output of the FeedForward
layer, which has the same shape as the input. Now, let's combine multiple EncoderLayers
to create the Encoder
layer.
Encoder layer
Now let's implement the Encoder
layer. The Encoder
has integrated the PositionalEmbedding
layer, the multiple of EncoderLayer
layers, and the Dropout
layer. The output of each EncoderLayer
is passed to the next EncoderLayer
. The output of the last EncoderLayer
will be the output of the Encoder
. In the following image, you can see the Encoder
marked in red:
The Nx
represents the count of how many EncoderLayers
we use in the whole Encoder
. Let's implement the Encoder
layer in the code:
class Encoder(tf.keras.layers.Layer):
"""
A custom TensorFlow layer that implements the Encoder. This layer is mostly used in the Transformer models
for natural language processing tasks, such as machine translation, text summarization or text classification.
Methods:
call: Performs the forward pass of the layer.
Attributes:
d_model (int): The dimensionality of the model.
num_layers (int): The number of layers in the encoder.
pos_embedding (PositionalEmbedding): The positional embedding layer.
enc_layers (list): The list of encoder layers.
dropout (tf.keras.layers.Dropout): The dropout layer.
"""
def __init__(self, num_layers: int, d_model: int, num_heads: int, dff: int, vocab_size: int, dropout_rate: float=0.1):
"""
Constructor of the Encoder.
Args:
num_layers (int): The number of layers in the encoder.
d_model (int): The dimensionality of the model.
num_heads (int): The number of heads in the multi-head attention layer.
dff (int): The dimensionality of the feed-forward layer.
vocab_size (int): The size of the vocabulary.
dropout_rate (float): The dropout rate.
"""
super().__init__()
self.d_model = d_model
self.num_layers = num_layers
self.pos_embedding = PositionalEmbedding(vocab_size=vocab_size, d_model=d_model)
self.enc_layers = [
EncoderLayer(d_model=d_model,
num_heads=num_heads,
dff=dff,
dropout_rate=dropout_rate)
for _ in range(num_layers)]
self.dropout = tf.keras.layers.Dropout(dropout_rate)
def call(self, x: tf.Tensor) -> tf.Tensor:
"""
The call function that performs the forward pass of the layer.
Args:
x (tf.Tensor): The input sequence of shape (batch_size, seq_length).
Returns:
tf.Tensor: The output sequence of shape (batch_size, seq_length, d_model).
"""
x = self.pos_embedding(x)
# here x has shape `(batch_size, seq_len, d_model)`
# Add dropout.
x = self.dropout(x)
for i in range(self.num_layers):
x = self.enc_layers[i](x)
return x # Shape `(batch_size, seq_len, d_model)`.
Following this and previous tutorials step-by-step, you should already see that we combined everything we covered before and constructed the Encoder
layer. Let's test it out. We will use the same random input as before. As input, we'll generate a random sequence as before, which would be the sequence of tokenized words in real life.
encoder_vocab_size = 1000
d_model = 512
encoder = Encoder(num_layers=2, d_model=d_model, num_heads=2, dff=2048, vocab_size=encoder_vocab_size)
random_encoder_input = np.random.randint(0, encoder_vocab_size, size=(1, 100))
encoder_output = encoder(random_encoder_input)
print("random_encoder_input shape", random_encoder_input.shape)
print("encoder_output shape", encoder_output.shape)
You should see the following output:
random_encoder_input shape (1, 100)
encoder_output shape (1, 100, 512)
Now we have completely implemented the Encoder
layer. If everything is clear up to this point, you can move on to the decoder part. But if you still need clarification, go back and reread the previous sections.
DecoderLayer layer
The DecoderLayer
is similar to the EncoderLayer
, but it has an additional CrossAttention
layer between the CausalSelfAttention
layer and the FeedForward
layer:
The CrossAttention
layer calculates the attention weights between the decoder input and the encoder output. The CausalSelfAttention
layer calculates the attention weights between the decoder input and the decoder output. The FeedForward
layer transforms the representation of the input sequence into a more suitable form for the task at hand.
Let's implement the DecoderLayer
layer:
class DecoderLayer(tf.keras.layers.Layer):
"""
A single layer of the Decoder. Usually there are multiple layers stacked on top of each other.
Methods:
call: Performs the forward pass of the layer.
Attributes:
causal_self_attention (CausalSelfAttention): The causal self-attention layer.
cross_attention (CrossAttention): The cross-attention layer.
ffn (FeedForward): The feed-forward layer.
"""
def __init__(self, d_model: int, num_heads: int, dff: int, dropout_rate: float=0.1):
"""
Constructor of the DecoderLayer.
Args:
d_model (int): The dimensionality of the model.
num_heads (int): The number of heads in the multi-head attention layer.
dff (int): The dimensionality of the feed-forward layer.
dropout_rate (float): The dropout rate.
"""
super(DecoderLayer, self).__init__()
self.causal_self_attention = CausalSelfAttention(
num_heads=num_heads,
key_dim=d_model,
dropout=dropout_rate)
self.cross_attention = CrossAttention(
num_heads=num_heads,
key_dim=d_model,
dropout=dropout_rate)
self.ffn = FeedForward(d_model, dff)
def call(self, x: tf.Tensor, context: tf.Tensor) -> tf.Tensor:
"""
The call function that performs the forward pass of the layer.
Args:
x (tf.Tensor): The input sequence of shape (batch_size, seq_length, d_model). x is usually the output of the previous decoder layer.
context (tf.Tensor): The context sequence of shape (batch_size, seq_length, d_model). Context is usually the output of the encoder.
"""
x = self.causal_self_attention(x=x)
x = self.cross_attention(x=x, context=context)
# Cache the last attention scores for plotting later
self.last_attn_scores = self.cross_attention.last_attn_scores
x = self.ffn(x) # Shape `(batch_size, seq_len, d_model)`.
return x
Let's do a short analysis of what we have done here. We have implemented the DecoderLayer
layer. The DecoderLayer
layer consists of three sublayers: a CausalSelfAttention
layer, a CrossAttention
layer, and a FeedForward
layer. Each of these sublayers has a residual connection around it, followed by a layer normalization. The output of each sublayer is LayerNormalization(x + Sublayer(x))
. The output of the DecoderLayer
is the same as the output of the FeedForward
layer, which has the same shape as the input.
If we take, for example, the translation task from Spanish to English, here context
would be a Spanish sentence, and x
would be an English sentence. At the first iteration, we don't have any input to the decoder, except some <start>
token, and we have a complete sentence for the encoder. So we input these to this layer. Then it calculates the attention weights between the decoder input and the encoder output. Then it calculates the attention weights between the decoder input and the decoder output. Then it transforms the representation of the input sequence into a more suitable form for the task at hand. Then it outputs the result.
After the first iteration, we now have, for example, <start> Hello
as output from the decoder. So we repeat the above steps until the end of the sentence. After all iterations, we have translated the sentence, for example, <start> Hello, how are you? <end>
. That's the whole idea of iterations in the decoder.
As before, we need to test this layer. We'll generate a random integers list, which will be our tokenized sentence. Then we'll push this data into an embedding layer to give us embeddings for each token. Then we'll push this data into the decoderLayer
layer along with the encoder output (we received while testing the encoder layer) and get the output. Let's do it:
# Test DecoderLayer layer
decoder_vocab_size = 1000
d_model = 512
dff = 2048
num_heads = 8
decoder_layer = DecoderLayer(d_model, num_heads, dff)
random_decoderLayer_input = np.random.randint(0, decoder_vocab_size, size=(1, 110))
decoder_embeddings = encoder_embedding_layer(random_decoderLayer_input)
decoderLayer_output = decoder_layer(decoder_embeddings, encoder_output)
print("random_decoder_input shape", random_decoderLayer_input.shape)
print("decoder_embeddings shape", decoder_embeddings.shape)
print("decoder_output shape", decoderLayer_output.shape)
You should see the following output:
random_decoder_input shape (1, 110)
decoder_embeddings shape (1, 110, 512)
decoder_output shape (1, 110, 512)
Great, it works as expected. Our decoderLayer
output shape is the same as the embedding shape, meaning we can stack whatever layers count we want sequentially.
Decoder layer
Now let's implement the Decoder
layer, which is very similar to the Encoder
layer. The Decoder
has integrated the PositionalEmbedding
layer, the multiple of DecoderLayer
layers, and the Dropout
layer. The output of each DecoderLayer
is passed to the next DecoderLayer
. The output of the last DecoderLayer
is the output of the Decoder
. In the following image, you can see the Decoder
marked in red:
Let's implement the Decoder
layer in the code:
class Decoder(tf.keras.layers.Layer):
"""
A custom TensorFlow layer that implements the Decoder. This layer is mostly used in the Transformer models
for natural language processing tasks, such as machine translation, text summarization or text classification.
Methods:
call: Performs the forward pass of the layer.
Attributes:
d_model (int): The dimensionality of the model.
num_layers (int): The number of layers in the decoder.
pos_embedding (PositionalEmbedding): The positional embedding layer.
dec_layers (list): The list of decoder layers.
dropout (tf.keras.layers.Dropout): The dropout layer.
"""
def __init__(self, num_layers: int, d_model: int, num_heads: int, dff: int, vocab_size: int, dropout_rate: float=0.1):
"""
Constructor of the Decoder.
Args:
num_layers (int): The number of layers in the decoder.
d_model (int): The dimensionality of the model.
num_heads (int): The number of heads in the multi-head attention layer.
dff (int): The dimensionality of the feed-forward layer.
vocab_size (int): The size of the vocabulary.
dropout_rate (float): The dropout rate.
"""
super(Decoder, self).__init__()
self.d_model = d_model
self.num_layers = num_layers
self.pos_embedding = PositionalEmbedding(vocab_size=vocab_size, d_model=d_model)
self.dropout = tf.keras.layers.Dropout(dropout_rate)
self.dec_layers = [
DecoderLayer(
d_model=d_model,
num_heads=num_heads,
dff=dff,
dropout_rate=dropout_rate) for _ in range(num_layers)]
self.last_attn_scores = None
def call(self, x: tf.Tensor, context: tf.Tensor) -> tf.Tensor:
"""
The call function that performs the forward pass of the layer.
Args:
x (tf.Tensor): The input sequence of shape (batch_size, target_seq_len).
context (tf.Tensor): The context sequence of shape (batch_size, input_seq_len, d_model).
"""
# `x` is token-IDs shape (batch, target_seq_len)
x = self.pos_embedding(x) # (batch_size, target_seq_len, d_model)
x = self.dropout(x)
for i in range(self.num_layers):
x = self.dec_layers[i](x, context)
self.last_attn_scores = self.dec_layers[-1].last_attn_scores
# The shape of x is (batch_size, target_seq_len, d_model).
return x
The Decoder
class requires two inputs: a token-ID sequence representing the target sequence and an encoded input sequence, also known as the context. The decoder layer has multiple DecoderLayer
instances that perform various operations on the input sequence to produce an output sequence.
When the Decoder
object is instantiated, it sets up several layers, such as the PositionalEmbedding
layer, responsible for adding positional information to the input token IDs, a dropout
layer for regularization, and a stack of DecoderLayer
instances.
The input token IDs go through the positional embedding and dropout layers during a forward pass. Then, for each DecoderLayer
, the input undergoes causal self-attention, cross-attention, and a feed-forward neural network layer. The output of the last DecoderLayer
is returned as the final output of the Decoder
.
The last_attn_scores
attribute of the Decoder
instance stores the attention scores from the last decoder layer, which can be valuable for visualizing and debugging purposes.
Now, let's write a simple code to test the Decoder
layer. We will use the same random input as before. As input, we'll generate a random sequence as before, what in real life would be the sequence of tokenized words:
# Test decoder layer
decoder_vocab_size = 1000
d_model = 512
decoder_layer = Decoder(num_layers=2, d_model=d_model, num_heads=2, dff=2048, vocab_size=decoder_vocab_size)
random_decoder_input = np.random.randint(0, decoder_vocab_size, size=(1, 100))
decoder_output = decoder_layer(random_decoder_input, encoder_output)
print("random_decoder_input shape", random_decoder_input.shape)
print("decoder_output shape", decoder_output.shape)
You should see the following output:
random_decoder_input shape (1, 100)
decoder_output shape (1, 100, 512)
Now we tested it with random data. But imagine, if it were actual data, then we would have, for example, a Spanish sentence as input and an English sentence as output. Then we would have to translate Spanish sentences to English sentences. We would have to input a Spanish sentence to the encoder and an English sentence to the decoder. Then we would have to iterate over the decoder until we get the <end>
token. Then we would have translated sentences.
As we can see, the output decoder shape is a (1, 100, 512)
vector. On this layer, we would have to apply the argmax
function to get the most probable token and pick the word from the dictionary to get the final word. But we will do it later.
The Transformer
Finally, we have implemented all the layers we need to build the Transformer
. The Transformer
consists of an Encoder
, a Decoder
, and a final linear layer. The output of the Decoder
is the input to the final linear layer, and its result is returned as the output of the Transformer
. The final Dense
layer converts the resulting sequence into a probability distribution over the output vocabulary.
In the following image, you can see the Transformer
model that we will implement:
Now let's implement the Transformer
model in TensorFlow:
def Transformer(
input_vocab_size: int,
target_vocab_size: int,
encoder_input_size: int = None,
decoder_input_size: int = None,
num_layers: int=6,
d_model: int=512,
num_heads: int=8,
dff: int=2048,
dropout_rate: float=0.1,
) -> tf.keras.Model:
"""
A custom TensorFlow model that implements the Transformer architecture.
Args:
input_vocab_size (int): The size of the input vocabulary.
target_vocab_size (int): The size of the target vocabulary.
encoder_input_size (int): The size of the encoder input sequence.
decoder_input_size (int): The size of the decoder input sequence.
num_layers (int): The number of layers in the encoder and decoder.
d_model (int): The dimensionality of the model.
num_heads (int): The number of heads in the multi-head attention layer.
dff (int): The dimensionality of the feed-forward layer.
dropout_rate (float): The dropout rate.
Returns:
A TensorFlow Keras model.
"""
inputs = [
tf.keras.layers.Input(shape=(encoder_input_size,), dtype=tf.int64),
tf.keras.layers.Input(shape=(decoder_input_size,), dtype=tf.int64)
]
encoder_input, decoder_input = inputs
encoder = Encoder(num_layers=num_layers, d_model=d_model, num_heads=num_heads, dff=dff, vocab_size=input_vocab_size, dropout_rate=dropout_rate)(encoder_input)
decoder = Decoder(num_layers=num_layers, d_model=d_model, num_heads=num_heads, dff=dff, vocab_size=target_vocab_size, dropout_rate=dropout_rate)(decoder_input, encoder)
output = tf.keras.layers.Dense(target_vocab_size)(decoder)
return tf.keras.Model(inputs=inputs, outputs=output)
The Transformer
incorporates both the Encoder
and Decoder
components to implement the Transformer
architecture.
The Encoder
is an instance of the Encoder
class, responsible for taking a sequence of tokens as input and producing a sequence of contextual vectors, each representing information about a specific token in the input sequence.
The Decoder
is also an instance of the Decoder
class, which takes both a sequence of target tokens and the contextual information generated by the Encoder
as input. It then generates a sequence of contextual vectors corresponding to each target token in the output sequence.
The final_layer
is a Dense layer used to take the output from the Decoder
and map it to a sequence of probabilities for the target tokens.
When we have a constructed Transformer
Model, we provide an input tensor called inputs
. This inputs
tensor is actually a tuple containing two tensors: the context
tensor (representing the input sequence for the Encoder) and the x
tensor (representing the target sequence for the Decoder
). When we call the Transformer
model, it processes the context tensor through the Encoder
to obtain contextual information for each token in the input sequence. It then uses this information and the x
tensor to generate the output sequence through the Decoder
. Finally, the model passes the output of the Decoder
through the final_layer
to obtain probabilities for the target tokens. The model returns both the logits (target token probabilities) and the attention weights.
To make this example more efficient, we reduced the size of layers, embeddings, and the internal dimensions of the FeedForward
layer in the Transformer
model. The original Transformer paper used a base model with num_layers=6
, d_model=512
, num_heads=8
, and dff=2048
. However, for testing purposes, we reduced these numbers.
encoder_input_size = 100
decoder_input_size = 110
encoder_vocab_size = 1000
decoder_vocab_size = 1000
model = Transformer(
input_vocab_size=encoder_vocab_size,
target_vocab_size=decoder_vocab_size,
encoder_input_size=encoder_input_size,
decoder_input_size=decoder_input_size,
num_layers=2,
d_model=512,
num_heads=2,
dff=512,
dropout_rate=0.1)
model.summary()
In the output, it should print the summary of the model:
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_7 (InputLayer) [(None, 100)] 0 []
input_8 (InputLayer) [(None, 110)] 0 []
encoder_4 (Encoder) (None, 100, 512) 5768192 ['input_7[0][0]']
decoder_5 (Decoder) (None, 110, 512) 9971712 ['input_8[0][0]',
'encoder_4[0][0]']
dense_51 (Dense) (None, 110, 1000) 513000 ['decoder_5[0][0]']
==================================================================================================
Total params: 16,252,904
Trainable params: 16,252,904
Non-trainable params: 0
__________________________________________________________________________________________________
So, we have implemented the Transformer
model, which we can use with standard TensorFlow fit
and evaluate
methods. Remember that the larger your decoder vocabulary size, the larger the model will be because of the last Dense
layer. So, if you have an extensive vocabulary, you can use a smaller d_model
to keep the model size and training time reasonable. Or you can use vocabulary that is in characters, not in words.
Conclusion:
Walking through this Transformer
series tutorials, I provided a comprehensive journey through Transformers
, from understanding their basics and limitations to building essential layers like Add & Norm
, BaseAttention
, CrossAttention
, and GlobalSelfAttention
. We then seamlessly constructed the Encoder
layer, showcasing the power of residual connections.
The DecoderLayer
introduction highlighted its role in sequence-to-sequence tasks, especially with the CrossAttention
layer. This set the stage for developing the complete Decoder
layer, merging PositionalEmbedding
and Dropout
for a robust design.
Finally, the tutorial series culminated in a fully-fledged Transformer
model, combining Encoder
and Decoder
layers. This journey equipped you with the skills to leverage Transformers
effectively in various natural language processing tasks.
Let's go to another tutorial, where I'll show you how to prepare data to train the Transformer
model in language translation tasks.