The Greatest Medicine.

Without enough water in your body system, you can feel sick; or feel bad and you may even have bad dreams at night because you can’t sleep well. Most illnesses or health failures are caused by the…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

The technical ABCs of transformers in deep learning

Following the somewhat recent explosion of ChatGPT onto the world stage, the architecture behind the model, namely the Transformer, has come into the limelight. ChatGPT is a special type of transformer called a Generative Pretrained Transformer, but to understand this type of neural network, we first have to understand how the general transformer architecture is laid out. This article will take you through just that, but will not be dealing with actually coding up one.

Note: use the diagram below this section as a reference while reading.

The transformer takes a sequence as an input and outputs another sequence. Throughout this article, I will be explaining the transformer in terms of one of its most common applications, the translation of sentences.

The input embedding takes the input sequence as input, turning it into numbers so that it is understandable by the machine; in other words, the process transforms every word of the input sequence into a word embedding. The positional encoding adds information to each word embedding, expressing where the word is in the sentence so the computer can deduce semantic meaning from the position.

The word embeddings are fed through an encoder stack, where the number of encoders can vary, but is usually around 6. Each encoder includes a Multi-head self-attention sublayer and a Feed Forward sub-layer. The prior helps capture dependencies between different word embeddings (input elements), meaning which words are important to which other words semantically, while the latter non-linearly transforms the input while also taking these dependencies captured by the self-attention sublayer into account. The Add & Norm blocks will be explained later. The output of one encoder is passed to the next (so the word embeddings are only directly passed to the first encoder), until the final encoder is finished, at which point we turn to the decoders. important: the encoder stack is run only once for an input.

The decoder stack outputs one word per time it is run through. In other words, the decoder stack is run as many times as are elements in the output sequence. The first decoder takes as input the entire output sequence up to that point, this means that during the first run-through, the first decoder only takes a “start token” indicating no outputs have been generated. The same embedding and positional encoding process as for the input in encoders are applied to this sequence before being fed into the decoder stack. Instead of the sequence generated up to that point, the decoders following the first take the output of the decoder before them as input. Additionally, each decoder takes the output generated at the end of the encoder stack as input into their “encoder-decoder attention” sublayer. The sublayers of a decoder are as follows: Masked Multi-head (self) attention layer, encoder-decoder attention layer (Multi-Head Attention in the diagram), and a feed-forward layer. At the end of the decoder stack, a linear transformation and a softmax function create a probability distribution over which word the model thinks should come next, which then gets appended to the output sequence. Then the decoder stack is run again, until the next “word” decided on is an <end token>, signifying the end of the output.

The input embedding takes the input sequence as input. It first tokenizes the sequence, which means a sentence like “Hey there, John”, would be turned into a comma-separated list of its constituent elements, plus a start and end character like this: [“<start>”, “Hey”, “there”, “,”, “John”, “<end>”]. Then this list is one-hot encoded, so the words are transformed directly into numbers. Subsequently, a vocabulary is used to turn this one-hot encoding of each word into semantically meaningful vectors. The vectors are usually of size d=512. Whatever d is equal to is called the dimensionality of the model as a whole. The vocabulary consists of every possible input element, in our case every possible word, with associated vectors for each word, which it transforms them into. The vocabulary is trained alongside the model, which means the creator of the model doesn't directly determine what vectors the words are transformed into, the training process does. The output from feeding the one hot encoded sequence through this vocabulary is a matrix, filled with the vectors/embeddings representing the input words in different rows, each row of length d=512. The matrix, therefore, has dimensions of the amount words in the input sequence, times d=512.

Illustration of the embedding process. Source: Myself

The output embedding is identical to this, except that it doesn't require the tokenization step. The output embedding takes as input the sequence of outputs predicted up to that point. The rightwards shift is just a way of saying outputs that have not been generated yet aren't fed into the output embedding. This means that for the first run-through of the decoder stack, it just takes the <start> character.

The Linear transformation at the top of the decoder stack uses the same vocabulary as the input and output embedding, just in reverse. In other words, it turns the numbers it receives as input from the decoder stack into the corresponding words contained in the vocabulary.

Positional encoding is performed so that the model knows where words are located in the sentence in relation to one another. This is important so the model can create meaningful relationships between the words in the sentence. A positional encoding vector is used for this. It is of equal dimension to the input/output embeddings, d=512, and is added to each word embedding. The positional encoding vector is determined per the following formulas:

We need to generate as many values for the positional encoding vector as the dimension of the vector. The above formulas are applied to generate each of these values independently. In the formulas, i is the dimensional index, defined as which dimension we're currently accessing. For generating a 512-dimensional positional encoding vector, the first element of this vector would have a dimensional index of 0, the second — one, etc, all the way up to the last having i=d-1. Pos is just the position of the word we're dealing with in the sequence of input words. For “Hey there, Johnny”, for example, “hey” would have pos=0, and “there” would have pos=1. The first formula is used for even-dimensional indices and the second for odd.

To describe the multi-head attention used in the transformer, we have to explain self-attention first. There are multiple methods to calculate self-attention, but the one we will be discussing here is Scaled Dot-Product Attention. Scaled Dot-Product attention works by creating a key, value, and query vector for each embedding, each of equal dimension, by multiplying the embedding with trained matrices, one for key, one for value, and one for query. The dimensions of these three are not necessarily the same as the dimension of the embedding. We then compute the dot-product of the query vector of the embedding in question, with the key vectors of all embeddings including itself, keeping these dot products separate. After this, we divide all these dot products by the square root of the dimension of the key vector. We then apply a softmax function to all these values, getting a weight distribution back. Each element in this weight distribution is multiplied by its respective value vector, from which we gain a distribution denoting how important each word in the sequence is to the word we're calculating self-attention for. The division by the root of the key vector is where the method gets its “scaled” name from. The technique was first introduced in the same paper proposing the transformer and was designed to make the dot-product of the key and query vectors be of a smaller value, as large key vector dimensions could create very high values, which when passed through a softmax would lead to overly skewed weight distributions. In other words, the scaling is done to increase the gradients in the weight distribution.

The self-attention layer in the decoder (Masked Multi-Head Attention), works identically to the encoder self-attention layer, except that it masks future outputs. Because the decoders don't take a finished input sequence as an input, but rather an incomplete output sequence, we have to mask future outputs by setting their values to -inf so the self-attention layer lets the model know that it should not pay attention to any future outputs. This is where the “Mask (opt).” in the above diagram fits in.

The “Multi-Head” part of the attention blocks indicates a process of calculating multiple of these attention weight distributions, and then combining them to gain a more complete picture of dependencies within the sequence. After the one key, value, and query vector are generated for each embedding, usually, h=8 different vectors (heads) are generated from these, using h=8 different trained key, value, and query matrixes (so a total of 24 different key, value, and query vectors and 24 different matrixes + the 3 to create the original key, value, and query vectors). Then, Scaled Dot Product Attention is run for each head, the results of which are concatenated, and run through a linear transformation using a matrix trained alongside the model to end up with the complete multi-head weight distribution.

The formula below explains the process:

Here, Wo is the matrix performing the linear transformation after concatenation, headi is the i'th head inserted into the upper formula, between 0->h. The bottom of the two formulas explains the process of creating the h=8 different key, value, and query vectors from the initial one of each, by multiplying by WiQ, WiK, and WiV matrixes respectively, where h=8 of each exists.

The above would become very computationally expensive with high dimension key, value, and query vectors. To counteract this, the dimensions of these vectors that would be used for a single-head attention model are divided by h to get comparable computation times.

The effect of the multi-head process is that the weight distribution is less concentrated on a few words because each different key, value, and query vector set is created from distinct matrixes that gain different attributes through the training. The less concentrated distribution is preferable because it allows the model to take into account more inter-word dependencies than if one word had nearly all the weighting.

Encoder-Decoder Attention is the layer in the decoder named “multi-head attention”, which takes inputs from the encoder stack, and the attention layer below it. This layer works identically to the self attention multi-head attention layers described previously, except that only the query vector is generated from the output embedding in the decoder, as the key and value vector are generated from the output of the encoder stack. This is the only place information is transferred from the encoder stack to the decoder stack, and is therefore what allows the decoder stack to interact with the input sequence.

At the end of each encoder and decoder are Feed Forward networks that cause non-linear transformations (through activation functions like GeLU or ReLU) of the embeddings created by the attention layers. Each Encoder and Decoder has its very own Feed Forward Network that is trained alongside the model.

The Add & Norm layers follow every sublayer in the encoders and decoders. They take a residual connection from before the sublayer, alongside the output of the sublayer as inputs. They are added together and normalized through layer normalization

In the above formula, x is the input into the previous sublayer, and Sublayer() is the function of the sublayer. Layer normalization works by performing the following operations on the input: subtracting the mean, and dividing by the standard deviation. This changes the mean to 0, and the variance to 1. In a deep architecture like the Transformer, the problem of Internal Covariate Shift pops up. That is changes in earlier parts of the network created through backpropagation, change what values the following parts of the architecture receive as input. These changes propagate through deep architectures, confusing the later layers and making them more difficult and time-consuming to improve. Normalizing the values, so they stay somewhat “similar”, helps the following layers not get as confused with large changing values. This is why layer normalization decreases training time.

Understanding Transformers can feel pretty daunting. They combine an assortment of different complex techniques to create a model that has proven itself to work extremely well. Regardless, I thank you for taking the time to read my article, and I hope it has helped you understand transformers a little bit better.

Sources:

The Greatest Medicine.

The technical ABCs of transformers in deep learning

Add a comment

Related posts:

NBA Boss criticizes many star losses during the season

9 Ideas To Improve Your Next Talk

Trust Dice BTC Wagering Contest