The Illustrated Transformer 翻译
In the?previous post, we looked at Attention?– a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at?The Transformer?– a model that uses attention to boost the speed with which these models can be trained. The Transformers outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their?Cloud TPU?offering. So let’s try to break the model apart and look at how it functions.
在前一篇文章中,我們討論了注意力——一種在現(xiàn)代深度學習模型中普遍存在的方法。注意力是一個有助于提高神經(jīng)機器翻譯應用程序性能的概念。在這篇文章中,我們將著眼于Transformer——一個使用注意力來提高這些模型的訓練速度的模型。Transformer在特定任務中的性能優(yōu)于谷歌神經(jīng)機器翻譯模型。然而,最大的好處來自于轉換器如何實現(xiàn)并行化。實際上,谷歌Cloud推薦使用Transformer作為參考模型來使用他們的云TPU產(chǎn)品。讓我們試著把模型拆開看看它是如何工作的。
The Transformer was proposed in the paper?Attention is All You Need. A TensorFlow implementation of it is available as a part of the?Tensor2Tensor?package. Harvard’s NLP group created a?guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter.
Transformer的提出是在?Attention is All You Need?論文中。它的TensorFlow實現(xiàn)是tensor2張量包的一部分。哈佛大學的NLP小組創(chuàng)建了一個指南,用PyTorch實現(xiàn)對論文進行注釋。在這篇文章中,我們將嘗試把事情簡單化一點,并逐一介紹概念,希望讓沒有深入了解主題的人更容易理解。
A High-Level Look
Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.
讓我們首先將模型看作一個單獨的黑盒。在機器翻譯應用程序中,它將使用一種語言的句子,并輸出另一種語言的翻譯。
?
Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.
看其內(nèi)部,我們看到一個編碼組件,一個解碼組件,以及它們之間的聯(lián)系。
?
The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.
編碼組件是一堆編碼器(紙上6個編碼器疊在一起——數(shù)字6沒有什么神奇之處,肯定可以嘗試其他安排)。解碼組件是相同數(shù)量的解碼器的堆棧。
?
The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:
編碼器在結構上都是相同的(但它們不共享權重)。每一個都被分解成兩個子層:
?
The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.
編碼器的輸入首先通過一個Self-Attention層——這個層在編碼器編碼特定單詞時幫助編碼器查看輸入句子中的其他單詞。我們將在稍后的文章中更詳細地討論它。
The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.
將self-attention層的輸出輸入前饋神經(jīng)網(wǎng)絡。完全相同的前饋網(wǎng)絡分別應用于每個位置。
The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in?seq2seq models).
解碼器也有這兩個層,但它們之間是一個attention層,幫助解碼器將注意力集中于輸入句子的相關部分(類似于注意在seq2seq模型中的作用)。
?
Bringing The Tensors Into The Picture
Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.
現(xiàn)在我們已經(jīng)了解了模型的主要組成部分,讓我們開始研究各種向量/張量,以及它們?nèi)绾卧谶@些組成部分之間流動,從而將經(jīng)過訓練的模型的輸入轉換為輸出。
As is the case in NLP applications in general, we begin by turning each input word into a vector using an?embedding algorithm.
與NLP應用程序中的一般情況一樣,我們首先使用嵌入算法將每個輸入單詞轉換成一個向量。
?
Each word is embedded into a vector of size 512. We'll represent those vectors with these simple boxes.
每個單詞都嵌入到一個大小為512的向量中。我們用這些簡單的盒子來表示這些向量。
The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.
嵌入只發(fā)生在最底部的編碼器。所有編碼器都有一個共同的抽象概念,那就是它們接收到的向量列表,每個向量的大小都是512——在底部的編碼器中是詞嵌入,但是在其他編碼器中,它是下面的編碼器的輸出。這個列表的大小是我們可以設置的超參數(shù)——基本上就是我們的訓練數(shù)據(jù)集中最長句子的長度。
After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.
在輸入序列中嵌入單詞之后,每個單詞都流經(jīng)編碼器的兩層。
?
Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.
在這里,我們開始看到Transformer的一個關鍵屬性,即每個位置上的單詞在編碼器中通過自己的路徑流動。在self-attention層中,這些路徑之間存在依賴關系。但是,前饋層沒有這些依賴項,因此可以在流經(jīng)前饋層時并行執(zhí)行各種路徑。
Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.
接下來,我們將把示例轉換為一個更短的句子,并查看在編碼器的每個子層中發(fā)生了什么。
?
Now We’re Encoding!
As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.
正如我們已經(jīng)提到的,編碼器接收一組向量作為輸入。它通過將這些向量傳遞到一個“self-attention”層來處理這個列表,然后進入前饋神經(jīng)網(wǎng)絡,然后向上發(fā)送輸出到下一個編碼器。
?
The word at each position passes through a self-encoding process. Then, they each pass through a feed-forward neural network -- the exact same network with each vector flowing through it separately.
每個位置的單詞都經(jīng)過一個自編碼過程。然后,它們各自通過前饋神經(jīng)網(wǎng)絡——完全相同的網(wǎng)絡,每個向量分別通過它。
?
Self-Attention at a High Level
Don’t be fooled by me throwing around the word “self-attention” like it’s a concept everyone should be familiar with. I had personally never came across the concept until reading the Attention is All You Need paper. Let us distill how it works.
不要被我隨便說的“自我關注”這個詞所愚弄,好像它是每個人都應該熟悉的概念。我個人從來沒有遇到過這個概念,直到閱讀這篇論文?Attention is All You Need?。讓我們提煉一下它是如何工作的。
Say the following sentence is an input sentence we want to translate:
”The animal didn't cross the street because it was too tired”
假設下面這句話是我們要翻譯的輸入句:
”The animal didn't cross the street because it was too tired”
What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.
這個句子中的“it”是什么意思?它指的是街道還是動物?這對人類來說是一個簡單的問題,但對算法來說就沒那么簡單了。
When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.
當模型處理單詞“it”時,self-attention將“it”與“animal”聯(lián)系起來。
As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.
當模型處理每個單詞(輸入序列中的每個位置)時,self - attention允許它查看輸入序列中的其他位置,以尋找有助于對該單詞進行更好編碼的線索。
If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.
如果您熟悉RNN,請考慮如何維護一個隱藏狀態(tài),使RNN能夠將它處理過的前一個單詞/向量的表示與它正在處理的當前單詞/向量結合起來。Self-attention是Transformer用來將其他相關單詞的“理解”轉換成我們正在處理的單詞的方法。
?
As we are encoding the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism was focusing on "The Animal", and baked a part of its representation into the encoding of "it".
當我們在編碼器5中編碼“it”(堆棧中最上面的編碼器)時,注意力機制的一部分集中在“動物”上,并將其表示的一部分融入到“it”的編碼中。
Be sure to check out the?Tensor2Tensor notebook?where you can load a Transformer model, and examine it using this interactive visualization.
請務必查看"Tensor2Tensor notebook",在那里您可以加載Transformer模型,并使用這種交互式可視化來檢查它。
?
Self-Attention in Detail
Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually implemented – using matrices.
讓我們先看看如何用向量來計算self-attention,然后再看看它是如何實現(xiàn)的——用矩陣。
The?first step?in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.
計算自我注意的第一步是從編碼器的每個輸入向量中創(chuàng)建三個向量(在本例中是每個單詞的嵌入)。因此,對于每個單詞,我們創(chuàng)建一個查詢向量、一個鍵向量和一個值向量。這些向量是通過將嵌入乘以我們在訓練過程中訓練的三個矩陣得到的。
Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.
注意這些新向量的維數(shù)比嵌入向量小。其維數(shù)為64,而嵌入和編碼器的輸入/輸出向量的維數(shù)為512。它們不需要更小,這是一種架構選擇,可以使多目標注意力的計算(大部分)保持不變。
?
Multiplying?x1?by the?WQ?weight matrix produces?q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence.
x1乘以WQ權重矩陣得到q1,即與這個詞相關的“查詢”向量。我們最終創(chuàng)建了輸入句子中每個單詞的“查詢”、“鍵”和“值”投影。
What are the “query”, “key”, and “value” vectors??
什么是“查詢”、“鍵”和“值”向量?
They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors plays.
它們是對計算和思考注意力很有用的抽象概念。一旦你開始閱讀下面計算注意力的方法,你就會知道這些向量所扮演的角色。
The?second step?in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
計算自我注意的第二步是計算分數(shù)。假設我們在計算本例中第一個單詞“Thinking”的self-attention。我們需要將輸入句子中的每個單詞與這個單詞進行評分。分數(shù)決定了當我們在某個位置編碼一個單詞時,對輸入句子的其他部分的關注程度。
The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.
分數(shù)是通過查詢向量與我們正在評分的單詞的key向量的點積計算出來的。如果我們處理位置1的單詞的self-attention,第一個分數(shù)就是q1和k1的點積。第二個分數(shù)是q1和k2的點積。
?
The?third and forth steps?are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.
第三步和第四步是將分數(shù)除以8(論文中使用的關鍵向量的維數(shù)的平方根- 64)。這導致了更穩(wěn)定的梯度。這里可能有其他可能的值,但這是默認值),然后通過softmax操作傳遞結果。Softmax將分數(shù)標準化,使其均為正數(shù),加起來為1。
?
This softmax score determines how much how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.
softmax分數(shù)決定每個單詞在這個位置的表達量。顯然,這個位置的單詞將擁有最高的softmax分數(shù),但有時關注與當前單詞相關的另一個單詞是有用的。
The?fifth step?is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).
第五步是將每個value向量乘以softmax分數(shù)(為求和做準備)。這里直觀展示的是保持我們要關注的單詞的值,淹沒無關單詞的值(例如,將它們乘以0.001這樣的小數(shù)字)。
The?sixth step?is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).
第六步是加權值向量的求和。這將在這個位置(對于第一個單詞)生成self-attention層的輸出。
?
That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward neural network. In the actual implementation, however, this calculation is done in matrix form for faster processing. So let’s look at that now that we’ve seen the intuition of the calculation on the word level.
這就是自我注意計算的結論。得到的向量是一個我們可以發(fā)送到前饋神經(jīng)網(wǎng)絡的向量。然而,在實際實現(xiàn)中,這種計算是以矩陣形式進行的,以便更快地進行處理。現(xiàn)在我們來看看這個我們已經(jīng)在單詞層面上看到了計算的直觀展示。
?
Matrix Calculation of Self-Attention
The first step?is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix?X, and multiplying it by the weight matrices we’ve trained (WQ,?WK,?WV).
第一步是計算query、key和value矩陣。我們把embeddings包裝成矩陣X,然后乘以我們訓練過的權矩陣(WQ WK WV)。
?
Every row in the?X?matrix corresponds to a word in the input sentence. We again see the difference in size of the embedding vector (512, or 4 boxes in the figure), and the q/k/v vectors (64, or 3 boxes in the figure)
X矩陣中的每一行對應輸入句子中的一個單詞。我們再次看到embeddings向量(512,或圖中4個框)和q/k/v向量(64,或圖中3個框)大小的差異
Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.
最后,由于我們處理的是矩陣,我們可以將步驟2到步驟6壓縮到一個公式中來計算自我注意層的輸出。
?
The self-attention calculation in matrix form
矩陣形式的self-attention計算
?
The Beast With Many Heads
The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways:
本文進一步細化了self-attention層,增加了“多頭”注意機制。這通過兩種方式提高了注意力層的性能:
1. It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself. It would be useful if we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to.
它擴展了模型關注不同位置的能力。是的,在上面的例子中,z1包含了一些其他編碼,但是它可以被實際單詞本身所控制。如果我們翻譯像“動物沒過馬路是因為它太累了”這樣的句子,它會很有用,我們想知道“它”指的是哪個單詞。
2.?It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.
它給了attention?層多個“表示子空間”。接下來我們將看到,對于multi-headed?,我們不僅有一個,而且有多個query/key/value權重矩陣集(Transformer?使用8個注意頭,因此我們最終為每個編碼器/解碼器使用8個注意頭)。每個集合都是隨機初始化的。然后,在訓練之后,使用每個集合將輸入嵌入(或來自較低編碼器/解碼器的向量)投影到不同的表示子空間中。
?
With multi-headed attention, we maintain separate Q/K/V weight matrices for each head resulting in different Q/K/V matrices. As we did before, we multiply X by the WQ/WK/WV matrices to produce Q/K/V matrices.
對于多頭注意,我們?yōu)槊總€頭部維護單獨的Q/K/V權重矩陣,從而產(chǎn)生不同的Q/K/V矩陣。和之前一樣,我們用X乘以WQ/WK/WV矩陣得到Q/K/V矩陣。
If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices.
如果我們做同樣的self-attention計算,就像上面概述的那樣,用不同的權重矩陣做8次不同的計算,我們得到8個不同的Z矩陣。
?
This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.
這給我們留下了一點挑戰(zhàn)。前饋層不需要八個矩陣——它只需要一個矩陣(每個單詞對應一個向量)。我們需要一種方法把這8個壓縮成一個矩陣。
How do we do that? We concat the matrices then multiple them by an additional weights matrix WO.
我們怎么做呢?我們把這些矩陣連起來然后乘以一個額外的權重矩陣。
?
That’s pretty much all there is to multi-headed self-attention. It’s quite a handful of matrices, I realize. Let me try to put them all in one visual so we can look at them in one place
這就是多腦self-attention的全部。我意識到這是相當多的矩陣。我試著把它們放在一個圖像中這樣我們就能在一個地方看到它們。
?
Now that we have touched upon attention heads, let’s revisit our example from before to see where the different attention heads are focusing as we encode the word “it” in our example sentence:
既然我們已經(jīng)提到了注意力頭部,讓我們再來看看之前的例子,看看當我們在例句中編碼單詞“it”時,不同的注意力頭部集中在哪里:
?
As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" -- in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".
當我們對“it”這個詞進行編碼時,一個注意力集中在“動物”上,而另一個注意力集中在“疲倦”上——從某種意義上說,“it”這個詞的模型表現(xiàn)在“動物”和“疲倦”兩個詞的表現(xiàn)中。
If we add all the attention heads to the picture, however, things can be harder to interpret:
然而,如果我們把所有的注意力都集中到這幅圖上,事情就很難解釋了:
?
?
Representing The Order of The Sequence Using Positional Encoding
用位置編碼表示序列的順序
One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.
到目前為止,我們所描述的模型中缺少的一件事是一種解釋輸入序列中單詞順序的方法。
To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.
為了解決這個問題,transformer向每個輸入的embedding添加一個向量。這些向量遵循模型學習到的特定模式,這有助于它確定每個單詞的位置,或序列中不同單詞之間的距離。直覺上,在點積attention時,一旦embeddings被投影到Q/K/V向量中,添加到embeddings中的這些值可以提供embeddings向量之間有意義的距離。
?
To give the model a sense of the order of the words, we add positional encoding vectors -- the values of which follow a specific pattern.
為了給模型一個單詞順序的感覺,我們添加位置編碼向量——它們的值遵循特定的模式。
If we assumed the embedding has a dimensionality of 4, the actual positional encodings would look like this:
如果我們假設嵌入的維數(shù)是4,那么實際的位置編碼應該是這樣的:
?
A real example of positional encoding with a toy embedding size of 4
一個真實的位置編碼的例子與玩具嵌入大小為4
What might this pattern look like?
這個模式可能是什么樣的?
In the following figure, each row corresponds the a positional encoding of a vector. So the first row would be the vector we’d add to the embedding of the first word in an input sequence. Each row contains 512 values – each with a value between 1 and -1. We’ve color-coded them so the pattern is visible.
在下面的圖中,每一行對應一個向量的位置編碼。所以第一行代表的向量就是我們在輸入序列中嵌入第一個單詞的向量。每行包含512個值——每個值在1到-1之間。我們用顏色標記了它們,所以模式是可見的。
?
A real example of positional encoding for 20 words (rows) with an embedding size of 512 (columns). You can see that it appears split in half down the center. That's because the values of the left half are generated by one function (which uses sine), and the right half is generated by another function (which uses cosine). They're then concatenated to form each of the positional encoding vectors.
一個實際的位置編碼示例,包含20個單詞(行),embedding大小為512(列)。你可以看到它從中間一分為二。這是因為左半部分的值是由一個函數(shù)(使用sin)生成的,而右半部分是由另一個函數(shù)(使用cos)生成的。然后將它們連接起來形成每個位置編碼向量。
The formula for positional encoding is described in the paper (section 3.5). You can see the code for generating positional encodings in?get_timing_signal_1d(). This is not the only possible method for positional encoding. It, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g. if our trained model is asked to translate a sentence longer than any of those in our training set).
位置編碼的公式在本文(章節(jié)3.5)中進行了描述。您可以在get_timing_signal_1d()中看到生成位置編碼的代碼。這不是唯一可能的位置編碼方法。然而,它的優(yōu)勢在于能夠擴展到序列的看不見的長度(例如,如果我們訓練的模型被要求翻譯一個比我們訓練集中的任何一個都長的句子)。
?
The Residuals
殘差
One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a?layer-normalization?step.
在繼續(xù)之前,我們需要提到編碼器架構中的一個細節(jié),即每個編碼器中的每個子層(self-attention, ffnn)在其周圍都有一個殘差連接,然后是一個分層規(guī)范化步驟。
?
If we’re to visualize the vectors and the layer-norm operation?associated with self attention, it would look like this:
如果我們要把這些向量和與自我注意相關的層規(guī)范操作形象化,它看起來是這樣的:
?
This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:
這也適用于解碼器的子層。如果我們考慮一個由兩個堆疊編碼器和解碼器組成的transformer,它看起來是這樣的:
?
?
The Decoder Side
Now that we’ve covered most of the concepts on the encoder side, we basically know how the components of decoders work as well. But let’s take a look at how they work together.
現(xiàn)在我們已經(jīng)介紹了編碼器端的大多數(shù)概念,我們基本上也知道解碼器的組件是如何工作的。下面讓我們來看看它們是如何一起工作的。
The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence:
編碼器從處理輸入序列開始。將頂層編碼器的輸出轉換為一組attention向量K和v,每個編碼器在其“編碼器-譯碼器attention”層中使用,幫助譯碼器在輸入序列中適當?shù)奈恢镁劢?
?
After finishing the encoding phase, we begin the decoding phase. Each step in the decoding phase outputs an element from the output sequence (the English translation sentence in this case).
完成編碼階段后,我們開始解碼階段。解碼階段的每個步驟從輸出序列中輸出一個元素(在本例中是英語翻譯句)。
The following steps repeat the process until a special?symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.
下面的步驟重復這個過程,直到到達一個特殊的符號,表明transformer解碼器已經(jīng)完成了輸出。每一步的輸出在下一次的步驟中被輸入到底層解碼器,解碼器就像編碼器一樣把它們的解碼結果放大。就像我們對編碼器輸入所做的那樣,我們嵌入并添加位置編碼到這些解碼器輸入中來表示每個單詞的位置。
?
The self attention layers in the decoder operate in a slightly different way than the one in the encoder:
解碼器中的自我注意層的工作方式與編碼器中的略有不同:
In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to?-inf) before the softmax step in the self-attention calculation.
在解碼器中,self-attention層只允許注意輸出序列中較早的位置。這是通過在self-attention計算中的softmax步驟之前屏蔽未來位置(將它們設置為-inf)來實現(xiàn)的。
The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.
編碼器-解碼器注意層的工作原理與多頭自注意層類似,只是它從下面的層創(chuàng)建查詢矩陣,并從編碼器堆棧的輸出中獲取鍵和值矩陣。
The Final Linear and Softmax Layer
最后的線性和軟最大層
The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.
解碼器堆棧輸出一個浮點向量。我們怎么把它變成一個詞呢?這是最后一個線性層的工作,然后是一個Softmax層。
The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.
線性層是一個簡單的全連接神經(jīng)網(wǎng)絡,它將解碼器堆棧產(chǎn)生的向量投影到一個更大的稱為邏輯向量的向量中。
Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.
讓我們假設我們的模型從它的訓練數(shù)據(jù)集中知道10,000個獨特的英語單詞(我們模型的“輸出詞匯表”)。這將使logits向量寬10,000個單元格—每個單元格對應一個惟一單詞的得分。這就是我們解釋線性層之后的模型輸出的方式。
The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.
softmax層然后將這些分數(shù)轉換為概率(全部為正數(shù),全部加起來為1.0)。選擇概率最大的單元格,并生成與之關聯(lián)的單詞作為此時間步驟的輸出。
?
This figure starts from the bottom with the vector produced as the output of the decoder stack. It is then turned into an output word.
此圖從底部開始,輸出的矢量作為解碼器堆棧的輸出。然后將其轉換為輸出字。
Recap Of Training
訓練回顧
Now that we’ve covered the entire forward-pass process through a trained Transformer, it would be useful to glance at the intuition of training the model.
既然我們已經(jīng)介紹了通過訓練過的轉換器的整個前向傳遞過程,那么了解一下訓練模型的直觀了解將是很有用的。
During training, an untrained model would go through the exact same forward pass. But since we are training it on a labeled training dataset, we can compare its output with the actual correct output.
在訓練過程中,未訓練的模型將經(jīng)歷完全相同的向前傳球。但是由于我們是在一個標記的訓練數(shù)據(jù)集上訓練它,我們可以將它的輸出與實際正確的輸出進行比較。
To visualize this, let’s assume our output vocabulary only contains six words(“a”, “am”, “i”, “thanks”, “student”, and “<eos>” (short for ‘end of sentence’)).
為了直觀地理解這一點,我們假設輸出詞匯表只包含6個單詞(“a”、“am”、“i”、“thanks”、“student”和“<eos>”(“end of sentence”的縮寫))。
?
The output vocabulary of our model is created in the preprocessing phase before we even begin training.
我們模型的輸出詞匯表是在我們開始訓練之前的預處理階段創(chuàng)建的。
Once we define our output vocabulary, we can use a vector of the same width to indicate each word in our vocabulary. This also known as one-hot encoding. So for example, we can indicate the word “am” using the following vector:
一旦定義了輸出詞匯表,就可以使用相同寬度的向量來表示詞匯表中的每個單詞。這也稱為one-hot編碼。例如,我們可以用下面的向量來表示am:
?
Example: one-hot encoding of our output vocabulary
示例:輸出詞匯表的one-hot編碼
The Loss Function
損失函數(shù)
Say we are training our model. Say it’s our first step in the training phase, and we’re training it on a simple example – translating “merci” into “thanks”.
假設我們正在訓練我們的模型。假設這是我們在訓練階段的第一步,我們用一個簡單的例子來進行訓練——將“merci”翻譯成“thanks”。
What this means, is that we want the output to be a probability distribution indicating the word “thanks”. But since this model is not yet trained, that’s unlikely to happen just yet.
這意味著,我們希望輸出是一個表示“謝謝”的概率分布。但是由于這個模型還沒有經(jīng)過訓練,所以現(xiàn)在還不太可能實現(xiàn)。
?
Since the model's parameters (weights) are all initialized randomly, the (untrained) model produces a probability distribution with arbitrary values for each cell/word. We can compare it with the actual output, then tweak all the model's weights using backpropagation to make the output closer to the desired output.
由于模型的參數(shù)(權重)都是隨機初始化的,因此(未經(jīng)訓練的)模型生成每個單元格/單詞的任意值的概率分布。我們可以將其與實際輸出進行比較,然后使用反向傳播調整模型的所有權重,使輸出更接近所需的輸出。
How do you compare two probability distributions? We simply subtract one from the other. For more details, look atcross-entropy?and?Kullback–Leibler divergence.
如何比較兩個概率分布?我們只要把其中一個減去另一個。更多細節(jié),請看交叉熵和庫爾巴克-萊布爾散度(KL散度)。
But note that this is an oversimplified example. More realistically, we’ll use a sentence longer than one word. For example – input: “je suis étudiant” and expected output: “i am a student”. What this really means, is that we want our model to successively output probability distributions where:
但是請注意,這是一個過于簡化的示例。更實際的情況是,我們將使用一個句子,而不是一個單詞。例如,輸入:“je suis etudiant”和期望輸出:“i am a student”。這真正的意思是,我們想要我們的模型連續(xù)輸出概率分布,其中:
- Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 3,000 or 10,000)
- The first probability distribution has the highest probability at the cell associated with the word “i”
- The second probability distribution has the highest probability at the cell associated with the word “am”
- And so on, until the fifth output distribution indicates ‘<end of sentence>’ symbol, which also has a cell associated with it from the 10,000 element vocabulary.
每個概率分布都由一個寬度為vocab_size的向量表示(在我們的玩具示例中是6,但更實際的數(shù)字是3000或10000)
第一個概率分布在與“i”相關的單元格中具有最高的概率
第二個概率分布在與am相關的單元格中具有最高的概率
以此類推,直到第5個輸出分布表示' <end of sentence> '符號,該符號也有一個來自10,000個元素詞匯表的與之關聯(lián)的單元格。
?
After training the model for enough time on a large enough dataset, we would hope the produced probability distributions would look like this:
在足夠大的數(shù)據(jù)集上訓練模型足夠長的時間后,我們希望生成的概率分布是這樣的:
?
Hopefully upon training, the model would output the right translation we expect. Of course it's no real indication if this phrase was part of the training dataset (see:?cross validation). Notice that every position gets a little bit of probability even if it's unlikely to be the output of that time step -- that's a very useful property of softmax which helps the training process.
希望通過訓練,該模型能夠輸出我們期望的正確翻譯。當然,如果這個短語是訓練數(shù)據(jù)集的一部分(參見:交叉驗證),這并不是真正的指示。注意,每個位置都有一點概率即使它不太可能是那個時間步長的輸出——這是softmax的一個非常有用的屬性,它有助于訓練過程。
Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. That’s one way to do it (called greedy decoding). Another way to do it would be to hold on to, say, the top two words (say, ‘I’ and ‘a(chǎn)’ for example), then in the next step, run the model twice: once assuming the first output position was the word ‘I’, and another time assuming the first output position was the word ‘me’, and whichever version produced less error considering both positions #1 and #2 is kept. We repeat this for positions #2 and #3…etc. This method is called “beam search”, where in our example, beam_size was two (because we compared the results after calculating the beams for positions #1 and #2), and top_beams is also two (since we kept two words). These are both hyperparameters that you can experiment with.
現(xiàn)在,因為這個模型每次產(chǎn)生一個輸出,我們可以假設這個模型從概率分布中選擇概率最大的單詞,然后扔掉剩下的。這是一種方法(稱為貪婪解碼)。另一個方法是堅持,好比,前兩個單詞(比如“我”和“a”),然后在下一步中,運行模型兩次:一次假設第一個輸出位置是“I”這個詞,而另一個假設第一個輸出位置是‘me’這個詞,和哪個版本產(chǎn)生更少的錯誤考慮# 1和# 2保存位置。我們對2號和3號位置重復這個。這種方法稱為“beam search”,在我們的示例中,beam_size為2(因為我們在計算位置1和位置2的beam后對結果進行了比較),top_beam也是2(因為我們保留了兩個單詞)。這兩個都是可以實驗的超參數(shù)。
Go Forth And Transform
I hope you’ve found this a useful place to start to break the ice with the major concepts of the Transformer. If you want to go deeper, I’d suggest these next steps:
我希望這篇文章能對你有所幫助,讓你可以從這里開始了解Transformer的主要概念。如果你想深入了解,我建議你采取以下步驟:
- Read the?Attention Is All You Need?paper, the Transformer blog post (Transformer: A Novel Neural Network Architecture for Language Understanding), and the?Tensor2Tensor announcement.
- 閱讀《Attention Is All You Need》這篇文章、Transformer博客文章(Transformer: A Novel Neural Network Architecture for Language Understanding)和Tensor2Tensor announcement。
- Watch??ukasz Kaiser’s talk?walking through the model and its details
- 看??ukasz Kaiser’s talk,講的是關于模型和細節(jié)
- Play with the?Jupyter Notebook provided as part of the Tensor2Tensor repo
- 操作一下Jupyter Notebook provided as part of the Tensor2Tensor repo
- Explore the?Tensor2Tensor repo.
- 探索?Tensor2Tensor repo.
Acknowledgements
Thanks to?Illia Polosukhin,?Jakob Uszkoreit,?Llion Jones?,?Lukasz Kaiser,?Niki Parmar, and?Noam Shazeer?for providing feedback on earlier versions of this post.
Please hit me up on?Twitter?for any corrections or feedback.
Written on June 27, 2018
感謝Illia Polosukhin、Jakob Uszkoreit、Llion Jones、Lukasz Kaiser、Niki Parmar和Noam Shazeer對本文早期版本提供的反饋。
如果有任何更正或反饋,請在Twitter上聯(lián)系我。
寫于2018年6月27日
?
原文:?https://jalammar.github.io/illustrated-transformer/
本來是有圖片的,粘貼過來就沒了。
總結
以上是生活随笔為你收集整理的The Illustrated Transformer 翻译的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 这么多年,终于有人讲清楚 Transfo
- 下一篇: The Illustrated Tran