The Illustrated Transformer:中英文(看原文,很多翻译是错误的)
在上一篇文章中(previous post),我們研究了注意力機制 - 一種在現代深度學習模型中無處不在的(ubiquitous)方法。 注意力是一個有助于提高神經機器翻譯(neural machine translation)應用程序性能的概念。 在這篇文章中(In this post),我們將介紹The Transformer–一個使用注意力來提高(boost)這些模型訓練速度的模型。The Transformers在特定任務中優于(outperforms)Google神經機器翻譯模型。 然而,最大的好處來自于Transformer如何為并行化(parallelization)做出貢獻。 事實上,Google Cloud建議使用The Transformer作為參考模型來使用他們的Cloud TPU產品。 因此,讓我們嘗試將模型分開,看看它是如何運作的。
The Transformer在文章中提出了Attention is All You Need。 它的TensorFlow實現作為Tensor2Tensor包的一部分提供。 哈佛大學的NLP小組創建了一個用PyTorch實現注釋論文的指南。 在這篇文章中,我們將嘗試過度簡化一些事情并逐一介紹這些概念,以便在沒有深入了解主題的情況下讓人們更容易理解。
A High-Level Look
讓我們首先將模型看作一個黑盒子。 在機器翻譯應用程序中,它將使用一種語言的句子,并將其翻譯輸出到另一種語言中。
Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.
The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.
The encoders are all identical in structure (yet they do not share weights注:他們的權重不共享). Each one is broken down into two sub-layers:
The encoder’s inputs first flow through a?self-attention layer?– a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.
The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.
The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in?seq2seq?models).
Bringing The Tensors Into The Picture
Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.
As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.
The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below.?The size of this list is hyperparameter we can set(這是其中的一個超參數)?– basically it would be the length of the longest sentence in our training dataset.
After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.
Here we begin to see one key property of the Transformer, which is that the word in each position flows through?its own path?in the encoder.?There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.
Next, we’ll switch up(切換) the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.
Now We’re Encoding!
As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.
Self-Attention at a High Level
Don’t be fooled by me throwing around the word “self-attention” like it’s a concept everyone should be familiar with. I had personally never came across the concept until reading the Attention is All You Need paper. Let us distill how it works.
Say the following sentence is an input sentence we want to translate:
”The animal didn't cross the street because it was too tired”- 1
What does “it” in this sentence refer to?(指什么)?? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.
When the model is processing the word “it”, self-attention allows it to associate?(相關聯)?“it” with “animal”.
As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.
If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate?its representation of previous words/vectors it has processed?with?the current one it’s processing. Self-attention is the method the Transformer uses to bake(融入) the “understanding” of other relevant words into the one we’re currently processing.
Be sure to check out the Tensor2Tensor notebook where you can load a Transformer model, and examine it using this interactive visualization.
Self-Attention in Detail
Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually implemented – using matrices.
The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So?for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created?by multiplying the embedding by three matrices?that we trained during the training process.
Notice that these?new vectors are smaller in dimension than the embedding vector. Their dimensionality is?64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.
What are the “query”, “key”, and “value” vectors?
They’re?abstractions?that?are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors plays.
The second step in calculating self-attention is to?calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
計算自我關注度的第二步是計算得分。 假設我們正在計算這個例子中第一個單詞“思考”的自我關注。 我們需要根據這個詞對輸入句子的每個單詞進行評分。 當我們在某個位置編碼單詞時,分數決定了對輸入句子的其他部分放置多少焦點。
The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.
通過將查詢向量的點積與我們得分的相應單詞的關鍵向量計算得分。 因此,如果我們處理位置#1中單詞的自我關注,則第一個分數將是q1和k1的點積。 第二個分數是q1和k2的點積。
The third and forth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.
第三步和第四步是將得分除以8(論文中使用的關鍵向量的維數的平方根 - 64.這導致具有更穩定的梯度。這里可能存在其他可能的值,但這是 默認),然后通過softmax操作傳遞結果。 Softmax將分數標準化,因此它們都是正數并且加起來為1。
The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition(直覺) here is to keep intact the values of the word(s) we want to focus on, and drown-out(淹沒) irrelevant words (by multiplying them by tiny numbers like 0.001, for example).
第五步是將每個值向量乘以softmax得分(準備將它們相加)。 這里的直覺是保持我們想要關注的單詞的值不變,并淹沒不相關的單詞(例如,通過將它們乘以像0.001這樣的小數字)。
The sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).
第六步是總結加權值向量。 這會在此位置產生自我關注層的輸出(對于第一個單詞)。
Matrix Calculation of Self-Attention
The first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV).
第一步是計算Query,Key和Value矩陣。 我們通過將嵌入包裝到矩陣X中,并將其乘以我們訓練過的權重矩陣(WQ,WK,WV)來實現。
Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.
最后,由于我們正在處理矩陣,我們可以在一個公式中濃縮步驟2到6來計算自我關注層的輸出。
The Beast With Many Heads
The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways:
本文通過增加一種稱為“多頭”關注的機制,進一步完善了自我關注層。 這以兩種方式改善了關注層的性能:
It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself. It would be useful if we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to.
它擴展了模型關注不同位置的能力。 是的,在上面的例子中,z1包含了所有其他編碼的一點點,但它可能由實際的單詞本身支配。 如果我們翻譯一句“動物沒有過馬路,因為它太累了”,我們會想知道“它”指的是哪個詞,這將是有用的。
It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.
它給予attention層多個“表示子空間”。 正如我們接下來將看到的,我們不僅有一個,而且還有多組Query / Key / Value權重矩陣(Transformer使用8個注意頭,因此我們最終為每個編碼器/解碼器設置了8個)。 這些集合中的每一個都是隨機初始化的。 然后,在訓練之后,每組用于將輸入嵌入(或來自較低編碼器/解碼器的矢量)投影到不同的表示子空間中。
If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices
如果我們進行上面概述的相同的自我關注計算,只有八個不同的時間使用不同的權重矩陣,我們最終得到八個不同的Z矩陣
This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.
這讓我們面臨一些挑戰。 前饋層不期望八個矩陣 - 它期望單個矩陣(每個字的向量)。 所以我們需要一種方法將這八個壓縮成一個矩陣。
How do we do that? We concat the matrices then multiple them by an additional weights matrix WO.
我們怎么做? 我們將矩陣連接起來然后通過另外的權重矩陣WO將它們多個。
That’s pretty much all there is to multi-headed self-attention. It’s quite a handful of matrices, I realize. Let me try to put them all in one visual so we can look at them in one place
這就是多頭自我關注的全部內容。 我意識到這是一小部分矩陣。 讓我嘗試將它們全部放在一個視覺中,這樣我們就可以在一個地方看到它們
Now that we have touched upon attention heads, let’s revisit our example from before to see where the different attention heads are focusing as we encode the word “it” in our example sentence:
現在我們已經觸及了注意力的頭,讓我們重新審視我們之前的例子,看看不同的注意力頭在哪里聚焦,因為我們在我們的例句中編碼“it”這個詞:
If we add all the attention heads to the picture, however, things can be harder to interpret:
但是,如果我們將所有注意力添加到圖片中,那么事情可能更難理解:
Representing The Order of The Sequence Using Positional Encoding
One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.
到目前為止,模型中缺少的一件事就是考慮輸入序列中單詞順序的一種方法。
To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.
為了解決這個問題,變換器為每個輸入嵌入添加了一個向量。 這些向量遵循模型學習的特定模式,這有助于確定每個單詞的位置,或者序列中不同單詞之間的距離。 這里的直覺是,將這些值添加到嵌入中,一旦它們被投影到Q / K / V向量中并且在點積注意期間,就在嵌入向量之間提供有意義的距離。
If we assumed the embedding has a dimensionality of 4, the actual positional encodings would look like this:
如果我們假設嵌入的維數為4,那么實際的位置編碼將如下所示:
What might this pattern look like?
這種模式可能是什么樣的?
In the following figure, each row corresponds the a positional encoding of a vector. So the first row would be the vector we’d add to the embedding of the first word in an input sequence. Each row contains 512 values – each with a value between 1 and -1. We’ve color-coded them so the pattern is visible.
在下圖中,每行對應矢量的位置編碼。 因此第一行將是我們添加到輸入序列中嵌入第一個單詞的向量。 每行包含512個值 - 每個值的值介于1和-1之間。 我們對它們進行了顏色編碼,使圖案可見。
The formula for positional encoding is described in the paper (section 3.5). You can see the code for generating positional encodings in get_timing_signal_1d(). This is not the only possible method for positional encoding. It, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g. if our trained model is asked to translate a sentence longer than any of those in our training set).
位置編碼的公式在論文(第3.5節)中描述。 您可以在get_timing_signal_1d()中看到用于生成位置編碼的代碼。 這不是位置編碼的唯一可能方法。 然而,它具有能夠擴展到看不見的序列長度的優點(例如,如果要求我們訓練的模型翻譯句子的時間長于我們訓練集中的任何句子)。
The Residuals
One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a?layer-normalization?step.
在繼續之前我們需要提到的編碼器架構中的一個細節是每個編碼器中的每個子層(自注意,ffnn)在其周圍具有殘余連接,然后是層規范化步驟。
If we’re to visualize the vectors and the layer-norm operation associated with self attention, it would look like this:
如果我們要將矢量和與自我關注相關的圖層規范操作可視化,它將如下所示:
This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:
這也適用于解碼器的子層。 如果我們想到2個堆疊編碼器和解碼器的變壓器,它看起來像這樣:
The Decoder Side
Now that we’ve covered most of the concepts on the encoder side, we basically know how the components of decoders work as well. But let’s take a look at how they work together.
既然我們已經涵蓋了編碼器方面的大多數概念,我們基本上都知道解碼器的組件是如何工作的。 但是讓我們來看看它們如何協同工作。
The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.
以下步驟重復此過程,直到達到特殊符號,表示變壓器解碼器已完成其輸出。 在下一個時間步驟中,每個步驟的輸出被饋送到底部解碼器,并且解碼器像編碼器那樣冒泡它們的解碼結果。 就像我們對編碼器輸入所做的那樣,我們在這些解碼器輸入中嵌入并添加位置編碼,以指示每個字的位置。
The self attention layers in the decoder operate in a slightly different way than the one in the encoder:
解碼器中的自關注層以與編碼器中的自注意層略有不同的方式操作:
In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.
在解碼器中,僅允許自我關注層關注輸出序列中的較早位置。 這是通過在自我關注計算中的softmax步驟之前屏蔽未來位置(將它們設置為-inf)來完成的。
The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.
“編碼器 - 解碼器注意”層就像多頭自我注意一樣,除了它從它下面的層創建其查詢矩陣,并從編碼器堆棧的輸出中獲取鍵和值矩陣。
The Final Linear and Softmax Layer
The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.
解碼器堆棧輸出浮點數向量。 我們如何將其變成一個單詞? 這是最終線性層的工作,其后是Softmax層。
The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.
線性層是一個簡單的完全連接的神經網絡,它將堆疊的解碼器產生的矢量投影到一個更大,更大的向量中,稱為logits向量。
Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.
讓我們假設我們的模型知道10,000個獨一無二的英語單詞(我們的模型的“輸出詞匯表”),它是從訓練數據集中學到的。 這將使logits向量10,000個細胞寬 - 每個細胞對應于一個唯一單詞的得分。 這就是我們如何解釋模型的輸出,然后是線性層。
The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.
然后softmax層將這些分數轉換為概率(全部為正,全部加起來為1.0)。 選擇具有最高概率的單元,并且將與其相關聯的單詞作為該時間步的輸出。
Recap Of Training
Now that we’ve covered the entire forward-pass process through a trained Transformer, it would be useful to glance at the intuition of training the model.
現在我們已經講述了Transformer的整個前向傳播過程,看一下訓練模型的直覺是有用的。
During training, an untrained model would go through the exact same forward pass. But since we are training it on a labeled training dataset, we can compare its output with the actual correct output.
在訓練期間,未經訓練的模型將通過完全相同的前進傳球。 但是,由于我們在標記的訓練數據集上訓練它,我們可以將其輸出與實際正確的輸出進行比較。
To visualize this, let’s assume our output vocabulary only contains six words(“a”, “am”, “i”, “thanks”, “student”, and “” (short for ‘end of sentence’)).
為了想象這一點,讓我們假設我們的輸出詞匯只包含六個單詞(“a”,“am”,“i”,“thanks”,“student”和“”(“句末”的縮寫))。
Once we define our output vocabulary, we can use a vector of the same width to indicate each word in our vocabulary. This also known as one-hot encoding. So for example, we can indicate the word “am” using the following vector:
一旦我們定義了輸出詞匯表,我們就可以使用相同寬度的向量來表示詞匯表中的每個單詞。 這也稱為單熱編碼。 例如,我們可以使用以下向量指示單詞“am”:
Following this recap, let’s discuss the model’s loss function – the metric we are optimizing during the training phase to lead up to a trained and hopefully amazingly accurate model.
在回顧一下之后,讓我們討論一下模型的損失函數 - 我們在訓練階段優化的指標,以引導一個訓練有素且令人驚訝的精確模型。
The Loss Function
Say we are training our model. Say it’s our first step in the training phase, and we’re training it on a simple example – translating “merci” into “thanks”.
假設我們正在訓練我們的模型。 說這是我們在訓練階段的第一步,我們正在訓練它的一個簡單例子 - 將“merci”翻譯成“謝謝”。
What this means, is that we want the output to be a probability distribution indicating the word “thanks”. But since this model is not yet trained, that’s unlikely to happen just yet.
這意味著,我們希望輸出是指示“謝謝”一詞的概率分布。 但由于這種模式還沒有接受過訓練,所以這種情況不太可能發生。
How do you compare two probability distributions? We simply subtract one from the other. For more details, look at cross-entropy and Kullback–Leibler divergence.
你如何比較兩個概率分布? 我們簡單地從另一個中減去一個。 有關更多詳細信息,請查看交叉熵和Kullback-Leibler散度。
But note that this is an oversimplified example. More realistically, we’ll use a sentence longer than one word. For example – input: “je suis étudiant” and expected output: “i am a student”. What this really means, is that we want our model to successively output probability distributions where:
但請注意,這是一個過于簡單的例子。 更現實的是,我們將使用長于一個單詞的句子。 例如 - 輸入:“jesuisétudiant”和預期輸出:“我是學生”。 這真正意味著,我們希望我們的模型能夠連續輸出概率分布,其中:
- Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 3,000 or 10,000)
每個概率分布由寬度為vocab_size的向量表示(在我們的玩具示例中為6,但更實際地是3,000或10,000的數字) - The first probability distribution has the highest probability at the cell associated with the word “i”
第一概率分布在與單詞“i”相關聯的單元處具有最高概率 - The second probability distribution has the highest probability at the cell associated with the word “am”
第二概率分布在與單詞“am”相關聯的單元格中具有最高概率 - And so on, until the fifth output distribution indicates ‘’ symbol, which also has a cell associated with it from the 10,000 element vocabulary.
依此類推,直到第五個輸出分布表示’<句末結束>‘符號,其中還有一個與10,000元素詞匯表相關聯的單元格。
After training the model for enough time on a large enough dataset, we would hope the produced probability distributions would look like this:
在足夠大的數據集上訓練模型足夠的時間之后,我們希望產生的概率分布看起來像這樣:
Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. That’s one way to do it (called greedy decoding). Another way to do it would be to hold on to, say, the top two words (say, ‘I’ and ‘a’ for example), then in the next step, run the model twice: once assuming the first output position was the word ‘I’, and another time assuming the first output position was the word ‘me’, and whichever version produced less error considering both positions #1 and #2 is kept. We repeat this for positions #2 and #3…etc. This method is called “beam search”, where in our example, beam_size was two (because we compared the results after calculating the beams for positions #1 and #2), and top_beams is also two (since we kept two words). These are both hyperparameters that you can experiment with.
現在,因為模型一次生成一個輸出,我們可以假設模型從該概率分布中選擇具有最高概率的單詞并丟棄其余的單詞。 這是一種方法(稱為貪婪解碼)。 另一種方法是保持前兩個詞(例如,‘I’和’a’),然后在下一步中,運行模型兩次:一旦假設第一個輸出位置是 單詞’I’,另一次假設第一個輸出位置是’me’這個單詞,考慮到#1和#2的位置,保留的是哪個版本產生的錯誤較少。 我們重復這個位置#2和#3 …等。 這種方法稱為“波束搜索”,在我們的例子中,beam_size是兩個(因為我們在計算位置#1和#2的波束后比較了結果),top_beams也是兩個(因為我們保留了兩個詞)。 這些都是您可以嘗試的超參數。
Go Forth And Transform
I hope you’ve found this a useful place to start to break the ice with the major concepts of the Transformer. If you want to go deeper, I’d suggest these next steps:
我希望你已經發現這是一個有用的地方,開始用Transformer的主要概念打破僵局。 如果你想更深入,我建議接下來的步驟:
- Read the?Attention Is All You Need paper, the Transformer blog post (Transformer: A Novel Neural Network Architecture for Language Understanding), and the?Tensor2Tensor announcement.
- Watch??ukasz Kaiser’s talk?walking through the model and its details
- Play with the?Jupyter Notebook provided as part of the Tensor2Tensor repo
- Explore the?Tensor2Tensor repo.
總結
以上是生活随笔為你收集整理的The Illustrated Transformer:中英文(看原文,很多翻译是错误的)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: The Illustrated Tran
- 下一篇: Transformer-XL解读(论文