BERT论文阅读(一): Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT三大核心:
- pre-training
- bidirectional==>alleviates the unidirectionality constriant?of fine-tuning based approaches by using a "masked language model"(MLM) pre-training objective
- fine-tuning
==>為啥掩碼mask?
==>?目前的方法,例如OpenAI GPT使用left-to-right架構,單向性約束
=> masked language model減輕單向,允許雙向
==> 因為真正完全遮蓋屏蔽的掩碼,造成pre-training和fine-tuning不匹配?
==>為了緩解這種情況,我們不總是用真正的[MASK]詞牌掩蓋單詞。80%詞牌用[MASK]掩蓋,10%用隨機詞牌替代,剩下的10%保持不變
==> 語言模型無法直接捕獲句子關系?
==>為了訓練一個能夠理解句子關系的模型,我們對二分類的下一句預測任務進行預訓練NSP
==> 傳統(tǒng)的兩步:文本對編碼 + 應用雙向交叉attention ?==> self-attention一步
==> 為了改善LTR這種單向模型缺陷,做了幾個嘗試:
attempt 1: we add a randomly initialized BiLSTM on top the LTR model.
==> LTR單向模型加BiLSTM之后,效果確實有提升,但性能還是比預訓練雙向模型差很多!
嘗試2: LTR + RTL,我們認為單獨訓練從左到右和從右到左模型是可能的,并且將每個詞牌表示為兩個模型的連接,比如ELMo。
problem:?
a.?LTR+RTL --> token representation的成本時單個雙向模型的兩倍?
b.對于像問答等任務不直觀,因為從右到左模型不能基于問題思考答案,<-- 因為答案比問題先出現
c.嚴格來說,它不如深度雙向模型強大,因為深度雙向模型可以在每一層都使用左右上下文。
目錄
Abstract
1. Introduction
feature-based approach?and fine-tuning approach
The contributions of our paper
2. Related Work
2.1 Unsupervised Feature-based Approaches
2.2 Unsupervised Fine-tuning Approaches
2.3 Transfer Learning from Supervised Data
3. BERT framework
Model Architecture of BERT
Input/Output Representations
dataset:WordPiece embeddings
we differentiate the sentences in two ways
BERT input representation
3.1 Pre-training BERT
Taks #1: Masked LM
Task #2: Next Sentence Prediciton(NSP)
3.2 Fine-tuning BERT
4. Experiments
4.1 GLUE
4.2 SQuAD v1.1
in the question answering task
4.3 SQuAD v2.0
4.4 SWAG
5. Ablation/??ble?.??n/ Studies
5.1 Effect of Pre-training Tasks
No NSP
LTR & No NSP
+BiLSTM
5.2 Effect of Model Size
5.3 Feature-based approach with BERT
the feature-based approach has certain advantages:
To ablate the fine-tuning approach,
6. Conclusion
Abstract
BERT, Bidirectional Encoder Representations from Transformers, a new language representation model.
BERT,來源于Transformers的雙向編碼器表示
to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left context(Left-to-Right)?and right(Right-to-Left) context in all layers.
BERT能夠通過聯(lián)合調節(jié)所有層的左上下文和右上下文,從未標記文本中預訓練深度雙向表示。
Pretrained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks without substantial(considerable importance, size or worth) taskspecific architecture modifications.
預訓練好的BERT模型能夠只用一個額外的輸出層進行微調,從而為各種任務創(chuàng)建最先進的模型,而不用對基于任務的架構進行大量修改。
In one word, BERT is conceptually simple and empiracally powerful.
總之,BERT在概念上簡單,在經驗上強大。
1. Introduction
Language model pre-training has been shown to be effective for improving many natural language processing tasks.
- sentence-level tasks. i.e. NLI, paraphrasing --> which aim to predict the relationships between sentences by analyzing them holistically.
- token-level tasks. i.e. NER, question ansering -->?where models are required to produce fine-grained?output at the token leve
語言模型預訓練對提升許多nlp任務是有效的,比如:
- sentence-level tasks。自然語言推理,釋義-->旨在通過整體分析來預測句子之間的關系
- token-level task。命名實體識別,問答 --> 模型需要在令牌級別產生細粒度輸出
feature-based approach?and fine-tuning approach
There are two existing strategies for applying pre-trained language representations?to downstream tasks: ?
==> share same objective function during pre-training; they ?both use unidirectional language?models to learn general language representations.
- feature-based approaches. ELMo, use task-specific architectures that include the pre-trained representations as additional features.
- fine-tuning approach. OpenAI GPT, Generative Pre-trained Transformer, introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters.
目前有兩種應用預訓練語言表示的策略:==> 在預訓練期間分享相同的目標函數;它們都使用單向語言模型去學習整體的語言表示。
- 基于特征的方法。例如,ELMo,用包含預訓練表示的基于任務的架構作為額外特征。
- 基于微調的方法。OpenAI GPT,引入最少的基于任務的參數,并且通過簡單微調所有預訓練參數對下游任務進行訓練。
==> 單向的缺點,要雙向
Problem: current techniques restrict the power of ?the pre-trained representations, especially for the fine-tuning approaches.?The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training.?For example, in OpenAI GPT, the authors use a left-to-right architecture, where every token can only attend to previous tokens in the self-attention layers.
Such restrictions are sub-optimal for sentence-level tasks, and could be very harmful ?when applying fine-tuning based approaches to token-level tasks, such as question answering, where it's crucial to incorporate context from both directions.
問題:目前的技術限制了預訓練表示的能力,特別是對于微調方法。主要的限制是標準語言模型是單向的,這限制了可以在預訓練期間使用的架構選擇。例如,在OpenAI GPT中,作者使用一個left-to-right架構,其中每個詞牌只能參與self-attention層中之前的詞牌。
這種單向限制對句子級別任務是次優(yōu)的,但當把基于微調的方法用于詞牌級別的任務時危害很大。例如,在問答領域,從兩個方向合并上下文是非常關鍵的。
In this paper, we improve the fine-tuning based approaches by BERT.
在這篇論文中,我們通過BERT來提升基于微調的方法。
==> masked language model減輕單向,允許雙向
solve: BERT alleviates the previously mentioned unidirectionality constriant by using a "masked language model"(MLM) pre-training objective. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context.
BERT通過使用“掩碼語言模型”MLM預訓練目標減輕了前面提到的單向性約束。掩碼的語言模型隨機屏蔽輸入中的一些詞牌,目標是僅基于其上下文來預測掩碼單詞的原始詞匯id。
Unlike left-to-right language model pre-training
不同于從左到右的語言模型預訓練
the MLM objective enables the representation to fuse the left and the right context, which allows us to pretrain a deep bidirectional Transformer.
MLM目標能夠讓表示融合左側和右側的上下文,這使我們能預訓練一個深度雙向Transformer。
A "next sentence prediction" task that jointly pretrains text-pair representations.
下一句預測任務可以聯(lián)合預訓練文本對表示。
The contributions of our paper
- demonstrate the importance of bidirectional pre-training for language representations.
說明了對語言表示來說雙向預訓練的重要性 <= 之前的模型在pre-training階段是單向的
BERT uses MLM to enable pretrained deep bidirectional representations.
BERT使用MLM來啟用預訓練深度雙向表示
- pre-trained representations reduce the need for many heavily-engineered task-specific architectures.
預訓練表示降低了許多精心設計的特定任務架構的需求
BERT is the first fine-tuning based representation model. ?> many task-specific architectures
BERT是第一個基于微調的表示模型,其性能超過了許多特定任務架構。
- BERT advances the state of the art for eleven NLP tasks.
https://github.com/google-research/bert
2. Related Work
history of pre-training general language representations. 預訓練通用語言表示的歷史
2.1 Unsupervised Feature-based Approaches
- word representation
pre-trained word embeddings > embeddings learned from scratch
預訓練詞嵌入,性能優(yōu)于從頭開始學習的嵌入
--> To pretrain word embedding vectors, left-to-right language modeling objectives have been?used + objectives to discriminate correct from incorrect words in left and right context
為了預訓練詞嵌入向量,使用了從左到右語言建模目標 + 在左右上下文中區(qū)分正確單詞和錯誤單詞的目標
- coarser granularities更粗粒度(sentence embeddings, paragraph embeddings)
--> To train sentence representations, prior work has used objectives to rank candidate next sentences, left-to-right generation of next sentence words given a representation of the previous sentence.
為了訓練句子表示,之前的工作使用目標對候選的下一個句子進行排序,給頂一個之前句子的表示,從左到右生成下一個句子單詞。
- traditional word embedding
ELMo and its predecessor
extract context-sensitive features <-- from a left-to-right and a right-to-left language model
抽取上下文敏感特征 <-- 從一個從左到右和一個從右到左語言模型
The contextual representation of each token is the concatenation of the left-to-right and right-to-left representations.
每個詞牌的上下文表示是從左到右和從右到左表示的串聯(lián)。
problem: LSTMs similar to ELMo, their model is feature-based and not deeply bidirectional.
LSTMs與ELMo類似,它們的模型都是基于特征的,都不是深度雙向的
-->擴展:https://www.cnblogs.com/robert-dlut/p/9824346.html
problem: word2vec是上下文無關的,無法對一詞多義進行建模
solve: ELMo, Embeddings from language models, 獲得一個上下文相關的預訓練表示
problem: their model is feature-based and not deeply bidirectional
2.2 Unsupervised Fine-tuning Approaches
only pretrained word embedding parameters from unlabeled text
- sentence or document encoders?which produce contextual token representations have been pretrained from unlabeled text and fine-tuned for a supervised downstream task.
生成上下文詞牌表示的句子或文本編碼器已經從未標記的文本中進行了預訓練,并且針對有監(jiān)督的下游任務進行了微調。
advantage: few parameters need to be learned from scratch.?
優(yōu)點:幾乎沒有參數需要從頭開始學習
2.3 Transfer Learning from Supervised Data
CV research demonstrate?the importance of transfer learning from large pre-trained models?
計算機視覺研究表明來自大規(guī)模預訓練模型的transfer learning的重要性
an effective recipe is to fine-tune models pretrained with ImageNet
一個有效方法是微調用ImageNet預訓練的模型
3. BERT framework
two steps in our framework of BERT: pre-training, fine-tuning。
在BERT模型框架中主要有兩步:預訓練和微調
- pre-training: the model is trained on unlabeled data over different pre-training tasks
預訓練:模型在不同預訓練任務的未標記數據上進行訓練
- fine-tuning: the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks.
微調:BERT模型首先用預訓練參數進行初始化,并且所有的參數用來自下游任務的標簽數據進行微調。
Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters.
每一個下游任務有獨立的微調模型,即使它們用相同的預訓練參數進行初始化。
A distinctive feature of BERT: its unified architecture across different tasks. There's minimal difference between the pre-trained architecture and the final downstream architecture.
BERT的一個顯著特征:跨不同任務的統(tǒng)一架構。預訓練架構和最終的下游架構之間的差異很小
Model Architecture of BERT
a multi-layer bidirectional Transformer encoder
BERT模型架構:一個多層雙向Transformer編碼器
L, the number of layers; H, the hidden size; A, the number of self-attention heads; 4H, the feed-forward/filter size, i.e. 3072 for the H=768, 4096 for the H=1024
L,層數;H,隱藏層大小;A,self-attention的頭數
two model sizes:?
(1)?BERTbase(L=12, H=768, A=12, Total Parameters=110M)
(2) BERTlarge(L=24, H=1024, A=16, Total Parameters=340M)
比較:BERTbase Transformer ?<--compare--> OpenAI GPT Transformer:
BERT uses bidirectional self-attention; GPT uses constrained self-attention where every token can only attend to context to its left.
BERT使用雙向self-attention;而OpenAI GPT使用受限的self-attention,其中每個詞牌只能關注左側的上下文。
in the literature, the bidirectional Transformer is often referred to as a "Transformer encoder" while the left-context-only version is referred to as a "Transformer decoder" since it can used for text feneration.?
Input/Output Representations
To make BERT handle a variety of downstream tasks
為了使BERT處理不同的下游任務
our input representation is able to unambiguously represent both a single sentence and a pair of sentences(e.g., ) in one token sequence. sentence, can be an arbitrary span of continguous text, rather than an actual linguistic sentence; sequence, refers to the input token sequence to BERT, which may be a single sentence or true sentences packed together.
我們的輸入表示能夠在每個詞牌序列中清楚明確地表示單個句子和句子對。其中,句子是連續(xù)文本一段任意范圍,而不是一個真實的語言句子;序列指的是輸入詞牌序列到BERT中,它可能是單個句子,也可能是大寶在一起的真實句子。
-
dataset:WordPiece embeddings
CLS -> a special classification token, the first token of every sequence
CLS是一個特殊的分類詞牌,每個序列的第一個詞牌
the final hidden state corresponding to CLS -> is used as aggregate sequence representation for classification tasks
與CLS相關的最終隱藏狀態(tài) -> 被用作分類任務的聚合序列表示
sentence pairs -> are packed together into a single sequence
句子對 -> 被打包在一起作為一個單獨序列
-
we differentiate the sentences in two ways
First, sep~ separate them with a special token?
首先,用一個特殊詞牌區(qū)分隔開它們,注意:是分隔開而不是區(qū)分開!!
Second, add a learned embedding to every token ->??indicating whether it belongs to sentence A or sentence B.
其次,將學習到的嵌入添加到每一個詞牌,用來表示它屬于句子A還是句子B
- E ~ input embeddings 輸入嵌入
- C ~ the final hidden vector of the special token 特殊詞牌的最終隱含向量
- Ti ~ the final hidden vector for the ith input token 第i個輸入詞牌的最終隱含向量
-
BERT input representation
Given a token, BERT input representation = corresponding token embeddings + segmentation embeddings + position embeddings
給定一個詞牌,BERT的輸入表示=相應的詞牌插入向量 + 分割嵌入 + 位置嵌入
3.1 Pre-training BERT
we do not use traditional left-to-right or right-to-left language models to pre-train BERT.
?我們不使用傳統(tǒng)的從左到右或從右到左語言模型去預訓練BERT。
we pre-train BERT using two unsupervised tasks.
我們用兩個非監(jiān)督任務來預訓練BERT?
Taks #1: Masked LM
problem: standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly "see itself", and the model could trivially predict the target word in a multi-layered context.
==>為啥掩碼mask?
問題:標準的條件語言模型只能被從左到右或從右到左進行訓練,因為雙向條件能讓每個詞間接地看到自己,而且模型在多層上下文環(huán)境下可以很簡單地預測出目標詞
purpose: In order to train a deep bidirectional representation
為了訓練一個深層雙向表示
==> 掩碼允許雙向
solve: we simply mask some percentage of the input tokens at random, and then predict those masked tokens.
我們簡單地隨機掩蓋一部分輸入詞牌,然后預測那些掩碼詞牌?
called "masked LM"(MLM), the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM. ?
掩碼語言模型MLM,與掩碼詞牌相對應的最終隱含向量被輸送到詞匯表上的輸出softmax,就像在標準語言模型中一樣。
15% at random, we only predict the masked words rather than reconstrcting the entire input.?
15%的隨機比例,我們只預測被掩碼的單詞而不是重建整個輸入
problem: Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismath between pre-training and fine-tuning, since the [mask] token does not appear during fine-tuning.
==> 因為真正完全遮蓋屏蔽的掩碼,造成pre-training和fine-tuning不匹配
solve: To mitigate this, we do not always replace "masked" words with the actual [MASK] token.
為了緩解這種情況,我們不總是用真正的[MASK]詞牌掩蓋單詞。
- chooses 15% of the token positions at random for prediction
if i-th token is chosen, 80% ~ replace with [MASK] token; 10% ~ replace with a random token; 10% ~ unchanged i-th token
隨機選擇15%的詞牌位置用于預測。其中,80%詞牌用[MASK]掩蓋,10%用隨機詞牌替代,剩下的10%保持不變
- Ti will be used to predict the original token with cross entropy loss
Ti將被用于用交叉熵損失來預測原始詞牌
Task #2: Next Sentence Prediciton(NSP)
Many important downstream tasks such as QA and NLI are based on undestanding the relationship?between two sentences
?許多重要的下游任務,例如問答和自然語言推理,都基于理解兩個句子之間的關系
problem: which is not directly captured by language modeling.
==> 語言模型無法直接捕獲句子關系
問題:但是句子之間的關系不能直接被語言建模捕獲
solve: In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.
解決:為了訓練一個能夠理解句子關系的模型,我們對二分類的下一句預測任務進行預訓練,該任務可以從任意單語語料庫中生成。
when choosing the sentence A and B for each pretraining example, 50% ~ true next -> IsNext label; 50% ~ random sentence -> NotNext label
當為每一個預訓練實例選擇句子A、B時,其中50%是真實的下一個句子,標記為IsNext;而另外50%是隨機句子,標記為NotNext。
C is used for next sentence prediction(NSP), which is closely related to representation learning objectives.
C用于下一句預測任務,它和表征學習目標密切相關
- However, in prior work, only sentence embeddings are transferred to downstream tasks.
然而,在之前的工作中,只有句子插入能夠被傳遞到下游任務中
- BERT transfers all parameters to initialize end-task model parameters.
但BERT能夠將所有參數都以初始化終端任務模型參數。
- Pre-training data
For the pre-training corpus, we use the BooksCorpus(800M words) and English Wikipeida. For Wikipedia, we extract only the next passages and ignore lists, tables, and headers.
3.2 Fine-tuning BERT
Fine-tuning is straightforward <-- since the self-attention mechanism in the Transformer allows ?BERT to model many downstream tasks by swapping out the appropriate inputs and outputs.
微調是很簡單的。因為Transformer中的self-attention允許BERT通過交換適當的輸入輸出對許多下游任務建模。
problem: For applications involving text pairs, a common pattern is to independently encode text pairs before applying bidirectional across attention.
==> 傳統(tǒng)的兩步:文本對編碼 + 應用雙向交叉attention ?==> self-attention一步
問題:對于包含文本對的應用,一種常見的模式就是在應用雙向交叉attention之前獨立編碼文本對。
solve: BERT instead uses the self-attention mechanism to unify these two stages, as encoding a concatenated text pair with self-attention effectively includes bidirectional cross attention between two sentences.
解決:而BERT使用self-attention機制統(tǒng)一這兩個階段,因為編碼具有self-attention的連接文本對有效地包括了兩個句子之間的雙向交叉注意力。
For each task, we simply plug in the task-specific inputs and outputs into BERT and fine-tune all the parameters end-to-end.
對于每一個任務,我們只需要將特定于任務的輸入和輸出插入到BERT,并且端到端地微調所有參數。
- at the input, sentence A and B analogous to (1) sentence pairs (2) (3) question-passage pairs
- at the output, the token representations are fed into an output layer for token-level tasks,?such as sequence tagging or question answering,?and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.
在輸出端,詞牌表示被輸入到輸出層用于詞牌級任務,比如sequence tagging or question answering, 而[CLS]表示被輸入到輸出層中用于分類,比如文本蘊含或情感分析。
4. Experiments
we present BERT fine-tuning results?on 11 NLP tasks.
我們將在11個NLP任務上展示BERT微調結果
4.1 GLUE
The General Language Understanding Evaluation(GLUE) benchmark is a collection of diverse natural?language understanding tasks.
GLUE分塊是一個不同自然語言理解任務的合集
To fine-tune on GLUE
為了在GLUE上進行微調
represent the input sequence(for single sentence or sentence pairs) as described in Section3, and use the final hidden vector C corresponding to the first input token as the aggregate representation.
像第3節(jié)描述的表示輸入序列(對單個句子或句子對),使用與第一個輸入詞牌相對應的最終隱含向量C作為聚合表示。
The only new parameters introduced during fine-tuning are classification layer weights?W-RKH, whereK is the number of labels. We compute?a standard classification loss with C and W,?i.e., log(softmax(CWT )).
在微調階段唯一引入的新參數是分類層權重W。
4.2 SQuAD v1.1
The Stanford Question Answering Dataset is a collention of 100k crowdsourced question/answer pair. Given a question and a passage from Wikipedia containing the answer, the task is to predict the answer text span in the passage.
Stanford問答數據集是由100k條眾包問答對組成的集合。給定一個問題和一篇來自Wikipedia帶答案的文章,人物履歷是預測文章中答案文本的范圍
in the question answering task
we represent the input question and passage?-->?a single packed sequence
- with the question using the A embeddings
- the passage using the B embeddings.
we only introduce a start vector S and an end vector E during fine-tuning.
我們在微調階段只引入開始向量S和結束向量E
The probability of word i being the start of the answer span: is computed as a dot product between Ti and S followed by a softmax over all of the words in the paragraph:
單詞i作為答案范圍起始的概率 = 計算Ti與向量s的點積,然后對段落中所有單詞進行softmax
The analogous formula is used for the end of the answer span. 相似的公式用于答案范圍的結尾
The score of a candidate answer span from position i to position j?is defined as S*Ti + E * Tj, and the maximum scoring span where j >= i is used as a prediction.
從位置i到位置j的候選答案范圍分值 = S*Ti + E*Tj,其中,最大分值的范圍用作預測。
The training objective is the sum of the log-likelihoods of the correct start and end positions.
訓練目標是正確開始和結束位置的對數似然之和。
4.3 SQuAD v2.0
The SQuAD 2.0 task extends the SQuAD 1.1 problem definition
- by allowing for the possibility that no short answer exits in the provided paragraph, making the problem more realistic.
- treat questions that do not have an answer as having an answer span with start and end at the [CLS] token.
Stanford問答數據集2.0擴展數據集1.1,
- 通過允許在提供的段落中不存在簡短的答案來增強問題的實際性。
- 認為沒有答案的問題有一個從start開始到[CLS]結束的答案范圍
The probablility space for the start and end answer span positions is extended to include the position of the [CLS] token.
4.4 SWAG
The Situations With Adversarial Generations(SWAG) dataset contains 113k sentence-pair completion examples that evaluate grounded commensense inference. Given a sentence, the task is to choose the most plausible continuation among four choices.
SWAG包含113k根植于常理推斷的句對完成實例。給定一個句子,任務是在四個選項中選擇最合理的延續(xù)。
when fine-tuning on the SWAG dataset, construct?four input sequences, each containing the concatenation of the given sentence(sentence A) and a possible continuation(sentence B)
當在SWAG數據集上進行微調試驗時,構建四個輸入序列----每個都包含給定句子A和可能的延續(xù)句子B的連接。
The only task-specific parameters introduced is a vector whose dot product with the [CLS] token representatioin C denotes a score for each choice which is normalized with a softmax layer.
唯一被引入的特定任務參數時一個向量,它與[CLS]詞牌表示C的點積表示每個選擇的分數,該分數用softmax層進行歸一化。
5. Ablation/??ble?.??n/ Studies
-->?ablation study就是你在同時提出多個思路提升某個模型的時候,為了驗證這幾個思路分別都是有效的,做的控制變量實驗的工作
Ablation experiments over a number of facets of BERT in order to better understand their relative importance.
5.1 Effect of Pre-training Tasks
we demonstrate the importance of the deep bidirectionality of BERT by evaluating two pre-training objectives using exactly the same pre-training data, fine-tuning scheme, and hyperparameters as BERTbase:
我們通過使用與BERTbase完全相同的預訓練數據、微調方案、超參數來評估兩個預訓練目標,證明了BERT深度雙向的重要性:
-
No NSP
--> MLM without NSP:?A bidirectional model which is trained using the "Masked LM"(MLM) but without the "next sentence prediction"(NSP) task
用掩碼語言模型,但沒有用下一句預測任務訓練的雙向模型
result: removing NSP hurts performance significantly on ONLI, MNLI and SQuAD 1.1
-
LTR & No NSP
--> A left-context-only model which is trained using a standard Left-to-Right(LTR) LM, rather than an MLM. The left-only constraint was also applied at fine-tuning, because removing it introduced a pre-train/fine-tune mismatch that degraded downstream performance.?
使用一個標準的從左到右語言模型,而不是掩碼語言模型去訓練僅左上下文的模型。僅左約束也被應用于微調,因為刪除它會引入預訓練/微調不匹配,從而降低下游性能。
--> this model was pre-trained without the NSP task.
result:The LTR model performs worse than the MLM model on all tasks, with large drop on MRPC and SQuAD.
-
+BiLSTM
suspect: the LTR model will perform poorly at token predictions, since the token-level hidden states have no right-side context.
直觀上的懷疑:LTR模型在詞牌預測上表現較差,是因為詞牌級隱含狀態(tài)沒有右側上下文。?
purpose: In order to make it clear and test, they make a good faith attempt at strengthening the LTR system.
目的:為了弄清楚我們懷疑的原因,作者嘗試加強LTR系統(tǒng)?
==> 為了改善LTR這種單向模型缺陷,做了幾個嘗試:
attempt 1: we add a randomly initialized BiLSTM on top the LTR model.
嘗試1: 他們在LTR模型頂部加入了一個隨機初始化的雙向LSTM模型。
result: This does significantly improve results on SQuAD, but the results are still far worse than those of the pre-trained bidirectional models.
==> LTR單向模型加BiLSTM之后,效果確實有提升,但性能還是比預訓練雙向模型差很多!
attempt 2:?we recognize that it would also be possible to train separate LTR and RTL models and represent each token as the concatenation of the two models, as ELMo does.
嘗試2: LTR + RTL,我們認為單獨訓練從左到右和從右到左模型是可能的,并且將每個詞牌表示為兩個模型的連接,比如ELMo。
problem:?
a. this is twice as expensive as a single bidirectional model
LTR+RTL --> token representation的成本時單個雙向模型的兩倍?
b. this is non-intuitive for tasks like QA, since RTL model would not be able to condition the answer on the question.
對于像問答等任務不直觀,因為從右到左模型不能基于問題思考答案,<-- 因為答案比問題先出現
c. this is strictly less powerful than a deep bidirectional model, since it can use both left and right context at every layer.
嚴格來說,它不如深度雙向模型強大,因為深度雙向模型可以在每一層都使用左右上下文。
5.2 Effect of Model Size
we explore the effect of model size on fine-tuning task accuracy.
我們探索模型大小對微調任務準確度的影響
we trained a number of BERT models with?a differing number of layers, hidden units, and attention heads
我們訓練了許多BERT模型,它們具有不同的層,隱含單元和attention頭。
# L = the number of layers; # H = hidden size; # A = number of attention heads; # LM(ppl) = the masked LM perlexity of held-out training data 保留訓練數據的掩碼語言模型困惑度
It has long been known that increasing the?model size will lead to continual improvements?on large-scale tasks?
眾所周知,增加模型大小將導致大規(guī)模任務的持續(xù)改進。
scaling to extreme model sizes also lead to large improvements on very small scale tasks, provided that the model has been sufficiently pre-trained.
擴張到極致模型大小會導致極小規(guī)模任務的巨大提升,假定該模型已經被充分預訓練過了。
increasing hidden dimension?size from 200 to 600 helped, but increasing?further to 1,000 did not bring further improvements.
prior works used a feature-based approach, the task-specific models can benefit from the larger, more expensive pre-trained representations even when downstream task data is very small
5.3 Feature-based approach with BERT
All of the BERT results presented so far have used the fine-tuning approach, where a simple classification layer is added to the pre-trained model, all parameters are jointly fine-tuned on a downstream task.
至今為止,所有的被提出的BERT結果都使用了微調方法,當一個簡單的分類層被添加到預訓練模型,所有的參數在一個下游任務上被聯(lián)合微調。
==> 微調fine-tuning和NSP二分類預測關系?
the feature-based approach has certain advantages:
where fixed features are extracted from the pre-trained model
- First, not all tasks can be easily represented by a Transformer encoder architecture, and therefore require a task-specific model architecture to be added.
首先,不是所有的任務都能被Transformer編碼器架構代表,因此需要添加一個特定任務的模型架構
> Transformer編碼器架構屬于feature-based方法還是fine-tuning方法、task-specific方法,他們有什么區(qū)別聯(lián)系?
在第一節(jié)Introduction
- Second, there are major computational benefits to pre-compute an expensive representation of the training data once and then run many experiments with cheaper models on top of this representation.
其次,具有重大的計算優(yōu)勢,一旦預計算一次昂貴的訓練數據表示,然后可以在此表示之上用更便宜的模型運行許多實驗
To ablate the fine-tuning approach,
- we apply the?feature-based approach by extracting the activations?from one or more layers without?fine-tuning?any parameters of BERT.
為了對比微調方法,我們通過從一個或多個沒有經過BERT微調參數的層中抽取激活來應用基于特征方法
- These contextual embeddings?are used as input to a randomly initialized?two-layer 768-dimensional BiLSTM before?the classification layer.
這些上下文嵌入被用作分類層之前隨機初始化兩層768維雙向LSTM的輸入
==> BERT is effective for both fine-tuning and feature-based approaches.
6. Conclusion
- rich, unsupervised pre-training is an integral part of many language understanding systems, these results enable even low-resource tasks to benefit from deep unidirectional architectures.
豐富的、無監(jiān)督的預訓練是許多語言理解系統(tǒng)不可或缺的一部分,這些結果即使是低資源任務也能從深度單向架構中獲益。
- our major contribution --> furthur generalizing these findings to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks.
我們的主要貢獻-->進一步將這些發(fā)現擴展到深層雙向架構中,允許相同的預訓練模型成功處理一系列廣泛的NLP任務。
--> 如果只看結論,啥也看不出來,因為很多重要內容都分散在文中大體中,這可能是中英文論文的區(qū)別吧
總結
以上是生活随笔為你收集整理的BERT论文阅读(一): Pre-training of Deep Bidirectional Transformers for Language Understanding的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Transformer论文阅读(一):
- 下一篇: 注意力机制~Attention Mech