【NLP】Transformers 源码阅读和实践
本文主要針對HuggingFace開源的 transformers,以BERT為例介紹其源碼并進行一些實踐。主要以pytorch為例 (tf 2.0 代碼風格幾乎和pytorch一致),介紹BERT使用的Transformer Encoder,Pre-training Tasks和Fine-tuning Tasks。最后,針對預訓練好的BERT進行簡單的實踐,例如產出語句embeddings,預測目標詞以及進行抽取式問答。本文主要面向BERT新手,在閱讀本文章前,假設讀者已經閱讀過BERT原論文。
1. Core Components
Transformers: State-of-the-art Natural Language Processing
參考上面的論文,transformers開源庫的核心組件包括3個:
「Con?guration」:配置類,通常繼承自「PretrainedCon?g」,保存model或tokenizer的超參數,例如詞典大小,隱層維度數,dropout rate等。配置類主要可用于復現模型。
「Tokenizer」:切詞類,通常繼承自「PreTrainedTokenizer」,主要存儲詞典,token到index映射關系等。此外,還會有一些model-specific的特性,如特殊token,[SEP], [CLS]等的處理,token的type類型處理,語句最大長度等,因此tokenizer通常和模型是一對一適配的。比如BERT模型有BertTokenizer。Tokenizer的實現方式有多種,如word-level, character-level或者subword-level,其中subword-level包括Byte-Pair-Encoding,WordPiece。subword-level的方法目前是transformer-based models的主流方法,能夠有效解決OOV問題,學習詞綴之間的關系等。Tokenizer主要為了「將原始的語料編碼成適配模型的輸入。」
「Model」: 模型類。封裝了預訓練模型的計算圖過程,遵循著相同的范式,如根據token ids進行embedding matrix映射,緊接著多個self-attention層做編碼,最后一層task-specific做預測。除此之外,Model還可以做一些靈活的擴展,用于下游任務,例如在預訓練好的Base模型基礎上,添加task-specific heads。比如,language model heads,sequence classi?cation heads等。在代碼庫中通常命名為,「XXXForSequenceClassification」 or 「XXXForMaskedLM」,其中XXX是模型的名稱(如Bert), 結尾是預訓練任務的名稱 (MaskedLM) 或下游任務的類型(SequenceClassification)。
另外,針對上述三大類,transformer還額外封裝了「AutoConfig, AutoTokenizer,AutoModel」,可通過模型的命名來定位其所屬的具體類,比如'bert-base-cased',就可以知道要加載BERT模型相關的配置、切詞器和模型。非常方便。通常上手時,我們都會用Auto封裝類來加載切詞器和模型。
2. Transformer-based Pre-trained model
所有已實現的Transformer-based Pre-trained models:
CONFIG_MAPPING?=?OrderedDict([("retribert",?RetriBertConfig,),("t5",?T5Config,),("mobilebert",?MobileBertConfig,),("distilbert",?DistilBertConfig,),("albert",?AlbertConfig,),("camembert",?CamembertConfig,),("xlm-roberta",?XLMRobertaConfig,),("marian",?MarianConfig,),("mbart",?MBartConfig,),("bart",?BartConfig,),("reformer",?ReformerConfig,),("longformer",?LongformerConfig,),("roberta",?RobertaConfig,),("flaubert",?FlaubertConfig,),("bert",?BertConfig,),("openai-gpt",?OpenAIGPTConfig,),("gpt2",?GPT2Config,),("transfo-xl",?TransfoXLConfig,),("xlnet",?XLNetConfig,),("xlm",?XLMConfig,),("ctrl",?CTRLConfig,),("electra",?ElectraConfig,),("encoder-decoder",?EncoderDecoderConfig,),]上述是該開源庫實現的模型,包括了BERT,GPT2,XLNet,RoBERTa,ALBERT,ELECTRA,T5等家喻戶曉的預訓練語言模型。
下面將以BERT為例,來介紹BERT相關的源碼。建議仔細閱讀源碼中我做的一些「注釋」,尤其是「步驟的細分」。同時,關注下目錄的層次,「即:不同類之間的關系。」
2.1 BertModel Transformer
「BertModel」, The bare Bert Model transformer outputting 「raw hidden-states」 without any specific head on top。這個類的目標主要就是利用「Transformer」獲取序列的編碼向量。抽象出來的目標是為了適配不同的預訓練任務。例如:MLM預訓練任務對應的類為BertForMaskedLM,其中有個成員實例為BertModel,就是為了編碼序列,獲取序列的hidden states后,再構建MaskedLM task進行訓練或者預測。
核心構造函數和Forward流程代碼如下:
#?BertModel的構造函數 def?__init__(self,?config):super().__init__(config)self.config?=?configself.embeddings?=?BertEmbeddings(config)self.encoder?=?BertEncoder(config)self.pooler?=?BertPooler(config)self.init_weights()def?forward(self,?input_ids=None,?attention_mask=None,token_type_ids=None,????position_ids=None,?head_mask=None,?inputs_embeds=None,encoder_hidden_states=None,?encoder_attention_mask=None,output_attentions=None,?output_hidden_states=None,):#?ignore?some?code?here...#?step?1:?obtain?sequence?embedding,?BertEmbeddings?embedding_output?=?self.embeddings(input_ids=input_ids,?position_ids=position_ids,?token_type_ids=token_type_ids,?inputs_embeds=inputs_embeds)#?step?2:?transformer?encoder,?BertEncoderencoder_outputs?=?self.encoder(embedding_output,attention_mask=extended_attention_mask,head_mask=head_mask,encoder_hidden_states=encoder_hidden_states,encoder_attention_mask=encoder_extended_attention_mask,output_attentions=output_attentions,output_hidden_states=output_hidden_states,)sequence_output?=?encoder_outputs[0]#?step?3:?pooling?to?obtain?sequence-level?encoding,?BertPoolerpooled_output?=?self.pooler(sequence_output)outputs?=?(sequence_output,?pooled_output,)?+?encoder_outputs[1:]return?outputs??#?sequence_output,?pooled_output,?(hidden_states),?(attentions)「參數如下:」
「input_ids」: 帶特殊標記([CLS]、[SEP])的「token ids」序列, e.g., tensor([[ 101, 1188, 1110, 1126, 7758, 1859, 102]]), 其中101和102分別是[CLS],[SEP]對應的token id。其「shape」: ,「B」為batch size, 「S」為序列的長度,此例即:1x7。
「inputs_embeds:」 和input_ids參數「二選一」。inputs_embeds代表給定了輸入tokens對應的token embeddings,比如用word2vec的word embeddings作為token embeddings,這樣就不需要用input_ids對默認隨機初始化的embedding做lookup得到token embeddings。
「attention_mask」: 「self-attention使用」,可選,shape和input_ids一致。當對encoder端的序列做self-attention時,默認全為1,即都可以attend;decoder端序列做self-attention時,默認為類似下三角矩陣的形式 (對角線也為1)。
「token_type_ids」: 可選,shape和input_ids一致,單語句輸入時,取值全為0;在“語句對“的輸入中,該取值為0或1,即:前一句為0,后一句為1。
「head_mask」: **self-attention使用,**可選,想用哪些head,就為1或者None,不想用的head就為0。shape為[num_heads] or [num_hidden_layers x num_heads],即:可以每層每個head單獨設置mask。
「position_ids」: 可選,位置id,默認就是0~S。
「encoder_hidden_states/encoder_attention_mask」:decoder端對encoder端做cross-attention時使用,此時K和V即通過encoder_hidden_states得到。
其中,
「Step 1」: 「獲取序列的embedding」,對應下文要介紹的「BertEmbeddings」
「Step 2」: 「利用Transformer進行編碼」,對應下文要介紹的「BertEncoder」,獲取sequence token-level encoding.
「Step 3」: 「對 [CLS] 對應的hidden state進行非線性變換得到」 sequence-level encoding,對應下文要介紹的「BertPooler」。
2.2 BertEmbeddings
「第一步Step 1」,獲取序列的embeddings
「token embedding + position embedding + segment embedding」
embedding_output?=?self.embeddings(input_ids=input_ids,?position_ids=position_ids,?token_type_ids=token_type_ids,?inputs_embeds=inputs_embeds)?#?embeddings是BertEmbeddings類基于input_ids或者inputs_embeds獲取token embeddings。
基于position_ids獲取position embeddings,此處采用的是絕對位置編碼。
基于token_type_ids獲取語句的segment embeddings。
此處還做了layer_norm和dropout。輸出的embedding的shape為,。D默認為768。此處輸出的embeddings標記為。
2.3 BertEncoder
「第二步,step 2」,利用「Transformer」對序列進行編碼
#?encoder是BertEncoder類 encoder_outputs?=?self.encoder(embedding_output,?#?序列embedding,?B?x?S?x?Dattention_mask=extended_attention_mask,?#?序列self-attention時使用head_mask=head_mask,?#?序列self-attention時使用encoder_hidden_states=encoder_hidden_states,?#?decoder,cross-attentionencoder_attention_mask=encoder_extended_attention_mask,?#?cross-attentionoutput_attentions=output_attentions,?#?是否輸出attentionoutput_hidden_states=output_hidden_states)??#?是否輸出每層的hidden?state「embedding_output」:BertEmbeddings的輸出,batch中樣本序列的每個token的嵌入。
「extended_attention_mask」:「self-attention」使用。根據attention_mask做維度廣播,是head數量,此時,方便下文做self-attention時作mask,即:softmax前對logits作處理,「logits+extended_attention_mask」,即:attention_mask取值為1時,extended_attention_mask對應位置的取值為0;否則,attention_mask為0時,extended_attention_mask對應位置的取值為-10000.0 (很小的一個數),這樣softmax后,mask很小的值對應的位置概率接近0達到mask的目的。
「head_mask」:「self-attention」使用。同樣可以基于「原始輸入head_mask作維度廣播」,廣播前的shape為H or L x H;廣播后的shape為:「L x B x H x S x S」。即每個樣本序列中每個token對其他tokens的head attentions 值作mask,head attentions數量為L x H。
「encoder_hidden_states」:可選,「cross-attention使用」。即:decoder端做編碼時,要傳入encoder的隱狀態,「B x S x D」。
「encoder_attention_mask」:可選,「cross-attention使用」。即,decoder端做編碼時,encoder的隱狀態的attention mask。和extended_attention_mask類似,「B x S。」
「output_attentions」:是否輸出attention值,bool。可用于可視化attention scores。
「output_hidden_states」:是否輸出每層得到的隱向量,bool。
2.4 BertLayer
上述代碼最重要的是循環內的「BertLayer」迭代過程,其核心代碼:
def?forward(self,?hidden_states,?attention_mask=None,?head_mask=None,encoder_hidden_states=None,?encoder_attention_mask=None,output_attentions=False,):#?step?1.0:?self-attention,?attention實例是BertAttention類self_attention_outputs?=?self.attention(hidden_states,?attention_mask,?head_mask,?output_attentions=output_attentions,)attention_output?=?self_attention_outputs[0]outputs?=?self_attention_outputs[1:]??#?add?self?attentions?if?we?output?attention?weights# step 1.1:?如果是decoder, 就作cross-attention,此時step1.0的輸出即為decoder側的序列的self-attention結果,并作為step1.1的輸入;step 1.1的輸出為decoder側的cross-attention結果, crossattention實例也是BertAttentionif?self.is_decoder?and?encoder_hidden_states?is?not?None:cross_attention_outputs?=?self.crossattention(attention_output,attention_mask,head_mask,encoder_hidden_states,encoder_attention_mask,output_attentions,)attention_output?=?cross_attention_outputs[0]outputs?=?outputs?+?cross_attention_outputs[1:]??#?add?cross?attentions?if?we?output?attention?weights#?step?2:?intermediate轉化,對應原論文中的前饋神經網絡FFNintermediate_output?=?self.intermediate(attention_output)#?step?3:?做skip-connectionlayer_output?=?self.output(intermediate_output,?attention_output)outputs?=?(layer_output,)?+?outputsreturn?outputs其中,step 1分為了2個小步驟。如果是encoder (BERT只用了encoder),只有1.0起作用,即只對輸入序列進行self-attention。如果是做seq2seq的模型,還會用到transformer的decoder,此時1.0就是對decoder的seq做self-attention,相應的attention_mask實際上是類下三角形式的矩陣;而1.1步驟此時就是基于1.0得到的self-attention序列的hidden states,對encoder_hidden_states進行cross-attention。這是本部分的重點。
2.4.1 BertAttention
BertAttention是上述代碼中attention實例對應的類,也是transformer進行self-attention的核心類。包括了BertSelfAttention和BertSelfOutput成員。
class?BertAttention(nn.Module):def?__init__(self,?config):super().__init__()self.self?=?BertSelfAttention(config)self.output?=?BertSelfOutput(config)def?forward(self,?hidden_states,?attention_mask=None,head_mask=None,?encoder_hidden_states=None,encoder_attention_mask=None,?output_attentions=False):#?step?1:?self-attention,?B?x?S?x?Dself_outputs?=?self.self(hidden_states,?attention_mask,?head_mask,?encoder_hidden_states,?encoder_attention_mask,?output_attentions)#?step?2:?skip-connection,?B?x?S?x?Dattention_output?=?self.output(self_outputs[0],?hidden_states)outputs?=?(attention_output,)?+?self_outputs[1:]??#?add?attentions?if?we?output?themreturn?outputs「BertSelfAttention」: 是「self-attention」,BertSelfAttention可以被實例化為encoder側的self-attention,也可以被實例化為decoder側的self-attention,此時attention_mask是非空的 (類似下三角形式的mask矩陣)。同時,還可以實例化為decoder側的cross-attention,此時,hidden_states即為decoder側序列的self-attention結果,同時需要傳入encoder側的encoder_hidden_states和encoder_attention_mask來進行cross-attention。
def?forward(self,?hidden_states,?attention_mask=None,?head_mask=None,encoder_hidden_states=None,?encoder_attention_mask=None,output_attentions=False):#?step?1:?mapping?Query/Key/Value?to?sub-space#?step?1.1:?query?mappingmixed_query_layer?=?self.query(hidden_states)?#?B?x?S?x?(H*d)#?If?this?is?instantiated?as?a?cross-attention?module,?the?keys#?and?values?come?from?an?encoder;?the?attention?mask?needs?to?be#?such?that?the?encoder's?padding?tokens?are?not?attended?to.#?step?1.2:?key/value?mappingif?encoder_hidden_states?is?not?None:mixed_key_layer?=?self.key(encoder_hidden_states)?#?B?x?S?x?(H*d)mixed_value_layer?=?self.value(encoder_hidden_states)?attention_mask?=?encoder_attention_mask?else:mixed_key_layer?=?self.key(hidden_states)?#?B?x?S?x?(H*d)mixed_value_layer?=?self.value(hidden_states)query_layer?=?self.transpose_for_scores(mixed_query_layer)?#?B?x?H?x?S?x?dkey_layer?=?self.transpose_for_scores(mixed_key_layer)?#?B?x?H?x?S?x?dvalue_layer?=?self.transpose_for_scores(mixed_value_layer)?#?B?x?H?x?S?x?d#?step?2:?compute?attention?scores#?step?2.1:?raw?attention?scores#?B?x?H?x?S?x?d???B?x?H?x?d?x?S?->?B?x?H?x?S?x?S#?Take?the?dot?product?between?"query"?and?"key"?to?get?the?raw?attention?scores.attention_scores?=?torch.matmul(query_layer,?key_layer.transpose(-1,?-2))attention_scores?=?attention_scores?/?math.sqrt(self.attention_head_size)#?step?2.2:?mask?if?necessaryif?attention_mask?is?not?None:#?Apply?the?attention?mask,?B?x?H?x?S?x?Sattention_scores?=?attention_scores?+?attention_mask#?step?2.3:?Normalize?the?attention?scores?to?probabilities,?B?x?H?x?S?x?Sattention_probs?=?nn.Softmax(dim=-1)(attention_scores)#?This?is?actually?dropping?out?entire?tokens?to?attend?to,?which?might#?seem?a?bit?unusual,?but?is?taken?from?the?original?Transformer?paper.attention_probs?=?self.dropout(attention_probs)#?Mask?heads?if?we?want?toif?head_mask?is?not?None:attention_probs?=?attention_probs?*?head_mask#?B?x?H?x?S?x?S???B?x?H?x?S?x?d?->??B?x?H?x?S?x?d#?step?4:?aggregate?values?by?attention?probs?to?form?context?encodingscontext_layer?=?torch.matmul(attention_probs,?value_layer)#?B?x?S?x?H?x?dcontext_layer?=?context_layer.permute(0,?2,?1,?3).contiguous()#?B?x?S?x?Dnew_context_layer_shape?=?context_layer.size()[:-2]?+?(self.all_head_size,)#?B?x?S?x?D,相當于是多頭concat操作context_layer?=?context_layer.view(*new_context_layer_shape)outputs?=?(context_layer,?attention_probs)?if?output_attentions?else?(context_layer,)return?outputs不同head均分768維度,12個head則每個為64維度;具體計算的時候合在一起,即同時算multi-head。記本步驟的輸出為: ,輸入即為hidden_states參數。
?; ? ? ?,每個token根據其對序列內其它tokens的attention scores,來加權序列tokens的embeddings,得到每個token對應的上下文編碼向量。
reshape后的形狀為,, 。
: ?,「transpose_for_scores」。
: ?
,
, 如果是decoder側的self-attention,則logit加上預先計算好的decoder側對應的序列的每個位置的attention_mask,實際上就是下三角形式(包括對角線)的mask矩陣。
, :每個batch每個head內,每個token對序列內其它token的attention score。
、、的shape: ? : 「<Batch Size, Seq Length, Head Num, Embedding Dimension>」,。此處D=768, H=12, d=64。
「attention score計算過程:」
「context_layer」: ?:
「BertSelfOutput」
, 「self-connection」,
2.4.2 BertIntermediate
, , 其中, 默認值為3072,用到了gelu激活函數。
2.4.3 BertOutput
, ,其中,.
上述輸出作為下一個BertLayer的輸入,輸出,依次類推,進行迭代,最終輸出,即共12層BertLayer。
2.5 BertPooler
第三步,step3, 獲取sequence-level embedding。
拿到上述BertEncoder的輸出,shape為,其中每個樣本序列(S維度)的第一個token為[CLS]標識的hidden state,標識為,即:。則得到序列級別的嵌入表征:,shape為。這個主要用于下游任務的fine-tuning。
def?forward(self,?hidden_states):#?We?"pool"?the?model?by?simply?taking?the?hidden?state?corresponding#?to?the?first?token.first_token_tensor?=?hidden_states[:,?0]pooled_output?=?self.dense(first_token_tensor)pooled_output?=?self.activation(pooled_output)?##?nn.tanhreturn?pooled_output3. Bert Pre-training Tasks
上文介紹了BERT核心的Transformer編碼器,下面將介紹Bert的預訓練任務。
3.1 BertForMaskedLM
Bert Model with 「a language modeling head」 on top。上述介紹了BertModel的源碼,BertModel主要用于獲取序列的編碼。本部分要介紹的BertForMaskedLM將基于BertModel得到的序列編碼,利用MaskedLM預訓練任務進行預訓練。
Bert主要利用了Transformer的Encoder,基于encoder得到的序列編碼進行預訓練,而MLM使得encoder能夠進行雙向的self-attention。
「BertForMaskedLM」的構造函數:
def?__init__(self,?config):super().__init__(config)assert?(not?config.is_decoder),?"If?you?want?to?use?`BertForMaskedLM`?make?sure?`config.is_decoder=False`?for?bi-directional?self-attention."?#?is_decoder為False,不需要用到decoderself.bert?=?BertModel(config)?#?BertModel進行序列編碼self.cls?=?BertOnlyMLMHead(config)?#?多分類預訓練任務,?task-specific?headself.init_weights()核心Forward代碼:
def?forward(self,?input_ids=None,?attention_mask=None,?token_type_ids=None,position_ids=None,?head_mask=None,?inputs_embeds=None,?labels=None,encoder_hidden_states=None,?encoder_attention_mask=None,output_attentions=None,?output_hidden_states=None,**kwargs):#?step?1:?obtain?sequence?encoding?by?BertModeloutputs?=?self.bert(input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids,position_ids=position_ids,head_mask=head_mask,inputs_embeds=inputs_embeds,encoder_hidden_states=encoder_hidden_states,encoder_attention_mask=encoder_attention_mask,output_attentions=output_attentions,output_hidden_states=output_hidden_states,)sequence_output?=?outputs[0]?#?B?x?S?x?D#?step?2:?output?scores?of?each?token?in?the?sequenceprediction_scores?=?self.cls(sequence_output)?#?B?x?S?x?V,?輸出詞典中每個詞的預測概率outputs?=?(prediction_scores,)?+?outputs[2:]??#?Add?hidden?states?and?attention?if?they?are?here#?step?3:?build?loss,?label,?B?x?Sif?labels?is?not?None:loss_fct?=?CrossEntropyLoss()??#?-100?index?=?padding?tokenmasked_lm_loss?=?loss_fct(prediction_scores.view(-1,?self.config.vocab_size),?labels.view(-1))?#?拍扁,?(B*S)?x?Voutputs?=?(masked_lm_loss,)?+?outputsreturn?outputs??#?(masked_lm_loss),?prediction_scores,?(hidden_states),?(attentions)參數基本上和BertModel一模一樣,多了一個labels參數,主要用于獲取MLM loss。
其中,cls對應的「BertOnlyMLMHead」類 (其實就是類「BertLMPredictionHead」) 做的主要事情如下公式,即:MLM多分類預測任務,其中為BertModel得到的sequence-token-level encoding,shape為。
其中,,為vocab的大小。的shape為:。
特別的,label的形式:
「labels」 (torch.LongTensor of shape (batch_size, sequence_length), optional, defaults to None) – Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
即,不打算預測的,「label設置為-100」。一般只設置[MASK]位置對應的label,其它位置設置成-100。這樣只計算了[MASK]待預測位置的token對應的loss。-100實際上是CrossEntropyLos的ignore_index參數的默認值。
3.2 BertForPreTraining
和BertForMaskedLM類似,多了一個next sentence prediction預訓練任務。Bert Model with 「two heads on top」 as done during the pre-training: a 「masked language modeling」 head and 「a next sentence prediction」 (classification) head.
此部分對應的heads的核心代碼為:
class?BertPreTrainingHeads(nn.Module):def?__init__(self,?config):super().__init__()self.predictions?=?BertLMPredictionHead(config)self.seq_relationship?=?nn.Linear(config.hidden_size,?2)def?forward(self,?sequence_output,?pooled_output):prediction_scores?=?self.predictions(sequence_output)seq_relationship_score?=?self.seq_relationship(pooled_output)return?prediction_scores,?seq_relationship_score其中,BertLMPredictionHead和BertForMaskedLM中的BertLMPredictionHead一樣,通過這個來得到MLM loss。另外,多了一個seq_relationship,即拿pooled encoding接一個線性二分類層,判斷是否是next sentence,因此可以構造得到next-sentence loss。二者Loss相加。
3.3 BertForNextSentencePrediction
Bert Model with a next sentence prediction (classification) head on top。只有上述的seq_relationship head來構造next-sentence loss,不作贅述。
4. Bert Fine-tuning Tasks
下面將介紹利用預訓練好的Bert對下游任務進行Fine-tuning的方式。下文介紹的fine-tuning任務對應的model,已經在BERT基礎上加了task-specific parameters,只需要利用該model,輸入task-specific data,然后optimization一下,就能夠得到fine-tuned model。
4.1 BertForSequenceClassification
句子級別的任務,sentence-level task。Bert Model transformer with a sequence classification/regression head on top ?(a linear layer on top of the pooled output) e.g. 「for GLUE tasks.」
class?BertForSequenceClassification(BertPreTrainedModel):def?__init__(self,?config):super().__init__(config)self.num_labels?=?config.num_labelsself.bert?=?BertModel(config)self.dropout?=?nn.Dropout(config.hidden_dropout_prob)self.classifier?=?nn.Linear(config.hidden_size,?config.num_labels)?#?類別數量self.init_weights()#?forward輸入參數和前文介紹的預訓練任務一樣def?forward(self,?input_ids=None,?attention_mask=None,token_type_ids=None,?position_ids=None,head_mask=None,?inputs_embeds=None,?labels=None,output_attentions=None,?output_hidden_states=None):#?step?1:?transformer?encodingoutputs?=?self.bert(input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids,position_ids=position_ids,head_mask=head_mask,inputs_embeds=inputs_embeds,output_attentions=output_attentions,output_hidden_states=output_hidden_states)#?step?2:?use?the?pooled?hidden?state?corresponding?to?the?[CLS]?token#?B?x?Dpooled_output?=?outputs[1]pooled_output?=?self.dropout(pooled_output)#?B?x?Nlogits?=?self.classifier(pooled_output)outputs?=?(logits,)?+?outputs[2:]??#?add?hidden?states?and?attention?if?they?are?here#?step?3:?build?loss,??labels:?(B,?)if?labels?is?not?None:if?self.num_labels?==?1:#??We?are?doing?regressionloss_fct?=?MSELoss()loss?=?loss_fct(logits.view(-1),?labels.view(-1))else:loss_fct?=?CrossEntropyLoss()loss?=?loss_fct(logits.view(-1,?self.num_labels),?labels.view(-1))outputs?=?(loss,)?+?outputsreturn?outputs??#?(loss),?logits,?(hidden_states),?(attentions)看上述代碼,非常清晰。先經過BertModel得到encoding,由于是sentence-level classification,直接拿第一個[CLS] token對應的hidden state過一個分類層得到類別的預測分數logits。再基于logits和labels來構造損失函數。這個任務主要用于sentence-level的分類任務,當然也能夠用于sentence-pair-level的分類任務。
4.2 BertForMultipleChoice
句子對級別的任務,「sentence-pair」-level task。Bert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a softmax) e.g. for 「RocStories/SWAG tasks.」
給一個提示prompt以及多個選擇choice(其中有1個是對的,其它是錯的),判斷其中哪個選擇是對的。「輸入格式會整成[[prompt, choice0], [prompt, choice1]…]的形式」。bertModel得到的pooled基礎上接一個全連接層,輸出在每個“句對“[prompt, choice i]上的logits,然后過一個softmax,構造交叉熵損失。
4.3 BertForTokenClassification
token級別的下游任務,token-level task。Bert Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for 「Named-Entity-Recognition (NER) tasks.」
def?forward(self,?input_ids=None,?attention_mask=None,token_type_ids=None,?position_ids=None,?head_mask=None,inputs_embeds=None,?labels=None,output_attentions=None,?output_hidden_states=None):????#?step?1:?Transformeroutputs?=?self.bert(input_ids,?attention_mask=attention_mask,token_type_ids=token_type_ids,?position_ids=position_ids,head_mask=head_mask,?inputs_embeds=inputs_embeds,output_attentions=output_attentions,output_hidden_states=output_hidden_states)#?step?2:?get?sequence-token?encoding,?B?x?S?x?Dsequence_output?=?outputs[0]#?step?3:?fine-tuning?parameterssequence_output?=?self.dropout(sequence_output)#?B?x?S?x?Nlogits?=?self.classifier(sequence_output)?#?nn.Linear(config.hidden_size,?config.num_labels)outputs?=?(logits,)?+?outputs[2:]??#?add?hidden?states?and?attention?if?they?are?here#?step?4:?build?loss,?labels,?B?x?Sif?labels?is?not?None:loss_fct?=?CrossEntropyLoss()#?Only?keep?active?parts?of?the?lossif?attention_mask?is?not?None:active_loss?=?attention_mask.view(-1)?==?1active_logits?=?logits.view(-1,?self.num_labels)active_labels?=?torch.where(active_loss,?labels.view(-1),?torch.tensor(loss_fct.ignore_index).type_as(labels))loss?=?loss_fct(active_logits,?active_labels)else:loss?=?loss_fct(logits.view(-1,?self.num_labels),?labels.view(-1))outputs?=?(loss,)?+?outputsreturn?outputs??#?(loss),?scores,?(hidden_states),?(attentions)上述代碼一目了然。不作贅述。主要應用于token-level的分類任務,如NER等。
4.4 BertForQuestionAnswering
句子對級別的任務,「sentence-pair」-level task,具體而言,即抽取式問答任務。Bert Model with a 「span classification head on top」 for extractive question-answering tasks like SQuAD (a linear layers on top of the hidden-states output to compute span start logits and span end logits).
class?BertForQuestionAnswering(BertPreTrainedModel):def?__init__(self,?config):super().__init__(config)self.num_labels?=?config.num_labelsself.bert?=?BertModel(config)# num_labels為2, 分別代表start_position/end_position對應的下游參數。self.qa_outputs?=?nn.Linear(config.hidden_size,?config.num_labels)self.init_weights()#?多了倆參數,start_positions,end_positions,抽取式問答的span label,?shape都是(B,?)def?forward(self,?input_ids=None,?attention_mask=None,token_type_ids=None,?position_ids=None,head_mask=None,?inputs_embeds=None,start_positions=None,?end_positions=None,output_attentions=None,?output_hidden_states=None):#?step?1:?Transformer?encodingoutputs?=?self.bert(input_ids,?#?question,?passage?attention_mask=attention_mask,token_type_ids=token_type_ids,position_ids=position_ids,head_mask=head_mask,inputs_embeds=inputs_embeds,output_attentions=output_attentions,output_hidden_states=output_hidden_states,)#?B?x?S?x?Dsequence_output?=?outputs[0]#?step?2:?split?to?obtain?start?and?end?logits#?B?x?S?x?N?(N為labels數量,此處N=2)logits?=?self.qa_outputs(sequence_output)#?split后,?B?x?S?x?1,?B?x?S?x?1start_logits,?end_logits?=?logits.split(1,?dim=-1)#?B?x?S,?B?x?Sstart_logits?=?start_logits.squeeze(-1)end_logits?=?end_logits.squeeze(-1)outputs?=?(start_logits,?end_logits,)?+?outputs[2:]#?step?3:?build?loss,??start_positions,?end_positions:?(B,?)if?start_positions?is?not?None?and?end_positions?is?not?None:#?If?we?are?on?multi-GPU,?split?add?a?dimensionif?len(start_positions.size())?>?1:start_positions?=?start_positions.squeeze(-1)if?len(end_positions.size())?>?1:end_positions?=?end_positions.squeeze(-1)#?sometimes?the?start/end?positions?are?outside?our?model?inputs,?we?ignore?these?termsignored_index?=?start_logits.size(1)start_positions.clamp_(0,?ignored_index)end_positions.clamp_(0,?ignored_index)#?S?分類loss_fct?=?CrossEntropyLoss(ignore_index=ignored_index)start_loss?=?loss_fct(start_logits,?start_positions)end_loss?=?loss_fct(end_logits,?end_positions)total_loss?=?(start_loss?+?end_loss)?/?2outputs?=?(total_loss,)?+?outputsreturn?outputs??#?(loss),?start_logits,?end_logits,?(hidden_states),?(attentions)上述代碼主要就是拿sequence-token-level hidden states接兩個全連接層,分別輸出start_position預測的logits和end_position預測的logits。
5. Bert Practice
本部分進行Bert的實踐,包括3個部分:
利用預訓練好的BERT模型,輸出目標語句的Embeddings。
利用預訓練好的BERT模型,預測目標語句中[MASK]位置的真實詞。
利用預訓練好的BERT模型,進行抽取式問答系統。
目前該庫實現的預訓練模型如下:
bert-base-chinese
bert-base-uncased
bert-base-cased
bert-base-german-cased
bert-base-multilingual-uncased
bert-base-multilingual-cased
bert-large-cased
bert-large-uncased
bert-large-uncased-whole-word-masking
bert-large-cased-whole-word-masking
上述預訓練好的模型的主要差異在于:
預訓練時的文本語言語料,中文、英文、德文、多語言等
有無大小寫區分
層數
預訓練時遮蓋的是 wordpieces 得到的sub-word 還是整個word
接下來主要采用'bert-base-cased'。在QA部分還會使用上述預訓練模型‘bert-large-uncased-whole-word-masking’在SQUAD上的fine-tuning好的模型進行推斷。
首先加載「切割器和模型:」
MODEL_NAME?=?"bert-base-cased"#?step?1:?先獲取tokenizer,?BertTokenizer,? tokenizer?=?AutoTokenizer.from_pretrained(MODEL_NAME,?cache_dir='tmp/token')? #?step?2:?獲取預訓練好的模型,?BertModel model?=?AutoModel.from_pretrained(MODEL_NAME,?cache_dir='tmp/model')預覽下tokenizer (「transformers.tokenization_bert.BertTokenizer」):
#?共28996詞,包括特殊符號:('[UNK]',?100),('[PAD]',?0),('[CLS]',?101),('[SEP]',?102),?('[MASK]',?103)... tokenizer.vocab?看下「model」的網絡結構:
BertModel((embeddings):?BertEmbeddings((word_embeddings):?Embedding(28996,?768,?padding_idx=0)(position_embeddings):?Embedding(512,?768)(token_type_embeddings):?Embedding(2,?768)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False))(encoder):?BertEncoder((layer):?ModuleList((0):?BertLayer((attention):?BertAttention((self):?BertSelfAttention((query):?Linear(in_features=768,?out_features=768,?bias=True)(key):?Linear(in_features=768,?out_features=768,?bias=True)(value):?Linear(in_features=768,?out_features=768,?bias=True)(dropout):?Dropout(p=0.1,?inplace=False))(output):?BertSelfOutput((dense):?Linear(in_features=768,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(intermediate):?BertIntermediate((dense):?Linear(in_features=768,?out_features=3072,?bias=True))(output):?BertOutput((dense):?Linear(in_features=3072,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(1):?BertLayer((attention):?BertAttention((self):?BertSelfAttention((query):?Linear(in_features=768,?out_features=768,?bias=True)(key):?Linear(in_features=768,?out_features=768,?bias=True)(value):?Linear(in_features=768,?out_features=768,?bias=True)(dropout):?Dropout(p=0.1,?inplace=False))(output):?BertSelfOutput((dense):?Linear(in_features=768,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(intermediate):?BertIntermediate((dense):?Linear(in_features=768,?out_features=3072,?bias=True))(output):?BertOutput((dense):?Linear(in_features=3072,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(2):?BertLayer((attention):?BertAttention((self):?BertSelfAttention((query):?Linear(in_features=768,?out_features=768,?bias=True)(key):?Linear(in_features=768,?out_features=768,?bias=True)(value):?Linear(in_features=768,?out_features=768,?bias=True)(dropout):?Dropout(p=0.1,?inplace=False))(output):?BertSelfOutput((dense):?Linear(in_features=768,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(intermediate):?BertIntermediate((dense):?Linear(in_features=768,?out_features=3072,?bias=True))(output):?BertOutput((dense):?Linear(in_features=3072,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(3):?BertLayer((attention):?BertAttention((self):?BertSelfAttention((query):?Linear(in_features=768,?out_features=768,?bias=True)(key):?Linear(in_features=768,?out_features=768,?bias=True)(value):?Linear(in_features=768,?out_features=768,?bias=True)(dropout):?Dropout(p=0.1,?inplace=False))(output):?BertSelfOutput((dense):?Linear(in_features=768,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(intermediate):?BertIntermediate((dense):?Linear(in_features=768,?out_features=3072,?bias=True))(output):?BertOutput((dense):?Linear(in_features=3072,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(4):?BertLayer((attention):?BertAttention((self):?BertSelfAttention((query):?Linear(in_features=768,?out_features=768,?bias=True)(key):?Linear(in_features=768,?out_features=768,?bias=True)(value):?Linear(in_features=768,?out_features=768,?bias=True)(dropout):?Dropout(p=0.1,?inplace=False))(output):?BertSelfOutput((dense):?Linear(in_features=768,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(intermediate):?BertIntermediate((dense):?Linear(in_features=768,?out_features=3072,?bias=True))(output):?BertOutput((dense):?Linear(in_features=3072,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(5):?BertLayer((attention):?BertAttention((self):?BertSelfAttention((query):?Linear(in_features=768,?out_features=768,?bias=True)(key):?Linear(in_features=768,?out_features=768,?bias=True)(value):?Linear(in_features=768,?out_features=768,?bias=True)(dropout):?Dropout(p=0.1,?inplace=False))(output):?BertSelfOutput((dense):?Linear(in_features=768,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(intermediate):?BertIntermediate((dense):?Linear(in_features=768,?out_features=3072,?bias=True))(output):?BertOutput((dense):?Linear(in_features=3072,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(6):?BertLayer((attention):?BertAttention((self):?BertSelfAttention((query):?Linear(in_features=768,?out_features=768,?bias=True)(key):?Linear(in_features=768,?out_features=768,?bias=True)(value):?Linear(in_features=768,?out_features=768,?bias=True)(dropout):?Dropout(p=0.1,?inplace=False))(output):?BertSelfOutput((dense):?Linear(in_features=768,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(intermediate):?BertIntermediate((dense):?Linear(in_features=768,?out_features=3072,?bias=True))(output):?BertOutput((dense):?Linear(in_features=3072,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(7):?BertLayer((attention):?BertAttention((self):?BertSelfAttention((query):?Linear(in_features=768,?out_features=768,?bias=True)(key):?Linear(in_features=768,?out_features=768,?bias=True)(value):?Linear(in_features=768,?out_features=768,?bias=True)(dropout):?Dropout(p=0.1,?inplace=False))(output):?BertSelfOutput((dense):?Linear(in_features=768,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(intermediate):?BertIntermediate((dense):?Linear(in_features=768,?out_features=3072,?bias=True))(output):?BertOutput((dense):?Linear(in_features=3072,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(8):?BertLayer((attention):?BertAttention((self):?BertSelfAttention((query):?Linear(in_features=768,?out_features=768,?bias=True)(key):?Linear(in_features=768,?out_features=768,?bias=True)(value):?Linear(in_features=768,?out_features=768,?bias=True)(dropout):?Dropout(p=0.1,?inplace=False))(output):?BertSelfOutput((dense):?Linear(in_features=768,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(intermediate):?BertIntermediate((dense):?Linear(in_features=768,?out_features=3072,?bias=True))(output):?BertOutput((dense):?Linear(in_features=3072,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(9):?BertLayer((attention):?BertAttention((self):?BertSelfAttention((query):?Linear(in_features=768,?out_features=768,?bias=True)(key):?Linear(in_features=768,?out_features=768,?bias=True)(value):?Linear(in_features=768,?out_features=768,?bias=True)(dropout):?Dropout(p=0.1,?inplace=False))(output):?BertSelfOutput((dense):?Linear(in_features=768,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(intermediate):?BertIntermediate((dense):?Linear(in_features=768,?out_features=3072,?bias=True))(output):?BertOutput((dense):?Linear(in_features=3072,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(10):?BertLayer((attention):?BertAttention((self):?BertSelfAttention((query):?Linear(in_features=768,?out_features=768,?bias=True)(key):?Linear(in_features=768,?out_features=768,?bias=True)(value):?Linear(in_features=768,?out_features=768,?bias=True)(dropout):?Dropout(p=0.1,?inplace=False))(output):?BertSelfOutput((dense):?Linear(in_features=768,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(intermediate):?BertIntermediate((dense):?Linear(in_features=768,?out_features=3072,?bias=True))(output):?BertOutput((dense):?Linear(in_features=3072,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(11):?BertLayer((attention):?BertAttention((self):?BertSelfAttention((query):?Linear(in_features=768,?out_features=768,?bias=True)(key):?Linear(in_features=768,?out_features=768,?bias=True)(value):?Linear(in_features=768,?out_features=768,?bias=True)(dropout):?Dropout(p=0.1,?inplace=False))(output):?BertSelfOutput((dense):?Linear(in_features=768,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))(intermediate):?BertIntermediate((dense):?Linear(in_features=768,?out_features=3072,?bias=True))(output):?BertOutput((dense):?Linear(in_features=3072,?out_features=768,?bias=True)(LayerNorm):?LayerNorm((768,),?eps=1e-12,?elementwise_affine=True)(dropout):?Dropout(p=0.1,?inplace=False)))))(pooler):?BertPooler((dense):?Linear(in_features=768,?out_features=768,?bias=True)(activation):?Tanh()) )模型結構參考BertModel源碼介紹部分。
5.1 Embeddings produced by pre-trained BertModel
text?=?"This?is?an?input?example"#?step?1:?tokenize,?including?add?special?tokens tokens_info?=?tokenizer.encode_plus(text,?return_tensors="pt")? for?key,?value?in?tokens_info.items():print("{}:\n\t{}".format(key,?value)) #?observe?the?enriched?token?sequences print(tokenizer.convert_ids_to_tokens(tokens_info['input_ids'].squeeze(0).numpy()))#?step?2:?BertModel?Transformer?Encoding outputs,?pooled?=?model(**tokens_info) print("Token?wise?output:?{},?Pooled?output:?{}".format(outputs.shape,?pooled.shape))''' step?1:?outputs: ----------------------------------------------------------- input_ids:tensor([[?101,?1188,?1110,?1126,?7758,?1859,??102]]) token_type_ids:tensor([[0,?0,?0,?0,?0,?0,?0]]) attention_mask:tensor([[1,?1,?1,?1,?1,?1,?1]])['[CLS]',?'This',?'is',?'an',?'input',?'example',?'[SEP]']step?2:?outputs: ------------------------------------------------------------ Token?wise?output:?torch.Size([1,?7,?768]),?Pooled?output:?torch.Size([1,?768]) '''5.2 Predict the missing word in a sentence
from?transformers?import?BertForMaskedLMtext?=?"Nice?to?[MASK]?you"?#?target?token?using?[MASK]?to?mask#?step?1:?obtain?pretrained?Bert?Model?using?MLM?Loss maskedLM_model?=?BertForMaskedLM.from_pretrained(MODEL_NAME,?cache_dir='tmp/model') maskedLM_model.eval()?#?close?dropout#?step?2:?tokenize token_info?=?tokenizer.encode_plus(text,?return_tensors='pt') tokens?=?tokenizer.convert_ids_to_tokens(token_info['input_ids'].squeeze().numpy()) print(tokens)?#?['[CLS]',?'Nice',?'to',?'[MASK]',?'you',?'[SEP]']#?step?3:?forward?to?obtain?prediction?scores with?torch.no_grad():outputs?=?maskedLM_model(**token_info)predictions?=?outputs[0]?#?shape,?B?x?S?x?V,?[1,?6,?28996]#?step?4:?top-k?predicted?tokens masked_index?=?tokens.index('[MASK]')?#?3 k?=?10 probs,?indices?=?torch.topk(torch.softmax(predictions[0,?masked_index],?-1),?k)predicted_tokens?=?tokenizer.convert_ids_to_tokens(indices.tolist()) print(list(zip(predicted_tokens,?probs)))''' output:[('meet',?tensor(0.9712)),('see',?tensor(0.0267)),('meeting',?tensor(0.0010)),('have',?tensor(0.0003)),('met',?tensor(0.0002)),('know',?tensor(0.0001)),('join',?tensor(7.0005e-05)),('find',?tensor(5.8323e-05)),('Meet',?tensor(2.7171e-05)),('tell',?tensor(2.4689e-05))] '''可以看出,meet的概率最大,且達到了0.97,非常顯著。
5.3 Extractive QA
展示sentence-pair level的下游任務。
from?transformers?import?BertTokenizer,?BertForQuestionAnswering import?torch#?step?1:?obtain?pretrained-model?in?SQUAD tokenizer?=?BertTokenizer.from_pretrained('bert-base-uncased',?cache_dir='tmp/token_qa') model?=?BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad',?cache_dir='tmp/model_qa')#?step?2:?tokenize,?sentence-pair,?question,?passage question,?text?=?"Who?was?Jim?Henson?",?"Jim?Henson?was?a?nice?puppet" encoding?=?tokenizer.encode_plus(question,?text,?return_tensors='pt') input_ids,?token_type_ids?=?encoding["input_ids"],?encoding["token_type_ids"] print(input_ids,?token_type_ids) #?observe?enriched?tokens all_tokens?=?tokenizer.convert_ids_to_tokens(input_ids.squeeze().numpy()) print(all_tokens)#?step?3:?obtain?start/end?position?scores,?B?x?S start_scores,?end_scores?=?model(input_ids,?token_type_ids=token_type_ids)?#?(B,?S) answer?=?'?'.join(all_tokens[torch.argmax(start_scores)?:?torch.argmax(end_scores)+1]) print(answer) assert?answer?==?"a?nice?puppet"''' output: step?2:input_ids:?tensor([[??101,??2040,??2001,??3958,?27227,??1029,???102,??3958,?27227,??2001,?1037,??3835,?13997,???102]])?token_type_ids:?tensor([[0,?0,?0,?0,?0,?0,?0,?1,?1,?1,?1,?1,?1,?1]])all_tokens:['[CLS]',?'who',?'was',?'jim',?'henson',?'?',?'[SEP]',?'jim',?'henson',?'was',?'a',?'nice',?'puppet',?'[SEP]']???step?3:answer:a?nice?puppet '''可以看出,模型能準確預測出答案,「a nice puppet」。
6. Summary
之前一直沒有機會閱讀BERT源碼。這篇文章也算是對BERT源碼的一次粗淺的閱讀筆記。想要進一步學習的話,可以參考文章,進擊的 BERT:NLP 界的巨人之力與遷移學習。總之,基于huggingface提供的transfomers進行二次開發和fine-tune還是比較方便的。下一次會嘗試結合AllenNLP,在AllenNLP中使用transformers來解決NLP tasks。
7. References
Transformers: State-of-the-art Natural Language Processing
深入理解NLP Subword算法:BPE、WordPiece、ULM
huggingface transformers doc
huggingface transformers source code
進擊的 BERT:NLP 界的巨人之力與遷移學習
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
往期精彩回顧適合初學者入門人工智能的路線及資料下載機器學習及深度學習筆記等資料打印機器學習在線手冊深度學習筆記專輯《統計學習方法》的代碼復現專輯 AI基礎下載機器學習的數學基礎專輯 本站知識星球“黃博的機器學習圈子”(92416895) 本站qq群704220115。 加入微信群請掃碼: 與50位技術專家面對面20年技術見證,附贈技術全景圖總結
以上是生活随笔為你收集整理的【NLP】Transformers 源码阅读和实践的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Win11系统设置绿色护眼模式的方法
- 下一篇: 深度技术win11 32位稳定版系统v2