當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

bert模型简介、transformers中bert模型源码阅读、分类任务实战和难点总结

發布時間：2024/7/5 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 bert模型简介、transformers中bert模型源码阅读、分类任务实战和难点总结小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

bert模型簡介、transformers中bert模型源碼閱讀、分類任務實戰和難點總結：https://blog.csdn.net/HUSTHY/article/details/105882989

一、bert模型簡介

bert與訓練的流程：

bert模型的輸入

二、huggingface的bert源碼淺析

bert提取文本詞向量

BertModel代碼閱讀

BertEmbedding子模型

BertEncoder

BertAttention

BertIntermediate

BertOutput(config)

BertPooler()

三、Bert文本分類任務實戰

四、Bert模型難點總結

寫在最前面，這篇博客篇幅有點長，原因是貼的代碼和圖有點多，感興趣的可以堅持讀下去！

一、bert模型簡介

? ? ? ? 2018年bert模型被谷歌提出，它在NLP的11項任務中取得了state of the art 的結果。bert模型是由很多層transformer結構堆疊而成，這里簡單看看一下transformer的結構，上一張經典的圖片，如下：

可以看到transformer是由encoder和decoder模塊構成，而bert模型則是利用了transformer的encoder模塊。最輕量的bert買模型是由12層transformer，12頭注意力，768維的hidden state，在論文中的結構簡圖如下：

這樣的雙向transformer的結構，在NLP的大部分任務中取得了很好的效果，具備較強的泛化能力。由于使用了海量的語料進行了訓練，bert模型可以使用pretrain——fine-tune這種方式來進行各類NLP任務。

bert與訓練的流程：

這個過程包括兩個任務，一個是Masked Language Model(遮掩語言模型)，另外一個是Next Sentence Prediction(下一句預測)。

Masked Language Model(遮掩語言模型)可以理解為是做完型填空，把語料中15%的詞遮掩掉，來學習詞和詞之間的一些規律；

Next Sentence Prediction就是學習語料中上下文中2個句子之間的關系規律。

通過這2個階段任務的學習，bert就會把文本的語法和語義信息學習到。bert模型中的self-attention機制可以使用文本其他的詞來增強目標詞的語義表示，這也是bert模型吊打其他模型的一個關鍵原因。

bert模型的輸入

bert模型的輸入可以是一個句子或者句子對，代碼層面來說，就是輸入了句子或者句子對對應的3個向量。它們分別是token embedding，segment embedding和position embedding，具體的含義：

token embedding：句子的詞向量

segment embedding：是那個句子的0和1

position embedding：位置向量，指明每個字在句中的位置。

關于position embedding這里有兩種求法，一種是有相應的三角函數公式得出的，這種是絕對向量；還有一種是學習得到的，這種是相對向量。具體形式如下：

二、huggingface的bert源碼淺析

關于bert模型的使用，我主要是使用huggingface的transformer庫來調用bert和使用——一般是直接用來bert來獲取詞向量。這里就bert的使用和huggingface中的源碼進行一些解讀。

bert提取文本詞向量

首先看一段簡單的代碼，使用huggingface的transformers(其實就是實現的bert)來提取句——我愛武漢！我愛中國！——的向量。代碼如下：bert提取文本詞向量首先看一段簡單的代碼，使用huggingface的transformers(其實就是實現的bert)來提取句——我愛武漢！我愛中國！——的向量。代碼如下：

from transformers import BertModel,BertTokenizer,BertConfig import torchconfig = BertConfig.from_pretrained('pretrain_model/chinese-bert-wwm')#第一步加載模型配置文件 bertmodel = BertModel.from_pretrained('pretrain_model/chinese-bert-wwm',config=config)#第二步初始化模型，并加載權重 # print('***************************bertmodel***************************') tokenizer = BertTokenizer.from_pretrained('pretrain_model/chinese-bert-wwm')#第三步加載tokenizertext1 = '我愛武漢！我愛中國！' tokeniz_text1 = tokenizer.tokenize(text1) # print(tokeniz_text1) # print('tokeniz_text1:',len(tokeniz_text1)) indexed_tokens_1 = tokenizer.convert_tokens_to_ids(tokeniz_text1) print('len(indexed_tokens_1):',len(indexed_tokens_1)) print(indexed_tokens_1)input_ids_1 = indexed_tokens_1 # print(indexed_tokens_1) # print('indexed_tokens_1:',len(indexed_tokens_1)) segments_ids_1 = [0]*len(input_ids_1)#其實這個輸入可以不用的，因為是單句的原因 input_masks_1 = [1]*len(input_ids_1)#其實這個輸入可以不用的，因為是單句的原因input_ids_1_tensor = torch.tensor([input_ids_1]) vector1,pooler1 = bertmodel(input_ids_1_tensor)#應該是輸入3個向量的，但是單句情況下，它自會自己做判斷，然后自動生成對應的segments_ids和input_masks向量 #這里的輸出最后一層的last_hidden_state和最后一層首個token的hidden-statetext2 = '[CLS]我愛武漢！我愛中國![SEP]' tokeniz_text2 = tokenizer.tokenize(text2) indexed_tokens_2 = tokenizer.convert_tokens_to_ids(tokeniz_text2)input_ids_2 = indexed_tokens_2 segments_ids_2 = [0]*len(input_ids_2)#其實這個輸入可以不用的，因為是單句的原因 input_masks_2 = [1]*len(input_ids_2)#其實這個輸入可以不用的，因為是單句的原因input_ids_2_tensor = torch.tensor([input_ids_2]) vector2,pooler2 = bertmodel(input_ids_2_tensor) print('pooler2:',pooler2) print('vector2[:,0:1,:]:',vector2[:,0:1,:])text1_encode = tokenizer.encode(text1,add_special_tokens=True) print('len(text1_encode):',len(text1_encode)) print('text1_encode:',text1_encode)

input_ids_2_tensor = torch.tensor([input_ids_2])
vector2,pooler2 = bertmodel(input_ids_2_tensor)
print('pooler2:',pooler2)
print('vector2[:,0:1,:]:',vector2[:,0:1,:])
text1_encode = tokenizer.encode(text1,add_special_tokens=True)
print('len(text1_encode):',len(text1_encode))
print('text1_encode:',text1_encode)
#這里的text1_encode和indexed_tokens_2是一模一樣的，encode()函數會自動為文本添加特殊字符[UNK][CLS][SEP][MASK]等
以上代碼是基于pytorch來實現的，同時應用到了transoformers庫！可以看到bert模型的使用非常簡單！

第一步，初始化bert模型和加載權重。這個步驟中，首先加載配置文件、然后加載bert模型和載入權重。

第二步，對輸入文本做詞表映射，形成初始詞向量。

第三步，輸入喂入bert模型中得到輸入文本的結果向量。

文中是bert模型的輸入我這里只給出了一個那就是input_ids,另外的2個沒有給出。這里的原因就是這里是單個句子，模型內部可以對另外2個輸入做自動添加的處理——并不是沒有，這點要注意到。

這里有個疑問因為bert的輸入文本得添加一個[cls]特殊字符，我認為最后的輸出lsat_hidden_state中的lsat_hidden_state[:,0:1,:]應該和pooler結果是一樣的，可是這里是不一樣的，有點理解的偏差，不知道為什么。

BertModel代碼閱讀

通過上文中的代碼，大致可以知道怎么調用一些API來創建bert模型和應用它。那么huggingface中是怎么實現BertModel的這個也是比較重要的，這里我們就好好閱讀以下其中關于BertModel實現的代碼。看一張transformers項目文件結構圖：

這么面封裝了很多模型的構建，我們主要是閱讀modeling_bert.py文件，它在里面詳細的展示了如何構建一個Bert模型的：
class BertModel(BertPreTrainedModel):"""......."""def __init__(self, config):super().__init__(config)self.config = configself.embeddings = BertEmbeddings(config)self.encoder = BertEncoder(config)self.pooler = BertPooler(config)self.init_weights()def get_input_embeddings(self):return self.embeddings.word_embeddingsdef set_input_embeddings(self, value):self.embeddings.word_embeddings = valuedef _prune_heads(self, heads_to_prune):""" Prunes heads of the model.heads_to_prune: dict of {layer_num: list of heads to prune in this layer}See base class PreTrainedModel"""for layer, heads in heads_to_prune.items():self.encoder.layer[layer].attention.prune_heads(heads)@add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING)def forward(self,input_ids=None,attention_mask=None,token_type_ids=None,position_ids=None,head_mask=None,inputs_embeds=None,encoder_hidden_states=None,encoder_attention_mask=None,):r"""......."""if input_ids is not None and inputs_embeds is not None:raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")elif input_ids is not None:input_shape = input_ids.size()elif inputs_embeds is not None:input_shape = inputs_embeds.size()[:-1]else:raise ValueError("You have to specify either input_ids or inputs_embeds")device = input_ids.device if input_ids is not None else inputs_embeds.deviceif attention_mask is None:attention_mask = torch.ones(input_shape, device=device)if token_type_ids is None:token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]# ourselves in which case we just need to make it broadcastable to all heads.extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, self.device)# If a 2D ou 3D attention mask is provided for the cross-attention# we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]if self.config.is_decoder and encoder_hidden_states is not None:encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)if encoder_attention_mask is None:encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)else:encoder_extended_attention_mask = None# Prepare head mask if needed# 1.0 in head_mask indicate we keep the head# attention_probs has shape bsz x n_heads x N x N# input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]# and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds)encoder_outputs = self.encoder(embedding_output,attention_mask=extended_attention_mask,head_mask=head_mask,encoder_hidden_states=encoder_hidden_states,encoder_attention_mask=encoder_extended_attention_mask,)sequence_output = encoder_outputs[0]pooled_output = self.pooler(sequence_output)outputs = (sequence_output, pooled_output,) + encoder_outputs[1:] # add hidden_states and attentions if they are herereturn outputs # sequence_output, pooled_output, (hidden_states), (attentions)
以上就是BertModel的全部代碼，可以看到在BertModel類中，首先__init__()函數中定義了模型的基本模塊，然后在forward()函數里面使用這些結構模塊具體實現了Bert的邏輯。
def __init__(self, config):super().__init__(config)self.config = configself.embeddings = BertEmbeddings(config)self.encoder = BertEncoder(config)self.pooler = BertPooler(config)self.init_weights()
init()函數中定義的模型模塊主要是3個，分別是BertEmbedding、BertEncoder和BertPooler。然后在forward()，輸入順序的經過這3個模塊的處理就得到了我們要的結果——對應文本的bert向量。

下面來閱讀forward()：
if input_ids is not None and inputs_embeds is not None:raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") elif input_ids is not None:input_shape = input_ids.size() elif inputs_embeds is not None:input_shape = inputs_embeds.size()[:-1] else:raise ValueError("You have to specify either input_ids or inputs_embeds")device = input_ids.device if input_ids is not None else inputs_embeds.deviceif attention_mask is None:attention_mask = torch.ones(input_shape, device=device) if token_type_ids is None:token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length] # ourselves in which case we just need to make it broadcastable to all heads. if attention_mask.dim() == 3:extended_attention_mask = attention_mask[:, None, :, :] elif attention_mask.dim() == 2:# Provided a padding mask of dimensions [batch_size, seq_length]# - if the model is a decoder, apply a causal mask in addition to the padding mask# - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]if self.config.is_decoder:batch_size, seq_length = input_shapeseq_ids = torch.arange(seq_length, device=device)causal_mask = seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <= seq_ids[None, :, None]causal_mask = causal_mask.to(attention_mask.dtype) # causal and attention masks must have same type with pytorch version < 1.3extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]else:extended_attention_mask = attention_mask[:, None, None, :] else:raise ValueError("Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(input_shape, attention_mask.shape))# Since attention_mask is 1.0 for positions we want to attend and 0.0 for # masked positions, this operation will create a tensor which is 0.0 for # positions we want to attend and -10000.0 for masked positions. # Since we are adding it to the raw scores before the softmax, this is # effectively the same as removing these entirely. extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0# If a 2D ou 3D attention mask is provided for the cross-attention # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length] if self.config.is_decoder and encoder_hidden_states is not None:encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)if encoder_attention_mask is None:encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)if encoder_attention_mask.dim() == 3:encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]elif encoder_attention_mask.dim() == 2:encoder_extended_attention_mask = encoder_attention_mask[:, None, None, :]else:raise ValueError("Wrong shape for encoder_hidden_shape (shape {}) or encoder_attention_mask (shape {})".format(encoder_hidden_shape, encoder_attention_mask.shape))encoder_extended_attention_mask = encoder_extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibilityencoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -10000.0 else:encoder_extended_attention_mask = None# Prepare head mask if needed # 1.0 in head_mask indicate we keep the head # attention_probs has shape bsz x n_heads x N x N # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length] if head_mask is not None:if head_mask.dim() == 1:head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)head_mask = head_mask.expand(self.config.num_hidden_layers, -1, -1, -1, -1)elif head_mask.dim() == 2:head_mask = (head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)) # We can specify head_mask for each layerhead_mask = head_mask.to(dtype=next(self.parameters()).dtype) # switch to fload if need + fp16 compatibility else:head_mask = [None] * self.config.num_hidden_layers
以上是一些預處理的代碼。判定input_ids的合法性，不能為空不能和inputs_embeds同時輸入；接著就獲取使用的設備是CPU還是GPU；判定attention_mask和token_type_ids的合法性，為None的話就新建一個；處理attention_mask得到encoder_extended_attention_mask，把它傳播給所有的注意力頭；最后就是判定是否啟用decoder——bert模型是基于encoder的，我認為這里就不必要做這個判定，bert的encoder的結果只是傳遞給下一層encoder，并沒有傳遞到decoder。

下面具體看核心的部分。

上面把輸入做一些預處理后，使得輸入都合法，然后就可以喂入模型的功能模塊中。第一個就是
embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds)
BertEmbedding子模型
其中的self.embeddings()就是__inti__()的BertEmbeddings(config)模塊，它可以看做是一個起embedding功能作用的子模型，具體代碼：
class BertEmbeddings(nn.Module):"""Construct the embeddings from word, position and token_type embeddings."""def __init__(self, config):super().__init__()self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load# any TensorFlow checkpoint fileself.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)self.dropout = nn.Dropout(config.hidden_dropout_prob)def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):if input_ids is not None:input_shape = input_ids.size()else:input_shape = inputs_embeds.size()[:-1]seq_length = input_shape[1]device = input_ids.device if input_ids is not None else inputs_embeds.deviceif position_ids is None:position_ids = torch.arange(seq_length, dtype=torch.long, device=device)position_ids = position_ids.unsqueeze(0).expand(input_shape)if token_type_ids is None:token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)if inputs_embeds is None:inputs_embeds = self.word_embeddings(input_ids)position_embeddings = self.position_embeddings(position_ids)token_type_embeddings = self.token_type_embeddings(token_type_ids)embeddings = inputs_embeds + position_embeddings + token_type_embeddingsembeddings = self.LayerNorm(embeddings)embeddings = self.dropout(embeddings)return embeddings
它的具體作用就是：首先把我們輸入的input_ids、token_type_ids和position_ids——(這里輸入的是對應元素在詞典中的index集合)經過torch.nn.Embedding()在各自的詞典中得到詞嵌入。然后把這3個向量直接做加法運算，接著做層歸一化以及dropout()操作。這里為何可以直接相加是可以做一個專門的問題來討論的，這里的歸一化的作用應該就是避免一些數值問題、梯度問題和模型收斂問題以及分布改變問題，dropout操作隨機丟棄掉一部分特征，可以增加模型的泛化性能。

BertEncoder
經過上述的處理后，我們就得到了一個維度是[batch_size,sequence_length,hidden_states]的向量embeddings。然后再把這個embeddings輸入到Encoder中，代碼如下，參數都很清晰明確：
encoder_outputs = self.encoder(embedding_output,attention_mask=extended_attention_mask,head_mask=head_mask,encoder_hidden_states=encoder_hidden_states,encoder_attention_mask=encoder_extended_attention_mask,)
這里的self.encoder同樣是__init__()中的BertEncoder(config)模型，全部代碼如下：
class BertEncoder(nn.Module):def __init__(self, config):super().__init__()self.output_attentions = config.output_attentionsself.output_hidden_states = config.output_hidden_statesself.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])def forward(self,hidden_states,attention_mask=None,head_mask=None,encoder_hidden_states=None,encoder_attention_mask=None,):all_hidden_states = ()all_attentions = ()for i, layer_module in enumerate(self.layer):if self.output_hidden_states:all_hidden_states = all_hidden_states + (hidden_states,)layer_outputs = layer_module(hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask)hidden_states = layer_outputs[0]if self.output_attentions:all_attentions = all_attentions + (layer_outputs[1],)# Add last layerif self.output_hidden_states:all_hidden_states = all_hidden_states + (hidden_states,)outputs = (hidden_states,)if self.output_hidden_states:outputs = outputs + (all_hidden_states,)if self.output_attentions:outputs = outputs + (all_attentions,)return outputs
其中模型定義部分的核心代碼如下：
self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])
通過這句代碼和config中的參數——“num_hidden_layers”: 12——可以得出BertEncoder使用12個(層)BertLayer組成的。對每一層的bertlayer在forward()中的for循環做如下操作：
for i, layer_module in enumerate(self.layer):if self.output_hidden_states:all_hidden_states = all_hidden_states + (hidden_states,)layer_outputs = layer_module(hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask)hidden_states = layer_outputs[0]if self.output_attentions:all_attentions = all_attentions + (layer_outputs[1],)
更新hidden_states(也就是layer_outputs[0])，然后把更新后的hidden_states傳入到下一層BertLayer中，同時把每一層的hidden_states和attentions(也就是layer_outputs[1])記錄下來，然后作為一個整體輸出。所有最后的輸出里包含的有最后一層BertLayer的hidden_states和12層所有的hidden_states以及attentions。

BertLayer具體又是什么樣的呢？這里就需要看看具體的BertLayer的實現：
class BertLayer(nn.Module):def __init__(self, config):super().__init__()self.attention = BertAttention(config)self.is_decoder = config.is_decoderif self.is_decoder:self.crossattention = BertAttention(config)self.intermediate = BertIntermediate(config)self.output = BertOutput(config)
可以看到BertLayer是由BertAttention()、BertIntermediate()和BertOutput()構成。它的forward()是比較簡單的，沒有什么奇特的操作，都是順序的把輸入經過BertAttention()、BertIntermediate()和BertOutput()這些子模型。這里主要來看看這些子模型的實現：

BertAttention
這里它又嵌套了一層，由BertSelfAttention()和BertSelfOutput()子模型組成！

這里馬上就看到self-attention機制的實現了！感覺好激動！——Self-Attention則利用了Attention機制，計算每個單詞與其他所有單詞之間的關聯(說實話理解的不是很透徹！)
class BertSelfAttention(nn.Module):def __init__(self, config):super().__init__()if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):raise ValueError("The hidden size (%d) is not a multiple of the number of attention ""heads (%d)" % (config.hidden_size, config.num_attention_heads))self.output_attentions = config.output_attentionsself.num_attention_heads = config.num_attention_headsself.attention_head_size = int(config.hidden_size / config.num_attention_heads)self.all_head_size = self.num_attention_heads * self.attention_head_sizeself.query = nn.Linear(config.hidden_size, self.all_head_size)self.key = nn.Linear(config.hidden_size, self.all_head_size)self.value = nn.Linear(config.hidden_size, self.all_head_size)self.dropout = nn.Dropout(config.attention_probs_dropout_prob)def transpose_for_scores(self, x):new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)x = x.view(*new_x_shape)return x.permute(0, 2, 1, 3)def forward(self,hidden_states,attention_mask=None,head_mask=None,encoder_hidden_states=None,encoder_attention_mask=None,):mixed_query_layer = self.query(hidden_states)# If this is instantiated as a cross-attention module, the keys# and values come from an encoder; the attention mask needs to be# such that the encoder's padding tokens are not attended to.if encoder_hidden_states is not None:mixed_key_layer = self.key(encoder_hidden_states)mixed_value_layer = self.value(encoder_hidden_states)attention_mask = encoder_attention_maskelse:mixed_key_layer = self.key(hidden_states)mixed_value_layer = self.value(hidden_states)query_layer = self.transpose_for_scores(mixed_query_layer)key_layer = self.transpose_for_scores(mixed_key_layer)value_layer = self.transpose_for_scores(mixed_value_layer)# Take the dot product between "query" and "key" to get the raw attention scores.attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))attention_scores = attention_scores / math.sqrt(self.attention_head_size)if attention_mask is not None:# Apply the attention mask is (precomputed for all layers in BertModel forward() function)attention_scores = attention_scores + attention_mask# Normalize the attention scores to probabilities.attention_probs = nn.Softmax(dim=-1)(attention_scores)# This is actually dropping out entire tokens to attend to, which might# seem a bit unusual, but is taken from the original Transformer paper.attention_probs = self.dropout(attention_probs)# Mask heads if we want toif head_mask is not None:attention_probs = attention_probs * head_maskcontext_layer = torch.matmul(attention_probs, value_layer)context_layer = context_layer.permute(0, 2, 1, 3).contiguous()new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)context_layer = context_layer.view(*new_context_layer_shape)outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)return outputs
閱讀代碼之前先回顧一下，self-attention的公式是什么樣的，公式編輯比較麻煩直接上2個圖，都是來自Attention機制詳解（二）——Self-Attention與Transformer文章中：

首先定義Q、K、V

然后應用到公式中：

以上就是單個頭的self-attention的公式，多頭的話就可以計算多次，然后在合并起來。這里就可以應用到矩陣運算了，還要注意的點就是Q、K、V的學習參數都是共享的——(要去驗證)，代碼對應的就是：
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
#注意這里的nn.Linear包含的學習參數一個是權重參數weights一個是偏置參數bias
#而且這里的query、key以及value它們的參數不一樣，也就是并不共享參數

參數都包含在nn.Linear中了，這里的self.query對應的是12個頭的self-attention機制對應的Q的學習參數模型，當然query、key以及value它們的參數不一樣，也就是并不共享參數。

那么在forward()中是如何實現的呢？

mixed_query_layer = self.query(hidden_states)#計算Q
if encoder_hidden_states is not None:
mixed_key_layer = self.key(encoder_hidden_states)
mixed_value_layer = self.value(encoder_hidden_states)
attention_mask = encoder_attention_mask
else:
mixed_key_layer = self.key(hidden_states) #計算K
mixed_value_layer = self.value(hidden_states)#計算V
#做轉置操作——這有點特殊：mixed_query_layer[batch_size,sequence_length,hidden_states]
#query_layer的維度：[batch_size,num_attention_heads,sequence_length,attention_head_size]
query_layer = self.transpose_for_scores(mixed_query_layer)
key_layer = self.transpose_for_scores(mixed_key_layer)
value_layer = self.transpose_for_scores(mixed_value_layer)
#Q和K做點積
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
#Q和K做點積后然后除以根號下多頭主力的尺寸
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask
# Normalize the attention scores to probabilities.
#做softmax操作，歸一化
attention_probs = nn.Softmax(dim=-1)(attention_scores)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs)
# Mask heads if we want to
if head_mask is not None:
attention_probs = attention_probs * head_mask
#中間結果和V做點積，得到最終結果——注意力得分也就是公式中的Z
context_layer = torch.matmul(attention_probs, value_layer)

以上代碼的中文注釋就把計算過程分析清楚了，計算mixed_query_layer、mixed_key_layer和mixed_value_layer，然后做轉置(說是維度變換更貼切一點)；接著mixed_query_layer、mixed_key_layer做點積操作，然后除以注意力頭的尺寸的開方，做softmax操作；最后和mixed_value_layer相乘，得到注意力得分————矩陣計算代碼就很好的實現了self-attention。

以上就是完成了self-attention，然后接下來就進入BertSelfOutput():

class BertSelfOutput(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states

以上BertSelfOutput()代碼很簡單，把self-attention輸出的結果經過線性模型和dropout操作，最后做層歸一化。到這里就跳出了BertAttention()模型，然后就進入中間層BertIntermediate()。

BertIntermediate

BertIntermediate()作為中間層代碼很簡單：

class BertIntermediate(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
if isinstance(config.hidden_act, str):
self.intermediate_act_fn = ACT2FN[config.hidden_act]
else:
self.intermediate_act_fn = config.hidden_act
def forward(self, hidden_states):
hidden_states = self.dense(hidden_states)
hidden_states = self.intermediate_act_fn(hidden_states)
return hidden_states

經過一個全連接層，由于config.hidden_size<config.intermediate_size，這里的Linear把特征空間變大了，然后進過了gelu激活函數，增加了特征的非線性性。

BertOutput(config)

跳出BertIntermediate()作為中間層后，就進入了BertOutput(config)模型，這個是BertLayer()模型的最后一個子模型。

class BertOutput(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states

經過線性模型和dropout操作，最后做層歸一化，把特征空間又縮小回來了。最后輸出一個hidden_states，這里就是一個BertLayer()的輸出了。

BertPooler()

然后經歷了12個BertLayer()的操作，一層一層的變換，最后得出的outputs進入BertPooler():

sequence_output = encoder_outputs[0]
pooled_output = self.pooler(sequence_output)

pooler代碼如下：

class BertPooler(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()
def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output
#以上的pooler作用要具體的去調試hidden_states的shape。

由代碼可知這個pooler的功能就是把last_hidden_states的第二維的第一維也就是文本對應的第一個；。。。、。。

以上差不多就是BertModel的具體實現，由于這個模型的代碼嵌套調用過多，可能理解起來有一定的困惑，那么接下來就需要一個圖片來可視化理解。上圖：

上圖是huggingface中的BertModel的結構流程圖(簡圖，有很多疏漏的地方勿怪！)，bertModel的輸入和基本的子模型以及數據的流向都顯示出來了，對應著代碼理解起來更加方便。黃色的圖形就是torch中的基本函數模塊(這里的Q、K和V不是)，其他顏色的矩形就是模型，平行四邊形就是數據。

以上就是對BertModel實現代碼的簡單解析，里面涉及到很多的細節：不同模型模塊的參數以及它們的維度信息，還有就是變量的維度變化，以及每個模型模塊的具體作用和意義，沒有去深究，讀者有精力的話可以自己去深究。

三、Bert文本分類任務實戰

? ? ? ? 這里我們要寫一個使用transformers項目中的分類器來實現一個簡單的文本分類任務，這里我們沒有自己取重寫Dataloader以及模型的訓練，就是直接把transformers項目中的bert分類器拿過來進行fine-tune，工作量少，結果也比較好！當然也可以完全自己實現(前面也自己實現過一個基于bert的句子分類的任務——使用bert模型做句子分類，有興趣的可以移步)，后續有時間的話可以寫一個各個模型文本分類任務的比較博客，更加熟練文本分類的一些代碼coding和知識——增加熟練度，也可以給大家分享一下。

來看本文的transformers項目中的bert分類器進行fine-tune作文本分類的任務，在這個項目里面已經把全部的代碼寫好了，我們只需要把我們的文本處理成項目能夠識別和讀取的形式。簡單的分析一下，分類任務的代碼：

主要的分類任務的代碼是在run_glue.py文件中，這里面定義了main函數，命令行參數接收器，模型的加載和調用，模型的訓練以及驗證，和數據讀取以及處理的功能模塊調用。

我們看一下這里調用的分類模型，代碼是這樣的：

model = AutoModelForSequenceClassification.from_pretrained(
args.model_name_or_path,
from_tf=bool(".ckpt" in args.model_name_or_path),
config=config,
cache_dir=args.cache_dir,
)

其實最終這里的AutoModelForSequenceClassification.from_pretrained()調用的是modeling_bert.py中的BertForSequenceClassification類，它就是具體的分類器實現：

class BertForSequenceClassification(BertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.bert = BertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)
self.init_weights()
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
labels=None,
):
outputs = self.bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
)
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here
if labels is not None:
if self.num_labels == 1:
# We are doing regression
loss_fct = MSELoss()
loss = loss_fct(logits.view(-1), labels.view(-1))
else:
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
outputs = (loss,) + outputs
return outputs

模型調用了BertModel，然后做使用nn.Linear(config.hidden_size, self.config.num_labels)做分類，loss函數是常用的交叉熵損失函數。以上就是分類器的一些簡單的分析。我們要做的工作就是仿照項目里的代碼寫一個任務處理器：

項目目錄結構：transformerer_local/data/glue.py，注意這里的transformerer_local原本應該是transformerer，我這里已經做了修改。在glue.py添加上我們的分類任務代碼——添加一個讀取文件中的文本然后，然后把每條數據序列化成Example，注意get_labels()函數，把自己的類別數目實現過來，代碼如下：

class MyownProcessor(DataProcessor):
"""Processor for the CoLA data set (GLUE version)."""
def get_example_from_tensor_dict(self, tensor_dict):
"""See base class."""
return InputExample(
tensor_dict["idx"].numpy(),
tensor_dict["sentence"].numpy().decode("utf-8"),
None,
str(tensor_dict["label"].numpy()),
)
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_predict_examples(self, data_dir):
return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "predict")
def get_labels(self):
"""See base class."""
return ["0", "1","2","3","4","5","6","7"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type, i)
if len(line)==2:
text_a = line[0]
label = line[1]
examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
else:
print(line)
return examples

同時在驗證的時候，對應評價指標函數，我們這里不是binary，計算f1_score的時候要采用其他的策略：

transformerer_local/data/metrics/__init__.py，注意這里的transformerer_local原本應該是transformerer，添加內容：

#添加多分類評價函數
def acc_and_f1_multi(preds, labels):
acc = simple_accuracy(preds, labels)
f1 = f1_score(y_true=labels, y_pred=preds,average='micro')
return {
"acc": acc,
"f1": f1,
"acc_and_f1": (acc + f1) / 2,
}
def glue_compute_metrics(task_name, preds, labels):
assert len(preds) == len(labels)
if task_name == "cola":
return {"mcc": matthews_corrcoef(labels, preds)}
elif task_name == "sst-2":
return {"acc": simple_accuracy(preds, labels)}
elif task_name == "mrpc":
return acc_and_f1(preds, labels)
elif task_name == "sts-b":
return pearson_and_spearman(preds, labels)
elif task_name == "qqp":
return acc_and_f1(preds, labels)
elif task_name == "mnli":
return {"acc": simple_accuracy(preds, labels)}
elif task_name == "mnli-mm":
return {"acc": simple_accuracy(preds, labels)}
elif task_name == "qnli":
return {"acc": simple_accuracy(preds, labels)}
elif task_name == "rte":
return {"acc": simple_accuracy(preds, labels)}
elif task_name == "wnli":
return {"acc": simple_accuracy(preds, labels)}
elif task_name == "hans":
return {"acc": simple_accuracy(preds, labels)}
#添加我們的多分類任務調用函數
elif task_name == "myown":
return acc_and_f1_multi(preds, labels)
else:
raise KeyError(task_name)

添加內容就在注釋部分。

OK，現在代碼部分已經做好了，接下來就是數據部分了。直接上數據：

數據截圖部分就是上面這樣的，把pat_summary和ipc_class屬性提取出來，這里的數據質量比較好，然后只需要把超級長的文本去掉(長度大于510的)：

數據長度分布直方圖，發現幾乎全部都是小于510的長度，只有少部分比較長，只有128條，這里數據集總規模是24.8W條，可以把這少部分的直接去掉。然后把數據分割成訓練集和測試集比例(8:2)，保存為tsv格式。

接下來就是直接進行訓練了，編寫如下命令行，在train_glue_classification.sh文件中：

export TASK_NAME=myown
python -W ignore ./examples/run_glue.py \
--model_type bert \
--model_name_or_path ./pretrain_model/Chinese-BERT-wwm/ \
--task_name $TASK_NAME \
__do_train \
--do_eval \
--data_dir ./data_set/patent/ \
--max_seq_length 510 \
--per_gpu_eval_batch_size=8 \
--per_gpu_train_batch_size=8 \
--per_gpu_predict_batch_size=48 \
--learning_rate 2e-5 \
--num_train_epochs 5.0 \
--output_dir ./output/

直接在終端上運行這個sh文件，bash train_glue_classification.sh。注意這里的訓練顯卡顯存得11G以上，不然跑步起來，batch_size不能太大。訓練過程中，一個epoch大概時間3.5小時，所以時間還要蠻久的。最后給出結果：

可以看到acc=0.8508，一個8分類的任務準確率85%粗略一看還能接受。如果要詳細的分析，可以把每一類的準確率和召回率給弄出來，或者分析一下ROC，對模型的性能做詳細的分析，這里不做過多討論。另外關于這個模型的優化，怎么提高準確率，也不做考慮。

小結：以上就是直接使用transformers項目中的bert分類器拿過來進行fine-tune，做文本分類，其實代碼都寫好了，我們只需要簡單的修改一下代碼和配置，就能很快的訓練好自己的分類器。

四、Bert模型難點總結

其實關于Bert模型還有很多細節可以去探究，這里推薦知乎上的一些文章：超細節的BERT/Transformer知識點。

1、Bert模型怎么解決長文本問題？

如果文本的長度不是特別長，511-600左右，可以直接把大于510的部分直接去掉，這是一種最粗暴的處理辦法。

如果文本內容很長，而且內容也比較重要，那么就不能夠這么直接粗暴的處理了。主要思路是global norm + passage rank + sliding window——來自Amazon EMNLP的這篇文章：Multi-passage BERT。簡單的說一下sliding window，滑窗法就是把文檔分割成有部分重疊的短文本段落，然后把這些文本得出的向量拼接起來或者做mean pooling操作。具體的效果，要去做實驗。

2、Bert的輸入向量Token Embedding、Segment Embedding、Position Embedding，它們都有自己的物理含義，為什么可以相加后輸入到模型中取呢？

這個問題在知乎上已經有人提問了，回答的大佬很多。我個人傾向接受這個解釋：one hot向量concat后經過一個全連接等價于向量embedding后直接相加。

Token Embedding、Segment Embedding、Position Embedding分別代表了文本的具體語義，段落含義和位置含義，現在要把這3個不同的向量一起放到模型中去訓練，我認為concat的操作就能完整的保留文本的含義。[input_ids] 、[token_type_ids] 和[position_ids]這3個向量，concat以后形成一個[input_ids token_type_ids position_ids]新的向量，這樣丟入模型中取訓練就應該是我們初始要的結果。但是在丟入模型之前這個向量[input_ids token_type_ids position_ids]是需要經過Embedding的，而[input_ids] 、[token_type_ids] 和[position_ids]先經過Embedding然后相加和上面的效果是等價的。這樣的好處是降低了參數的維度的同時達到了同樣的效果，所以就采用了直接相加。

3、BERT在第一句前會加一個[CLS]標志，為什么？作用是什么？

最后一層的transformer的輸出該位置的向量，由于本身并不具有任何意義，就能很公平的融合整個句子的含義，然后做下游任務的時候就很好了。

其實在huggingface實現的bert代碼中，作者認為這個向量并不是很好，要想做下有任務，還是得靠自己取把最后一層的hidden_states[B,S,D]去做一些操作，比如mean pool操作。我這里沒有實驗過，只是拿來使用，在使用bert模型做句子分類一文中使用了這樣的思想。

4、Bert模型的非線性來自什么地方？

主要是來子前饋層的gelu激活函數和self-attention。

5、Bert模型為何要使用多頭注意力機制？

谷歌bert作者在論文中提到的是模型有多頭的話，就可以形成多個子空間，那么模型就可以去關注不同方面的信息。

可以這樣理解，多頭attention機制確實有點類似多個卷積核的作用，可以捕捉到文本更多更豐富的信息。

當然知乎有人專門研究這個問題，列舉了頭和頭直接的異同關系，作了一個比較綜合全面的回答，可以去閱讀！為什么Transformer 需要進行 Multi-head Attention？

寫在最后：

我個人理解的Bert模型就只有這么多，其實Bert模型就是一個提取詞向量的語言模型，由于提取的詞向量能很好的包含文本的語義信息，它能夠做很多任務并且取得不錯的效果。NER、關系抽取、文本相似度計算、文本分類、閱讀理解等等任務Bert都能做。

這個博客個人算是花了一定的精力了的(五一到現在，差不多10天時間吧)，作為這段時間以來學習NLP的一個總結還是很有收獲感的。加油！繼續努力！當然博客可能寫的不是干貨，也許還有錯誤的地方，作者水平有限，望大家提出改正！

參考文章

一文讀懂bert模型

超細節的BERT/Transformer知識點

總結

以上是生活随笔為你收集整理的bert模型简介、transformers中bert模型源码阅读、分类任务实战和难点总结的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：技术动态 | 自底向上构建知识图谱全过程
下一篇： Python 捕获异常

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

bert模型简介、transformers中bert模型源码阅读、分类任务实战和难点总结

一、bert模型簡介

bert與訓練的流程：

bert模型的輸入

二、huggingface的bert源碼淺析

bert提取文本詞向量

BertModel代碼閱讀

BertIntermediate

BertOutput(config)

BertPooler()

三、Bert文本分類任務實戰

四、Bert模型難點總結

總結