Datawhale组队学习-NLP新闻文本分类-TASK06
Task6 基于深度學(xué)習(xí)的文本分類3
基于深度學(xué)習(xí)的文本分類
學(xué)習(xí)目標(biāo)
- 了解Transformer的原理和基于預(yù)訓(xùn)練語言模型(Bert)的詞表示
- 學(xué)會(huì)Bert的使用,具體包括pretrain和finetune
文本表示方法Part4
Transformer原理
Transformer是在"Attention is All You Need"中提出的,模型的編碼部分是一組編碼器的堆疊(論文中依次堆疊六個(gè)編碼器),模型的解碼部分是由相同數(shù)量的解碼器的堆疊。
我們重點(diǎn)關(guān)注編碼部分。他們結(jié)構(gòu)完全相同,但是并不共享參數(shù),每一個(gè)編碼器都可以拆解成兩部分。在對(duì)輸入序列做詞的向量化之后,它們首先流過一個(gè)self-attention層,該層幫助編碼器在它編碼單詞的時(shí)候能夠看到輸入序列中的其他單詞。self-attention的輸出流向一個(gè)前向網(wǎng)絡(luò)(Feed Forward Neural Network),每個(gè)輸入位置對(duì)應(yīng)的前向網(wǎng)絡(luò)是獨(dú)立互不干擾的。最后將輸出傳入下一個(gè)編碼器。
這里能看到Transformer的一個(gè)關(guān)鍵特性,每個(gè)位置的詞僅僅流過它自己的編碼器路徑。在self-attention層中,這些路徑兩兩之間是相互依賴的。前向網(wǎng)絡(luò)層則沒有這些依賴性,但這些路徑在流經(jīng)前向網(wǎng)絡(luò)時(shí)可以并行執(zhí)行。
Self-Attention中使用多頭機(jī)制,使得不同的attention heads所關(guān)注的的部分不同。
編碼"it"時(shí),一個(gè)attention head集中于"the animal",另一個(gè)head集中于“tired”,某種意義上講,模型對(duì)“it”的表達(dá)合成了的“animal”和“tired”兩者。
對(duì)于自注意力的詳細(xì)計(jì)算,歡迎大家參考Jay Alammar關(guān)于Transformer的博客,這里不再展開。
除此之外,為了使模型保持單詞的語序,模型中添加了位置編碼向量。如下圖所示,每行對(duì)應(yīng)一個(gè)向量的位置編碼。因此,第一行將是我們要添加到輸入序列中第一個(gè)單詞的嵌入的向量。每行包含512個(gè)值—每個(gè)值都在1到-1之間。因?yàn)樽髠?cè)是用sine函數(shù)生成,右側(cè)是用cosine生成,所以可以觀察到中間顯著的分隔。
編碼器結(jié)構(gòu)中值得提出注意的一個(gè)細(xì)節(jié)是,在每個(gè)子層中(Self-attention, FFNN),都有殘差連接,并且緊跟著layer-normalization。如果我們可視化向量和LayerNorm操作,將如下所示:
基于預(yù)訓(xùn)練語言模型的詞表示
基于預(yù)訓(xùn)練語言模型的詞表示由于可以建模上下文信息,進(jìn)而解決傳統(tǒng)靜態(tài)詞向量不能建模“一詞多義”語言現(xiàn)象的問題。最早提出的ELMo基于兩個(gè)單向LSTM,將從左到右和從右到左兩個(gè)方向的隱藏層向量表示拼接學(xué)習(xí)上下文詞嵌入。而GPT用Transformer代替LSTM作為編碼器,首先進(jìn)行了語言模型預(yù)訓(xùn)練,然后在下游任務(wù)微調(diào)模型參數(shù)。但GPT由于僅使用了單向語言模型,因此難以建模上下文信息。為了解決以上問題,研究者們提出了BERT,BERT模型結(jié)構(gòu)如下圖所示,它是一個(gè)基于Transformer的多層Encoder,通過執(zhí)行一系列預(yù)訓(xùn)練,進(jìn)而得到深層的上下文表示。
ELMo論文題目中Deep是指雙向雙層LSTM,而更關(guān)鍵的在于context。傳統(tǒng)方法生成的單詞映射表的形式,即先為每個(gè)單詞生成一個(gè)靜態(tài)的詞向量,之后這個(gè)單詞的表示就被固定住了,不會(huì)跟著上下文的變化而做出改變。事實(shí)上,由于一詞多義的語言現(xiàn)象,靜態(tài)詞向量是有很大的弊端的。以bank為例,如果訓(xùn)練語料的足夠大,事先學(xué)好的詞向量中混雜著所有的語義。而當(dāng)下游應(yīng)用時(shí),即使在新句子中,bank的上下文里包含money等詞,我們基本可以確定bank是“銀行”的語義而不是在其他上下文中的“河床”的語義,但是由于靜態(tài)詞向量不能跟隨上下文而進(jìn)行變化,所以bank的表示中還是混雜著多種語義。為了解決這一問題,ELMo首先進(jìn)行了語言模型預(yù)訓(xùn)練,然后在下游任務(wù)中動(dòng)態(tài)調(diào)整Word Embedding,因此最后輸出的詞表示能夠充分表達(dá)單詞在上下文中的特定語義,進(jìn)而解決一詞多義的問題。
GPT來自于openai,是一種生成式預(yù)訓(xùn)練模型。GPT 除了將ELMo中的LSTM替換為Transformer 的Encoder外,更開創(chuàng)了NLP界基于預(yù)訓(xùn)練-微調(diào)的新范式。盡管GPT采用的也是和ELMo相同的兩階段模式,但GPT在第一個(gè)階段并沒有采取ELMo中使用兩個(gè)單向雙層LSTM拼接的結(jié)構(gòu),而是采用基于自回歸式的單向語言模型。
Google在NAACL 2018發(fā)表的論文中提出了BERT,與GPT相同,BERT也采用了預(yù)訓(xùn)練-微調(diào)這一兩階段模式。但在模型結(jié)構(gòu)方面,BERT采用了ELMO的范式,即使用雙向語言模型代替GPT中的單向語言模型,但是BERT的作者認(rèn)為ELMo使用兩個(gè)單向語言模型拼接的方式太粗暴,因此在第一階段的預(yù)訓(xùn)練過程中,BERT提出掩碼語言模型,即類似完形填空的方式,通過上下文來預(yù)測(cè)單詞本身,而不是從右到左或從左到右建模,這允許模型能夠自由地編碼每個(gè)層中來自兩個(gè)方向的信息。而為了學(xué)習(xí)句子的詞序關(guān)系,BERT將Transformer中的三角函數(shù)位置表示替換為可學(xué)習(xí)的參數(shù),其次為了區(qū)別單句和雙句輸入,BERT還引入了句子類型表征。BERT的輸入如圖所示。此外,為了充分學(xué)習(xí)句子間的關(guān)系,BERT提出了下一個(gè)句子預(yù)測(cè)任務(wù)。具體來說,在訓(xùn)練時(shí),句子對(duì)中的第二個(gè)句子有50%來自與原有的連續(xù)句子,而其余50%的句子則是通過在其他句子中隨機(jī)采樣。同時(shí),消融實(shí)驗(yàn)也證明,這一預(yù)訓(xùn)練任務(wù)對(duì)句間關(guān)系判斷任務(wù)具有很大的貢獻(xiàn)。除了模型結(jié)構(gòu)不同之外,BERT在預(yù)訓(xùn)練時(shí)使用的無標(biāo)簽數(shù)據(jù)規(guī)模要比GPT大的多。
在第二階段,與GPT相同,BERT也使用Fine-Tuning模式來微調(diào)下游任務(wù)。如下圖所示,BERT與GPT不同,它極大的減少了改造下游任務(wù)的要求,只需在BERT模型的基礎(chǔ)上,通過額外添加Linear分類器,就可以完成下游任務(wù)。具體來說,對(duì)于句間關(guān)系判斷任務(wù),與GPT類似,只需在句子之間加個(gè)分隔符,然后在兩端分別加上起始和終止符號(hào)。在進(jìn)行輸出時(shí),只需把句子的起始符號(hào)[CLS]在BERT最后一層中對(duì)應(yīng)的位置接一個(gè)Softmax+Linear分類層即可;對(duì)于單句分類問題,也與GPT類似,只需要在句子兩段分別增加起始和終止符號(hào),輸出部分和句間關(guān)系判斷任務(wù)保持一致即可;對(duì)于問答任務(wù),由于需要輸出答案在給定段落的起始和終止位置,因此需要先將問題和段落按照句間關(guān)系判斷任務(wù)構(gòu)造輸入,輸出只需要在BERT最后一層中第二個(gè)句子,即段落的每個(gè)單詞對(duì)應(yīng)的位置上分別接判斷起始和終止位置的分類器;最后,對(duì)于NLP中的序列標(biāo)注問題,輸入與單句分類任務(wù)一致,不同的是在BERT最后一層中每個(gè)單詞對(duì)應(yīng)的位置上接分類器即可。
更重要的是,BERT開啟了NLP領(lǐng)域“預(yù)訓(xùn)練-微調(diào)”這種兩階段的全新范式。在第一階段首先在海量無標(biāo)注文本上預(yù)訓(xùn)練一個(gè)雙向語言模型,這里特別值得注意的是,將Transformer作為特征提取器在解決并行性和長(zhǎng)距離依賴問題上都要領(lǐng)先于傳統(tǒng)的RNN或者CNN,通過預(yù)訓(xùn)練的方式,可以將訓(xùn)練數(shù)據(jù)中的詞法、句法、語法知識(shí)以網(wǎng)絡(luò)參數(shù)的形式提煉到模型當(dāng)中,在第二階段使用下游任務(wù)的數(shù)據(jù)Fine-tuning不同層數(shù)的BERT模型參數(shù),或者把BERT當(dāng)作特征提取器生成BERT Embedding,作為新特征引入下游任務(wù)。這種兩階段的全新范式盡管是來自于計(jì)算機(jī)視覺領(lǐng)域,但是在自然語言處理領(lǐng)域一直沒有得到很好的運(yùn)用,而BERT作為近些年NLP突破性進(jìn)展的集大成者,最大的亮點(diǎn)可以說不僅在于模型性能好,并且?guī)缀跛蠳LP任務(wù)都可以很方便地基于BERT進(jìn)行改造,進(jìn)而將預(yù)訓(xùn)練學(xué)到的語言學(xué)知識(shí)引入下游任務(wù),進(jìn)一步提升模型的性能。
基于Bert的文本分類
Bert Pretrain
預(yù)訓(xùn)練過程使用了Google基于Tensorflow發(fā)布的BERT源代碼。首先從原始文本中創(chuàng)建訓(xùn)練數(shù)據(jù),由于本次比賽的數(shù)據(jù)都是ID,這里重新建立了詞表,并且建立了基于空格的分詞器。
class WhitespaceTokenizer(object):"""WhitespaceTokenizer with vocab."""def __init__(self, vocab_file):self.vocab = load_vocab(vocab_file)self.inv_vocab = {v: k for k, v in self.vocab.items()}def tokenize(self, text):split_tokens = whitespace_tokenize(text)output_tokens = []for token in split_tokens:if token in self.vocab:output_tokens.append(token)else:output_tokens.append("[UNK]")return output_tokensdef convert_tokens_to_ids(self, tokens):return convert_by_vocab(self.vocab, tokens)def convert_ids_to_tokens(self, ids):return convert_by_vocab(self.inv_vocab, ids)預(yù)訓(xùn)練由于去除了NSP預(yù)訓(xùn)練任務(wù),因此將文檔處理多個(gè)最大長(zhǎng)度為256的段,如果最后一個(gè)段的長(zhǎng)度小于256/2則丟棄。每一個(gè)段執(zhí)行按照BERT原文中執(zhí)行掩碼語言模型,然后處理成tfrecord格式。
def create_segments_from_document(document, max_segment_length):"""Split single document to segments according to max_segment_length."""assert len(document) == 1document = document[0]document_len = len(document)index = list(range(0, document_len, max_segment_length))other_len = document_len % max_segment_lengthif other_len > max_segment_length / 2:index.append(document_len)segments = []for i in range(len(index) - 1):segment = document[index[i]: index[i+1]]segments.append(segment)return segments在預(yù)訓(xùn)練過程中,也只執(zhí)行掩碼語言模型任務(wù),因此不再計(jì)算下一句預(yù)測(cè)任務(wù)的loss。
(masked_lm_loss, masked_lm_example_loss, masked_lm_log_probs) = get_masked_lm_output(bert_config, model.get_sequence_output(), model.get_embedding_table(),masked_lm_positions, masked_lm_ids, masked_lm_weights)total_loss = masked_lm_loss為了適配句子的長(zhǎng)度,以及減小模型的訓(xùn)練時(shí)間,我們采取了BERT-mini模型,詳細(xì)配置如下。
{"hidden_size": 256,"hidden_act": "gelu","initializer_range": 0.02,"vocab_size": 5981,"hidden_dropout_prob": 0.1,"num_attention_heads": 4,"type_vocab_size": 2,"max_position_embeddings": 256,"num_hidden_layers": 4,"intermediate_size": 1024,"attention_probs_dropout_prob": 0.1 }由于我們的整體框架使用Pytorch,因此需要將最后一個(gè)檢查點(diǎn)轉(zhuǎn)換成Pytorch的權(quán)重。
def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, bert_config_file, pytorch_dump_path):# Initialise PyTorch modelconfig = BertConfig.from_json_file(bert_config_file)print("Building PyTorch model from configuration: {}".format(str(config)))model = BertForPreTraining(config)# Load weights from tf checkpointload_tf_weights_in_bert(model, config, tf_checkpoint_path)# Save pytorch-modelprint("Save PyTorch model to {}".format(pytorch_dump_path))torch.save(model.state_dict(), pytorch_dump_path)預(yù)訓(xùn)練消耗的資源較大,硬件條件不允許的情況下建議直接下載開源的模型
Bert Finetune
微調(diào)將最后一層的第一個(gè)token即[CLS]的隱藏向量作為句子的表示,然后輸入到softmax層進(jìn)行分類。
sequence_output, pooled_output = \self.bert(input_ids=input_ids, token_type_ids=token_type_ids)if self.pooled:reps = pooled_output else:reps = sequence_output[:, 0, :] # sen_num x 256if self.training:reps = self.dropout(reps)本章小結(jié)
本章介紹了Bert的原理和使用,具體包括pretrain和finetune兩部分。
本章作業(yè)
- 完成Bert Pretrain和Finetune的過程
- 閱讀Bert官方文檔,找到相關(guān)參數(shù)進(jìn)行調(diào)參
簡(jiǎn)化版bert的學(xué)習(xí)
A Visual Notebook to Using BERT for the First TIme.ipynb
In this notebook, we will use pre-trained deep learning model to process some text. We will then use the output of that model to classify the text. The text is a list of sentences from film reviews. And we will calssify each sentence as either speaking “positively” about its subject of “negatively”.
Models: Sentence Sentiment Classification
Our goal is to create a model that takes a sentence (just like the ones in our dataset) and produces either 1 (indicating the sentence carries a positive sentiment) or a 0 (indicating the sentence carries a negative sentiment). We can think of it as looking like this:
Under the hood, the model is actually made up of two model.
- DistilBERT processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
- The next model, a basic Logistic Regression model from scikit learn will take in the result of DistilBERT’s processing, and classify the sentence as either positive or negative (1 or 0, respectively).
The data we pass between the two models is a vector of size 768. We can think of this of vector as an embedding for the sentence that we can use for classification.
Dataset
The dataset we will use in this example is SST2, which contains sentences from movie reviews, each labeled as either positive (has the value 1) or negative (has the value 0):
| a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films | 1 |
| apparently reassembled from the cutting room floor of any given daytime soap | 0 |
| they presume their audience won't sit still for a sociology lesson | 0 |
| this is a visually stunning rumination on love , memory , history and the war between art and commerce | 1 |
| jonathan parker 's bartleby should have been the be all end all of the modern office anomie films | 1 |
Installing the transformers library
Let’s start by installing the huggingface transformers library so we can load our deep learning NLP model.
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV from sklearn.model_selection import cross_val_score import torch import transformers as ppb import warnings warnings.filterwarnings('ignore')Importing the dataset
We’ll use pandas to read the dataset and load it into a dataframe.
df = pd.read_csv('train.tsv', delimiter='\t', header=None)For performance reasons, we’ll only use 2,000 sentences from the dataset
batch_1 = df[:2000]We can ask pandas how many sentences are labeled as “positive” (value 1) and how many are labeled “negative” (having the value 0)
batch_1[1].value_counts() batch_1.head()Loading the Pre-trained BERT model
Let’s now load a pre-trained BERT model.
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')model_class # For DistilBERT: model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')## Want BERT instead of distilBERT? Uncomment the following line: #model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')# Load pretrained model/tokenizer tokenizer = tokenizer_class.from_pretrained(pretrained_weights) model = model_class.from_pretrained(pretrained_weights)Right now, the variable model holds a pretrained distilBERT model – a version of BERT that is smaller, but much faster and requiring a lot less memory.
Model #1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to so some minimal processing to put them in the format it requires.
Tokenization
Our first step is to tokenize the sentences – break them up into word and subwords in the format BERT is comfortable with.
tokenized = batch_1[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True))) tokenized.valuesPadding
After tokenization, tokenized is a list of sentences – each sentences is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It’s just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths).
max_len = 0 for i in tokenized.values:if len(i) > max_len:max_len = len(i)padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])np.array(padded).shapeMasking
If we directly send padded to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we’ve added when it’s processing its input. That’s what attention_mask is:
attention_mask = np.where(padded != 0, 1, 0) attention_mask.shapeModel #1: And Now, Deep Learning!
Now that we have our model and inputs ready, let’s run our model!
The model() function runs our sentences through BERT. The results of the processing will be returned into last_hidden_states.
input_ids = torch.tensor(padded) attention_mask = torch.tensor(attention_mask)with torch.no_grad():last_hidden_states = model(input_ids, attention_mask=attention_mask)Model #2: Train/Test Split
Let’s now split our datset into a training set and testing set (even though we’re using 2,000 sentences from the SST2 training set).
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)[Bonus] Grid Search for Parameters
We can dive into Logistic regression directly with the Scikit Learn default parameters, but sometimes it’s worth searching for the best value of the C parameter, which determines regularization strength.
# parameters = {'C': np.linspace(0.0001, 100, 20)} # grid_search = GridSearchCV(LogisticRegression(), parameters) # grid_search.fit(train_features, train_labels)# print('best parameters: ', grid_search.best_params_) # print('best scrores: ', grid_search.best_score_)We now train the LogisticRegression model. If you’ve chosen to do the gridsearch, you can plug the value of C into the model declaration (e.g. LogisticRegression(C=5.2)).
lr_clf = LogisticRegression() lr_clf.fit(train_features, train_labels)Evaluating Model #2
So how well does our model do in classifying sentences? One way is to check the accuracy against the testing dataset:
lr_clf.score(test_features, test_labels)How good is this score? What can we compare it against? Let’s first look at a dummy classifier:
from sklearn.dummy import DummyClassifier clf = DummyClassifier()scores = cross_val_score(clf, train_features, train_labels) print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))So our model clearly does better than a dummy classifier. But how does it compare against the best models?
Proper SST2 scores
For reference, the highest accuracy score for this dataset is currently 96.8. DistilBERT can be trained to improve its score on this task – a process called fine-tuning which updates BERT’s weights to make it achieve a better performance in this sentence classification task (which we can call the downstream task). The fine-tuned DistilBERT turns out to achieve an accuracy score of 90.7. The full size BERT model achieves 94.9.
And that’s it! That’s a good first contact with BERT. The next step would be to head over to the documentation and try your hand at fine-tuning. You can also go back and switch from distilBERT to BERT and see how that works.
總結(jié)
以上是生活随笔為你收集整理的Datawhale组队学习-NLP新闻文本分类-TASK06的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: P3386 【模板】二分图匹配
- 下一篇: VB计算时间间隔