當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

bert 无标记文本调优_使用BERT准确标记主观问答内容

發(fā)布時間：2023/12/15 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 bert 无标记文本调优_使用BERT准确标记主观问答内容小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

bert 無標(biāo)記文本調(diào)優(yōu)

介紹 (Introduction)

Kaggle released Q&A understanding competition at the beginning of 2020. This competition asks each team to build NLP models to predict the subjective ratings of question and answer pairs. We finished it with 6th place in all 1571 teams. Apart from a winning solution blog posted in Kaggle, we write this more beginner friendly tutorial to introduce the competition and how we won the gold medal. We also open source our code in this Github repository.

Kaggle在2020年初發(fā)布了Q＆A理解競賽。該競賽要求每個團(tuán)隊構(gòu)建NLP模型，以預(yù)測問題和答案對的主觀評分。我們在1571個團(tuán)隊中均排名第六。除了在Kaggle 上發(fā)布的獲獎解決方案博客之外，我們還編寫了這個對初學(xué)者更友好的教程，以介紹比賽以及我們?nèi)绾潍@得金牌。我們還在這個 Github存儲庫中開源了我們的代碼。

數(shù)據(jù) (Data)

The competition collects question and answer pairs from 70 Stack-Overflow-like websites, Question title, body and answer as text features, also some other features such as url, user id. The target labels are 30 dimensions with values between 0 and 1 to evaluate questions and answer such as if the question is critical, if the answer is helpful, etc. The raters received minimal guidance and training, and the target relied largely on their subjective interpretation. In other words, the target score is simply from raters common-sense. The target variables were the result of averaging the classification of multiple raters. i.e. if there are four raters, one classifies it a positive and the other three as a negative, the target value will be 0.25.

比賽從70個類似Stack-Overflow的網(wǎng)站收集問題和答案對，問題標(biāo)題，正文和答案作為文本特征，還包括其他一些特征，例如url，用戶ID。目標(biāo)標(biāo)簽為30個維度，值在0到1之間，用于評估問題和答案，例如問題是否關(guān)鍵，答案是否有幫助等。評分者僅獲得了最少的指導(dǎo)和培訓(xùn)，目標(biāo)很大程度上取決于他們的主觀解釋。換句話說，目標(biāo)分?jǐn)?shù)僅來自評估者的常識。目標(biāo)變量是對多個評估者的分類進(jìn)行平均的結(jié)果。例如，如果有四個評估者，一個評估者為正，其他三個評估為負(fù)，則目標(biāo)值為0.25。

Here is an example of the question

這是一個問題的例子

Question title: What am I losing when using extension tubes instead of a macro lens?
問題標(biāo)題 ：使用延長管代替微距鏡頭時我會失去什么？
Question body: After playing around with macro photography on-the-cheap (read: reversed lens, rev. lens mounted on a straight lens, passive extension tubes), I would like to get further with this. The problems with …
問題主體 ：在便宜地進(jìn)行微距攝影(閱讀：倒置鏡頭，將鏡頭安裝在直鏡頭上，無源延長管)玩完后，我想進(jìn)一步介紹一下。 …的問題
Answer: I just got extension tubes, so here’s the skinny. …what am I losing when using tubes…? A very considerable amount of light! Increasing that distance from the end of the lens to the sensor …
答：我只有延長管，所以這里很瘦。 …使用電子管時我會失去什么？非常大量的光！從鏡頭末端到傳感器的距離增加了……

The training and test set are distributed as below

訓(xùn)練和測試集的分布如下

評估指標(biāo) (Evaluation metrics)

Spearman’s rank correlation coefficient is used as the evaluation metrics in this competition.

Spearman等級相關(guān)系數(shù)用作本次比賽的評估指標(biāo)。

Intuitively, Pearson correlation is a measure of linear correlation of X and Y. For Spearman’s rank correlation, instead of using the value of X and Y, we use the ranking of X and Y in the formula. It is a measure of the monotonic relationship between X and Y. As the figure shown, the data given in the chart, pearson is 0.88 and spearman is 1.

直觀上，Pearson相關(guān)性是X和Y的線性相關(guān)性的度量。對于Spearman的秩相關(guān)，我們在公式中使用X和Y的排名，而不是使用X和Y的值。它是X和Y之間單調(diào)關(guān)系的度量。如圖所示，圖表中給出的數(shù)據(jù)，pearson為0.88，spearman為1。

Why was spearman used in this kaggle competition? Considering the subjective and noisy nature of the labels, Spearman correlation tends to be more robust to outliers as for instance pearson correlation. Also, because the target value is an understanding of question and answer based on rater’s common sense. Suppose we have 3 answers and we evaluate if the answers are well-written. answer A has score 0.5, answer B has score 0.2 and answer C is 0.1, If we claim answer A is 0.3 better than answer B, does it make sense? Not really. Here, we do not need the accurate value difference. It is just enough to know A is better than B and B is better than C.

為什么在這項kaggle比賽中使用了spearman？考慮到標(biāo)簽的主觀性和嘈雜性，Spearman相關(guān)性趨向于對離群值更穩(wěn)健，例如earson相關(guān)性。此外，因為目標(biāo)值是基于評估者的常識來理解問答。假設(shè)我們有3個答案，并且我們評估答案是否寫得好。答案A的得分為0.5，答案B的得分為0.2，答案C的得分為0.1，如果我們認(rèn)為答案A比答案B的得分高0.3，這有意義嗎？并不是的。在這里，我們不需要精確的值差。僅僅知道A優(yōu)于B并且B優(yōu)于C就足夠了。

source: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient來源： https : //zh.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

NLP管道 (NLP Pipeline)

Image by author)圖片由作者 )

A general NLP pipeline is shown as the figure above. And a typical non-neural network-based solution could be:

常規(guī)NLP管道如上圖所示。典型的基于非神經(jīng)網(wǎng)絡(luò)的解決方案可能是：

Use TF-IDF or word-embedding to get the token based vector representations
使用TF-IDF或詞嵌入來獲取基于令牌的矢量表示
Average the token vectors to a get document vector representation
將令牌向量平均化為獲取文檔向量表示
Use random forest or lightGBM as the classifier or the regressor
使用隨機(jī)森林或lightGBM作為分類器或回歸器

Due to the emergence of transformer and BERT in 2017 and 2018, NLP has been experiencing an “ImageNet” moment. BERT has become the dominant algorithm for NLP competitions. In this blog, we do not introduce BERT. There are several good tutorials such as here, here and here.

由于2017年和2018年變壓器和BERT的出現(xiàn)，NLP一直處于“ ImageNet”時代。 BERT已成為NLP比賽的主要算法。在此博客中，我們不介紹BERT。有幾個很好的教程，例如在這里，這里和這里。

Now, we can restructure the NLP pipeline by using BERT:

現(xiàn)在，我們可以使用BERT重構(gòu)NLP管道：

Use BERT wordpiece tokenizer to generate (sub)word tokens
使用BERT詞片標(biāo)記器生成(子)詞標(biāo)記
Generate embedding vectors per token from BERT
從BERT生成每個令牌的嵌入向量
Average the token vectors by a neural network pooling layer
通過神經(jīng)網(wǎng)絡(luò)池化層平均令牌向量
Use feed forward layers as the classifier or regressor
使用前饋層作為分類器或回歸器

金牌解決方案 (Gold Medal Solution)

大圖 (The big picture)

As illustrated in the figure below, we use four BERT-based models and a Universal Sentence Encoder model as base models, then stack them to generate the final result. In the rest of this blog, we will only focus on the transformer/BERT models. For more information of Universal Sentence Encoder, you can visit the original paper here, and the code is available here.

如下圖所示，我們使用四個基于BERT的模型和一個通用語句編碼器模型作為基礎(chǔ)模型，然后將它們堆疊以生成最終結(jié)果。在本博客的其余部分中，我們將僅關(guān)注變壓器/ BERT模型。有關(guān)通用語句編碼器的更多信息，您可以在此處訪問原始論文，并且代碼可以在此處獲得。

Image by author)圖片由作者 )

基于BERT的模型的架構(gòu) (Architecture of BERT-based models)

The animation below shows how one base model works. The codes are here.

下面的動畫顯示了一個基本模型的工作原理。代碼在這里。

Question title and question body are concatenated as input. BERT tokenizer is used to get sub-words, then BERT embeddings are generated. Followed by an average pooling layer, we get a vector representation for each question title and body pair. It is noted that we averaged over the token embeddings of non-masked tokens. It was something we did different from the common approaches and made a slight improvement in cross-validation. Other categorical or numerical features are appended, then connected with a linear layer with Gelu activation and dropout.

問題標(biāo)題和問題正文被串聯(lián)為輸入。 BERT令牌生成器用于獲取子詞，然后生成BERT嵌入。接下來是平均池化層，我們獲得每個問題標(biāo)題和正文對的向量表示。注意，我們對非掩碼令牌的令牌嵌入取平均值。我們所做的與常規(guī)方法有所不同，并且在交叉驗證方面做了些微改進(jìn)。附加其他類別或數(shù)字特征，然后通過Gelu激活和退出與線性層連接。

Similarly, we have a mirror structure with question titles and answer pairs as input. We have two options. If the mirror BERT model can share the weights of the first BERT model, we call it “siamese” structure. It can also use separate weights, then we call it “double” structure. The siamese structure normally has less parameters and better generalization. We experimented with both siamese and double structure and choose the best N base models according to cross-validate scores.

同樣，我們有一個鏡像結(jié)構(gòu)，其中以問題標(biāo)題和答案對為輸入。我們有兩個選擇。如果鏡像BERT模型可以共享第一個BERT模型的權(quán)重，我們將其稱為“暹羅”結(jié)構(gòu)。它還可以使用單獨的權(quán)重，因此我們將其稱為“雙重”結(jié)構(gòu)。暹羅結(jié)構(gòu)通常具有較少的參數(shù)和更好的概括性。我們對暹羅和雙重結(jié)構(gòu)進(jìn)行了實驗，并根據(jù)交叉驗證得分選擇了最佳的N基模型。

The output of both aforementioned structures are concatenated, and connected to a forward layer to get the prediction of 30 dimensional target value.

將上述兩個結(jié)構(gòu)的輸出連接起來，并連接到前向?qū)?#xff0c;以預(yù)測30維目標(biāo)值。

Huggingface packages most state-of-the-art NLP models Pytorch implementations. In our solution, 4 BERT based models implemented by Huggingface are selected. They are Siamese Roberta base, Siamese XLNet base, Double Albert base V2, Siamese BERT base uncased.

Huggingface打包了大多數(shù)最新的NLP模型Pytorch實現(xiàn)。在我們的解決方案中，選擇了4種由Huggingface實現(xiàn)的基于BERT的模型。他們是連體羅伯塔基地，連體XLNet基地，雙偉業(yè)基地V2，連體BERT基地?zé)o套管。

Image by author)作者提供的圖片 )

培訓(xùn)和實驗設(shè)置 (Training and experiment setup)

We have two stage training. Stage 1 is an end-to-end parameter tuning, and stage 2 only tunes the “head”.

我們有兩個階段的培訓(xùn)。階段1是端到端的參數(shù)調(diào)整，階段2僅調(diào)整“頭部”。

in the first stage:

在第一階段：

Train for 4 epochs with huggingface AdamW optimiser. The code is here
訓(xùn)練用adamW優(yōu)化器進(jìn)行4個時期的訓(xùn)練。代碼在這里
Binary cross-entropy loss.
二進(jìn)制交叉熵?fù)p失。
One-cycle LR schedule. Uses cosine warmup, followed by cosine decay, whilst having a mirrored schedule for momentum (i.e. cosine decay followed by cosine warmup). The code is here
一周期LR時間表。使用余弦預(yù)熱，然后進(jìn)行余弦衰減，同時具有動量的鏡像計劃(即，余弦衰減后進(jìn)行余弦預(yù)熱)。代碼在這里
Max LR of 1e-3 for the regression head, max LR of 1e-5 for transformer backbones.
回歸頭的最大LR為1e-3，變壓器主干的最大LR為1e-5。
Accumulated batch size of 8
累計批量大小為8

In the second stage:

在第二階段：

Freeze transformer backbone and fine-tune the regression head for an additional 5 epochs with constant LR of 1e-5. The code is here
凍結(jié)變壓器主干并微調(diào)回歸頭，使LR恒定為1e-5的另外5個時期。代碼在這里
Added about 0.002 to CV for most models.
大多數(shù)型號的CV增加了0.002。

堆碼 (Stacking)

Stacking is the “de-facto” ensemble strategy for kagglers. The animations below illustrate the training and prediction procedure. there are 3 folds in the example. To get the meta training data for each fold, we train iteratively on 2 folds and predict on the remaining fold. And the whole out-of-fold prediction is used as features. Then, we train the stacking model.

堆疊是kaggler的“實際”合奏策略。下面的動畫說明了訓(xùn)練和預(yù)測過程。示例中有3折。為了獲得每個折疊的元訓(xùn)練數(shù)據(jù)，我們在2個折疊上進(jìn)行迭代訓(xùn)練，并預(yù)測剩余的折疊。并且將整個失疊預(yù)測用作特征。然后，我們訓(xùn)練堆疊模型。

In the prediction stage, we input the test data to all out-of-fold base models to get the predictions. Then, we average the results, pass to the stacking model to get the final prediction.

在預(yù)測階段，我們將測試數(shù)據(jù)輸入到所有折衷的基礎(chǔ)模型中以獲取預(yù)測。然后，我們將結(jié)果取平均值，然后傳遞到堆疊模型以獲得最終預(yù)測。

Image by author)作者提供的圖片 ) Image by author)作者提供的圖片 )

其他技巧 (Other tricks)

GroupKFold (GroupKFold)

Let us first have a look why normal KFold split does not work well in this competition. In the dataset, some samples were collected from one question-answer thread, which means multiple samples share the same question title and body but with different answers.

首先讓我們看一下為什么普通的KFold拆分在本次比賽中效果不佳。在數(shù)據(jù)集中，從一個問答線程中收集了一些樣本，這意味著多個樣本共享相同的標(biāo)題和正文，但答案不同。

If we use a normal KFold split function, answers to the same questions will be distributed in both training set and test set. This will bring an information leakage problem. A better split is to put all question/answer pairs from the same question together in either the training set or the test set.

如果我們使用普通的KFold拆分功能，則對相同問題的答案將同時分布在訓(xùn)練集和測試集中。這將帶來信息泄漏問題。更好的劃分是將來自同一問題的所有問題/答案對放到訓(xùn)練集或測試集中。

Fortunately, sk-learn has provided a function GroupKFold to generate non-overlapping groups for cross validation. Question body field is used to indicate the group, as the code below.

幸運(yùn)的是，sk-learn提供了一個功能GroupKFold來生成非重疊組以進(jìn)行交叉驗證。問題正文字段用于指示組，如下代碼所示。

后期處理 (Post-processing)

As many other teams did, one post-processing step had a massive impact on the performance. The general idea is based on rounding predictions downwards to a multiple of some fraction 1/d.

與其他許多團(tuán)隊一樣，一個后處理步驟對性能產(chǎn)生了巨大影響。總體思路是基于將預(yù)測向下舍入到1 / d的倍數(shù)的倍數(shù)。

So if d=4 and x = [0.12, 0.3, 0.31, 0.24, 0.7] these values will get rounded to [0.0, 0.25, 0.25, 0.0, 0.5]. For each target column we did a grid search for values of d in [4, 8, 16, 32, 64, None].

因此，如果d = 4且x = [0.12，0.3，0.31，0.24，0.7]，則這些值將四舍五入為[0.0，0.25，0.25，0.0，0.5]。對于每個目標(biāo)列，我們在[4，8，16，32，64，None]中進(jìn)行了d值的網(wǎng)格搜索。

In our ensemble we exploited this technique even further, applying the rounding first to individual model predictions and again after taking a linear combination of model predictions. In doing so we did find that using a separate rounding parameter for each model, out-of-fold score improvements would no longer translate to leaderboard. We addressed this by reducing the number of rounding parameters using the same d_local across all models:

在我們的合奏中，我們甚至進(jìn)一步利用了該技術(shù)，首先將舍入應(yīng)用于單個模型預(yù)測，然后在對模型預(yù)測進(jìn)行線性組合之后再次應(yīng)用。通過這樣做，我們確實發(fā)現(xiàn)，對于每個模型使用單獨的舍入?yún)?shù)，失格得分的提高將不再轉(zhuǎn)化為排行榜。我們通過在所有模型中使用相同的d_local減少舍入?yún)?shù)的數(shù)量來解決此問題：

All ensembling parameters — 2 rounding parameters and model weights — were set using a small grid search optimising the spearman rank correlation coefficient metric on out-of-fold while ignoring question targets for rows with duplicate questions. In the end, this post-processing improved our 10 fold GroupKFold CV by ~0.05.

所有合奏參數(shù)(2個舍入?yún)?shù)和模型權(quán)重)均使用小網(wǎng)格搜索進(jìn)行設(shè)置，這些搜索優(yōu)化了失疊的Spearman等級相關(guān)系數(shù)度量，而忽略了具有重復(fù)問題的行的問題目標(biāo)。最后，此后處理使我們的10倍GroupKFold CV改善了?0.05。

Zhe Sun, Robin Niesert, Ahmet Erdem and Jing Qin

孫哲，羅賓·尼瑟特，艾哈邁德·埃德姆和靜琴

翻譯自: https://towardsdatascience.com/accurately-labeling-subjective-question-answer-content-using-bert-bffe7c6e7c4