當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

迁移学习 nlp_NLP的发展-第3部分-使用ULMFit进行迁移学习

發布時間：2023/12/15 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了迁移学习 nlp_NLP的发展-第3部分-使用ULMFit进行迁移学习小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

遷移學習 nlp

This is the third part of a series of posts showing the improvements in NLP modeling approaches. We have seen the use of traditional techniques like Bag of Words, TF-IDF, then moved on to RNNs and LSTMs. This time we’ll look into one of the pivotal shifts in approaching NLP Tasks — Transfer Learning!

這是一系列文章的第三部分，顯示了NLP建模方法的改進。我們已經看到了諸如詞袋，TF-IDF之類的傳統技術的使用，然后又轉向了RNN和LSTM 。這次，我們將探討處理NLP任務的一項重要轉變-轉移學習！

The complete code for this tutorial is available at this Kaggle Kernel

本教程的完整代碼可在此Kaggle內核中找到。

超低價 (ULMFit)

The idea of using Transfer Learning is quite new in NLP Tasks, while it has been quite prominently used in Computer Vision tasks! This new way of looking at NLP was first proposed by Howard Jeremy, and has transformed the way we looked at data previously!

在NLP任務中，使用轉移學習的想法是相當新的，而在計算機視覺任務中已經非常顯著地使用了轉移學習！這種查看NLP的新方法最初是由霍華德·杰里米(Howard Jeremy)提出的，它改變了我們之前查看數據的方式！

The core idea is two-fold — using generative pre-trained Language Model + task-specific fine-tuning was first explored in ULMFiT (Howard & Ruder, 2018), directly motivated by the success of using ImageNet pre-training for computer vision tasks. The base model is AWD-LSTM.

核心思想有兩個方面-使用生成式預訓練語言模型+特定于任務的微調是在ULMFiT中首次探索的(Howard＆Ruder，2018)，其直接動機是將ImageNet預訓練成功用于計算機視覺任務。基本模型是AWD-LSTM。

A Language Model is exactly like it sounds — the output of this model is to predict the next word of a sentence. The goal is to have a model that can understand the semantics, grammar, and unique structure of a language.

語言模型就像聽起來一樣—該模型的輸出是預測句子的下一個單詞。我們的目標是建立一個能夠理解語言的語義，語法和獨特結構的模型。

ULMFit follows three steps to achieve good transfer learning results on downstream language classification tasks:

ULMFit遵循三個步驟以在下游語言分類任務上獲得良好的遷移學習結果：

General Language Model pre-training: on Wikipedia text.

通用語言模型預培訓：在Wikipedia文本上。

Target task Language Model fine-tuning: ULMFiT proposed two training techniques for stabilizing the fine-tuning process.

目標任務語言模型的微調：ULMFiT提出了兩種訓練技術來穩定微調過程。

Target task classifier fine-tuning: The pretrained LM is augmented with two standard feed-forward layers and a softmax normalization at the end to predict a target label distribution.

目標任務分類器的微調：預訓練的LM通過兩個標準前饋層和最后的softmax歸一化進行增強，以預測目標標簽的分布。

對NLP使用fast.ai- (Using fast.ai for NLP -)

fast.ai’s motto — Making Neural Networks Uncool again — tells you a lot about their approach ;) Implementation of these models is remarkably simple and intuitive, and with good documentation, you can easily find a solution if you get stuck anywhere. Along with this, and a few other reasons I elaborate below, I decided to try out the fast.ai library which is built on top of PyTorch instead of Keras. Despite being used to working in Keras, I didn’t find it difficult to navigate fast.ai and the learning curve is quite fast to implement advanced things as well!

fast.ai的座右銘-再次使神經網絡變得不酷-向您介紹了他們的方法；)這些模型的實現非常簡單直觀，并且有了良好的文檔，如果您遇到任何麻煩，都可以輕松找到解決方案。伴隨著此，以及下面我要闡述的其他一些原因，我決定嘗試在PyTorch而非Keras之上構建的fast.ai庫。盡管習慣了在Keras上工作，但我發現快速導航并不困難。愛，而且學習曲線也很快就能實現高級功能！

In addition to its simplicity, there are some advantages of using fast.ai’s implementation -

除了簡單之外，使用fast.ai的實現還有一些優勢-

Discriminative fine-tuning is motivated by the fact that different layers of LM capture different types of information (see discussion above). ULMFiT proposed to tune each layer with different learning rates, {η1,…,η?,…,ηL}, where η is the base learning rate for the first layer, η? is for the ?-th layer and there are L layers in total.
區分微調的動機是，LM的不同層捕獲不同類型的信息(請參見上面的討論)。 ULMFiT建議用不同的學習速率{η1，…，η?，…，ηL}來調整每一層，其中η是第一層的基本學習率，η?是第?層，總共有L層。

J(θ) is the gradient of Loss Function with respect to θ(?). η(?) is the learning rate of the ?-th layer.J (θ)是損失函數相對于θ(?)的梯度。 η(?)是第layer層的學習率。

Slanted triangular learning rates (STLR) refer to a special learning rate scheduling that first linearly increases the learning rate and then linearly decays it. The increase stage is short so that the model can converge to a parameter space suitable for the task fast, while the decay period is long allowing for better fine-tuning.
斜三角學習率(STLR)是指一種特殊的學習率計劃，該計劃首先線性增加學習率，然后線性降低它。增加階段很短，因此模型可以快速收斂到適合任務的參數空間，而衰減周期很長，可以進行更好的微調。

Learning rate increases till 200th iteration and then slowly decays. Howard, Ruder (2018) — Universal Language Model Fine-tuning for Text Classification學習率增加到第200次迭代，然后緩慢衰減。 Howard，Ruder(2018)—用于文本分類的通用語言模型微調

Let’s try to see how well this approach works for our dataset. I would also like to point out that all these ideas and code are available at fast.ai’s free official course for Deep Learning.

讓我們嘗試看看這種方法對我們的數據集的效果如何。我還想指出，所有這些想法和代碼都可以在fast.ai的免費深度學習官方官方課程中獲得。

加載數據！ (Loading the data!)

Data in fast.ai is taken using TextLMDataBunch. This is very similar to ImageGenerator in Keras, where the path, labels, etc. are provided and the method prepares Train, Test and Validation data depending on the task at hand!

fast.ai中的數據是使用TextLMDataBunch獲取的。這與Keras中的ImageGenerator非常相似，其中提供了路徑，標簽等，并且該方法根據手頭的任務準備了Train，Test和Validation數據！

語言模型數據集 (Data Bunch for Language Model)

data_lm = TextLMDataBunch.from_csv(path,'train.csv', text_cols = 3, label_cols = 4)

分類任務的數據束 (Data Bunch for Classification Task)

data_clas = TextClasDataBunch.from_csv(path, 'train.csv', vocab=data_lm.train_ds.vocab, bs=32, text_cols = 3, label_cols = 4)

As discussed in the steps before, we start out first with a language model learner, while basically predicts the next word, given a sequence. Intuitively, this model tries to understand what language and context is. And then we use this model and fine-tune it for our specific task — Sentiment Classification.

正如前面步驟中討論的那樣，我們首先從語言模型學習者入手，基本上根據給定的順序預測下一個單詞。直觀地，該模型試圖理解什么是語言和上下文。然后，我們使用此模型并針對特定任務(情感分類)對其進行微調。

步驟1.訓練語言模型 (Step 1. Training a Language Model)

learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)
learn.fit_one_cycle(1, 1e-2)

By default, we start with a pre-trained model, based on AWD-LSTM architecture. This model is built on top of simple LSTM units but has multiple dropout layers and hyperparameters. Based on the drop_mult argument, we can simultaneously set multiple dropouts within the model. I’ve kept it at 0.5. You can set it higher if you find that this model is overfitting.

默認情況下，我們從基于AWD-LSTM體系結構的預訓練模型開始。該模型基于簡單的LSTM單元構建，但是具有多個輟學層和超參數。基于drop_mult參數，我們可以同時在模型中設置多個dropout 。我將其保持在0.5。如果發現此模型過度擬合，可以將其設置得更高。

區分性微調 (Discriminative Fine-Tuning)

learn.unfreeze()
learn.fit_one_cycle(3, slice(1e-4,1e-2))

learn.unfreeze() makes all the layers of AWD-LSTM trainable. We can set a training rate using slice() function, which trains the last layer at 1e-02, while groups (of layers) in between would have geometrically reducing learning rates. In our case, I’ve specified the learning rate using the slice() method. It basically takes 1e-4 as the learning rate for the inner layer and 1e-2 for the outer layer. Layers in between have geometrically scaled learning rates.

learn.unfreeze()使AWD-LSTM的所有層均可訓練。我們可以使用slice()函數設置訓練速率，該函數在1e-02訓練最后一層，而介于兩者之間的(層)組將在幾何上降低學習速率。在我們的例子中，我使用slice()方法指定了學習率。內層的學習率基本上是1e-4，外層的學習率是1e-2。兩者之間的層具有按幾何比例縮放的學習率。

預期的三角學習率 (Slated Triangular Learning Rates)

This can be achieved simply by using fit_one_cycle() method in fast.ai

這可以通過在fast.ai中使用fit_one_cycle()方法簡單地實現

逐漸解凍 (Gradual Unfreezing)

Though I’ve not experimented with this here, the idea is pretty simple. In the start, we keep the initial layers of the model as un-trainable, and then we slowly unfreeze earlier layers, as we keep on training. I’ll cover this in detail in next post

盡管我沒有在這里進行嘗試，但是這個想法很簡單。首先，我們將模型的初始層保持為不可訓練，然后在繼續訓練的同時慢慢解凍較早的層。我將在下一篇文章中詳細介紹

Since, we’ve made a language model, we can actually use it to predict the next few words based on certain input. This can tell if the model has begun to understand our reviews.

由于我們已經建立了語言模型，因此實際上可以根據特定輸入使用它來預測接下來的幾個單詞。這可以判斷模型是否已開始理解我們的評論。

You can see that, with just a simple starting input, the model is able to generate realistic reviews. So, this assures that we are in the right direction.

您可以看到，僅需簡單的開始輸入，該模型就可以生成現實的評論。因此，這可以確保我們朝著正確的方向前進。

learn.save(file = Path('language_model'))
learn.save_encoder(Path('language_model_encoder'))

Let’s save this model and we will load it later for classification

保存此模型，稍后我們將其加載以進行分類

步驟2.使用語言模型作為編碼器的分類任務 (Step 2. Classification Task using Language Model as encoder)

learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5).to_fp16()
learn.model_dir = Path('/kaggle/working/')
learn.load_encoder('language_model_encoder')

Let’s get started with training. I’m running it in a similar manner. Training only outer layer for 1 epoch, unfreezing the whole network and training for 3 epochs.

讓我們開始培訓吧。我以類似的方式運行它。僅訓練1個時代的外層，解凍整個網絡并訓練3個時代。

learn.fit_one_cycle(1, 1e-2)
learn.unfreeze()
learn.fit_one_cycle(3, slice(1e-4, 1e-2))

準確性-90％ (Accuracy — 90%)

With this alone (in just 4 epochs), we are at 90% accuracy! It’s an absolutely amazing result if you consider the amount of effort we’ve put in! Within just a few lines of code and nearly 10 mins of training, we’ve breached the 90% wall.

僅此一項(僅4個紀元)，我們的準確性就達到了90％！如果您考慮我們付出的努力，這絕對是一個驚人的結果！在短短的幾行代碼和近10分鐘的培訓中，我們突破了90％的要求。

I hope this was helpful for you as well to get started with NLP and Transfer Learning. I’ll catch you later in the 4th blog of this series, where we take this up a notch and explore transformers!

我希望這對您也對NLP和轉學學習有所幫助。我將在本系列的第4個博客中稍后吸引您，在此我們將其提升一個檔次并探索變形金剛！

翻譯自: https://medium.com/analytics-vidhya/evolution-of-nlp-part-3-transfer-learning-using-ulmfit-267d0a73421e