NLP分类
nlp文本分類
Text classification is one of the important applications of NLP. Applications such as Sentiment Analysis and Identifying spam, bots, and offensive comments come under Text Classification. Until now, the approaches used for solving these problems included building Machine Learning or Deep Learning models from scratch, training them on your text data, and fine-tuning it with hyperparameters. Even though such models give decent results for applications like classifying whether a movie review is positive or negative, they may perform terribly if things become more ambiguous because most of the time there’s just not enough amount of labeled data to learn from.
文本分類是NLP的重要應用之一。 情感分析和識別垃圾郵件,僵尸程序和令人反感的評論等應用程序屬于“ 文本分類” 。 到目前為止,用于解決這些問題的方法包括從頭開始構建機器學習或深度學習模型,在文本數據上對其進行訓練,并使用超參數對其進行微調。 即使此類模型為諸如影片評論的正面還是負面分類之類的應用程序提供了不錯的結果,但如果事情變得更加模棱兩可,它們可能會表現非常差勁,因為在大多數情況下,沒有足夠的標簽數據可供學習。
But wait a minute? Isn’t the Imagenet using the same approach to classify the images? Then how has it able to achieve great results with the same approach? What if, instead of building a model from scratch, we use a model that has been trained to solve one problem (classifying images from Imagenet) as the basis to solve some other somewhat similar problem (text classification). As the fine-tuned model doesn’t have to learn from scratch, it gives higher accuracy without needing a lot of data. This is the principle of Transfer Learning upon which the Universal Language Model Fine-tuning (ULMFiT) has been built.
等一下 Imagenet是否使用相同的方法對圖像進行分類? 那么用相同的方法又如何能夠取得出色的成績呢? 如果不是使用從頭開始建立模型的模型,而是使用經過訓練可以解決一個問題(對Imagenet的圖像進行分類)的模型作為解決其他一些相似問題(文本分類)的基礎,該怎么辦? 由于微調的模型不必從頭學習,因此無需大量數據即可提供更高的準確性。 這是構建通用語言模型微調(ULMFiT)的遷移學習原理。
And today we are going to see how you can leverage this approach for the Sentiment Analysis. You can read more about the ULMFiT, its advantages as well as comparison with other approaches here.
今天,我們將看到您如何利用這種方法進行情感分析。 你可以關于ULMFiT,它的優點以及與其他方法相比, 這里 。
The fastai library provides modules necessary to train and use ULMFiT models. You can view the library here.
fastai庫提供訓練和使用ULMFiT模型所需的模塊。 您可以在此處查看庫。
The problem we are going to solve is the Sentiment Analysis of US Airlines. You can download the dataset from here. So without further ado, let’s start!
我們要解決的問題是美國航空的情緒分析。 您可以從此處下載數據集。 因此,事不宜遲,讓我們開始吧!
Firstly, let’s import all the libraries.
首先,讓我們導入所有庫。
Now we will convert the CSV file of our data into Pandas Dataframe and see the data.
現在,我們將數據的CSV文件轉換為Pandas Dataframe并查看數據。
Now we check if there are any nulls in the dataframe. We observe that there are 5462 nulls in the negative_reason column. These nulls belong to positive + Neutral sentiments which makes sense. We verify this by taking count of all non-negative tweets. Both the numbers match. The reason negativereason_confidence count doesn’t match with negativereason count is that the 0 values in the negativereason_confidence column correspond to blanks in negativereason column.
現在,我們檢查數據幀中是否有任何空值。 我們觀察到negative_reason列中有5462個空值。 這些空值屬于正+中性情緒,這是有道理的。 我們通過計算所有非負面的推文來驗證這一點。 兩個數字匹配。 negativereason_confidence計數與negativereason計數不匹配的原因是,negativereason_confidence列中的0值對應于negativereason列中的空白。
If we look at the total count of data samples, it’s 14640. The columns airline_sentiment_gold, negativereason_gold & tweet_coord have large amounts of blanks, i.e. in the range of 13000–14000. Thus it can be concluded that these columns will not provide any significant information & thus can be discarded.
如果我們看一下數據樣本的總數,則為14640。air_sentiment_gold,negativereason_gold和tweet_coord列包含大量空白,即13000–14000。 因此可以得出結論,這些列將不提供任何重要信息,因此可以將其丟棄。
Now that we have the relevant data, let’s start building our model.
現在我們有了相關的數據,讓我們開始構建模型。
When we are making NLP model with Fastai, there are two stages:
當我們用Fastai制作NLP模型時,有兩個階段:
- Creating LM Model & fine-tuning it with the pre-trained model 創建LM模型并使用預先訓練的模型對其進行微調
- Using the fine-tuned model as a classifier 使用微調模型作為分類器
Here I’m using TextList which is part of the data bloc instead of using the factory methods of TextClasDataBunch and TextLMDataBunch because TextList is part of the API which is more flexible and powerful.
在這里,我使用的是TextList ,它是數據塊的一部分,而不是使用TextClasDataBunch和TextLMDataBunch的工廠方法,因為TextList是API的一部分,它更加靈活和強大。
We can see that since we are training a language model, all the texts are concatenated together (with a random shuffle between them at each new epoch).
我們可以看到,由于我們正在訓練一種語言模型,因此所有文本都被串聯在一起(在每個新紀元之間都有隨機的隨機播放)。
Now we will fine-tune our model with the weights of a model pre-trained on a larger corpus, Wikitext 103. This model has been trained to predict the next word in the sentence provided to it as an input. As the language of the tweets is not always grammatically perfect, we will have to adjust the parameters of our model. Next, we will find the optimal learning rate & visualize it. The visualization will help us to spot the range of learning rates & choose from while training our model.
現在,我們將使用在較大的語料庫Wikitext 103上預先訓練的模型的權重來調整模型。該模型已經過訓練,可以預測作為輸入提供給它的句子中的下一個單詞。 由于推文的語言在語法上并不總是完美的,因此我們將不得不調整模型的參數。 接下來,我們將找到最佳學習率并將其可視化。 可視化將幫助我們發現學習率的范圍并在訓練模型時選擇。
By default, the Learner object is frozen thus we need to train the embeddings at first. Here, instead of running the cycle for one epoch, I am going to run it for 6 to see how accuracy varies. The learning rate I picked is with the help of the plot we got above.
默認情況下,Learner對象是凍結的,因此我們首先需要訓練嵌入。 在這里,我將循環運行6個,以查看準確性如何變化,而不是將循環運行一個時期。 我選擇的學習率是借助以上獲得的情節進行的。
We got very low accuracy, which was expected the rest of our model is still frozen but we can see that the accuracy is increasing.
我們獲得了非常低的準確性,可以預期我們模型的其余部分仍會凍結,但是我們可以看到準確性正在提高。
We see that the accuracy improved slightly but still looming in the same range. This is because firstly the model was trained on a pre-trained model with different vocabulary & secondly, there were no labels, we had passed the data without specifying the labels.
我們看到精度略有提高,但仍在相同范圍內顯示。 這是因為,首先,該模型是在具有不同詞匯的預訓練模型上訓練的;其次,沒有標簽,我們在未指定標簽的情況下傳遞了數據。
Now we will test our model with random input & see if it’ll accurately complete the sentence.
現在,我們將使用隨機輸入來測試我們的模型,并查看它能否準確地完成句子。
Now, we’ll create a new data object that only grabs the labeled data and keeps those labels.
現在,我們將創建一個新的數據對象,該對象僅獲取標記的數據并保留這些標簽。
The classifier needs a little less dropout, so we pass drop_mult=0.5 to multiply all the dropouts by this amount. We don’t load the pre-trained model, but instead our fine-tuned encoder from the previous section.
分類器需要更少的輟學,因此我們傳遞drop_mult = 0.5來將所有輟學乘以該數量。 我們不加載預訓練的模型,而是上一節中的微調編碼器。
Again we perform similar steps as Language mode. Here I am skipping the last 15 data points as I’m only interested till 1e-1.
同樣,我們執行與語言模式類似的步驟。 在這里,我跳過了最后15個數據點,因為我只對1e-1感興趣。
Here we see that the accuracy has drastically improved if we compare with the Language model in step 1 when we provide labels.
在這里,我們可以看到,如果我們與提供標簽的第1步中的Language模型進行比較,則準確性會大大提高。
Now we will partially train the model by unfreezing one layer at a time & differential learning rate. Here I am using the slice attribute which will divide the specified learning rates among 3 groups of models.
現在,我們將通過按時間和差異學習速率解凍一層來部分訓練模型。 在這里,我使用slice屬性,它將指定的學習率劃分為3組模型。
We see that the accuracy is improving gradually which is expected as we are gradually unfreezing the layers. More layers providing more depth.
我們看到精度逐漸提高,這是隨著我們逐漸解凍各層所期望的。 更多的圖層可提供更多的深度。
Finally, we will unfreeze the whole model & visualize the learning rate to choose & use that for final training.
最后,我們將解凍整個模型并可視化學習率,以選擇并用于最終培訓。
We see that we have achieved maximum accuracy of 80% by the end of this model.
我們看到,到該模型結束時,我們已達到80%的最大精度。
For our final results, we’ll take the average of the predictions of the model. Since the samples are sorted by text lengths for batching, we pass the argument ordered=True to get the predictions in the order of the texts.
對于最終結果,我們將取模型預測的平均值。 由于樣本是按文本長度排序以進行批處理的,因此我們傳遞了ordered = True參數以按文本順序獲得預測。
We got the accuracy of 80.09%
我們的準確性為80.09%
Now it’s time to test our model with new text inputs & see how it performs!
現在是時候使用新的文本輸入來測試我們的模型并查看其性能了!
The databunch has converted the text labels into numerical. They are as follows:
數據倉庫已將文本標簽轉換為數字標簽。 它們如下:
- 0 => Negative 0 =>負數
- 1 => Neutral 1 =>中性
- 2 => Positive 2 =>正
We see that our model has performed pretty well!!
我們看到我們的模型表現不錯!
You can test the model with negative as well as mixed sentiment text and verify results.
您可以使用否定的和混合的情緒文本來測試模型并驗證結果。
Hope you find this article helpful :D
希望本文對您有所幫助:D
Also, any suggestions/corrections are welcome.
另外,歡迎提出任何建議/更正。
Happy Coding!!
快樂編碼!
翻譯自: https://towardsdatascience.com/nlp-classification-with-universal-language-model-fine-tuning-ulmfit-4e1d5077372b
nlp文本分類
總結
- 上一篇: 北方稀土最高涨到多少,历史最高为99.8
- 下一篇: 解构里面再次解构_解构后的咖啡:焙炒,研