国内quora_处理Quora不真诚问题分类问题
國內quora
Quora insincere question classification was a challenge organized by kaggle in the field of natural language processing. The main aim the challenge was to figure out the toxic and divisive content. It is binary classification problem where class 0 represented insincere question and class 1 otherwise. This blog would specifically deal with the data modelling part.
Quora真誠的問題分類是kaggle在自然語言處理領域組織的一項挑戰。 挑戰的主要目的是找出有毒和分裂性的內容。 這是二進制分類問題,其中類別0表示不誠實的問題,否則類別1。 該博客將專門處理數據建模部分。
預處理: (Preprocessing:)
In the first step we shall read the data using pandas. This code snippet would read the file into a pandas data frame.
第一步,我們將使用熊貓讀取數據。 此代碼段會將文件讀入pandas數據框。
train=pd.read_csv(‘/kaggle/input/quora-insincere-questions-classification/train.csv’)test_df=pd.read_csv(‘/kaggle/input/quora-insincere-questions-classification/test.csv’)
sub=pd.read_csv(‘/kaggle/input/quora-insincere-questions-classification/sample_submission.csv’)
We can know the shape of the data using the shape method.
我們可以使用shape方法知道數據的形狀。
Initially we would try to divide the training dataset into 2 parts which are train and validation. To do so we can take help of sklearn. The following code snippet would help us achieve it.
最初,我們嘗試將訓練數據集分為訓練和驗證兩部分。 為此,我們可以尋求sklearn的幫助。 以下代碼段將幫助我們實現這一目標。
train_df,val_df=train_test_split(train,test_size=0.1)In the first step we would try to fill the question that contain null values. To do so we can use the fillna method. The code snippet below would do the same.
第一步,我們將嘗試填充包含空值的問題。 為此,我們可以使用fillna方法。 下面的代碼段將執行相同的操作。
train_x=train_df['question_text'].fillna('__na__').valuesval_x=val_df['question_text'].fillna('__na__').values
test_x=test_df['question_text'].fillna('__na__').values
Now it is the time to choose some of the important parameters. These are embedd_size,max_features and max_len. Here embedd_size represents the word vector size of each word we are going to represent, whereas max_features tells about the number of top frequency words which we shall consider. For example if we consider max_feature to be 50000 it would imply we shall take the 50000 most occurring words into consideration while converting them into vectors. Similarly max_len would imply starting from the beginning how many words we would. For example max_len 100 would mean we shall consider the only first 100 words. The following are the parameters chosen for the same.
現在是時候選擇一些重要參數了。 這些是embedd_size , max_features和max_len 。 這里的embedd_size表示我們將要表示的每個單詞的單詞向量大小,而max_features講述了我們將要考慮的最高頻率單詞的數量。 例如,如果我們認為max_feature為50000,則意味著在將它們轉換為向量時,應考慮50000個最常出現的單詞。 類似地, max_len表示從頭開始會有多少個單詞。 例如max_len 100意味著我們將只考慮前100個字。 以下是為其選擇的參數。
embedd_size=300max_features=50000
max_len=100
Now consider the following code snippet.
現在考慮以下代碼片段。
tokenizer=Tokenizer(num_words=max_features)tokenizer.fit_on_texts(list(train_df))
train_x=tokenizer.texts_to_sequences(train_x)
val_x=tokenizer.texts_to_sequences(val_x)
test_x=tokenizer.texts_to_sequences(test_x)
Here we would take the most 50000 frequent words into account in the first line. The second line would convert each word into a unique number based on where it appears in the sentence. The text_to_sequence method in the third,fourth and fifth line would convert each sentence to the numbers. For example to convert “India won the match” we shall lookup the integer assigned to each word in the previous method fit_on_texts and change the sentence accordingly.
在第一行中,我們將考慮最多50000個常用詞。 第二行將根據單詞在句子中的位置將每個單詞轉換為唯一的數字。 第三,第四和第五行中的text_to_sequence方法會將每個句子轉換為數字。 例如,要轉換“ 印度贏得比賽 ”,我們將在先前的方法fit_on_texts中查找分配給每個單詞的整數,并相應地更改句子。
train_x=pad_sequences(train_x,maxlen=max_len)val_x=pad_sequences(val_x,maxlen=max_len)
test_x=pad_sequences(test_x,maxlen=max_len)
The above code snippet would ensure each sentence is converted to a particular length. This is done so that while giving this sequences in batches they would fit into a particular length. Hence it is required to pad or truncate some of the sequences.
上面的代碼片段將確保將每個句子轉換為特定的長度。 這樣做是為了在批量分配此序列的同時,使其適合特定的長度。 因此,需要填充或截斷某些序列。
造型: (Modelling:)
We shall take a bidirectional lstm in order to build this classification model. Before that we have to convert the text into numbers. We did the initial step of it in the last paragraph where each word was converted into unique integers. In the next step we shall convert the words into vectors using the keras embedding layer. We also have the option of using any pretrained word embedding but here we have chosen the embedding layer to learn the word embedding while training. Embedding layer assigns random vectors to the words initially but learns the word embedding for each as the training of the model goes on. The below snippet tells us about the whole modelling strategy.
我們將采用雙向lstm來建立此分類模型。 在此之前,我們必須將文本轉換為數字。 我們在最后一個段落中完成了它的第一步,其中每個單詞都被轉換為唯一的整數。 在下一步中,我們將使用keras嵌入層將單詞轉換為向量。 我們還可以選擇使用任何預訓練的單詞嵌入,但是在這里,我們選擇了嵌入層來在訓練時學習單詞嵌入。 嵌入層最初會為單詞分配隨機向量,但是隨著模型的訓練的進行,每個單詞都將學習單詞嵌入。 下面的代碼段向我們介紹了整個建模策略。
inp=Input(shape=(max_len))x=Embedding(max_features,embedd_size)(inp)
x=Bidirectional(LSTM(128,return_sequences=True))(x)
x=GlobalMaxPool1D()(x)
x=Dense(16,activation='relu')(x)
x=Dropout(0.2)(x)
x=Dense(1,activation='sigmoid')(x)
model=Model(inputs=inp,outputs=x)
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
After the embedding we shall use the bidirectional lstm. Here the parameter return sequence=True implies we would like to get the output of each hidden state. The deafult one is false which means we want the output of the final state only. Similarly the GlobalMaxPool1D implies that for each sentence vector we shall take the highest value only. Dropout layer is added to deal with the overfitting. The dense layer with the sigmoid activation function will output a value between 0 and 1. The compile method builds the model where no training has been performed yet.
嵌入之后,我們將使用雙向lstm。 在這里,參數return sequence = True表示我們希望獲取每個隱藏狀態的輸出。 默認值是假的,這意味著我們只需要最終狀態的輸出。 類似地, GlobalMaxPool1D意味著對于每個句子向量,我們將僅取最大值。 添加了輟學層以應對過度擬合。 具有S型激活函數的密集層將輸出0到1之間的值。編譯方法將在尚未進行訓練的情況下構建模型。
In the compile method we had only defined the architecture and initialized it. Now we have to tune the parameters so as to get optimal model. To do so we pass the training data through the model. The fit method passes the data through model and computes the loss. Also based on the loss it does the backpropagation and tunes the parameters. The following code snippet does the same.
在編譯方法中,我們僅定義了架構并對其進行了初始化。 現在我們必須調整參數以獲得最佳模型。 為此,我們通過模型傳遞訓練數據。 擬合方法通過模型傳遞數據并計算損失。 同樣基于損耗,它進行反向傳播并調整參數。 以下代碼段執行相同的操作。
model.fit(train_x,train_out,batch_size=256,epochs=2,validation_data=(val_x,val_out))After training for 2 epochs the model had achieved an accuracy of 93%. We can change the parameters and do hyperparameter tuning to get a better model.
在訓練了2個時期后,該模型的準確率達到了93%。 我們可以更改參數并進行超參數調整以獲得更好的模型。
The performance metric for this competition is F1 score. As the dataset is imbalanced.
這項比賽的表現指標是F1分數。 由于數據集不平衡。
To predict the model we can write
為了預測模型,我們可以寫
pred_y=model.predict([test_x],batch_size=256)The whole code can be found out at https://github.com/mohantyaditya/quora-insincere-classification/blob/master/quora%20insincere.ipynb
整個代碼可以在https://github.com/mohantyaditya/quora-insincere-classification/blob/master/quora%20insincere.ipynb中找到
This was a pretty basic approach to the problem. We can try out with the pre trained embedding vectors to get better result.
這是解決該問題的非常基本的方法。 我們可以嘗試使用預訓練的嵌入向量以獲得更好的結果。
Gain Access to Expert View — Subscribe to DDI Intel
獲得訪問專家視圖的權限- 訂閱DDI Intel
翻譯自: https://medium.com/datadriveninvestor/approaching-the-quora-insincere-question-classification-problem-eb27b0ad3100
國內quora
總結
以上是生活随笔為你收集整理的国内quora_处理Quora不真诚问题分类问题的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 机器学习-推荐系统中基于深度学习的混合协
- 下一篇: 广西一男子酒后肇事逃逸 致环卫工人被撞身