當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

sklearn综合示例9：分类问题的onehot与预测阈值调整

發布時間：2024/1/23 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 sklearn综合示例9：分类问题的onehot与预测阈值调整小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

本文介紹了：

如何將多個標簽做onehot，比如說總共有1000個標簽，用戶帶了其中100個標簽，那就是一個1000維的feautre，其中100維=1，其余900維=0。

調整分類算法的分類閾值，比如將LR中的默認閾值從0調整到0.9，降低recall提升精度。

各種算法的使用方式。

1、數據預處理

樣本格式

最終得到的樣本格式如下，第一列是label，第二列是一“|”分割的一些特征，可以理解為用戶觀看了哪部電影，喜歡哪本書，關注了哪個微博id等。
label,features
1,20018
0,20006|20025
1,1509|8713|2000341|9010
我們讀取數據后，將數據做onehot。比如總共有10000個標簽，如果設備帶了其中1000個，則標簽有10000列，其中1000列為1，其余為0.

import pandas as pd import sklearn import numpy as np from sklearn.preprocessing import MultiLabelBinarizer

載入數據

sample_dir = '/home/ljhn1829/jupyter_data/sklearn_onehot_threshold.csv' df_sample_all = pd.read_csv(sample_dir) print(df_sample_all.head()) label features 0 1 2001841 1 0 2000641|2002541 2 1 1509|871305|2000341|901005|147409|132905|13560... 3 1 1034005|20909|9505|1083505|69209|19109|10905|9... 4 1 148009|4109|3809|169105|685006|62409|99805|200...

onehot

除了本文的sklearn方式外，也可以使用pandas.get_dummies()。但數據量比較大，而且稀疏時，建議使用sklearn。

見《sklearn系列之2：數據預處理》及https://stackoverflow.com/questions/63544536/convert-pd-get-dummies-result-to-df-str-get-dummies

mlb = MultiLabelBinarizer(sparse_output=True) onehot_output = pd.DataFrame.sparse.from_spmatrix(mlb.fit_transform(df_sample_all['features'].str.split('|')),columns=mlb.classes_) df_sample_onehot_all = pd.DataFrame() df_sample_onehot_all['label'] = df_sample_all['label'] df_sample_onehot_all= pd.concat([df_sample_onehot_all,onehot_output], axis=1) print(df_sample_onehot_all.head()) label 1000005 100008 100009 10001 1000108 10002 1000208 10005 \ 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 3 1 0 0 0 0 0 0 0 0 4 1 0 0 0 0 0 0 0 0 10007 ... 998805 999008 99905 99908 99909 999505 999508 999708 \ 0 0 ... 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 2 0 ... 0 0 0 0 0 0 0 0 3 0 ... 0 0 0 0 0 0 0 0 4 0 ... 0 0 0 0 0 0 0 0 999805 999808 0 0 0 1 0 0 2 0 0 3 0 0 4 0 0 [5 rows x 23720 columns] print(df_sample_onehot_all['label'].value_counts()) 0 5493 1 4506 Name: label, dtype: int64 #print(df_sample_onehot_all['1000005'].value_counts())

數據集拆分

將數據集拆分為訓練、測試集。

# from sklearn.model_selection import train_test_split # train_set, test_set = train_test_split(df_sample_onehot_all, test_size=0.2, random_state=42) # test_set, train_set = df_sample_onehot_all[0:30], df_sample_onehot_all[30:100]def split_train_test(data, test_ratio):shuffle_indices = np.random.permutation(len(data))test_size = int(len(data) * test_ratio) print(test_size)training_idx, test_idx = shuffle_indices[test_size:], shuffle_indices[:test_size]return data.iloc[training_idx], data.iloc[test_idx]train_set, test_set = split_train_test(df_sample_onehot_all, 0.2) 1999

看一下訓練集和測試集：

print(train_set['label'].value_counts()) print(test_set['label'].value_counts()) 0 4384 1 3616 Name: label, dtype: int64 0 1109 1 890 Name: label, dtype: int64

將XY分開：

y_train, X_train = train_set['label'], train_set.iloc[:, 1:] y_test, X_test = test_set['label'], test_set.iloc[:, 1:] print(train_set.head()) label 1000005 100008 100009 10001 1000108 10002 1000208 10005 \ 45 1 0 0 0 0 0 0 0 0 9521 0 0 0 0 0 0 0 0 0 7718 0 0 0 0 0 0 0 0 0 7054 0 0 0 0 0 0 0 0 0 4605 1 0 0 0 0 0 0 0 0 10007 ... 998805 999008 99905 99908 99909 999505 999508 999708 \ 45 0 ... 0 0 0 0 0 0 0 0 9521 0 ... 0 0 0 0 0 0 0 0 7718 0 ... 0 0 0 0 0 0 0 0 7054 0 ... 0 0 0 0 0 0 0 0 4605 0 ... 0 0 0 0 0 0 0 0 999805 999808 45 0 0 9521 0 0 7718 0 0 7054 0 0 4605 0 0 [5 rows x 23720 columns]

看一下哪些屬性和label最相關

corr_matrix = train_set.corr() corr_matrix['label'].sort_values(ascending=False)

模型訓練

我們使用各種模型訓練上述數據

LR

from sklearn.linear_model import SGDClassifier, LogisticRegression #clf = LogisticRegression(loss='log') clf = LogisticRegression(penalty='l2', C=0.1) clf.fit(X_train, y_train) pred = clf.predict(X_test) #print(pred)from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, precision_recall_curve,roc_auc_scoreaccuracy = accuracy_score(y_test, pred) precision = precision_score(y_test, pred) recall = recall_score(y_test, pred) f1 = f1_score(y_test, pred) auc = roc_auc_score(y_test, pred) print(accuracy, precision, recall, f1, auc)from sklearn.model_selection import cross_val_score, cross_val_predict # cross_val_score(clf, X_train, y_train, cv=3, scoring='recall') 0.5722861430715358 0.5323741007194245 0.4884488448844885 0.5094664371772806 0.5653253398734368

閾值調整

我們調整一下預測分類的閾值，LogisticRegression的預測范圍是[-1,1]，閾值默認取0.0。

下面我們先得到預測的具體數值，然后通過調整閾值的方式，調整其分類。

y_score = clf.decision_function(X_test) print(y_score) [ 0.21320269 0.25052858 -0.47845967 ... 0.83939436 0.367850.3116261 ] threshold = 0.99 y_predict_t = (y_score > threshold) print(y_predict_t) accuracy = accuracy_score(y_test, y_predict_t) precision = precision_score(y_test, y_predict_t) recall = recall_score(y_test, y_predict_t) f1 = f1_score(y_test, y_predict_t) auc = roc_auc_score(y_test, y_predict_t) print(accuracy, precision, recall, f1, auc) [False False False ... False False False] 0.5617808904452226 0.6222222222222222 0.0924092409240924 0.16091954022988506 0.5228101250492021

可以看到，提高閾值，可以降低recall，提升precision。
也就是說，提高分類閾值，recall從0.49降到0.09，同時精度從0.53上升到0.62。
閾值=0時，accuracy, precision, recall, f1, auc = 0.5722861430715358 0.5323741007194245 0.4884488448844885 0.5094664371772806 0.5653253398734368
閾值=0.99時，accuracy, precision, recall, f1, auc = 0.5617808904452226 0.6222222222222222 0.0924092409240924 0.16091954022988506 0.5228101250492021

SVM

from sklearn.linear_model import SGDClassifier clf = SGDClassifier(loss='hinge') clf.fit(X_train, y_train) pred = clf.predict(X_test)from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, precision_recall_curve,roc_auc_scoreaccuracy = accuracy_score(y_test, pred) #precision_score(y_test, pred) #precision = precision_score(y_test, pred) #recall = recall_score(y_test, pred) #f1 = f1_score(y_test, pred) auc = roc_auc_score(y_test, pred) #print(accuracy, precision, recall, f1, uac) print(accuracy, auc)cross_val_score(clf, X_train, y_train, cv=3, scoring='recall') 0.5322661330665333 0.529358807440377array([0.42452043, 0.5646372 , 0.39366138])

決策樹

from sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier() clf.fit(X_train, y_train) pred = clf.predict(X_test)from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, precision_recall_curve,roc_auc_scoreaccuracy = accuracy_score(y_test, pred) #precision_score(y_test, pred) #precision = precision_score(y_test, pred) #recall = recall_score(y_test, pred) #f1 = f1_score(y_test, pred) auc = roc_auc_score(y_test, pred) #print(accuracy, precision, recall, f1, uac) print(accuracy, auc) 0.5537768884442221 0.5478048263541951

深度學習Tensorflow

import tensorflow as tf from tensorflow import keras from tensorflow.keras import layerscolumns_count = X_train.shape[1] print(columns_count)# model = keras.models.Sequential() # model.add(keras.layers.Flatten(input_shape=[columns_count,1])) #將輸入的二維數組展開成一維向量 # model.add(keras.layers.Dense(300,activation='relu')) # model.add(keras.layers.Dense(100,activation='relu')) # model.add(keras.layers.Dense(1,activation='sigmoid'))model = keras.Sequential([keras.layers.Flatten(input_shape=[columns_count,1]),layers.Dense(128, activation='relu'),layers.Dense(128, activation='relu'),layers.Dense(1, activation='sigmoid') ])model.compile(loss='binary_crossentropy',optimizer='sgd',metrics=['accuracy']) model.fit(X_train, y_train) 23719 250/250 [==============================] - 1s 4ms/step - loss: 0.6861 - accuracy: 0.5537<tensorflow.python.keras.callbacks.History at 0x7f83b42c5a00>

用Keras做文本二分類，總是遇到以下錯誤，我的類別是0或1，但是錯誤跟我說不能是1.

參見：Received a label value of 1 which is outside the valid range of [0, 1) - Python, Keras loss function的問題。

原來用的是sparse_categorical_crossentropy，改為binary_crossentropy問題解決。

https://blog.csdn.net/The_Time_Runner/article/details/93889004

創作挑戰賽新人創作獎勵來咯，堅持創作打卡瓜分現金大獎

總結

以上是生活随笔為你收集整理的sklearn综合示例9：分类问题的onehot与预测阈值调整的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： sklearn综合示例8：SVM
下一篇： sklearn处理文本和分类属性的方式