sklearn综合示例9:分类问题的onehot与预测阈值调整
本文介紹了:
1、數據預處理
樣本格式
最終得到的樣本格式如下,第一列是label,第二列是一“|”分割的一些特征,可以理解為用戶觀看了哪部電影,喜歡哪本書,關注了哪個微博id等。
label,features
1,20018
0,20006|20025
1,1509|8713|2000341|9010
我們讀取數據后,將數據做onehot。比如總共有10000個標簽,如果設備帶了其中1000個,則標簽有10000列,其中1000列為1,其余為0.
載入數據
sample_dir = '/home/ljhn1829/jupyter_data/sklearn_onehot_threshold.csv' df_sample_all = pd.read_csv(sample_dir) print(df_sample_all.head()) label features 0 1 2001841 1 0 2000641|2002541 2 1 1509|871305|2000341|901005|147409|132905|13560... 3 1 1034005|20909|9505|1083505|69209|19109|10905|9... 4 1 148009|4109|3809|169105|685006|62409|99805|200...onehot
除了本文的sklearn方式外,也可以使用pandas.get_dummies()。但數據量比較大,而且稀疏時,建議使用sklearn。
見《sklearn系列之2:數據預處理》及https://stackoverflow.com/questions/63544536/convert-pd-get-dummies-result-to-df-str-get-dummies
mlb = MultiLabelBinarizer(sparse_output=True) onehot_output = pd.DataFrame.sparse.from_spmatrix(mlb.fit_transform(df_sample_all['features'].str.split('|')),columns=mlb.classes_) df_sample_onehot_all = pd.DataFrame() df_sample_onehot_all['label'] = df_sample_all['label'] df_sample_onehot_all= pd.concat([df_sample_onehot_all,onehot_output], axis=1) print(df_sample_onehot_all.head()) label 1000005 100008 100009 10001 1000108 10002 1000208 10005 \ 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 3 1 0 0 0 0 0 0 0 0 4 1 0 0 0 0 0 0 0 0 10007 ... 998805 999008 99905 99908 99909 999505 999508 999708 \ 0 0 ... 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 2 0 ... 0 0 0 0 0 0 0 0 3 0 ... 0 0 0 0 0 0 0 0 4 0 ... 0 0 0 0 0 0 0 0 999805 999808 0 0 0 1 0 0 2 0 0 3 0 0 4 0 0 [5 rows x 23720 columns] print(df_sample_onehot_all['label'].value_counts()) 0 5493 1 4506 Name: label, dtype: int64 #print(df_sample_onehot_all['1000005'].value_counts())數據集拆分
將數據集拆分為訓練、測試集。
# from sklearn.model_selection import train_test_split # train_set, test_set = train_test_split(df_sample_onehot_all, test_size=0.2, random_state=42) # test_set, train_set = df_sample_onehot_all[0:30], df_sample_onehot_all[30:100]def split_train_test(data, test_ratio):shuffle_indices = np.random.permutation(len(data))test_size = int(len(data) * test_ratio) print(test_size)training_idx, test_idx = shuffle_indices[test_size:], shuffle_indices[:test_size]return data.iloc[training_idx], data.iloc[test_idx]train_set, test_set = split_train_test(df_sample_onehot_all, 0.2) 1999看一下訓練集和測試集:
print(train_set['label'].value_counts()) print(test_set['label'].value_counts()) 0 4384 1 3616 Name: label, dtype: int64 0 1109 1 890 Name: label, dtype: int64將XY分開:
y_train, X_train = train_set['label'], train_set.iloc[:, 1:] y_test, X_test = test_set['label'], test_set.iloc[:, 1:] print(train_set.head()) label 1000005 100008 100009 10001 1000108 10002 1000208 10005 \ 45 1 0 0 0 0 0 0 0 0 9521 0 0 0 0 0 0 0 0 0 7718 0 0 0 0 0 0 0 0 0 7054 0 0 0 0 0 0 0 0 0 4605 1 0 0 0 0 0 0 0 0 10007 ... 998805 999008 99905 99908 99909 999505 999508 999708 \ 45 0 ... 0 0 0 0 0 0 0 0 9521 0 ... 0 0 0 0 0 0 0 0 7718 0 ... 0 0 0 0 0 0 0 0 7054 0 ... 0 0 0 0 0 0 0 0 4605 0 ... 0 0 0 0 0 0 0 0 999805 999808 45 0 0 9521 0 0 7718 0 0 7054 0 0 4605 0 0 [5 rows x 23720 columns]看一下哪些屬性和label最相關
corr_matrix = train_set.corr() corr_matrix['label'].sort_values(ascending=False)模型訓練
我們使用各種模型訓練上述數據
LR
from sklearn.linear_model import SGDClassifier, LogisticRegression #clf = LogisticRegression(loss='log') clf = LogisticRegression(penalty='l2', C=0.1) clf.fit(X_train, y_train) pred = clf.predict(X_test) #print(pred)from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, precision_recall_curve,roc_auc_scoreaccuracy = accuracy_score(y_test, pred) precision = precision_score(y_test, pred) recall = recall_score(y_test, pred) f1 = f1_score(y_test, pred) auc = roc_auc_score(y_test, pred) print(accuracy, precision, recall, f1, auc)from sklearn.model_selection import cross_val_score, cross_val_predict # cross_val_score(clf, X_train, y_train, cv=3, scoring='recall') 0.5722861430715358 0.5323741007194245 0.4884488448844885 0.5094664371772806 0.5653253398734368閾值調整
我們調整一下預測分類的閾值,LogisticRegression的預測范圍是[-1,1],閾值默認取0.0。
下面我們先得到預測的具體數值,然后通過調整閾值的方式,調整其分類。
y_score = clf.decision_function(X_test) print(y_score) [ 0.21320269 0.25052858 -0.47845967 ... 0.83939436 0.367850.3116261 ] threshold = 0.99 y_predict_t = (y_score > threshold) print(y_predict_t) accuracy = accuracy_score(y_test, y_predict_t) precision = precision_score(y_test, y_predict_t) recall = recall_score(y_test, y_predict_t) f1 = f1_score(y_test, y_predict_t) auc = roc_auc_score(y_test, y_predict_t) print(accuracy, precision, recall, f1, auc) [False False False ... False False False] 0.5617808904452226 0.6222222222222222 0.0924092409240924 0.16091954022988506 0.5228101250492021可以看到,提高閾值,可以降低recall,提升precision。
也就是說,提高分類閾值,recall從0.49降到0.09,同時精度從0.53上升到0.62。
閾值=0時,accuracy, precision, recall, f1, auc = 0.5722861430715358 0.5323741007194245 0.4884488448844885 0.5094664371772806 0.5653253398734368
閾值=0.99時,accuracy, precision, recall, f1, auc = 0.5617808904452226 0.6222222222222222 0.0924092409240924 0.16091954022988506 0.5228101250492021
SVM
from sklearn.linear_model import SGDClassifier clf = SGDClassifier(loss='hinge') clf.fit(X_train, y_train) pred = clf.predict(X_test)from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, precision_recall_curve,roc_auc_scoreaccuracy = accuracy_score(y_test, pred) #precision_score(y_test, pred) #precision = precision_score(y_test, pred) #recall = recall_score(y_test, pred) #f1 = f1_score(y_test, pred) auc = roc_auc_score(y_test, pred) #print(accuracy, precision, recall, f1, uac) print(accuracy, auc)cross_val_score(clf, X_train, y_train, cv=3, scoring='recall') 0.5322661330665333 0.529358807440377array([0.42452043, 0.5646372 , 0.39366138])決策樹
from sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier() clf.fit(X_train, y_train) pred = clf.predict(X_test)from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, precision_recall_curve,roc_auc_scoreaccuracy = accuracy_score(y_test, pred) #precision_score(y_test, pred) #precision = precision_score(y_test, pred) #recall = recall_score(y_test, pred) #f1 = f1_score(y_test, pred) auc = roc_auc_score(y_test, pred) #print(accuracy, precision, recall, f1, uac) print(accuracy, auc) 0.5537768884442221 0.5478048263541951深度學習Tensorflow
import tensorflow as tf from tensorflow import keras from tensorflow.keras import layerscolumns_count = X_train.shape[1] print(columns_count)# model = keras.models.Sequential() # model.add(keras.layers.Flatten(input_shape=[columns_count,1])) #將輸入的二維數組展開成一維向量 # model.add(keras.layers.Dense(300,activation='relu')) # model.add(keras.layers.Dense(100,activation='relu')) # model.add(keras.layers.Dense(1,activation='sigmoid'))model = keras.Sequential([keras.layers.Flatten(input_shape=[columns_count,1]),layers.Dense(128, activation='relu'),layers.Dense(128, activation='relu'),layers.Dense(1, activation='sigmoid') ])model.compile(loss='binary_crossentropy',optimizer='sgd',metrics=['accuracy']) model.fit(X_train, y_train) 23719 250/250 [==============================] - 1s 4ms/step - loss: 0.6861 - accuracy: 0.5537<tensorflow.python.keras.callbacks.History at 0x7f83b42c5a00>用Keras做文本二分類,總是遇到以下錯誤,我的類別是0或1,但是錯誤跟我說不能是1.
參見:Received a label value of 1 which is outside the valid range of [0, 1) - Python, Keras loss function的問題。
原來用的是sparse_categorical_crossentropy,改為binary_crossentropy問題解決。
https://blog.csdn.net/The_Time_Runner/article/details/93889004
創作挑戰賽新人創作獎勵來咯,堅持創作打卡瓜分現金大獎總結
以上是生活随笔為你收集整理的sklearn综合示例9:分类问题的onehot与预测阈值调整的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: sklearn综合示例8:SVM
- 下一篇: sklearn处理文本和分类属性的方式