當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

[scikit-learn 机器学习] 6. 逻辑回归

發(fā)布時間：2024/7/5 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了 [scikit-learn 机器学习] 6. 逻辑回归小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

文章目錄

- 1. 邏輯回歸二分類
- 2. 垃圾郵件過濾
- - 2.1 性能指標
  - 2.2 準確率
  - 2.3 精準率、召回率
  - 2.4 F1值
  - 2.5 ROC、AUC
- 3. 網(wǎng)格搜索調(diào)參
- 4. 多類別分類
- 5. 多標簽分類
- - 5.1 多標簽分類性能指標

本文為 scikit-learn機器學習（第2版）學習筆記

邏輯回歸常用于分類任務

1. 邏輯回歸二分類

《統(tǒng)計學習方法》邏輯斯諦回歸模型（ Logistic Regression，LR）

定義：設(shè) $X$ 是連續(xù)隨機變量， $X$ 服從 logistic 分布是指 $X$ 具有下列分布函數(shù)和密度函數(shù)：

$\leq x) = \frac{1}{1+e^{{-(x-\mu)} / \gamma}}$

$\frac {e^{{-(x-\mu)} / \gamma}}{\gamma {(1+e^{{-(x-\mu)}/\gamma})}^2}$

在邏輯回歸中，當預測概率 >= 閾值，預測為正類，否則預測為負類

2. 垃圾郵件過濾

從信息中提取 TF-IDF 特征，并使用邏輯回歸進行分類

import pandas as pd data = pd.read_csv("SMSSpamCollection", delimiter='\t',header=None) data

data[data[0]=='ham'][0].count() # 4825 條正常信息 data[data[0]=='spam'][0].count() # 747 條垃圾信息 import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split, cross_val_scoreX = data[1].values y = data[0].values from sklearn.preprocessing import LabelBinarizer lb = LabelBinarizer() y = lb.fit_transform(y)X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, random_state=520)vectorizer = TfidfVectorizer() X_train = vectorizer.fit_transform(X_train_raw) X_test = vectorizer.transform(X_test_raw)classifier = LogisticRegression() classifier.fit(X_train, y_train)pred = classifier.predict(X_test) for i, pred_i in enumerate(pred[:5]):print("預測為：%s, 信息為：%s,真實為：%s" %(pred_i,X_test_raw[i],y_test[i])) 預測為：0, 信息為：Aww that's the first time u said u missed me without asking if I missed u first. You DO love me! :),真實為：[0] 預測為：0, 信息為：Poor girl can't go one day lmao,真實為：[0] 預測為：0, 信息為：Also remember the beads don't come off. Ever.,真實為：[0] 預測為：0, 信息為：I see the letter B on my car,真實為：[0] 預測為：0, 信息為：My love ! How come it took you so long to leave for Zaher's? I got your words on ym and was happy to see them but was sad you had left. I miss you,真實為：[0]

2.1 性能指標

混淆矩陣

from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt confusion_matrix = confusion_matrix(y_test, pred) plt.matshow(confusion_matrix) plt.rcParams["font.sans-serif"]= 'SimHei' # 消除中文亂碼 plt.title("混淆矩陣") plt.ylabel('真實') plt.xlabel('預測') plt.colorbar()

2.2 準確率

scores = cross_val_score(classifier, X_train, y_train, cv=5) print('Accuracies: %s' % scores) print('Mean accuracy: %s' % np.mean(scores)) Accuracies: [0.94976077 0.95933014 0.96650718 0.95215311 0.95688623] Mean accuracy: 0.9569274847434318

準確率不是一個很合適的性能指標，它不能區(qū)分預測錯誤，是正預測為負，還是負預測為正

2.3 精準率、召回率

可以參考 [Hands On ML] 3. 分類（MNIST手寫數(shù)字預測）

單獨只看精準率或者召回率是沒有意義的

from sklearn.metrics import precision_score, recall_score, f1_score precisions = precision_score(y_test, pred) print('Precision: %s' % precisions) recalls = recall_score(y_test, pred) print('Recall: %s' % recalls) Precision: 0.9852941176470589 預測為垃圾信息的基本上真的是垃圾信息Recall: 0.6979166666666666 有30%的垃圾信息預測為了非垃圾信息

2.4 F1值

F1 值是以上精準率和召回率的均衡

f1s = f1_score(y_test, pred) print('F1 score: %s' % f1s) # F1 score: 0.8170731707317074

2.5 ROC、AUC

好的分類器AUC面積越接近1越好，隨機分類器AUC面積為0.5

from sklearn.metrics import roc_curve from sklearn.metrics import roc_auc_scorefalse_positive_rate, recall, thresholds = roc_curve(y_test, pred) roc_auc_score = roc_auc_score(y_test, pred)plt.title('受試者工作特性') plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc_score) plt.legend(loc='lower right') plt.plot([0, 1], [0, 1], 'r--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.ylabel('Recall') plt.xlabel('Fall-out') plt.show()

3. 網(wǎng)格搜索調(diào)參

import pandas as pd from sklearn.preprocessing import LabelEncoder from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV from sklearn.pipeline import Pipeline from sklearn.model_selection import train_test_split from sklearn.metrics import precision_score, recall_score, accuracy_scorepipeline = Pipeline([('vect', TfidfVectorizer(stop_words='english')),('clf', LogisticRegression()) ]) parameters = {'vect__max_df': (0.25, 0.5, 0.75), # 模塊name__參數(shù)name'vect__stop_words': ('english', None),'vect__max_features': (2500, 5000, None),'vect__ngram_range': ((1, 1), (1, 2)),'vect__use_idf': (True, False),'clf__penalty': ('l1', 'l2'),'clf__C': (0.01, 0.1, 1, 10), }if __name__ == "__main__":df = pd.read_csv('./SMSSpamCollection', delimiter='\t', header=None)X = df[1].valuesy = df[0].valueslabel_encoder = LabelEncoder()y = label_encoder.fit_transform(y)X_train, X_test, y_train, y_test = train_test_split(X, y)grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)grid_search.fit(X_train, y_train)print('Best score: %0.3f' % grid_search.best_score_)print('Best parameters set:')best_parameters = grid_search.best_estimator_.get_params()for param_name in sorted(parameters.keys()):print('\t%s: %r' % (param_name, best_parameters[param_name]))predictions = grid_search.predict(X_test)print('Accuracy: %s' % accuracy_score(y_test, predictions))print('Precision: %s' % precision_score(y_test, predictions))print('Recall: %s' % recall_score(y_test, predictions)) Best score: 0.985 Best parameters set:clf__C: 10clf__penalty: 'l2'vect__max_df: 0.5vect__max_features: 5000vect__ngram_range: (1, 2)vect__stop_words: Nonevect__use_idf: True Accuracy: 0.9791816223977028 Precision: 1.0 Recall: 0.8605769230769231

調(diào)整參數(shù)后，提高了召回率

4. 多類別分類

電影情緒評價預測

data = pd.read_csv("./chapter5_movie_train.csv",header=0,delimiter='\t') data

data['Sentiment'].describe() count 156060.000000 mean 2.063578 std 0.893832 min 0.000000 25% 2.000000 50% 2.000000 75% 3.000000 max 4.000000 Name: Sentiment, dtype: float64

平均都是比較中立的情緒

data["Sentiment"].value_counts()/data["Sentiment"].count() 2 0.509945 3 0.210989 1 0.174760 4 0.058990 0 0.045316 Name: Sentiment, dtype: float64

50% 的例子都是中立的情緒

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, accuracy_score, confusion_matrix from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCVdf = pd.read_csv('./chapter5_movie_train.csv', header=0, delimiter='\t') X, y = df['Phrase'], df['Sentiment'].values X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)pipeline = Pipeline([('vect', TfidfVectorizer(stop_words='english')),('clf', LogisticRegression()) ]) parameters = {'vect__max_df': (0.25, 0.5),'vect__ngram_range': ((1, 1), (1, 2)),'vect__use_idf': (True, False),'clf__C': (0.1, 1, 10), }grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy') grid_search.fit(X_train, y_train)print('Best score: %0.3f' % grid_search.best_score_) print('Best parameters set:') best_parameters = grid_search.best_estimator_.get_params() for param_name in sorted(parameters.keys()):print('\t%s: %r' % (param_name, best_parameters[param_name])) Best score: 0.619 Best parameters set:clf__C: 10vect__max_df: 0.25vect__ngram_range: (1, 2)vect__use_idf: False

性能指標

predictions = grid_search.predict(X_test)print('Accuracy: %s' % accuracy_score(y_test, predictions)) print('Confusion Matrix:') print(confusion_matrix(y_test, predictions)) print('Classification Report:') print(classification_report(y_test, predictions)) Accuracy: 0.6292323465333846 Confusion Matrix: [[ 1013 1742 682 106 11][ 794 5914 6275 637 49][ 196 3207 32397 3686 222][ 28 488 6513 8131 1299][ 1 59 548 2388 1644]] Classification Report:precision recall f1-score support0 0.50 0.29 0.36 35541 0.52 0.43 0.47 136692 0.70 0.82 0.75 397083 0.54 0.49 0.52 164594 0.51 0.35 0.42 4640accuracy 0.63 78030macro avg 0.55 0.48 0.50 78030 weighted avg 0.61 0.63 0.62 78030

5. 多標簽分類

一個實例可以被貼上多個 labels

問題轉(zhuǎn)換：

實例的標簽(假設(shè)為L1,L2)，轉(zhuǎn)換成（L1 and L2）,以此類推，缺點，產(chǎn)生很多種類的標簽，且模型只能訓練數(shù)據(jù)中包含的類，很多可能無法覆蓋到
對每個標簽，訓練一個二分類器（這個實例是L1嗎，是L2嗎？），缺點，忽略了標簽之間的關(guān)系

5.1 多標簽分類性能指標

漢明損失：不正確標簽的平均比例，0最好
杰卡德相似系數(shù)：預測與真實標簽的交集數(shù)量 / 并集數(shù)量，1最好

from sklearn.metrics import hamming_loss, jaccard_score # help(jaccard_score)print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]])))print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]])))print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]])))print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]]),average=None))print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]]),average=None))print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]]),average=None)) 0.0 0.25 0.5 [1. 1.] [0.5 1. ] [0. 1.]

總結(jié)

以上是生活随笔為你收集整理的[scikit-learn 机器学习] 6. 逻辑回归的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： LeetCode 1276. 不浪费原料
下一篇： LeetCode 769. 最多能完成排