[scikit-learn 机器学习] 6. 逻辑回归
文章目錄
- 1. 邏輯回歸二分類
- 2. 垃圾郵件過濾
- 2.1 性能指標
- 2.2 準確率
- 2.3 精準率、召回率
- 2.4 F1值
- 2.5 ROC、AUC
- 3. 網(wǎng)格搜索調(diào)參
- 4. 多類別分類
- 5. 多標簽分類
- 5.1 多標簽分類性能指標
本文為 scikit-learn機器學習(第2版)學習筆記
邏輯回歸常用于分類任務
1. 邏輯回歸二分類
《統(tǒng)計學習方法》邏輯斯諦回歸模型( Logistic Regression,LR)
定義:設(shè) XXX 是連續(xù)隨機變量, XXX 服從 logistic 分布是指 XXX 具有下列分布函數(shù)和密度函數(shù):
F(x)=P(X≤x)=11+e?(x?μ)/γF(x) = P(X \leq x) = \frac{1}{1+e^{{-(x-\mu)} / \gamma}}F(x)=P(X≤x)=1+e?(x?μ)/γ1?
f(x)=F′(x)=e?(x?μ)/γγ(1+e?(x?μ)/γ)2f(x)=F'(x)= \frac {e^{{-(x-\mu)} / \gamma}}{\gamma {(1+e^{{-(x-\mu)}/\gamma})}^2}f(x)=F′(x)=γ(1+e?(x?μ)/γ)2e?(x?μ)/γ?
在邏輯回歸中,當預測概率 >= 閾值,預測為正類,否則預測為負類
2. 垃圾郵件過濾
從信息中提取 TF-IDF 特征,并使用邏輯回歸進行分類
import pandas as pd data = pd.read_csv("SMSSpamCollection", delimiter='\t',header=None) data data[data[0]=='ham'][0].count() # 4825 條正常信息 data[data[0]=='spam'][0].count() # 747 條垃圾信息 import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split, cross_val_scoreX = data[1].values y = data[0].values from sklearn.preprocessing import LabelBinarizer lb = LabelBinarizer() y = lb.fit_transform(y)X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, random_state=520)vectorizer = TfidfVectorizer() X_train = vectorizer.fit_transform(X_train_raw) X_test = vectorizer.transform(X_test_raw)classifier = LogisticRegression() classifier.fit(X_train, y_train)pred = classifier.predict(X_test) for i, pred_i in enumerate(pred[:5]):print("預測為:%s, 信息為:%s,真實為:%s" %(pred_i,X_test_raw[i],y_test[i])) 預測為:0, 信息為:Aww that's the first time u said u missed me without asking if I missed u first. You DO love me! :),真實為:[0] 預測為:0, 信息為:Poor girl can't go one day lmao,真實為:[0] 預測為:0, 信息為:Also remember the beads don't come off. Ever.,真實為:[0] 預測為:0, 信息為:I see the letter B on my car,真實為:[0] 預測為:0, 信息為:My love ! How come it took you so long to leave for Zaher's? I got your words on ym and was happy to see them but was sad you had left. I miss you,真實為:[0]2.1 性能指標
混淆矩陣
from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt confusion_matrix = confusion_matrix(y_test, pred) plt.matshow(confusion_matrix) plt.rcParams["font.sans-serif"]= 'SimHei' # 消除中文亂碼 plt.title("混淆矩陣") plt.ylabel('真實') plt.xlabel('預測') plt.colorbar()2.2 準確率
scores = cross_val_score(classifier, X_train, y_train, cv=5) print('Accuracies: %s' % scores) print('Mean accuracy: %s' % np.mean(scores)) Accuracies: [0.94976077 0.95933014 0.96650718 0.95215311 0.95688623] Mean accuracy: 0.9569274847434318準確率不是一個很合適的性能指標,它不能區(qū)分預測錯誤,是正預測為負,還是負預測為正
2.3 精準率、召回率
可以參考 [Hands On ML] 3. 分類(MNIST手寫數(shù)字預測)
單獨只看精準率或者召回率是沒有意義的
from sklearn.metrics import precision_score, recall_score, f1_score precisions = precision_score(y_test, pred) print('Precision: %s' % precisions) recalls = recall_score(y_test, pred) print('Recall: %s' % recalls) Precision: 0.9852941176470589 預測為垃圾信息的基本上真的是垃圾信息Recall: 0.6979166666666666 有30%的垃圾信息預測為了非垃圾信息2.4 F1值
F1 值是以上精準率和召回率的均衡
f1s = f1_score(y_test, pred) print('F1 score: %s' % f1s) # F1 score: 0.81707317073170742.5 ROC、AUC
- 好的分類器AUC面積越接近1越好,隨機分類器AUC面積為0.5
3. 網(wǎng)格搜索調(diào)參
import pandas as pd from sklearn.preprocessing import LabelEncoder from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV from sklearn.pipeline import Pipeline from sklearn.model_selection import train_test_split from sklearn.metrics import precision_score, recall_score, accuracy_scorepipeline = Pipeline([('vect', TfidfVectorizer(stop_words='english')),('clf', LogisticRegression()) ]) parameters = {'vect__max_df': (0.25, 0.5, 0.75), # 模塊name__參數(shù)name'vect__stop_words': ('english', None),'vect__max_features': (2500, 5000, None),'vect__ngram_range': ((1, 1), (1, 2)),'vect__use_idf': (True, False),'clf__penalty': ('l1', 'l2'),'clf__C': (0.01, 0.1, 1, 10), }if __name__ == "__main__":df = pd.read_csv('./SMSSpamCollection', delimiter='\t', header=None)X = df[1].valuesy = df[0].valueslabel_encoder = LabelEncoder()y = label_encoder.fit_transform(y)X_train, X_test, y_train, y_test = train_test_split(X, y)grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)grid_search.fit(X_train, y_train)print('Best score: %0.3f' % grid_search.best_score_)print('Best parameters set:')best_parameters = grid_search.best_estimator_.get_params()for param_name in sorted(parameters.keys()):print('\t%s: %r' % (param_name, best_parameters[param_name]))predictions = grid_search.predict(X_test)print('Accuracy: %s' % accuracy_score(y_test, predictions))print('Precision: %s' % precision_score(y_test, predictions))print('Recall: %s' % recall_score(y_test, predictions)) Best score: 0.985 Best parameters set:clf__C: 10clf__penalty: 'l2'vect__max_df: 0.5vect__max_features: 5000vect__ngram_range: (1, 2)vect__stop_words: Nonevect__use_idf: True Accuracy: 0.9791816223977028 Precision: 1.0 Recall: 0.8605769230769231調(diào)整參數(shù)后,提高了召回率
4. 多類別分類
電影情緒評價預測
data = pd.read_csv("./chapter5_movie_train.csv",header=0,delimiter='\t') data data['Sentiment'].describe() count 156060.000000 mean 2.063578 std 0.893832 min 0.000000 25% 2.000000 50% 2.000000 75% 3.000000 max 4.000000 Name: Sentiment, dtype: float64平均都是比較中立的情緒
data["Sentiment"].value_counts()/data["Sentiment"].count() 2 0.509945 3 0.210989 1 0.174760 4 0.058990 0 0.045316 Name: Sentiment, dtype: float6450% 的例子都是中立的情緒
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, accuracy_score, confusion_matrix from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCVdf = pd.read_csv('./chapter5_movie_train.csv', header=0, delimiter='\t') X, y = df['Phrase'], df['Sentiment'].values X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)pipeline = Pipeline([('vect', TfidfVectorizer(stop_words='english')),('clf', LogisticRegression()) ]) parameters = {'vect__max_df': (0.25, 0.5),'vect__ngram_range': ((1, 1), (1, 2)),'vect__use_idf': (True, False),'clf__C': (0.1, 1, 10), }grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy') grid_search.fit(X_train, y_train)print('Best score: %0.3f' % grid_search.best_score_) print('Best parameters set:') best_parameters = grid_search.best_estimator_.get_params() for param_name in sorted(parameters.keys()):print('\t%s: %r' % (param_name, best_parameters[param_name])) Best score: 0.619 Best parameters set:clf__C: 10vect__max_df: 0.25vect__ngram_range: (1, 2)vect__use_idf: False- 性能指標
5. 多標簽分類
- 一個實例可以被貼上多個 labels
問題轉(zhuǎn)換:
- 實例的標簽(假設(shè)為L1,L2),轉(zhuǎn)換成(L1 and L2),以此類推,缺點,產(chǎn)生很多種類的標簽,且模型只能訓練數(shù)據(jù)中包含的類,很多可能無法覆蓋到
- 對每個標簽,訓練一個二分類器(這個實例是L1嗎,是L2嗎?),缺點,忽略了標簽之間的關(guān)系
5.1 多標簽分類性能指標
- 漢明損失:不正確標簽的平均比例,0最好
- 杰卡德相似系數(shù):預測與真實標簽的交集數(shù)量 / 并集數(shù)量,1最好
總結(jié)
以上是生活随笔為你收集整理的[scikit-learn 机器学习] 6. 逻辑回归的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: LeetCode 1276. 不浪费原料
- 下一篇: LeetCode 769. 最多能完成排