基于sklearn的LogisticRegression鸢尾花多类分类实践
文章目錄
- 1. 問題描述
- 2. 數(shù)據(jù)介紹
- 2.1 數(shù)據(jù)描述
- 2.2 數(shù)據(jù)
- 2.3 數(shù)據(jù)可視化
- 3. 模型選擇
- 3.1 固有的多類分類器
- 3.2 1對多的多類分類器
- 3.3 OneVsRestClassifier
- 3.4 OneVsOneClassifier
- 4. 結(jié)果分析
- 5. 附完整代碼
鳶尾花(拼音:yuān wěi huā)又名:藍蝴蝶、紫蝴蝶、扁竹花等,鳶尾屬約300種,原產(chǎn)于中國中部及日本,是法國的國花。鳶尾花主要色彩為藍紫色,有“藍色妖姬”的美譽,鳶尾花因花瓣形如鳶鳥尾巴而稱之,有藍、紫、黃、白、紅等顏色,英文irises音譯俗稱為“愛麗絲”
本文使用sklearn的邏輯斯諦回歸模型,進行鳶尾花多分類預(yù)測,對OvR與OvO多分類方法下的預(yù)測結(jié)果進行對比。
1. 問題描述
- 給定鳶尾花的特征數(shù)據(jù)集(花萼、花瓣的長和寬尺寸)
- 預(yù)測其屬于哪個品種(Setosa,Versicolor,Virginica)
2. 數(shù)據(jù)介紹
from sklearn import datasets iris = datasets.load_iris() print(dir(iris)) # 查看data所具有的屬性或方法 # ['DESCR', 'data', 'feature_names', 'filename', 'target', 'target_names']我們看見數(shù)據(jù)有很多屬性或方法,我們依次來看一看:
2.1 數(shù)據(jù)描述
print(iris.DESCR) # 數(shù)據(jù)描述- 數(shù)據(jù)包含150個(每個類型的花50個)
- 每個數(shù)據(jù)里有4個花的尺寸信息(花萼、花瓣的長寬)以及其分類class
- 描述里給出了4種尺寸信息的(分布區(qū)間,均值,方差,分類相關(guān)系數(shù))
- 數(shù)據(jù)是否缺失某些值,作者,日期,來源,數(shù)據(jù)應(yīng)用,參考文獻
2.2 數(shù)據(jù)
print(iris.data) # 特征數(shù)據(jù) # 150行4列 <class 'numpy.ndarray'> print(iris.feature_names) # 特征名稱 # ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] print(iris.filename) # 文件路徑 C:\Users\***\AppData\Roaming\Python\Python37\site-packages\sklearn\datasets\data\iris.csv print(iris.target) # 分類標簽 size 150 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2] print(iris.target_names) # 分類名稱 3類 花的名稱 # ['setosa' 'versicolor' 'virginica']2.3 數(shù)據(jù)可視化
由于平面只能展示2維特征,我們?nèi)?個特征進行進行觀看。
def show_data_set(X, y, data):plt.plot(X[y == 0, 0], X[y == 0, 1], 'rs', label=data.target_names[0])plt.plot(X[y == 1, 0], X[y == 1, 1], 'bx', label=data.target_names[1])plt.plot(X[y == 2, 0], X[y == 2, 1], 'go', label=data.target_names[2])plt.xlabel(data.feature_names[0])plt.ylabel(data.feature_names[1])plt.title("鳶尾花2維數(shù)據(jù)")plt.legend()plt.rcParams['font.sans-serif'] = 'SimHei' # 消除中文亂碼plt.show() iris = datasets.load_iris() # print(dir(iris)) # 查看data所具有的屬性或方法 # print(iris.data) # 數(shù)據(jù) # print(iris.DESCR) # 數(shù)據(jù)描述 X = iris.data[:, :2] # 取前2列特征sepal(平面只能展示2維) # X = iris.data[:, 2:4] # petal兩個特征 # X = iris.data # 全部4個特征 y = iris.target # 分類 show_data_set(X, y, iris)
3. 模型選擇
本人相關(guān)文章:
- 邏輯斯諦回歸模型( Logistic Regression,LR)
- 基于sklearn的LogisticRegression二分類實踐
sklearn多類和多標簽算法:
-
Multiclass classification 多類分類 意味著一個分類任務(wù)需要對多于兩個類的數(shù)據(jù)進行分類。比如,對一系列的橘子,蘋果或者梨的圖片進行分類。多類分類假設(shè)每一個樣本有且僅有一個標簽:一個水果可以被歸類為蘋果,也可以是梨,但不能同時被歸類為兩類。
-
固有的多類分類器:
sklearn.linear_model.LogisticRegression (setting multi_class=”multinomial”) -
1對多的多類分類器:
sklearn.linear_model.LogisticRegression (setting multi_class=”ovr”)
分類器Classifier方法:
-
One-vs-the-rest (OvR),也叫 one-vs-all,1對多, 在 OneVsRestClassifier 模塊中執(zhí)行。 這個方法在于每一個類都將用一個分類器進行擬合。 對于每一個分類器,該類將會和其他所有的類有所區(qū)別。除了它的計算效率之外 (只需要 n_classes 個分類器), 這種方法的優(yōu)點是它具有可解釋性。 因為每一個類都可以通過有且僅有一個分類器來代表,所以通過檢查一個類相關(guān)的分類器就可以獲得該類的信息。這是最常用的方法,也是一個合理的默認選擇。
-
One-vs-one (OvO),OneVsOneClassifier 1對1分類器 將會為每一對類別構(gòu)造出一個分類器,在預(yù)測階段,收到最多投票的類別將會被挑選出來。 當存在結(jié)時(兩個類具有同樣的票數(shù)的時候), 1對1分類器會選擇總分類置信度最高的類,其中總分類置信度是由下層的二元分類器 計算出的成對置信等級累加而成。
因為這需要訓(xùn)練出 n_classes * (n_classes - 1) / 2 個分類器, 由于復(fù)雜度為 O(n_classes^2),這個方法通常比 one-vs-the-rest 慢。然而,這個方法也有優(yōu)點,比如說是在沒有很好的縮放 n_samples 數(shù)據(jù)的核方法中。 這是由于每個單獨的學(xué)習(xí)問題只涉及一小部分數(shù)據(jù),而 one-vs-the-rest 將會使用 n_classes 次完整的數(shù)據(jù)。OvO準確率會比OvR高。
3.1 固有的多類分類器
- sklearn.linear_model.LogisticRegression (setting multi_class=”multinomial”)
相關(guān)multiclass參數(shù)選擇的help說明:
In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’,
and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)
multi_class : {‘a(chǎn)uto’, ‘ovr’, ‘multinomial’}, default=‘a(chǎn)uto’
If the option chosen is ‘ovr’, then a binary problem is fit for each label.
For ‘multinomial’ the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary.
‘multinomial’ is unavailable when solver=‘liblinear’.
‘a(chǎn)uto’ selects ‘ovr’ if the data is binary, or if solver=‘liblinear’, and otherwise selects ‘multinomial’.
直接設(shè)置LogisticRegression的參數(shù):multi_class='multinomial', solver='newton-cg',代碼如下:
def test1(X_train, X_test, y_train, y_test, multi_class='multinomial', solver='newton-cg'):log_reg = LogisticRegression(multi_class=multi_class, solver=solver) # 調(diào)用multinomial多分類,求解器 newton-cg or lbfgslog_reg.fit(X_train, y_train)predict_train = log_reg.predict(X_train)sys.stdout.write("LR(multi_class = %s, solver = %s) Train Accuracy : %.4g\n" % (multi_class, solver, metrics.accuracy_score(y_train, predict_train)))predict_test = log_reg.predict(X_test)sys.stdout.write("LR(multi_class = %s, solver = %s) Test Accuracy : %.4g\n" % (multi_class, solver, metrics.accuracy_score(y_test, predict_test)))plot_decision_boundary(4, 8.5, 1.5, 4.5, lambda x: log_reg.predict(x)) # 4個特征下注釋掉,前兩特征# plot_decision_boundary(0.5, 7.5, 0, 3, lambda x: log_reg.predict(x)) # 4個特征下注釋掉,后兩特征plot_data(X_train, y_train)3.2 1對多的多類分類器
- sklearn.linear_model.LogisticRegression (setting multi_class=”ovr”)
直接設(shè)置LogisticRegression的參數(shù):multi_class='ovr', solver='liblinear'',代碼如下:
def test1(X_train, X_test, y_train, y_test, multi_class='ovr', solver='liblinear'):log_reg = LogisticRegression(multi_class=multi_class, solver=solver) # 調(diào)用ovr多分類,設(shè)置求解器 liblinearlog_reg.fit(X_train, y_train)predict_train = log_reg.predict(X_train)sys.stdout.write("LR(multi_class = %s, solver = %s) Train Accuracy : %.4g\n" % (multi_class, solver, metrics.accuracy_score(y_train, predict_train)))predict_test = log_reg.predict(X_test)sys.stdout.write("LR(multi_class = %s, solver = %s) Test Accuracy : %.4g\n" % (multi_class, solver, metrics.accuracy_score(y_test, predict_test)))plot_decision_boundary(4, 8.5, 1.5, 4.5, lambda x: log_reg.predict(x)) # 4個特征下注釋掉,前兩特征# plot_decision_boundary(0.5, 7.5, 0, 3, lambda x: log_reg.predict(x)) # 4個特征下注釋掉,后兩特征plot_data(X_train, y_train)3.3 OneVsRestClassifier
class sklearn.multiclass.OneVsRestClassifier(estimator, n_jobs=None)分類器接受一個評估器estimator對象,先定義一個LR模型log_reg,將log_reg傳入OvR分類器 ovr = OneVsRestClassifier(log_reg)
def test2(X_train, X_test, y_train, y_test):# multi_class默認auto# 'auto' selects 'ovr' if the data is binary, or if solver='liblinear',# and otherwise selects 'multinomial'.# 看完help知道auto選擇的是ovr,因為下面求解器選的是 liblinear# 所以test1和test2是同種效果,不一樣的寫法log_reg = LogisticRegression(solver='liblinear')ovr = OneVsRestClassifier(log_reg) # 傳入LR至OvR分類器ovr.fit(X_train, y_train)predict_train = ovr.predict(X_train)sys.stdout.write("LR(ovr) Train Accuracy : %.4g\n" % (metrics.accuracy_score(y_train, predict_train)))predict_test = ovr.predict(X_test)sys.stdout.write("LR(ovr) Test Accuracy : %.4g\n" % (metrics.accuracy_score(y_test, predict_test)))plot_decision_boundary(4, 8.5, 1.5, 4.5, lambda x: ovr.predict(x)) # 4個特征下注釋掉,前兩特征# plot_decision_boundary(0.5, 7.5, 0, 3, lambda x: ovr.predict(x)) # 4個特征下注釋掉,后兩特征plot_data(X_train, y_train)3.4 OneVsOneClassifier
class sklearn.multiclass.OneVsOneClassifier(estimator, n_jobs=None)分類器接受一個評估器estimator對象,先定義一個LR模型log_reg,將log_reg傳入OvO分類器 ovo = OneVsOneClassifier(log_reg)
def test3(X_train, X_test, y_train, y_test):# For multiclass problems, only 'newton-cg', 'sag', 'saga' and 'lbfgs' handle multinomial loss;log_reg = LogisticRegression(multi_class='multinomial', solver='newton-cg')# ovo多分類,傳入LR(multinomial,newton-cg or lbfgs),測試時,選擇multi_class='ovr',結(jié)果一致,誰幫忙解釋下ovo = OneVsOneClassifier(log_reg) ovo.fit(X_train, y_train)predict_train = ovo.predict(X_train)sys.stdout.write("LR(ovo) Train Accuracy : %.4g\n" % (metrics.accuracy_score(y_train, predict_train)))predict_test = ovo.predict(X_test)sys.stdout.write("LR(ovo) Test Accuracy : %.4g\n" % (metrics.accuracy_score(y_test, predict_test)))plot_decision_boundary(4, 8.5, 1.5, 4.5, lambda x: ovr.predict(x)) # 4個特征下注釋掉,前兩特征# plot_decision_boundary(0.5, 7.5, 0, 3, lambda x: ovr.predict(x)) # 4個特征下注釋掉,后兩特征plot_data(X_train, y_train)4. 結(jié)果分析
執(zhí)行預(yù)測:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=777) # 默認test比例0.25 test1(X_train, X_test, y_train, y_test, multi_class='ovr', solver='liblinear') test2(X_train, X_test, y_train, y_test) test1(X_train, X_test, y_train, y_test, multi_class='multinomial', solver='newton-cg') test3(X_train, X_test, y_train, y_test)| seed(520), 2 features [sepal, L,W] | |||||
| 準確率:train / test | 0.7679,0.8421 | 0.7679,0.8421 | 0.7768,0.8947 | 0.7768,0.8684 | |
| seed(777), 2 features [sepal, L,W] | |||||
| 準確率:train / test | 0.7589,0.7368 | 0.7589,0.7368 | 0.7768,0.8158 | 0.7946,0.8158 | |
| seed(520), 2 features [petal, L,W] | |||||
| 準確率:train / test | 0.8750,0.9474 | 0.8750,0.9474 | 0.9554,1 | 0.9554,1 | |
| seed(777), 2 features [petal, L,W] | |||||
| 準確率:train / test | 0.9196,0.9474 | 0.9196,0.9474 | 0.9554,1 | 0.9554,1 | |
| seed(520), 4 features | - | - | - | - | - |
| 準確率:train / test | 0.9464,1 | 0.9464,1 | 0.9643,1 | 0.9732,1 | |
| seed(777), 4 features | - | - | - | - | - |
| 準確率:train / test | 0.9464,1 | 0.9464,1 | 0.9643,1 | 0.9732,1 |
- 前兩列是OvR模式的多分類,代碼寫法有區(qū)別,預(yù)測結(jié)果完全一樣
- 后兩列是OvO模式的多分類(sklearn里沒有提供 LR 內(nèi)置的'ovo'選項)
- 對比兩種模式的多分類預(yù)測效果,OvO比OvR要好,但OvO是 O(n2)的復(fù)雜度
- 在以sepal的長寬為特征的預(yù)測中,2維分類線可見setosa與剩余2類線性可分,剩余兩類之間線性不可分
- 在以petal的長寬為特征的預(yù)測相比于sepal的兩個特征預(yù)測,petal的預(yù)測準確率高,由圖也可看出,分界線較好的區(qū)分了3個種類
- 在使用4維特征下進行預(yù)測,訓(xùn)練準確率OvO比OvR要好,測試準確率均達到100%,使用4維特征比使用2維特征預(yù)測,4維特征預(yù)測準確率更高
對于上面OvR,OvO分類器傳入的 LR 模型(里面的參數(shù)該怎么填寫),在上表的基礎(chǔ)上做了如下測試:(如果有大佬看見這里,請賜教!)
OvR分類器(傳入LR(ovr,liblinear)) 增加 OvR分類器(傳入LR(multinomial, newton-cg)) -------------------------------------------------OvO分類器(傳入LR(multinomial, newton-cg)) 增加 OvO分類器(傳入LR(ovr,liblinear))| seed(520), 2 features [sepal, L,W] | 0.7679,0.8421 | 0.7679,0.8421 | 0.7857,0.8947 | 0.7768,0.8947 | 0.7768,0.8684 | 0.7500,07105 |
| seed(777), 2 features [sepal, L,W] | 0.7589,0.7368 | 0.7589,0.7368 | 0.7589,0.8158 | 0.7768,0.8158 | 0.7946,0.8158 | 0.7232,0.7105 |
| seed(520), 2 features [petal, L,W] | 0.8750,0.9474 | 0.8750,0.9474 | 0.9375,1 | 0.9554,1 | 0.9554,1 | 0.9464,1 |
| seed(777), 2 features [petal, L,W] | 0.9196,0.9474 | 0.9196,0.9474 | 0.9464,1 | 0.9554,1 | 0.9554,1 | 0.9554,0.9737 |
| seed(520), 4 features | 0.9464,1 | 0.9464,1 | 0.9464,1 | 0.9643,1 | 0.9732,1 | 0.9732,1 |
| seed(777), 4 features | 0.9464,1 | 0.9464,1 | 0.9464,1 | 0.9643,1 | 0.9732,1 | 0.9821,0.9737 |
根據(jù)上面的數(shù)據(jù),個人妄自推測:
- 可能大部分情況下,OvR < OvO,LR(‘ovr’) < LR(‘multinomial’)
- 搭配起來呢,所以同一OvR或者OvO下,傳入LR(‘multinomial’)預(yù)測結(jié)果準確率更高
這塊還請大佬指點迷津!!!
5. 附完整代碼
'''遇到不熟悉的庫、模塊、類、函數(shù),可以依次:1)百度(google確實靠譜一些),如"matplotlib.pyplot",會有不錯的博客供學(xué)習(xí)參考2)"終端-->python-->import xx-->help(xx.yy)",一開始的時候這么做沒啥用,但作為資深工程師是必備技能3)試著修改一些參數(shù),觀察其輸出的變化,在后面的程序中,會不斷的演示這種辦法 ''' # written by hitskyer, I just wanna say thank you ! # modified by Michael Ming on 2020.2.20 # Python 3.7 import sys import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn import metrics from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.multiclass import OneVsRestClassifier from sklearn.multiclass import OneVsOneClassifierdef show_data_set(X, y, data):plt.plot(X[y == 0, 0], X[y == 0, 1], 'rs', label=data.target_names[0])plt.plot(X[y == 1, 0], X[y == 1, 1], 'bx', label=data.target_names[1])plt.plot(X[y == 2, 0], X[y == 2, 1], 'go', label=data.target_names[2])plt.xlabel(data.feature_names[0])plt.ylabel(data.feature_names[1])plt.title("鳶尾花2維數(shù)據(jù)")plt.legend()plt.rcParams['font.sans-serif'] = 'SimHei' # 消除中文亂碼plt.show()def plot_data(X, y):plt.plot(X[y == 0, 0], X[y == 0, 1], 'rs', label='setosa')plt.plot(X[y == 1, 0], X[y == 1, 1], 'bx', label='versicolor')plt.plot(X[y == 2, 0], X[y == 2, 1], 'go', label='virginica')plt.xlabel("sepal length (cm)")plt.ylabel("sepal width (cm)")# plt.xlabel("petal length (cm)")# plt.ylabel("petal width (cm)")plt.title("預(yù)測分類邊界")plt.legend()plt.rcParams['font.sans-serif'] = 'SimHei' # 消除中文亂碼plt.show()def plot_decision_boundary(x_min, x_max, y_min, y_max, pred_func):h = 0.01xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))Z = pred_func(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape)plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)def test1(X_train, X_test, y_train, y_test, multi_class='ovr', solver='liblinear'):log_reg = LogisticRegression(multi_class=multi_class, solver=solver) # 調(diào)用ovr多分類log_reg.fit(X_train, y_train)predict_train = log_reg.predict(X_train)sys.stdout.write("LR(multi_class = %s, solver = %s) Train Accuracy : %.4g\n" % (multi_class, solver, metrics.accuracy_score(y_train, predict_train)))predict_test = log_reg.predict(X_test)sys.stdout.write("LR(multi_class = %s, solver = %s) Test Accuracy : %.4g\n" % (multi_class, solver, metrics.accuracy_score(y_test, predict_test)))plot_decision_boundary(4, 8.5, 1.5, 4.5, lambda x: log_reg.predict(x)) # 4個特征下注釋掉,前兩特征# plot_decision_boundary(0.5, 7.5, 0, 3, lambda x: log_reg.predict(x)) # 4個特征下注釋掉,后兩特征plot_data(X_train, y_train)def test2(X_train, X_test, y_train, y_test):# multi_class默認auto# 'auto' selects 'ovr' if the data is binary, or if solver='liblinear',# and otherwise selects 'multinomial'.# 看完help知道auto選擇的是ovr,因為下面求解器選的是 liblinear# 所以test1和test2是同種效果,不一樣的寫法log_reg = LogisticRegression(solver='liblinear')ovr = OneVsRestClassifier(log_reg)ovr.fit(X_train, y_train)predict_train = ovr.predict(X_train)sys.stdout.write("LR(ovr) Train Accuracy : %.4g\n" % (metrics.accuracy_score(y_train, predict_train)))predict_test = ovr.predict(X_test)sys.stdout.write("LR(ovr) Test Accuracy : %.4g\n" % (metrics.accuracy_score(y_test, predict_test)))plot_decision_boundary(4, 8.5, 1.5, 4.5, lambda x: ovr.predict(x)) # 4個特征下注釋掉,前兩特征# plot_decision_boundary(0.5, 7.5, 0, 3, lambda x: ovr.predict(x)) # 4個特征下注釋掉,后兩特征plot_data(X_train, y_train)def test3(X_train, X_test, y_train, y_test):# For multiclass problems, only 'newton-cg', 'sag', 'saga' and 'lbfgs' handle multinomial loss;log_reg = LogisticRegression(multi_class='multinomial', solver='newton-cg')ovo = OneVsOneClassifier(log_reg) # ovo多分類,傳入LR(multinomial,newton-cg or lbfgs)ovo.fit(X_train, y_train)predict_train = ovo.predict(X_train)sys.stdout.write("LR(ovo) Train Accuracy : %.4g\n" % (metrics.accuracy_score(y_train, predict_train)))predict_test = ovo.predict(X_test)sys.stdout.write("LR(ovo) Test Accuracy : %.4g\n" % (metrics.accuracy_score(y_test, predict_test)))plot_decision_boundary(4, 8.5, 1.5, 4.5, lambda x: ovo.predict(x)) # 4個特征下注釋掉,前兩特征# plot_decision_boundary(0.5, 7.5, 0, 3, lambda x: ovo.predict(x)) # 4個特征下注釋掉,后兩特征plot_data(X_train, y_train)if __name__ == '__main__':iris = datasets.load_iris()# print(dir(iris)) # 查看data所具有的屬性或方法# print(iris.data) # 數(shù)據(jù)# print(iris.DESCR) # 數(shù)據(jù)描述X = iris.data[:, :2] # 取前2列特征sepal(平面只能展示2維)# X = iris.data[:, 2:4] # petal兩個特征# X = iris.data # 全部4個特征y = iris.target # 分類show_data_set(X, y, iris)X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=777) # 默認test比例0.25test1(X_train, X_test, y_train, y_test, multi_class='ovr', solver='liblinear')test2(X_train, X_test, y_train, y_test)test1(X_train, X_test, y_train, y_test, multi_class='multinomial', solver='newton-cg')test3(X_train, X_test, y_train, y_test)總結(jié)
以上是生活随笔為你收集整理的基于sklearn的LogisticRegression鸢尾花多类分类实践的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: LeetCode 50. Pow(x,
- 下一篇: LeetCode 第 21 场双周赛(7