當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

sklearn学习（二）

發布時間：2025/4/16 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 sklearn学习（二）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

學習網站

http://scikit-learn.org/stable/tutorial/statistical_inference/index.html

Statistical learning: the setting and the estimator object in scikit-learn

通過下面代碼：

from sklearn import datasetsiris = datasets.load_iris() print(iris.DESCR)

可以獲取到對應的數據描述：

Iris Plants Database ==================== Notes ----- Data Set Characteristics::Number of Instances: 150 (50 in each of three classes):Number of Attributes: 4 numeric, predictive attributes and the class:Attribute Information:- sepal length in cm - sepal width in cm - petal length in cm - petal width in cm - class: - Iris-Setosa - Iris-Versicolour - Iris-Virginica :Summary Statistics:============== ==== ==== ======= ===== ==================== Min Max Mean SD Class Correlation============== ==== ==== ======= ===== ==================== sepal length: 4.3 7.9 5.84 0.83 0.7826sepal width: 2.0 4.4 3.05 0.43 -0.4194petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)============== ==== ==== ======= ===== ==================== :Missing Attribute Values: None:Class Distribution: 33.3% for each of 3 classes.:Creator: R.A. Fisher:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov):Date: July, 1988This is a copy of UCI ML iris datasets. http://archive.ics.uci.edu/ml/datasets/IrisThe famous Iris database, first used by Sir R.A FisherThis is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.References ---------- - Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions toMathematical Statistics" (John Wiley, NY, 1950).- Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially ExposedEnvironments". IEEE Transactions on Pattern Analysis and MachineIntelligence, Vol. PAMI-2, No. 1, 67-71.- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433.- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II conceptual clustering system finds 3 classes in the data.- Many, many more ...

數據長度處理

一般來說，sklearn里面的數據都是 (n_samples, n_features)這樣的樣子。
前面的那個數值表示的是有多少個樣本。第二個表示有多少個特征。

from sklearn import datasets iris = datasets.load_iris() print(iris.data.shape)

輸出：

(150, 4)

但是有些數據其實不是這樣的格式的。那就需要處理一下。

比如，digits數據

from sklearn import datasetsdigits = datasets.load_digits() print(digits.data.shape) print(digits.images.shape)

輸出是：

(1797, 64) (1797, 8, 8)

其實對于后者還有別的操作。

from sklearn import datasetsdigits = datasets.load_digits() data = digits.images.reshape((digits.images.shape[0], -1)) print(data.shape)

輸出：

(1797, 64)

-1表示這個自己計算出來，前面那個數據就是保證了樣本數目不變的情況下。

監督學習：預測一個輸出，通過高維的觀察

監督學習：包括學習一種在兩個數據集之間。觀測值 X和額外的一個變量y。一般來說，y都是一個長度為n_samples 的一維數組。

Iris DataSet數據的處理

根據之前的描述，可以看出。這里有三種iris（鸞尾花）。（Setosa, Versicolour, and Virginica）

from sklearn import datasets import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn.decomposition import PCAiris = datasets.load_iris() X = iris.data[:, :2] # take the first two features y = iris.targetx_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5 y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5fontsize = 15# 2D plt.figure(1, figsize=(8, 6)) plt.clf() # clear current figure plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1, edgecolors='k') plt.xlabel('Sepal length', fontsize=fontsize) plt.ylabel('Sepal width', fontsize=fontsize) plt.xlim(x_min, x_max) plt.ylim(y_min, y_max) plt.xticks([]) plt.yticks([])# 3D fig = plt.figure(2, figsize=(8, 6)) ax = Axes3D(fig, elev=-150, azim=110) # azim調整沿z軸旋轉度數 elev表示的看的角度 # 選前3個最相關 X_reduced = PCA(n_components=3).fit_transform(iris.data) ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=y, cmap=plt.cm.Set1, edgecolors='k', s=40) # s表示散點大小 ax.set_title("First three PCA directions", fontsize=fontsize) ax.set_xlabel("1st eigenvector", fontsize=fontsize) ax.w_xaxis.set_ticklabels([]) ax.set_ylabel("2nd eigenvector", fontsize=fontsize) ax.w_yaxis.set_ticklabels([]) ax.set_zlabel("3rd eigenvector", fontsize=fontsize) ax.w_zaxis.set_ticklabels([])plt.savefig('2.png') plt.show()

看起來上面的代碼很復雜，其實真實做了操作的，其實做的工作非常簡單。甚至是特別的多的重復。

說到最近鄰，其實最簡單的還是knn，也就是k最近鄰算法：
最近鄰，其實可以說是ML中最簡單的一類算法了

可能會需要的解釋：

關于knn的weights
關于np.c_函數

代碼

（copyfrom http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html）

解釋一波：

關于那個權重的問題。如果是uniform表示的是，所有點在鄰居之間是相等的。 distance表示關于距離的逆。https://zhuanlan.zhihu.com/p/23191325，最簡單的解釋就是假如采用的是5-nn，如果前面兩個是A，后面三個是B，那么如果是等權的話，這里就應該是選B，但是如果是考慮距離的逆的話，越近，當然數值就會越大。那么很有可能就是選A了，因為A更近。
meshgrid是構建網格數據，這個在matlab中也比較常見。
np.c_其實列項相連接的操作。比如說是A = [1, 2], B = [3,4], np.c_(A, B) = [[1, 3], [2, 4]]

import numpy as np import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap from sklearn import neighbors, datasetsn_neighbors = 15# import some data to play with iris = datasets.load_iris()# we only take the first two features. We could avoid this ugly # slicing by using a two-dim dataset X = iris.data[:, :2] y = iris.targeth = .02 # step size in the mesh# Create color maps cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF']) cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])for weights in ['uniform', 'distance']:# we create an instance of Neighbours Classifier and fit the data.clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)clf.fit(X, y)# Plot the decision boundary. For that, we will assign a color to each# point in the mesh [x_min, x_max]x[y_min, y_max].x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])# Put the result into a color plotZ = Z.reshape(xx.shape)plt.figure()# 背景染色plt.pcolormesh(xx, yy, Z, cmap=cmap_light)# Plot also the training pointsplt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,edgecolor='k', s=20)plt.xlim(xx.min(), xx.max())plt.ylim(yy.min(), yy.max())plt.title("3-Class classification (k = %i, weights = '%s')"% (n_neighbors, weights))plt.show()

KNN算法使用

選取除了后面的10個以外的作為訓練集，后10個作為測試集合

import numpy as np from sklearn import neighbors, datasetsn_neighbors = 15 iris = datasets.load_iris() np.random.seed(0) iris_X = iris.data iris_y = iris.target indices = np.random.permutation(len(iris_X)) iris_X_train = iris_X[indices[:-10]] iris_y_train = iris_y[indices[:-10]] iris_X_test = iris_X[indices[-10:]] iris_y_test = iris_y[indices[-10:]]knn =neighbors.KNeighborsClassifier(n_neighbors=n_neighbors) knn.fit(iris_X_train, iris_y_train) print(knn.predict(iris_X_test)) print(iris_y_test)

輸出：（預測結果和實際結果對比）

[1 2 1 0 0 0 2 1 2 0] [1 1 1 0 0 0 2 1 2 0]

當n的數目提高的之后。
n_neighbors = 26時，，所得到的結果就是完全正確的了。（k的數值提高，表明考慮的更全面，理論上講是更好的。雖然也不能選太過了）

但是，當權重選為distance時候，卻是在n_neighbors = 40時，才完全正確。

線性模型

線性回歸： 符合線性模型。通過調整參數使得整個殘差的平方和（the sum of the squared residuals）達到盡可能的小。

from sklearn import linear_model from sklearn import datasets import numpy as npdiabetes = datasets.load_diabetes()N_test = 20 diabetes_X_train = diabetes.data[:-N_test] diabetes_X_test = diabetes.data[-N_test:] diabetes_y_train = diabetes.target[:-N_test] diabetes_y_test = diabetes.target[-N_test:]regr = linear_model.LinearRegression() regr.fit(diabetes_X_train, diabetes_y_train) print(regr.coef_)mean_residual = np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2) print(mean_residual) print(regr.score(diabetes_X_test, diabetes_y_test))

輸出:

[ 3.03499549e-01 -2.37639315e+02 5.10530605e+02 3.27736980e+02-8.14131709e+02 4.92814588e+02 1.02848452e+02 1.84606489e+02 7.43519617e+02 7.60951722e+01] 2004.5676026898218 0.5850753022690572

用線性回歸來預測生理數據：

由于數據維度較高，所以，就先用主成分分析法挖掘出主要成分中的兩個。之后構建的一個平面來（三維的線性回歸）

from sklearn import datasets import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn.decomposition import PCA from sklearn import linear_model import numpy as nph = 100diabetes = datasets.load_diabetes() # 選前2個最相關(主成分分析) X_reduced = PCA(n_components=2).fit_transform(diabetes.data) regr = linear_model.LinearRegression() regr.fit(X_reduced, diabetes.target)x_min, x_max = min(X_reduced[:, 0]), max(X_reduced[:, 0]) y_min, y_max = min(X_reduced[:, 1]), max(X_reduced[:, 1]) xx, yy = np.meshgrid(np.linspace(x_min, x_max, h),np.linspace(y_min, y_max, h)) Z = regr.predict(np.c_[xx.ravel(), yy.ravel()])xx = xx.ravel() yy = yy.ravel() Z = Z.ravel() fig = plt.figure(1, figsize=(8, 6)) ax = Axes3D(fig, elev=30, azim=-30) plt.xlim(x_min, x_max) plt.ylim(y_min, y_max)ax.scatter(X_reduced[:, 0], X_reduced[:, 1], diabetes.target, color='k')ax.plot(xx, yy, Z) plt.show()

http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
這里有一個二維平面的代碼

擾動，收縮變換：

其實就是，在0.5 和 1這兩個參數點上，各添加一個擾動（這個擾動是服從正態分布的）。之后用這兩個點上的數據，來做擬合（由于每次只考慮兩個點，所以所得直線一定會穿過兩點。）

代碼：

import matplotlib.pyplot as plt from sklearn import linear_model import numpy as np X = np.c_[.5, 1].T y = [.5, 1] test = np.c_[0, 2].T regr = linear_model.LinearRegression() plt.figure()np.random.seed(0) for _ in range(6):this_X = .1 * np.random.normal(size=(2, 1)) + Xregr.fit(this_X, y)# 描點畫線plt.plot(test, regr.predict(test))plt.scatter(this_X, y, s=3)plt.show()

線性模型—Ridge模型

This model solves a regression model where the loss function is the linear least squares function and regularization is given by the l2-norm. Also known as Ridge Regression or Tikhonov regularization. This estimator has built-in support for multi-variate regression (i.e., when y is a 2d-array of shape [n_samples, n_targets]).

在擬合的過程中，很容易出現過擬合的現象，所以，ridge模型，就提出添加一個l2正則化。在保證了線性殘差平方較小的同時，也需要保證這個l2正則別太大。

總體來說，其實為了實現避免過擬合的問題。

重寫之前的三維模型：
其實就是該用rigde模型。

在保證其他都不變的情況下，可以看出這個會更向上翹起來了點。
特別是在保證alpha的數值越來越高的時候，翹的會更明顯。（不過，設置alpha值太高了的話，效果也不一定很好）

from sklearn import datasets import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn.decomposition import PCA from sklearn import linear_model import numpy as nph = 100diabetes = datasets.load_diabetes() # 選前2個最相關(主成分分析) X_reduced = PCA(n_components=2).fit_transform(diabetes.data) reg = linear_model.Ridge(alpha=.3) reg.fit(X_reduced, diabetes.target)x_min, x_max = min(X_reduced[:, 0]), max(X_reduced[:, 0]) y_min, y_max = min(X_reduced[:, 1]), max(X_reduced[:, 1]) xx, yy = np.meshgrid(np.linspace(x_min, x_max, h),np.linspace(y_min, y_max, h)) Z = reg.predict(np.c_[xx.ravel(), yy.ravel()])xx = xx.ravel() yy = yy.ravel() Z = Z.ravel() fig = plt.figure(1, figsize=(8, 6)) ax = Axes3D(fig, elev=30, azim=-30) plt.xlim(x_min, x_max) plt.ylim(y_min, y_max)ax.scatter(X_reduced[:, 0], X_reduced[:, 1], diabetes.target, color='k')ax.plot(xx, yy, Z) plt.show()

Note Capturing in the fitted parameters noise that prevents the model to generalize to new data is called overfitting. The bias introduced by the ridge regression is called a regularization.

捕捉到了匹配的系數噪聲，而阻止整個模型的推廣到新數據，就是過擬合。
做了一個偏差在ridge回歸中，叫做正則化。

Sparsity稀疏：

為了調高問題的一些條件（比如環節維度詛咒），這會很有趣的去只選取一些有用的信息。另外的一種懲罰方式，叫做 Lasso方法（least absolute shrinkage and selection operator），會設置一些系數為0。

先嘗試找到最好的alpha，之后，再用這樣的方式，去擬合。最后給出在測試集合上的打分。

from sklearn import datasets import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn.decomposition import PCA from sklearn import linear_model import numpy as npdiabetes = datasets.load_diabetes() diabetes_X_train = diabetes.data[:-20] diabetes_X_test = diabetes.data[-20:] diabetes_y_train = diabetes.target[:-20] diabetes_y_test = diabetes.target[-20:]regr = linear_model.Lasso()alphas = np.logspace(-4, -1, 6)# choose the best alpha scores = [regr.set_params(alpha=alpha).fit(diabetes_X_train, diabetes_y_train).score(diabetes_X_test, diabetes_y_test)for alpha in alphas] best_alpha = alphas[scores.index(max(scores))]# set the best alpha regr.alpha = best_alpha regr.fit(diabetes_X_train, diabetes_y_train) print(regr.coef_) print(regr.score(diabetes_X_test, diabetes_y_test))

輸出結果是：

[ 0. -212.43764548 517.19478111 313.77959962 -160.8303982-0. -187.19554705 69.38229038 508.66011217 71.84239008] 0.5887622418309261

同樣換作用Lasso來做下三維圖。（最優化alpha之后）

from sklearn import datasets import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn.decomposition import PCA from sklearn import linear_model import numpy as nph = 100diabetes = datasets.load_diabetes() # 選前2個最相關(主成分分析) X_reduced = PCA(n_components=2).fit_transform(diabetes.data) regr = linear_model.Lasso() regr.fit(X_reduced, diabetes.target)alphas = np.logspace(-4, -1, 6)# choose the best alpha scores = [regr.set_params(alpha=alpha).fit(X_reduced, diabetes.target).score(X_reduced, diabetes.target)for alpha in alphas] best_alpha = alphas[scores.index(max(scores))]# set the best alpha regr.alpha = best_alphax_min, x_max = min(X_reduced[:, 0]), max(X_reduced[:, 0]) y_min, y_max = min(X_reduced[:, 1]), max(X_reduced[:, 1]) xx, yy = np.meshgrid(np.linspace(x_min, x_max, h),np.linspace(y_min, y_max, h)) Z = regr.predict(np.c_[xx.ravel(), yy.ravel()])xx = xx.ravel() yy = yy.ravel() Z = Z.ravel() fig = plt.figure(1, figsize=(8, 6)) ax = Axes3D(fig, elev=30, azim=-30) plt.xlim(x_min, x_max) plt.ylim(y_min, y_max)ax.scatter(X_reduced[:, 0], X_reduced[:, 1], diabetes.target, color='k')ax.plot(xx, yy, Z) plt.show()

不同的算法來解決同一個問題：
不同的算法可以被用來解決同一個數學問題。例如，Lasso可以通過使用坐標下降的昂發來來解決Lasso算法問題，這在大數據上是非常有效的。但是LassoLars通過使用LARS算法，適合來處理權重向量是特別稀疏的情況。（觀測點較少的情況）

分類

為了分析，例如在標記過的鸞尾花iris，線性回歸不是最好方法。因為，它會給套多權重給那些原理決策前言的數據。
（比如，讓線性回歸去擬合邏輯回歸方程）

邏輯回歸的C還是為了避免過擬合而提出的一種系數問題。

import numpy as np from sklearn import linear_model, datasetsiris = datasets.load_iris() np.random.seed(0) iris_X = iris.data iris_y = iris.target indices = np.random.permutation(len(iris_X)) iris_X_train = iris_X[indices[:-10]] iris_y_train = iris_y[indices[:-10]] iris_X_test = iris_X[indices[-10:]] iris_y_test = iris_y[indices[-10:]] logistic = linear_model.LogisticRegression(C=1) logistic.fit(iris_X_train, iris_y_train) print(logistic.predict(iris_X_test)) print(iris_y_test)

輸出：

[1 2 1 0 0 0 2 1 2 0] [1 1 1 0 0 0 2 1 2 0]

畫平面多色圖：

copy from
http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html

就是只用前兩個變量作為特征來畫圖。
其他的跟之前的那個沒有太大的區別。

import numpy as np import matplotlib.pyplot as plt from sklearn import linear_model, datasets# import some data to play with iris = datasets.load_iris() X = iris.data[:, :2] # we only take the first two features. Y = iris.targeth = .02 # step size in the meshlogreg = linear_model.LogisticRegression(C=1e5)# we create an instance of Neighbours Classifier and fit the data. logreg.fit(X, Y)# Plot the decision boundary. For that, we will assign a color to each # point in the mesh [x_min, x_max]x[y_min, y_max]. x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5 y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])# Put the result into a color plot Z = Z.reshape(xx.shape) plt.figure(1, figsize=(4, 3)) plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)# Plot also the training points plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired) plt.xlabel('Sepal length') plt.ylabel('Sepal width')plt.xlim(xx.min(), xx.max()) plt.ylim(yy.min(), yy.max()) plt.xticks(()) plt.yticks(())plt.show()

多類分類，其實本質上還是先逐步分出一個類來。

The C parameter controls the amount of regularization in the LogisticRegression object: a large value for C results in less regularization. penalty=”l2” gives Shrinkage (i.e. non-sparse coefficients), while penalty=”l1” gives Sparsity.

C參數，控制的是正則化，C越大正則化越小。

penalty是12，表示的是波動范圍（非稀疏的系數）。
11表示的，稀疏度。

支持向量機（SVM）

線性SVM（Linear SVMs）

支持向量機模型，屬于判別式模型家族。他們嘗試去找到樣本之間的聯系，去建立一個平面最大化兩個類的邊界。

正則化，是通過C參數來設置的。小的C值，意味著邊界計算時候用更多觀測數據（在分界線附近）。越大的C值，意味著邊界的考量用更接近分界線上的點。

import numpy as np from sklearn import datasets from sklearn import svmiris = datasets.load_iris() np.random.seed(0) iris_X = iris.data iris_y = iris.target indices = np.random.permutation(len(iris_X)) iris_X_train = iris_X[indices[:-10]] iris_y_train = iris_y[indices[:-10]] iris_X_test = iris_X[indices[-10:]] iris_y_test = iris_y[indices[-10:]]svc = svm.SVC(kernel='linear') svc.fit(iris_X_train, iris_y_train) print(svc.predict(iris_X_test)) print(iris_y_test)print(svc.score(iris_X_test, iris_y_test))

警告：在很多評估器中，包括SVM的時候，需要講過將數據集給標準化之后，才能更好的得到好的預測結果。

總結

以上是生活随笔為你收集整理的sklearn学习（二）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

Sklearn

上一篇： sklearn学习（一）
下一篇： TensorFlow安装【2018/12

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

生活随笔