ML之sklearn:sklearn的make_pipeline函数、RobustScaler函数、KFold函数、cross_val_score函数的代码解释、使用方法之详细攻略
ML之sklearn:sklearn的make_pipeline函數、RobustScaler函數、KFold函數、cross_val_score函數的代碼解釋、使用方法之詳細攻略
?
?
?
目錄
sklearn的make_pipeline函數的代碼解釋、使用方法
sklearn的make_pipeline函數的代碼解釋
sklearn的make_pipeline函數的使用方法
1、使用Pipeline類來表示在使用MinMaxScaler縮放數據之后再訓練一個SVM的工作流程
2、make_pipeline函數創建管道
sklearn的RobustScaler函數的代碼解釋、使用方法
RobustScaler函數的代碼解釋
RobustScaler函數的使用方法
sklearn的KFold函數的代碼解釋、使用方法
KFold函數的代碼解釋
KFold函數的使用方法
sklearn的cross_val_score函數的代碼解釋、使用方法
cross_val_score函數的代碼解釋
scoring參數可選的對象
cross_val_score函數的使用方法
1、分類預測——糖尿病
2、分類預測——iris鳶尾花
?
sklearn的make_pipeline函數的代碼解釋、使用方法
? ? ? ? ? 為了簡化構建變換和模型鏈的過程,Scikit-Learn提供了pipeline類,可以將多個處理步驟合并為單個Scikit-Learn估計器。pipeline類本身具有fit、predict和score方法,其行為與Scikit-Learn中的其他模型相同。
sklearn的make_pipeline函數的代碼解釋
| def make_pipeline(*steps, **kwargs): ? ? This is a shorthand for the Pipeline constructor; it does not require, and?does not permit, naming the estimators. Instead, their names will be set??to the lowercase of their types automatically. ? ? Parameters ? ? memory : None, str or object with the joblib.Memory interface, optional | 根據給定的估算器構造一條管道。 這是管道構造函數的簡寫;它不需要,也不允許命名估算器。相反,它們的名稱將自動設置為類型的小寫。 參數 ????---------- *steps :評估表、 memory:無,str或帶有joblib的對象。內存接口,可選 用于緩存安裝在管道中的變壓器。默認情況下,不執行緩存。如果給定一個字符串,它就是到緩存目錄的路徑。啟用緩存會在安裝前觸發變壓器的克隆。因此,給管線的變壓器實例不能直接檢查。使用屬性' ' named_steps ' ' '或' ' steps ' '檢查管道中的評估器。當裝配耗時時,緩存變壓器是有利的。 |
| ? ? Examples ? ? Returns | ? |
?
?
sklearn的make_pipeline函數的使用方法
Examples-------->>> from sklearn.naive_bayes import GaussianNB>>> from sklearn.preprocessing import StandardScaler>>> make_pipeline(StandardScaler(), GaussianNB(priors=None))... # doctest: +NORMALIZE_WHITESPACEPipeline(memory=None,steps=[('standardscaler',StandardScaler(copy=True, with_mean=True, with_std=True)),('gaussiannb', GaussianNB(priors=None))])Returns-------p : Pipeline?
1、使用Pipeline類來表示在使用MinMaxScaler縮放數據之后再訓練一個SVM的工作流程
from sklearn.pipeline import Pipeline pipe = Pipeline([("scaler",MinMaxScaler()),("svm",SVC())]) pip.fit(X_train,y_train) pip.score(X_test,y_test)?
2、make_pipeline函數創建管道
用Pipeline類構建管道時語法有點麻煩,我們通常不需要為每一個步驟提供用戶指定的名稱,這種情況下,就可以用make_pipeline函數創建管道,它可以為我們創建管道并根據每個步驟所屬的類為其自動命名。
from sklearn.pipeline import make_pipeline pipe = make_pipeline(MinMaxScaler(),SVC())?
參考文章
《Python機器學習基礎教程》構建管道(make_pipeline)
Python?sklearn.pipeline.make_pipeline()?Examples
?
sklearn的RobustScaler函數的代碼解釋、使用方法
RobustScaler函數的代碼解釋
| class RobustScaler(BaseEstimator, TransformerMixin): ? ? This Scaler removes the median and scales the data according to??the quantile range (defaults to IQR: Interquartile Range). ? ? Standardization of a dataset is a common requirement for many??machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and??the interquartile range often give better results. ? ? .. versionadded:: 0.17 ? ? Read more in the :ref:`User Guide <preprocessing_scaler>`. ? ? Parameters ? ? with_scaling : boolean, True by default ? ? quantile_range : tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0 ? ? ? ? .. versionadded:: 0.18 ? ? copy : boolean, optional, default is True ? ? Attributes ? ? scale_ : array of floats ? ? ? ? .. versionadded:: 0.17 ? ? See also ? ? :class:`sklearn.decomposition.PCA` ? ? Notes ? ? https://en.wikipedia.org/wiki/Median_(statistics) | ? 使用對離群值穩健的統計數據來衡量特征。 這個標量去除中值,并根據分位數范圍(默認為IQR:四分位數范圍)對數據進行縮放。 數據集的標準化是許多機器學習估計器的常見需求。這通常是通過去除平均值和縮放到單位方差來實現的。然而,異常值往往會對樣本均值/方差產生負面影響。在這種情況下,中位數和四分位范圍通常會得到更好的結果。 . .versionadded:: 0.17 詳見:ref: ' User Guide ?'。</preprocessing_scaler> 參數 with_scaling:布爾值,默認為True quantile_range:元組(q_min, q_max), 0.0 < q_min < q_max < 100.0 . .versionadded:: 0.18 布爾值,可選,默認為真 屬性 浮點數數組 . .versionadded:: 0.17 另請參閱 類:“sklearn.decomposition.PCA” 筆記 https://en.wikipedia.org/wiki/Median_(統計) |
| ? ? ? def __init__(self, with_centering=True, with_scaling=True, ? ? def _check_array(self, X, copy): ? ? ? ? if sparse.issparse(X): ? ? def fit(self, X, y=None): ? ? ? ? Parameters ? ? ? ? if self.with_scaling: ? ? ? ? ? ? q = np.percentile(X, self.quantile_range, axis=0) ? ? def transform(self, X): ? ? ? ? Can be called on sparse input, provided that ``RobustScaler`` has been ? ? ? ? Parameters ? ? ? ? if sparse.issparse(X): ? ? def inverse_transform(self, X): ? ? ? ? Parameters ? ? ? ? if sparse.issparse(X): | ? |
?
RobustScaler函數的使用方法
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.5, random_state=1)) ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.5, l1_ratio=.9, random_state=3))?
?
?
sklearn的KFold函數的代碼解釋、使用方法
KFold函數的代碼解釋
| class KFold Found at: sklearn.model_selection._split class KFold(_BaseKFold): | 在:sklearn.model_select ._split找到的類KFold 類KFold (_BaseKFold): |
| ? ? Examples ? ? -------- ? ? >>> from sklearn.model_selection import KFold ? ? >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) ? ? >>> y = np.array([1, 2, 3, 4]) ? ? >>> kf = KFold(n_splits=2) ? ? >>> kf.get_n_splits(X) ? ? 2 ? ? >>> print(kf) ?# doctest: +NORMALIZE_WHITESPACE ? ? KFold(n_splits=2, random_state=None, shuffle=False) ? ? >>> for train_index, test_index in kf.split(X): ? ? ... ? ?print("TRAIN:", train_index, "TEST:", test_index) ? ? ... ? ?X_train, X_test = X[train_index], X[test_index] ? ? ... ? ?y_train, y_test = y[train_index], y[test_index] ? ? TRAIN: [2 3] TEST: [0 1] ? ? TRAIN: [0 1] TEST: [2 3] ? ?? ? ? Notes ? ? ----- ? ? The first ``n_samples % n_splits`` folds have size ? ? ``n_samples // n_splits + 1``, other folds have size ? ? ``n_samples // n_splits``, where ``n_samples`` is the number of? ? ? ?samples. ? ?? ? ? See also ? ? -------- ? ? StratifiedKFold ? ? Takes group information into account to avoid building folds with? imbalanced class distributions (for binary or multiclass??classification tasks). ? ? GroupKFold: K-fold iterator variant with non-overlapping groups. ? ? RepeatedKFold: Repeats K-Fold n times. ? ? """ | 另請參閱 -------- StratifiedKFold 考慮組信息,以避免構建不平衡的類分布的折疊(對于二進制或多類分類任務)。 GroupKFold:不重疊組的K-fold迭代器變體。 RepeatedKFold:重復K-Fold n次。 ”“” |
| ? ? def __init__(self, n_splits=3, shuffle=False,? ? ? ? ? random_state=None): ? ? ? ? super(KFold, self).__init__(n_splits, shuffle, random_state) ? ?? ? ? def _iter_test_indices(self, X, y=None, groups=None): ? ? ? ? n_samples = _num_samples(X) ? ? ? ? indices = np.arange(n_samples) ? ? ? ? if self.shuffle: ? ? ? ? ? ? check_random_state(self.random_state).shuffle(indices) ? ? ? ? n_splits = self.n_splits ? ? ? ? fold_sizes = (n_samples // n_splits) * np.ones(n_splits, dtype=np. ? ? ? ? ?int) ? ? ? ? fold_sizes[:n_samples % n_splits] += 1 ? ? ? ? current = 0 ? ? ? ? for fold_size in fold_sizes: ? ? ? ? ? ? start, stop = current, current + fold_size ? ? ? ? ? ? yield indices[start:stop] ? ? ? ? ? ? current = stop | ? |
?
?
KFold函數的使用方法
? ? Examples-------->>> from sklearn.model_selection import KFold>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])>>> y = np.array([1, 2, 3, 4])>>> kf = KFold(n_splits=2)>>> kf.get_n_splits(X)2>>> print(kf) ?# doctest: +NORMALIZE_WHITESPACEKFold(n_splits=2, random_state=None, shuffle=False)>>> for train_index, test_index in kf.split(X):... ? ?print("TRAIN:", train_index, "TEST:", test_index)... ? ?X_train, X_test = X[train_index], X[test_index]... ? ?y_train, y_test = y[train_index], y[test_index]TRAIN: [2 3] TEST: [0 1]TRAIN: [0 1] TEST: [2 3]?
?
?
?
?
sklearn的cross_val_score函數的代碼解釋、使用方法
cross_val_score函數的代碼解釋
| def cross_val_score Found at: sklearn.model_selection._validation def cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None,? n_jobs=1, verbose=0, fit_params=None,???pre_dispatch='2*n_jobs'): | 通過交叉驗證來評估一個分數 更多信息參見:ref: ' User Guide '。 |
| ? Parameters ? ? ---------- ? ? estimator : estimator object implementing 'fit' The object to use to fit the data. ? ?? ? ? X : array-like ? ? The data to fit. Can be for example a list, or an array. ? ?? ? ? y : array-like, optional, default: None ? ? The target variable to try to predict in the case of??supervised learning. ? ?? ? ? groups : array-like, with shape (n_samples,), optional ? ? Group labels for the samples used while splitting the dataset into??train/test set. ? ?? ? ? scoring : string, callable or None, optional, default: None ? ? A string (see model evaluation documentation) or a scorer callable object / function with signature??``scorer(estimator, X, y)``. ? ?? ? ? cv : int, cross-validation generator or an iterable, optional ? ? Determines the cross-validation splitting strategy. ? ? Possible inputs for cv are: ? ? - None, to use the default 3-fold cross validation, ? ? - integer, to specify the number of folds in a `(Stratified)KFold`, ? ? - An object to be used as a cross-validation generator. ? ? - An iterable yielding train, test splits. ? ? For integer/None inputs, if the estimator is a classifier and ``y`` is??either binary or multiclass, :class:`StratifiedKFold` is used. In all??other cases, :class:`KFold` is used. ? ?? ? ? Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here. ? ?? ? ? n_jobs : integer, optional ? ? The number of CPUs to use to do the computation. -1 means?? 'all CPUs'. ? ?? ? ? verbose : integer, optional ? ? The verbosity level. ? ?? ? ? fit_params : dict, optional ? ? Parameters to pass to the fit method of the estimator. ? ?? ? ? pre_dispatch : int, or string, optional ? ? Controls the number of jobs that get dispatched during parallel??execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched??than CPUs can process. This parameter can be: ? ?? ? ? - None, in which case all the jobs are immediately??created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand?spawning of the jobs ? ? - An int, giving the exact number of total jobs that are spawned ? ? - A string, giving an expression as a function of n_jobs, as in '2*n_jobs' ? ?? ? ? Returns ? ? ------- ? ? scores : array of float, shape=(len(list(cv)),) ? ? Array of scores of the estimator for each run of the cross validation. | 參數 ????---------- estimator:實現“適合”對象以適合數據。 ???? X:數組類 需要匹配的數據。可以是列表,也可以是數組。 ???? y : 類似數組,可選,默認:無 在監督學習的情況下,預測的目標變量。 ???? groups : 類數組,形狀(n_samples,),可選 將數據集分割為訓練/測試集時使用的樣本的標簽分組。 ???? scoring : 字符串,可調用或無,可選,默認:無 一個字符串(參見模型評估文檔)或簽名為' ' scorer(estimator, X, y) ' '的scorer可調用對象/函數。 ???? cv : int,交叉驗證生成器或可迭代,可選 確定交叉驗證分割策略。 cv可能的輸入有: -無,使用默認的三折交叉驗證, -整數,用于指定“(分層的)KFold”中的折疊數, -用作交叉驗證生成器的對象。 -一個可迭代產生的序列,測試分裂。 對于整數/無輸入,如果估計器是一個分類器,并且' ' y ' '是二進制的或多類的,則使用:class: ' StratifiedKFold '。在所有其他情況下,使用:class: ' KFold '。 ???? 請參考:ref: ' User Guide ',了解可以在這里使用的各種交叉驗證策略。 ???? n_jobs:整數,可選 用于進行計算的cpu數量。-1表示“所有cpu”。 ???? verbose:整數,可選 冗長的水平。 ???? fit_params :dict,可選 參數傳遞給估計器的擬合方法。 ???? pre_dispatch: int或string,可選 控制并行執行期間分派的作業數量。當分配的作業多于cpu能夠處理的任務時,減少這個數量有助于避免內存消耗激增。該參數可以為: -無,在這種情況下,立即創建并派生所有作業。將此用于輕量級和快速運行的作業,以避免由于按需生成作業而造成的延遲 -一個int,給出生成的作業的確切總數 一個字符串,給出一個作為n_jobs函數的表達式,如'2*n_jobs' ???? 返回 ????------- (len(list(cv)),) 交叉驗證的每次運行估計器的分數數組。 |
| ? ? Examples ? ? -------- ? ? >>> from sklearn import datasets, linear_model ? ? >>> from sklearn.model_selection import cross_val_score ? ? >>> diabetes = datasets.load_diabetes() ? ? >>> X = diabetes.data[:150] ? ? >>> y = diabetes.target[:150] ? ? >>> lasso = linear_model.Lasso() ? ? >>> print(cross_val_score(lasso, X, y)) ?# doctest: +ELLIPSIS ? ? [ 0.33150734 ?0.08022311 ?0.03531764] ? ?? ? ? See Also ? ? --------- ? ? :func:`sklearn.model_selection.cross_validate`: ? ? To run cross-validation on multiple metrics and also to return??train scores, fit times and score times. ? ?? ? ? :func:`sklearn.metrics.make_scorer`: ? ? Make a scorer from a performance metric or loss function. ? ?? ? ? """ ? ? # To ensure multimetric format is not supported ? ? scorer = check_scoring(estimator, scoring=scoring) ? ? cv_results = cross_validate(estimator=estimator, X=X, y=y, groups=groups,? ? ? ? ? scoring={'score':scorer}, cv=cv,? ? ? ? ? return_train_score=False,? ? ? ? ? n_jobs=n_jobs, verbose=verbose,? ? ? ? ? fit_params=fit_params,? ? ? ? ? pre_dispatch=pre_dispatch) ? ? return cv_results['test_score'] | 另請參閱 --------- :func:“sklearn.model_selection.cross_validate”: 在多個指標上進行交叉驗證,并返回訓練分數、適應時間和得分時間。 :func:“sklearn.metrics.make_scorer”: 從性能度量或損失函數中制作一個記分員。 ”“” #以確保不支持多度量格式 |
scoring參數可選的對象
https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
| Classification | ? | ? |
| ‘accuracy’ | metrics.accuracy_score | ? |
| ‘balanced_accuracy’ | metrics.balanced_accuracy_score | ? |
| ‘average_precision’ | metrics.average_precision_score | ? |
| ‘neg_brier_score’ | metrics.brier_score_loss | ? |
| ‘f1’ | metrics.f1_score | for binary targets |
| ‘f1_micro’ | metrics.f1_score | micro-averaged |
| ‘f1_macro’ | metrics.f1_score | macro-averaged |
| ‘f1_weighted’ | metrics.f1_score | weighted average |
| ‘f1_samples’ | metrics.f1_score | by multilabel sample |
| ‘neg_log_loss’ | metrics.log_loss | requires predict_proba support |
| ‘precision’ etc. | metrics.precision_score | suffixes apply as with ‘f1’ |
| ‘recall’ etc. | metrics.recall_score | suffixes apply as with ‘f1’ |
| ‘jaccard’ etc. | metrics.jaccard_score | suffixes apply as with ‘f1’ |
| ‘roc_auc’ | metrics.roc_auc_score | ? |
| ‘roc_auc_ovr’ | metrics.roc_auc_score | ? |
| ‘roc_auc_ovo’ | metrics.roc_auc_score | ? |
| ‘roc_auc_ovr_weighted’ | metrics.roc_auc_score | ? |
| ‘roc_auc_ovo_weighted’ | metrics.roc_auc_score | ? |
| Clustering | ? | ? |
| ‘adjusted_mutual_info_score’ | metrics.adjusted_mutual_info_score | ? |
| ‘adjusted_rand_score’ | metrics.adjusted_rand_score | ? |
| ‘completeness_score’ | metrics.completeness_score | ? |
| ‘fowlkes_mallows_score’ | metrics.fowlkes_mallows_score | ? |
| ‘homogeneity_score’ | metrics.homogeneity_score | ? |
| ‘mutual_info_score’ | metrics.mutual_info_score | ? |
| ‘normalized_mutual_info_score’ | metrics.normalized_mutual_info_score | ? |
| ‘v_measure_score’ | metrics.v_measure_score | ? |
| Regression | ? | ? |
| ‘explained_variance’ | metrics.explained_variance_score | ? |
| ‘max_error’ | metrics.max_error | ? |
| ‘neg_mean_absolute_error’ | metrics.mean_absolute_error | ? |
| ‘neg_mean_squared_error’ | metrics.mean_squared_error | ? |
| ‘neg_root_mean_squared_error’ | metrics.mean_squared_error | ? |
| ‘neg_mean_squared_log_error’ | metrics.mean_squared_log_error | ? |
| ‘neg_median_absolute_error’ | metrics.median_absolute_error | ? |
| ‘r2’ | metrics.r2_score | ? |
| ‘neg_mean_poisson_deviance’ | metrics.mean_poisson_deviance | ? |
| ‘neg_mean_gamma_deviance’ | metrics.mean_gamma_deviance |
?
?
cross_val_score函數的使用方法
1、分類預測——糖尿病
? ? >>> from sklearn import datasets, linear_model>>> from sklearn.model_selection import cross_val_score>>> diabetes = datasets.load_diabetes()>>> X = diabetes.data[:150]>>> y = diabetes.target[:150]>>> lasso = linear_model.Lasso()>>> print(cross_val_score(lasso, X, y)) ?# doctest: +ELLIPSIS[ 0.33150734 ?0.08022311 ?0.03531764]?
2、分類預測——iris鳶尾花
from sklearn import datasets #自帶數據集 from sklearn.model_selection import train_test_split,cross_val_score #劃分數據 交叉驗證 from sklearn.neighbors import KNeighborsClassifier #一個簡單的模型,只有K一個參數,類似K-means import matplotlib.pyplot as plt iris = datasets.load_iris() #加載sklearn自帶的數據集 X = iris.data #這是數據 y = iris.target #這是每個數據所對應的標簽 train_X,test_X,train_y,test_y = train_test_split(X,y,test_size=1/3,random_state=3) #這里劃分數據以1/3的來劃分 訓練集訓練結果 測試集測試結果 k_range = range(1,31) cv_scores = [] #用來放每個模型的結果值 for n in k_range:knn = KNeighborsClassifier(n) #knn模型,這里一個超參數可以做預測,當多個超參數時需要使用另一種方法GridSearchCVscores = cross_val_score(knn,train_X,train_y,cv=10,scoring='accuracy') #cv:選擇每次測試折數 accuracy:評價指標是準確度,可以省略使用默認值,具體使用參考下面。cv_scores.append(scores.mean()) plt.plot(k_range,cv_scores) plt.xlabel('K') plt.ylabel('Accuracy') #通過圖像選擇最好的參數 plt.show() best_knn = KNeighborsClassifier(n_neighbors=3) # 選擇最優的K=3傳入模型 best_knn.fit(train_X,train_y) #訓練模型 print(best_knn.score(test_X,test_y)) #看看評分?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
總結
以上是生活随笔為你收集整理的ML之sklearn:sklearn的make_pipeline函数、RobustScaler函数、KFold函数、cross_val_score函数的代码解释、使用方法之详细攻略的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: DataScience:初学者进阶数学处
- 下一篇: ML之FE:对爬取的某平台二手房数据进行