机器学习(监督学习) 项目流程模板
生活随笔
收集整理的這篇文章主要介紹了
机器学习(监督学习) 项目流程模板
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
流程模板
用標準Python類庫導入
from csv import reader import numpy as np filename = 'http://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data' with open(filename, 'rt') as raw_data:readers = reader(raw_data, delimiter=',')x = list(readers)data = np.array(x).astype('float')print(data.shape)用NumPy導入數據
from numpy import loadtxt filename = 'http://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data' with open(filename, 'rt') as raw_data:data = loadtxt(raw_data, delimiter=',')print(data.shape)采用Pandas導入
from pandas import read_csv filename = 'http://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data' names = ['name', 'landmass', 'zone', 'area', 'population', 'language', 'religion', 'bars', 'stripes','colours','red','green','blue','gold','white','black','orange','mainhue','circles','crosses','saltires','quarters','sunstars','crescent','triangle','icon','animate','text','topleft','botright'] data = read_csv(filename, names=names, delim_whitespace=False) print(data.shape)描述性統計 分析數據
```[python]# 簡單地查看數據print(data.head(10))# 數據的維度print(data.shape)# 數據的屬性和類型print(data.dtypes)# 描述性統計set_option('display.width',100)#設置對齊寬度set_option('precision',4) # 設置數據的精度print(data.describe())# 數據分組分布print(data.groupby('class).size())# 數據相關性set_option('display.width',100) #設置對齊寬度set_option('precision',2) # 設置數據的精度print(data.corr(method='pearson'))# 計算數據的高斯偏離print(data.skew())```數據可視化 觀察數據
import matplotlib.pyplot as plt# 直方圖 data.hist()# 密度圖 data.plot(kind='density',subplots=True,layout=(3,3),sharex=False)# 箱線圖 data.plot(kind='box',subplots=True,layout=(3,3),sharex=False)# 相關矩陣圖 correlations = data.corr() fig = plt.figure() ax = fig.add_subplot(111) cax = ax.matshow(correlations,vmin=-1, vmax=1) fig.colorbar(cax) ticks = np.arange(0,9,1) ax.set_xticks(ticks) ax.set_yticks(ticks) ax.set_xticklabels(names) ax.set_yticklabels(names)# 散點矩陣圖 from pandas.plotting import scatter_matrix scatter_matrix(data)plt.show()數據清洗
通過刪除重復數據、標記錯誤數值,甚至標記錯誤的輸入數據來清洗數據特征選擇
移除多余的特征屬性,增加新的特征屬性
數據轉換
對數據尺度進行調整或者調整數據的分布,以便更好地展示問題
分離數據集
from sklearn.linear_model import LogisticRegression# 分離數據集合評估數據集 from sklearn.model_selection import train_test_split test_size = 0.33 seed = 4 x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=test_size,random_state=seed) model = LogisticRegression() model.fit(x_train,y_train) result = model.score(x_test,y_test) print('算法的評估結果:%.3f%%' % (result * 100))# K折交叉驗證分離 將原始數據分為K組,將每個子集數據分別做一次驗證集,其余K-1組子集數據作為訓練集,這樣會得到K個模型,利用這K個模型最終的驗證集的分類準確率的平均數作為分類器的指標 from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score num_folds = 10 seed = 7 kfold = KFold(n_splits=num_folds, random_state=seed) model = LogisticRegression() result = cross_val_score(model,x,y,cv=kfold) print('算法評估結果:%.3f%% (%.3f%%)' % (result.mean() * 100, result.std() * 100))# 棄一交叉驗證分離 每個樣本單獨作為驗證集,其余的N-1個樣本作為訓練集,然后取N個模型最終驗證集的分類準確率的平均數 # 和K折交叉驗證相比而言,棄一交叉驗證的優點:1. 每一回合中幾乎所有的樣本皆用于訓練模型 2. 實驗過程中沒有隨機因素會影響實驗數據,實驗過程是可以被復制的 from sklearn.model_selection import LeaveOneOut from sklearn.model_selection import cross_val_score loocv = LeaveOneOut() model = LogisticRegression() result = cross_val_score(model, x, y, cv=loocv) print('算法評估結果:%.3f%% (%.3f%%)' % (result.mean()*100, result.std()*100))# 重復隨機分離評估數據集與訓練數據集 from sklearn.model_selection import ShuffleSplit from sklearn.model_selection import cross_val_score n_splits = 10 test_size = 0.33 seed = 7 kfold = ShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=seed) model = LogisticRegression() result = cross_val_score(model,x,y,cv=kfold) print('算法評估結果:%.3f%% (%.3f%%)' % (result.mean()*100,result.std()*100))分類算法矩陣
from sklearn.linear_model import LogisticRegression# 分類準確度 from skleran.model_selection import KFold from sklearn.model_selection import cross_val_score num_folds = 10 seed = 7 kfold = KFold(n_splits=num_folds, random_state=seed) model = LogisticRegression() result = cross_val_score(model,x,y,cv=kfold) print('算法評估結果準確度:%.3f (%.3f)' % (result.mean(), result.std()))# 對數損失函數 from skleran.model_selection import KFold from sklearn.model_selection import cross_val_score num_folds = 10 seed = 7 kfold = KFold(n_splits=num_folds, random_state=seed) model = LogisticRegression() scoring = 'neg_log_loss' result = cross_val_score(model,x,y,cv=kfold,scoring=scoring) print('Logloss %.3f (%.3f)' % (result.mean(),result.std()))# AUC圖 from skleran.model_selection import KFold from sklearn.model_selection import cross_val_score num_folds = 10 seed = 7 kfold = KFold(n_splits=num_folds, random_state=seed) model = LogisticRegression() scoring = 'roc_auc' result = cross_val_score(model,x,y,cv=kfold,scoring=scoring) print('AUC %.3f (%.3f)' % (result.mean(), result.std()))# 混淆矩陣 from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix test_size = 0.33 seed = 4 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_size, random_state=seed) model = LogisticRegression() model.fit(x_train,y_train) predicted = model.predict(x_test) matrix = confusion_matrix(y_test,predicted) classes = ['0','1'] dataframe = pd.DataFrame(data=matrix, index=classes, columns =classes) print(dataframe)# 分類報告 # 精確率 計算所有被檢索到的項目中應該被檢索到的項目所占的比例 # 召回率 計算所有檢索到的項目占所有應該檢索到的想的比例 from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report test_size = 0.33 seed = 4 x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=test_size,random_state=seed) model = LogisticRegression() model.fit(x_train,y_train) predicted = model.predict(x_test) report = classification_report(y_test,predicted) print(report)回歸算法矩陣
from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression n_splits = 10 seed = 7 kfold = KFold(n_splits=n_splits, random_state=seed) model = LinearRegression()# 平均絕對誤差 所有單個觀測值與算術平均值的偏差的絕對值的平均值 scoring = 'neg_mean_absolute_error'# 均方誤差 均方誤差的算術平方根 scoring = 'neg_mean_squared_error'# 決定系數 反映因變量的全部變異能通過回歸關系被自變量解釋的比例 scoring = 'r2' result = cross_val_score(model,x,y,cv=kfold,scoring=scoring) print('%.3f (%.3f)' % (result.mean(), result.std()))審查分類算法
from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score num_folds = 10 seed = 7 kfold = KFold(n_splits=num_folds, random_state=seed)# 線性算法# 邏輯回歸 通過擬合一個邏輯函數,來預測一個事件發生的概率,輸出值為0~1,非常適合處理二分類問題 from sklearn.linear_model import LogisticRegression model = LogisticRegression()# 線性判別分析 將高維的模式樣本投影到最佳鑒別矢量空間,以達到抽取分類信息和壓縮特征空間維數的效果,投影后,模式在該空間中有最佳的可分離性。線性判別分析與主要成分分析一樣,被廣泛應用在數據降維中 from sklearn.discriminant_analysis import LinearDiscriminantAnalysis model = LinearDiscriminantAnalysis()# 非線性算法# K近鄰算法 如果一個樣本在特征空間中的k個最相似的樣本中的大多數屬于某一個類別,則該樣本也屬于這個類別。 from sklearn.neighbors import KNeighborsClassifier model = KNeighborsClassifier()# 貝葉斯分類器 通過某對象的先驗概率,利用貝葉斯公式計算出其在所有類別上的后驗概率,選擇具有最大后驗概率的類作為該對象所屬的類 from sklearn.native_bayes import GaussianNB model = GaussianNB()# 分類與回歸樹 等價于遞歸二分每個特征,在輸入空間劃分為有限個單元并在這些單元上確定預測的概率分布 from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier()# 支持向量機 可以分析數據、識別模式,用于分類和回歸分析 from sklearn.svm import SVC model = SVC()result = cross_val_score(model,x,y,cv=kfold) print(result.mean())審查回歸算法
from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score num_folds = 10 seed = 7 kfold = KFold(n_splits=num_folds, random_state=seed)# 線性算法# 線性回歸算法 利用數理統計中的回歸分析,來確定兩種或兩種以上變量間相互依賴的定量關系的一種統計方法 from sklearn.linear_model import LinearRegression model = LinearRegression()# 嶺回歸算法 一種專門用于共線性數據分析的有偏估計回歸方法(最小二乘法的改良版) from sklearn.linear_model import Ridge model = Ridge()# 套索回歸算法 和嶺回歸算法類似,使用的懲罰函數是絕對值而不是平方 from sklearn.linear_model import Lasso model = Lasso()# 彈性網絡回歸算法 是套索回歸算法和嶺回歸算法的混合體 當有多個相關的特征時 彈性網絡回歸算法是很有用的 from sklearn.linear_model import ElasticNet model = ElasticNet()# 非線性算法# K近鄰算法 按照距離來預測結果 from sklearn.neighbors import KNeighborsRegressor model = KNeighborsRegressor()# 分類與回歸樹 from sklearn.tree import DecisionTreeRegressor model = DecisionTreeRegressor()# 支持向量機 from sklearn.svm import SVR model = SVR()scoring = 'neg_mean_squared_error' result = cross_val_score(model, x, y, cv=kfold, scoring=scoring) print('%.3f' % result.mean())算法比較
from sklearn.model_selection import KFold from sklearn.linear_model import LogisticRegression from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.svm import SVC from sklearn.model_selection import cross_val_score from sklearn.naive_bayes import GaussianNB from matplotlib import pyplot num_folds = 10 seed = 7 kfold = KFold(n_splits=num_folds, random_state=seed) models = {} models['LR'] = LogisticRegression() models['LDA'] = LinearDiscriminantAnalysis() models['KNN'] = KNeighborsClassifier() models['CART'] = DecisionTreeClassifier() models['SVM'] = SVC() models['NB'] = GaussianNB() results = [] for name in models:result = cross_val_score(models[name], X, Y, cv=kfold)results.append(result)msg = '%s: %.3f (%.3f)' % (name, result.mean(), result.std())print(msg)# 圖表顯示 fig = pyplot.figure() fig.suptitle('Algorithm Comparison') ax = fig.add_subplot(111) pyplot.boxplot(results) ax.set_xticklabels(models.keys()) pyplot.show()網格搜索優化參數
from sklearn.linear_model import Ridge from sklearn.model_selection import GridSearchCV # 算法實例化 model = Ridge() # 設置要遍歷的參數 param_grid = {'alpha': [1, 0.1, 0.01, 0.001, 0]} # 通過網格搜索查詢最優參數 grid = GridSearchCV(estimator=model, param_grid=param_grid) grid.fit(x, y) # 搜索結果 print('最高得分:%.3f' % grid.best_score_) print('最優參數:%s' % grid.best_estimator_.alpha)隨機搜索優化參數
from sklearn.linear_model import Ridge from sklearn.model_selection import RandomizedSearchCV from scipy.stats import uniform model = Ridge() # 設置要遍歷的參數 param_grid = {'alpha': uniform()} # 通過網格搜索查詢最優參數 grid = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100, random_state=7) grid.fit(x, y) # 搜索結果 print('最高得分:%.3f' % grid.best_score_) print('最優參數:%s' % grid.best_estimator_.alpha)集成算法
from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score num_folds = 10 seed = 7 kfold = KFold(n_splits=num_folds, random_state=seed) num_tree = 100# 裝袋算法 通過給定組合投票的方式獲得最優解# 裝袋決策樹 from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier cart = DecisionTreeClassifier() model = BaggingClassifier(base_estimator=cart, n_estimators=num_tree, random_state=seed)# 隨機森林 用隨機的方式建立一個森林,森林由很多的決策樹組成,且每棵決策樹之間是沒有關聯的 from sklearn.ensemble import RandomForestClassifier max_features = 3 model = RandomForestClassifier(n_estimators=num_tree, random_state=seed, max_features=max_features)# 極端隨機樹 和隨機森林類似,區別如下: # 1. 隨機森林應用的是Bagging模型,極端隨機樹的每棵決策樹應用的是相同的全部訓練樣本 # 2. 隨機森林是在一個隨機子集內得到最優分叉特征屬性,而極端隨機樹是完全隨機地選擇分叉特征屬性從而實現對決策樹進行分叉的 from sklearn.ensemble import ExtraTreesClassifier max_features = 7 model = ExtraTreesClassifier(n_estimators=num_tree, random_state=seed, max_features=max_features)# 提升算法 提高弱分類算法準確度的方法,也是一種提高任意給定學習算法準確度的方法# AdaBoost 是一種迭代算法,針對同一個訓練集訓練不同的分類器(弱分類器),然后把這些弱分類器集合起來,構成一個更強的最終分類器(強分類器) from sklearn.ensemble import AdaBoostClassifier model = AdaBoostClassifier(n_estimators=num_tree, random_state=seed)# 隨機梯度提升 沿著函數的梯度方向找到某個函數的最大值。每次只用一個樣本點來更新回歸系數 from sklearn.ensemble import GradientBoostingClassifier model = GradientBoostingClassifier(n_estimators=num_tree, random_state=seed)result = cross_val_score(model, x, y, cv=kfold)# 投票算法 通過創建兩個或多個算法模型。利用投票算法將這些算法包裝起來,計算各個子模型的平均預測狀況 cart = DecisionTreeClassifier() models = [] model_logistic = LogisticRegression() models.append(('logistic', model_logistic)) model_cart = DecisionTreeClassifier() models.append(('cart', model_cart)) model_svc = SVC() models.append(('svm', model_svc)) ensemble_model = VotingClassifier(estimators=models) result = cross_val_score(ensemble_model, x, y, cv=kfold) print(result.mean())實現
from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression# 通過pickle 序列化和反序列化機器學習的模型 from pickle import dump from pickle import load# 通過joblib 序列化和反序列化機器學習的模型 from sklearn.externals.joblib import dump from sklearn.externals.joblib import loadtest_size = 0.33 seed = 4 x_train, x_test, y_traing, y_test = train_test_split(x, y, test_size=test_size, random_state=seed)model = LogisticRegression() model.fit(x_train, y_traing)model_file = 'finalized_model.sav' with open(model_file, 'wb') as model_f:dump(model, model_f)with open(model_file, 'rb') as model_f:loaded_model = load(model_f)result = loaded_model.score(x_test, y_test)print("算法評估結果:%.3f%%" % (result * 100))整個流程不是線程的,而是循環進行的,要花費大量的時間來重復各個步驟,直到找到一個準確度足夠的模型!!!
對于無監督的機器學習算法,因為不存在目標變量值,所以不需要訓練算法。
注:本文根據《機器學習 Python實踐》整理總結所得
如需轉載請注明出處:https://www.cnblogs.com/zhuchenglin/p/10292007.html
轉載于:https://www.cnblogs.com/zhuchenglin/p/10292007.html
總結
以上是生活随笔為你收集整理的机器学习(监督学习) 项目流程模板的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 一个基于POP3协议进行邮箱账号验证的类
- 下一篇: 脱机下载至校验成功的脚本