金融贷款逾期的模型构建4——模型调优
文章目錄
- 一、任務
 - 二、概述
 - 1、參數說明
 - 2、常用方法
 
- 二、實現
 - 1、模塊引入
 - 2、模型評估函數
 - 3、數據讀取
 - 4、Logistic Regression
 - (1)調參部分
 - (2)模型評估
 
- 5、SVM
 - (1)調參部分
 - (2)模型評估
 
- 6、Decision Tree
 - (1)調參部分
 - (2)模型評估
 
- 7、Random Forest
 - 8、GBDT
 - 9、XGBoost
 - 10、LightGBM
 
- 三、遇到的問題
 - 1、UnboundLocalError: local variable 'xxx' referenced before assignment
 - 2、ImportError: [joblib] Attempting to do parallel computing without protecting
 - 3、recall
 
一、任務
使用網格搜索法對7個模型進行調優(調參時采用五折交叉驗證的方式),并進行模型評估,展示代碼的運行結果。
二、概述
機器學習模型基本都會涉及調參不同的參數組合會產生不同的效果 :
- 如果模型數據量不是很大(運行時間不是很長)——GridSearchCV來自動選擇輸入參數中的最優組合。
 - 若很大數據量,模型運行特別費計算資源和時間——GridSearchCV可能會成本太高,需要對模型了解深入一點或者積累更多的實戰經驗,最后進行手動調參。
 
1、參數說明
class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)(1)estimator
 所使用的分類器,如estimator=RandomForestClassifier(min_samples_split=100,min_samples_leaf=20,max_depth=8,max_features=‘sqrt’,random_state=10), 并且傳入除需要確定最佳的參數之外的其他參數。每一個分類器都需要一個scoring參數,或者score方法。
 (2)param_grid
 param_grid 值為字典或者列表,即需要最優化的參數的取值,param_grid =param_test1,param_test1 = {‘n_estimators’:range(10,71,10)}。
 (3)scoring
 準確度評價標準,默認None,這時需要使用score函數;或者如scoring=‘roc_auc’,根據所選模型不同,評價準則不同。字符串(函數名),或是可調用對象,需要其函數簽名形如:scorer(estimator, X, y);如果是None,則使用estimator的誤差估計函數。scoring參數選擇如下:
 傳送門:http://scikit-learn.org/stable/modules/model_evaluation.html
 (4)cv
 交叉驗證參數,默認None,使用三折交叉驗證。指定fold數量,默認為3,也可以是yield訓練/測試數據的生成器。
 (5)refit
 默認為True,程序將會以交叉驗證訓練集得到的最佳參數,重新對所有可用的訓練集與開發集進行,作為最終用于性能評估的最佳模型參數。即在搜索參數結束后,用最佳參數結果再次fit一遍全部數據集。
 (6)iid
 默認True,為True時,默認為各個樣本fold概率分布一致,誤差估計為所有樣本之和,而非各個fold的平均。
 (7)verbose
 日志冗長度,int:冗長度,0:不輸出訓練過程,1:偶爾輸出,>1:對每個子模型都輸出。
 (8)n_jobs
 并行數,int:個數,-1:跟CPU核數一致, 1:默認值。
 (9)pre_dispatch
 指定總共分發的并行任務數。當n_jobs大于1時,數據將在每個運行點進行復制,這可能導致OOM,而設置pre_dispatch參數,則可以預先劃分總共的job數量,使數據最多被復制pre_dispatch次
2、常用方法
grid.fit():運行網格搜索;
 grid_scores_:給出不同參數情況下的評價結果;
 best_params_:描述了已取得最佳結果的參數的組合;
 best_score_:成員提供優化過程期間觀察到的最好的評分。
二、實現
1、模塊引入
import pandas as pd from sklearn.model_selection import train_test_split, GridSearchCV import xgboost as xgb import numpy as np import lightgbm as lgb from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier import warnings from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score warnings.filterwarnings(action ='ignore', category = DeprecationWarning)2、模型評估函數
## 模型評估 def model_metrics(clf, y_target, y_predict):accuracy = accuracy_score(y_target, y_predict)print('The accuracy is ', accuracy)precision = precision_score(y_target, y_predict)print('The precision is ', precision)recall = recall_score(y_target, y_predict)print('The recall is ', recall)3、數據讀取
## 讀取數據data = pd.read_csv("data_all.csv")x = data.drop(labels='status', axis=1)y = data['status']x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.3,random_state=2018)## 數據標準化scaler = StandardScaler()scaler.fit(x_train)x_train_stand = scaler.transform(x_train)x_test_stand = scaler.transform(x_test)4、Logistic Regression
(1)調參部分
lr = LogisticRegression() # 要調參數 param = {'C':[1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']} grid = GridSearchCV(estimator=lr, param_grid=param, scoring='roc_auc', cv=5) grid.fit(x_train_stand, y_train) print('最佳參數:',grid.best_params_) print('訓練集的最佳分數:', grid.best_score_) print('測試集的最佳分數:', grid.score(x_test_stand, y_test))==》最佳參數: {‘C’: 0.1, ‘penalty’: ‘l1’}
(2)模型評估
lr = LogisticRegression(C = 0.1, penalty = 'l1') lr.fit(x_train_stand, y_train) y_pre_lr = lr.predict(x_test_stand) model_metrics(lr, y_test, y_pre_lr)結果輸出
The accuracy is 0.7890679747722494 The precision is 0.6746987951807228 The recall is 0.311977715877437335、SVM
(1)調參部分
svm = SVC(random_state=2018, probability=True) param = {'C':[0.01, 0.1, 1]} grid = GridSearchCV(estimator = svm, param_grid = param, scoring='roc_auc',cv=5) grid.fit(x_train_stand, y_train) print('最佳參數:',grid.best_params_) print('訓練集的最佳分數:', grid.best_score_) print('測試集的最佳分數:', grid.score(x_test_stand, y_test))==》最佳參數: {‘C’: 0.1}
(2)模型評估
svm = SVC(C = 0.1, random_state=2018, probability=True) svm.fit(x_train_stand, y_train) y_pre_svm = svm.predict(x_test_stand) model_metrics(svm, y_test, y_pre_svm)結果輸出
The accuracy is 0.7575332866152769 The precision is 0.8823529411764706 The recall is 0.041782729805013936、Decision Tree
(1)調參部分
dt = DecisionTreeClassifier(max_depth=9,min_samples_split=50,min_samples_leaf=90, max_features='sqrt',random_state =2018) param = {'max_depth':range(3,14,2), 'min_samples_split':range(100,801,200)} # 最佳參數: {'max_depth': 9, 'min_samples_split': 300} param = {'min_samples_split':range(50,1000,100), 'min_samples_leaf':range(60,101,10)} # 最佳參數: {'min_samples_leaf': 90, 'min_samples_split': 50} param = {'max_features':range(7,20,2)} # 最佳參數: {'max_features': 9} grid = GridSearchCV(estimator = dt, param_grid = param,scoring = 'roc_auc', cv = 5) grid.fit(x_train_stand, y_train) print('最佳參數:',grid.best_params_) print('訓練集的最佳分數:', grid.best_score_) print('測試集的最佳分數:', grid.score(x_test_stand, y_test))(2)模型評估
dt = DecisionTreeClassifier(max_depth=9,min_samples_split=50,min_samples_leaf=90, max_features=9,random_state =2018) dt.fit(x_train_stand, y_train) y_pre_dt = dt.predict(x_test_stand) model_metrics(dt, y_test, y_pre_dt)結果輸出
The accuracy is 0.7561317449194114 The precision is 0.5578947368421052 The recall is 0.147632311977715877、Random Forest
## Random Forest # param = {'n_estimators': range(1, 200, 5), 'max_features': ['log2', 'sqrt', 'auto']} # 最佳參數: {'max_features': 'sqrt', 'n_estimators': 171} rf = RandomForestClassifier(n_estimators=171, max_features='sqrt', random_state=2018) rf.fit(x_train_stand, y_train) y_pre_rf = rf.predict(x_test_stand) model_metrics(rf, y_test, y_pre_rf)輸出結果
The accuracy is 0.7848633496846531 The precision is 0.6857142857142857 The recall is 0.267409470752089158、GBDT
# gbdt = GradientBoostingClassifier(random_state=2018) # param = {'n_estimators': range(1, 100, 10), 'learning_rate': np.arange(0.1, 1, 0.1)} # grid = GridSearchCV(estimator = gbdt, param_grid = param,scoring = 'roc_auc', cv = 5) # grid.fit(x_train_stand, y_train) # print('最佳參數:',grid.best_params_) # print('訓練集的最佳分數:', grid.best_score_) # print('測試集的最佳分數:', grid.score(x_test_stand, y_test)) # 最佳參數: {'learning_rate': 0.1, 'n_estimators': 41} gbdt = GradientBoostingClassifier(learning_rate=0.1, n_estimators=41, random_state=2018) gbdt.fit(x_train_stand, y_train) y_pre_gbdt = gbdt.predict(x_test_stand) model_metrics(gbdt, y_test, y_pre_gbdt)9、XGBoost
## 調參部分 param = {'n_estimators':range(20,200,20)} # param = {'max_depth': range(3, 10, 2), 'min_child_weight': range(1, 12, 2)} # param = {'gamma': [i / 10 for i in range(1, 6)]} # param = {'subsample': [i / 10 for i in range(5, 10)], 'colsample_bytree': [i / 10 for i in range(5, 10)]} # param = {'reg_alpha': [1e-5, 1e-2, 0.1, 0, 1, 100]} # param = {'n_estimators': range(20, 200, 20)} xgboost = xgb.XGBClassifier(learning_rate=0.1, n_estimators=40, max_depth=3, min_child_weight=11, reg_alpha=0.01, gamma=0.1, subsample=0.7, colsample_bytree=0.7, objective='binary:logistic', nthread=4, scale_pos_weight=1, seed=2018) grid = GridSearchCV(estimator=xgb, param_grid=param, scoring='roc_auc', n_jobs=4, iid=False, cv=5) grid.fit(x_train_stand, y_train) print('最佳參數:', grid.best_params_) print('訓練集的最佳分數:', grid.best_score_) print('測試集的最佳分數:', grid.score(x_test_stand, y_test)) # # 最佳參數: {'n_estimators': 40} # 訓練集的最佳分數: 0.8028110571725202 # 測試集的最佳分數: 0.7770857458817146## 模型評估 xgboost = xgb.XGBClassifier(learning_rate=0.1, n_estimators=40, max_depth=3, min_child_weight=11, reg_alpha=0.01,gamma=0.1, subsample=0.7, colsample_bytree=0.7, objective='binary:logistic',nthread=4, scale_pos_weight=1, seed=2018) xgboost.fit(x_train_stand, y_train) y_pre_xgb = xgboost.predict(x_test_stand) model_metrics(xgboost, y_test, y_pre_xgb)輸出結果
The accuracy is 0.7876664330763841 The precision is 0.6521739130434783 The recall is 0.334261838440111410、LightGBM
## 調參部分 gbm = lgb.LGBMClassifier(seed = 2018) param = {'learning_rate': np.arange(0.1, 0.5, 0.1), 'max_depth': range(1, 6, 1),'n_estimators': range(30, 50, 5)} grid = GridSearchCV(estimator=gbm, param_grid=param, scoring='roc_auc', n_jobs=4, iid=False, cv=5) grid.fit(x_train_stand, y_train) print('最佳參數:', grid.best_params_) print('訓練集的最佳分數:', grid.best_score_) print('測試集的最佳分數:', grid.score(x_test_stand, y_test)) # 最佳參數: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 40} # 訓練集的最佳分數: 0.8007228827289531 # 測試集的最佳分數: 0.7729296422647178## 模型評估 gbm = lgb.LGBMClassifier(learning_rate = 0.1, max_depth = 3, n_estimators = 40, seed=2018) gbm.fit(x_train_stand, y_train) y_pre_gbm = gbm.predict(x_test_stand) model_metrics(gbm, y_test, y_pre_gbm)輸出結果
The accuracy is 0.7932725998598459 The precision is 0.6839080459770115 The recall is 0.33147632311977715三、遇到的問題
1、UnboundLocalError: local variable ‘xxx’ referenced before assignment
錯誤:
 UnboundLocalError: local variable ‘xxx’ referenced before assignment
在函數外部已經定義了變量n,在函數內部對該變量進行運算,運行時會遇到了這樣的錯誤:
主要是因為沒有讓解釋器清楚變量是全局變量還是局部變量。
解決方案:修改變量的命名,使之不發生沖突
2、ImportError: [joblib] Attempting to do parallel computing without protecting
錯誤:
 ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using “if name == ‘main‘”. Please see the joblib documentation on Parallel for more information
解決方案:添加if __name__=='__main__':即可
3、recall
為什么召回率普遍偏低?
總結
以上是生活随笔為你收集整理的金融贷款逾期的模型构建4——模型调优的全部內容,希望文章能夠幫你解決所遇到的問題。
                            
                        - 上一篇: 金融贷款逾期的模型构建3——模型评估
 - 下一篇: 【Python】编程笔记9