金融贷款逾期的模型构建6——特征选择
文章目錄
- 一、IV值
- 1、概述
- 2、IV計算
- (1)WOE
- (2)IV 計算
 
 
- 二、實現
- 0、相關模塊
- 1、IV值
- 2、Random Forest
- 3、特征合并
- 4、模型構建
- 5、模型評估
 
 
數據傳送門(data.csv):https://pan.baidu.com/s/1G1b2QJjYkkk7LDfGorbj5Q
 目標:數據集是金融數據(非脫敏),要預測貸款用戶是否會逾期。表格中 “status” 是結果標簽:0表示未逾期,1表示逾期。
任務:分別用IV值和隨機森林進行特征選擇。然后分別構建模型(邏輯回歸、SVM、決策樹、隨機森林、GBDT、XGBoost和LightGBM),進行模型評估。
一、IV值
1、概述
IV:Information Value,即信息價值,或者信息量。用于衡量變量的預測能力,也就是說,若某特征的IV值越大,該特征對預測的結果影響越大。
適用條件:有監督模型且必須是二分類。
常見的IV取值范圍代表意思如下:
- 若IV在(-∞,0.02]區間,視為無預測力變量
- 若IV在(0.02,0.1]區間,視為較弱預測力變量
- 若IV在(0.1,+∞)區間,視為預測力可以,而實際應用中,也是保留IV值大于0.1的變量進行篩選。
IV值計算
2、IV計算
WOE 是 IV 的計算基礎。
(1)WOE
WOE(Weight of Evidence,證據權重)。WOE是對原始自變量的一種編碼形式。
- 首先,對該特征進行分組處理(也稱離散化、分箱等)。
- 然后,對第 iii 組,計算WOEWOEWOE,公式如下所示:
 WOEi=ln(pyipni)=ln(#yi/#yT#ni/#nT)WOE_i = ln(\frac{p_{y_i}}{p_{n_i}})=ln(\frac{\#y_i/\#y_T}{\#n_i/\#n_T})WOEi?=ln(pni??pyi???)=ln(#ni?/#nT?#yi?/#yT??)
 其中,pyip_{y_i}pyi??表示該組中響應客戶(在風險模型中,即違約客戶)占所有樣本中所有響應客戶的比例,pnip_{n_i}pni??表示該組中未響應客戶占樣本中所有未響應客戶的比例。#yi\#y_i#yi?表示這個組中響應客戶的數量,#ni\#n_i#ni?表示這個組中未響應客戶的數量,#yT\#y_T#yT?表示樣本中所有響應客戶的數量,#nT\#n_T#nT?表示樣本中所有未響應客戶的數量。
 ==》WOEWOEWOE:“當前分組中響應客戶占所有響應客戶的比例”和“當前分組中沒有響應的客戶占所有沒有響應的客戶的比例”的差異。
- 公式變形:
 WOEi=ln(pyipni)=ln(#yi/#yT#ni/#nT)=ln(#yi/#ni#yT/#nT)WOE_i = ln(\frac{p_{y_i}}{p_{n_i}})=ln(\frac{\#y_i/\#y_T}{\#n_i/\#n_T})=ln(\frac{\#y_i/\#n_i}{\#y_T/\#n_T})WOEi?=ln(pni??pyi???)=ln(#ni?/#nT?#yi?/#yT??)=ln(#yT?/#nT?#yi?/#ni??)
 ==》WOEWOEWOE:當前這個組中響應的客戶和未響應客戶的比值,和所有樣本中這個比值的差異。
 ==》WOE越大,這種差異越大,這個分組里的樣本響應的可能性就越大,WOE越小,差異越小,這個分組里的樣本響應的可能性就越小。
(2)IV 計算
IVi=(pyi?pni)?WOEi=(pyi?pni)?ln(pyipni)=(#yi/#yT?#ni/#nT)ln(#yi/#yT#ni/#nT)IV=∑i=1nIViIV_i =(p_{y_i}-p_{n_i})* WOE_i = (p_{y_i}-p_{n_i})*ln(\frac{p_{y_i}}{p_{n_i}})=(\#y_i/\#y_T-\#n_i/\#n_T)ln(\frac{\#y_i/\#y_T}{\#n_i/\#n_T})\\ IV = \sum_{i=1}^{n}IV_i IVi?=(pyi???pni??)?WOEi?=(pyi???pni??)?ln(pni??pyi???)=(#yi?/#yT??#ni?/#nT?)ln(#ni?/#nT?#yi?/#yT??)IV=i=1∑n?IVi?
 其中,n為特征的分組個數。
二、實現
0、相關模塊
import pandas as pd from pandas import DataFrame as df from numpy import log import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier import xgboost as xgb import lightgbm as lgb from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, precision_score, f1_score from sklearn.metrics import roc_auc_score, recall_score, roc_curve, auc import matplotlib.pyplot as plt1、IV值
def calcWOE(dataset, col, target):## 對特征進行統計分組subdata = df(dataset.groupby(col)[col].count())## 每個分組中響應客戶的數量suby = df(dataset.groupby(col)[target].sum())## subdata 與 suby 的拼接data = df(pd.merge(subdata, suby, how='left', left_index=True, right_index=True))## 相關統計,總共的樣本數量total,響應客戶總數b_total,未響應客戶數量g_totalb_total = data[target].sum()total = data[col].sum()g_total = total - b_total## WOE公式data["bad"] = data.apply(lambda x:round(x[target]/b_total, 100), axis=1)data["good"] = data.apply(lambda x:round((x[col] - x[target])/g_total, 100), axis=1)data["WOE"] = data.apply(lambda x:log(x.bad / x.good), axis=1)return data.loc[:, ["bad", "good", "WOE"]]def calcIV(dataset):print()dataset["IV"] = dataset.apply(lambda x:(x["bad"] - x["good"]) * x["WOE"], axis=1)IV = sum(dataset["IV"])return IVfile_name = '1.csv' data = pd.read_csv(file_name, encoding='gbk') X = data.drop(labels="status", axis=1) print(X.shape) y = data["status"] col_list = [col for col in data.drop(labels=['Unnamed: 0','status'], axis=1)] data_IV = df() fea_iv = []for col in col_list:col_WOE = calcWOE(data, col, "status")## 刪除nan、inf、-infcol_WOE = col_WOE[~col_WOE.isin([np.nan, np.inf, -np.inf]).any(1)]col_IV = calcIV(col_WOE)if col_IV > 0.1:data_IV[col] = [col_IV]fea_iv.append(col)data_IV.to_csv('data_IV.csv', index=0) print(fea_iv)輸出結果
['trans_amount_increase_rate_lately', 'trans_activity_day', 'repayment_capability', 'first_transaction_time', 'historical_trans_day', 'rank_trad_1_month', 'trans_amount_3_month', 'abs', 'avg_price_last_12_month', 'trans_fail_top_count_enum_last_1_month', 'trans_fail_top_count_enum_last_6_month', 'trans_fail_top_count_enum_last_12_month', 'max_cumulative_consume_later_1_month', 'pawns_auctions_trusts_consume_last_1_month', 'pawns_auctions_trusts_consume_last_6_month', 'first_transaction_day', 'trans_day_last_12_month', 'apply_score', 'loans_score', 'loans_count', 'loans_overdue_count', 'history_suc_fee', 'history_fail_fee', 'latest_one_month_suc', 'latest_one_month_fail', 'loans_avg_limit', 'consfin_credit_limit', 'consfin_max_limit', 'consfin_avg_limit', 'loans_latest_day']2、Random Forest
rfc = RandomForestClassifier() rfc.fit(X, y) rfc_impc = pd.Series(rfc.feature_importances_, index=X.columns).sort_values(ascending=False) fea_gini = rfc_impc[:20].index.tolist() print(fea_gini)輸出結果
['trans_fail_top_count_enum_last_1_month', 'history_fail_fee', 'loans_score', 'apply_score', 'latest_one_month_fail', 'trans_fail_top_count_enum_last_12_month', 'Unnamed: 0', 'trans_amount_3_month', 'trans_activity_day', 'max_cumulative_consume_later_1_month', 'repayment_capability', 'historical_trans_amount', 'consfin_credit_limit', 'latest_query_day', 'pawns_auctions_trusts_consume_last_6_month', 'first_transaction_time', 'loans_overdue_count', 'history_suc_fee', 'trans_days_interval', 'number_of_trans_from_2011']3、特征合并
features = list(set(fea_gini)|set(fea_iv)) X_final = X[features] print(X_final.shape)(4754, 35)
 分析:從原來的(4754, 92)經過篩選得到 (4754, 35) 特征的數據,去掉了大量的冗余。
4、模型構建
## 劃分數據集 X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.3, random_state=2019)## 模型1:Logistic Regression lr = LogisticRegression() lr.fit(X_train, y_train)# ## 模型2:SVM svm = SVC(kernel='linear',probability=True) svm.fit(X_train,y_train)## 模型3:Decision Tree dtc = DecisionTreeClassifier(max_depth=8) dtc.fit(X_train,y_train)## 模型4:Random Forest rfc = RandomForestClassifier() rfc.fit(X_train,y_train)## 模型5:GBDT gbdt = GradientBoostingClassifier() gbdt.fit(X_train,y_train)## 模型6:XGBoost xgbc = xgb.XGBClassifier() xgbc.fit(X_train,y_train)## 模型7:LightGBM lgbc = lgb.LGBMClassifier() lgbc.fit(X_train,y_train)5、模型評估
## 模型評估 def model_metrics(clf, X_train, X_test, y_train, y_test):y_train_pred = clf.predict(X_train)y_test_pred = clf.predict(X_test)y_train_prob = clf.predict_proba(X_train)[:, 1]y_test_prob = clf.predict_proba(X_test)[:, 1]# 準確率print('準確率: ',end=' ')print('訓練集: ', '%.4f' % accuracy_score(y_train, y_train_pred), end=' ')print('測試集: ', '%.4f' % accuracy_score(y_test, y_test_pred))# 精準率print('精準率:',end=' ')print('訓練集: ', '%.4f' % precision_score(y_train, y_train_pred), end=' ')print('測試集: ', '%.4f' % precision_score(y_test, y_test_pred))# 召回率print('召回率:',end=' ')print('訓練集: ', '%.4f' % recall_score(y_train, y_train_pred), end=' ')print('測試集: ', '%.4f' % recall_score(y_test, y_test_pred))# f1_scoreprint('f1-score:',end=' ')print('訓練集: ', '%.4f' % f1_score(y_train, y_train_pred), end=' ')print('測試集: ', '%.4f' % f1_score(y_test, y_test_pred))# aucprint('auc:',end=' ')print('訓練集: ', '%.4f' % roc_auc_score(y_train, y_train_prob), end=' ')print('測試集: ', '%.4f' % roc_auc_score(y_test, y_test_prob))# roc曲線fpr_train, tpr_train, thred_train = roc_curve(y_train, y_train_prob, pos_label=1)fpr_test, tpr_test, thred_test = roc_curve(y_test, y_test_prob, pos_label=1)label = ['Train - AUC:{:.4f}'.format(auc(fpr_train, tpr_train)),'Test - AUC:{:.4f}'.format(auc(fpr_test, tpr_test))]plt.plot(fpr_train, tpr_train)plt.plot(fpr_test, tpr_test)plt.plot([0, 1], [0, 1], 'd--')plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.legend(label, loc=4)plt.title('ROC Curve')model_metrics(lr, X_train, X_test, y_train, y_test) model_metrics(svm, X_train, X_test, y_train, y_test) model_metrics(dtc, X_train, X_test, y_train, y_test) model_metrics(rfc, X_train, X_test, y_train, y_test) model_metrics(gbdt, X_train, X_test, y_train, y_test) model_metrics(xgbc, X_train, X_test, y_train, y_test) model_metrics(lgbc, X_train, X_test, y_train, y_test)出現的問題:
TypeError: 'list' object is not callable set
 原因:上面重復定義list所以該處不可使用,提示:定義任何對象不要和關鍵字或者import里面的函數等等同名。
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. 'precision', 'predicted', average, warn_for)在預測的時候出現該警告,同一模型有的評價指標結果為0,目前沒有解決。
參考:
 https://blog.csdn.net/kevin7658/article/details/50780391/
總結
以上是生活随笔為你收集整理的金融贷款逾期的模型构建6——特征选择的全部內容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: 【Numpy】学习笔记1
- 下一篇: 金融贷款逾期的模型构建7——模型融合
