當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

金融贷款逾期的模型构建6——特征选择

發布時間：2025/3/19 编程问答 46 豆豆

生活随笔收集整理的這篇文章主要介紹了金融贷款逾期的模型构建6——特征选择小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

- 一、IV值
- - 1、概述
  - 2、IV計算
  - - （1）WOE
    - （2）IV 計算
- 二、實現
- - 0、相關模塊
  - 1、IV值
  - 2、Random Forest
  - 3、特征合并
  - 4、模型構建
  - 5、模型評估

數據傳送門（data.csv）：https://pan.baidu.com/s/1G1b2QJjYkkk7LDfGorbj5Q
目標：數據集是金融數據（非脫敏），要預測貸款用戶是否會逾期。表格中 “status” 是結果標簽：0表示未逾期，1表示逾期。

任務：分別用IV值和隨機森林進行特征選擇。然后分別構建模型（邏輯回歸、SVM、決策樹、隨機森林、GBDT、XGBoost和LightGBM），進行模型評估。

一、IV值

1、概述

IV：Information Value，即信息價值，或者信息量。用于衡量變量的預測能力，也就是說，若某特征的IV值越大，該特征對預測的結果影響越大。

適用條件：有監督模型且必須是二分類。

常見的IV取值范圍代表意思如下：

若IV在（-∞，0.02]區間，視為無預測力變量
若IV在（0.02，0.1]區間，視為較弱預測力變量
若IV在（0.1，+∞）區間，視為預測力可以，而實際應用中，也是保留IV值大于0.1的變量進行篩選。

IV值計算

2、IV計算

WOE 是 IV 的計算基礎。

（1）WOE

WOE（Weight of Evidence，證據權重）。WOE是對原始自變量的一種編碼形式。

首先，對該特征進行分組處理（也稱離散化、分箱等）。
然后，對第 $i$ 組，計算 $W O E$ ，公式如下所示：
$WOEi=ln(pyipni)=ln(#yi/#yT#ni/#nT)WOE_i = ln(\frac{p_{y_i}}{p_{n_i}})=ln(\frac{\#y_i/\#y_T}{\#n_i/\#n_T})$
其中， $p_{y_i}$ 表示該組中響應客戶（在風險模型中，即違約客戶）占所有樣本中所有響應客戶的比例， $p_{n_i}$ 表示該組中未響應客戶占樣本中所有未響應客戶的比例。 $y_i$ 表示這個組中響應客戶的數量， $n_i$ 表示這個組中未響應客戶的數量， $y_T$ 表示樣本中所有響應客戶的數量， $n_T$ 表示樣本中所有未響應客戶的數量。
==》 $W O E$ ：“當前分組中響應客戶占所有響應客戶的比例”和“當前分組中沒有響應的客戶占所有沒有響應的客戶的比例”的差異。
公式變形：
$WOEi=ln(pyipni)=ln(#yi/#yT#ni/#nT)=ln(#yi/#ni#yT/#nT)WOE_i = ln(\frac{p_{y_i}}{p_{n_i}})=ln(\frac{\#y_i/\#y_T}{\#n_i/\#n_T})=ln(\frac{\#y_i/\#n_i}{\#y_T/\#n_T})$
==》 $W O E$ ：當前這個組中響應的客戶和未響應客戶的比值，和所有樣本中這個比值的差異。
==》WOE越大，這種差異越大，這個分組里的樣本響應的可能性就越大，WOE越小，差異越小，這個分組里的樣本響應的可能性就越小。

（2）IV 計算

$IVi=(pyi?pni)?WOEi=(pyi?pni)?ln(pyipni)=(#yi/#yT?#ni/#nT)ln(#yi/#yT#ni/#nT)IV=∑i=1nIViIV_i =(p_{y_i}-p_{n_i})* WOE_i = (p_{y_i}-p_{n_i})*ln(\frac{p_{y_i}}{p_{n_i}})=(\#y_i/\#y_T-\#n_i/\#n_T)ln(\frac{\#y_i/\#y_T}{\#n_i/\#n_T})\\ IV = \sum_{i=1}^{n}IV_i$
其中，n為特征的分組個數。

二、實現

0、相關模塊

import pandas as pd from pandas import DataFrame as df from numpy import log import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier import xgboost as xgb import lightgbm as lgb from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, precision_score, f1_score from sklearn.metrics import roc_auc_score, recall_score, roc_curve, auc import matplotlib.pyplot as plt

1、IV值

def calcWOE(dataset, col, target):## 對特征進行統計分組subdata = df(dataset.groupby(col)[col].count())## 每個分組中響應客戶的數量suby = df(dataset.groupby(col)[target].sum())## subdata 與 suby 的拼接data = df(pd.merge(subdata, suby, how='left', left_index=True, right_index=True))## 相關統計，總共的樣本數量total，響應客戶總數b_total，未響應客戶數量g_totalb_total = data[target].sum()total = data[col].sum()g_total = total - b_total## WOE公式data["bad"] = data.apply(lambda x:round(x[target]/b_total, 100), axis=1)data["good"] = data.apply(lambda x:round((x[col] - x[target])/g_total, 100), axis=1)data["WOE"] = data.apply(lambda x:log(x.bad / x.good), axis=1)return data.loc[:, ["bad", "good", "WOE"]]def calcIV(dataset):print()dataset["IV"] = dataset.apply(lambda x:(x["bad"] - x["good"]) * x["WOE"], axis=1)IV = sum(dataset["IV"])return IVfile_name = '1.csv' data = pd.read_csv(file_name, encoding='gbk') X = data.drop(labels="status", axis=1) print(X.shape) y = data["status"] col_list = [col for col in data.drop(labels=['Unnamed: 0','status'], axis=1)] data_IV = df() fea_iv = []for col in col_list:col_WOE = calcWOE(data, col, "status")## 刪除nan、inf、-infcol_WOE = col_WOE[~col_WOE.isin([np.nan, np.inf, -np.inf]).any(1)]col_IV = calcIV(col_WOE)if col_IV > 0.1:data_IV[col] = [col_IV]fea_iv.append(col)data_IV.to_csv('data_IV.csv', index=0) print(fea_iv)

輸出結果

['trans_amount_increase_rate_lately', 'trans_activity_day', 'repayment_capability', 'first_transaction_time', 'historical_trans_day', 'rank_trad_1_month', 'trans_amount_3_month', 'abs', 'avg_price_last_12_month', 'trans_fail_top_count_enum_last_1_month', 'trans_fail_top_count_enum_last_6_month', 'trans_fail_top_count_enum_last_12_month', 'max_cumulative_consume_later_1_month', 'pawns_auctions_trusts_consume_last_1_month', 'pawns_auctions_trusts_consume_last_6_month', 'first_transaction_day', 'trans_day_last_12_month', 'apply_score', 'loans_score', 'loans_count', 'loans_overdue_count', 'history_suc_fee', 'history_fail_fee', 'latest_one_month_suc', 'latest_one_month_fail', 'loans_avg_limit', 'consfin_credit_limit', 'consfin_max_limit', 'consfin_avg_limit', 'loans_latest_day']

2、Random Forest

rfc = RandomForestClassifier() rfc.fit(X, y) rfc_impc = pd.Series(rfc.feature_importances_, index=X.columns).sort_values(ascending=False) fea_gini = rfc_impc[:20].index.tolist() print(fea_gini)

輸出結果

['trans_fail_top_count_enum_last_1_month', 'history_fail_fee', 'loans_score', 'apply_score', 'latest_one_month_fail', 'trans_fail_top_count_enum_last_12_month', 'Unnamed: 0', 'trans_amount_3_month', 'trans_activity_day', 'max_cumulative_consume_later_1_month', 'repayment_capability', 'historical_trans_amount', 'consfin_credit_limit', 'latest_query_day', 'pawns_auctions_trusts_consume_last_6_month', 'first_transaction_time', 'loans_overdue_count', 'history_suc_fee', 'trans_days_interval', 'number_of_trans_from_2011']

3、特征合并

features = list(set(fea_gini)|set(fea_iv)) X_final = X[features] print(X_final.shape)

(4754, 35)
分析：從原來的(4754, 92)經過篩選得到 (4754, 35) 特征的數據，去掉了大量的冗余。

4、模型構建

## 劃分數據集 X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.3, random_state=2019)## 模型1：Logistic Regression lr = LogisticRegression() lr.fit(X_train, y_train)# ## 模型2：SVM svm = SVC(kernel='linear',probability=True) svm.fit(X_train,y_train)## 模型3：Decision Tree dtc = DecisionTreeClassifier(max_depth=8) dtc.fit(X_train,y_train)## 模型4：Random Forest rfc = RandomForestClassifier() rfc.fit(X_train,y_train)## 模型5：GBDT gbdt = GradientBoostingClassifier() gbdt.fit(X_train,y_train)## 模型6：XGBoost xgbc = xgb.XGBClassifier() xgbc.fit(X_train,y_train)## 模型7：LightGBM lgbc = lgb.LGBMClassifier() lgbc.fit(X_train,y_train)

5、模型評估

## 模型評估 def model_metrics(clf, X_train, X_test, y_train, y_test):y_train_pred = clf.predict(X_train)y_test_pred = clf.predict(X_test)y_train_prob = clf.predict_proba(X_train)[:, 1]y_test_prob = clf.predict_proba(X_test)[:, 1]# 準確率print('準確率: ',end=' ')print('訓練集: ', '%.4f' % accuracy_score(y_train, y_train_pred), end=' ')print('測試集: ', '%.4f' % accuracy_score(y_test, y_test_pred))# 精準率print('精準率:',end=' ')print('訓練集: ', '%.4f' % precision_score(y_train, y_train_pred), end=' ')print('測試集: ', '%.4f' % precision_score(y_test, y_test_pred))# 召回率print('召回率:',end=' ')print('訓練集: ', '%.4f' % recall_score(y_train, y_train_pred), end=' ')print('測試集: ', '%.4f' % recall_score(y_test, y_test_pred))# f1_scoreprint('f1-score:',end=' ')print('訓練集: ', '%.4f' % f1_score(y_train, y_train_pred), end=' ')print('測試集: ', '%.4f' % f1_score(y_test, y_test_pred))# aucprint('auc:',end=' ')print('訓練集: ', '%.4f' % roc_auc_score(y_train, y_train_prob), end=' ')print('測試集: ', '%.4f' % roc_auc_score(y_test, y_test_prob))# roc曲線fpr_train, tpr_train, thred_train = roc_curve(y_train, y_train_prob, pos_label=1)fpr_test, tpr_test, thred_test = roc_curve(y_test, y_test_prob, pos_label=1)label = ['Train - AUC:{:.4f}'.format(auc(fpr_train, tpr_train)),'Test - AUC:{:.4f}'.format(auc(fpr_test, tpr_test))]plt.plot(fpr_train, tpr_train)plt.plot(fpr_test, tpr_test)plt.plot([0, 1], [0, 1], 'd--')plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.legend(label, loc=4)plt.title('ROC Curve')model_metrics(lr, X_train, X_test, y_train, y_test) model_metrics(svm, X_train, X_test, y_train, y_test) model_metrics(dtc, X_train, X_test, y_train, y_test) model_metrics(rfc, X_train, X_test, y_train, y_test) model_metrics(gbdt, X_train, X_test, y_train, y_test) model_metrics(xgbc, X_train, X_test, y_train, y_test) model_metrics(lgbc, X_train, X_test, y_train, y_test)

出現的問題：

TypeError: 'list' object is not callable set
原因：上面重復定義list所以該處不可使用，提示：定義任何對象不要和關鍵字或者import里面的函數等等同名。

UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. 'precision', 'predicted', average, warn_for)在預測的時候出現該警告，同一模型有的評價指標結果為0，目前沒有解決。

參考：
https://blog.csdn.net/kevin7658/article/details/50780391/

總結

以上是生活随笔為你收集整理的金融贷款逾期的模型构建6——特征选择的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【Numpy】学习笔记1
下一篇：金融贷款逾期的模型构建7——模型融合