當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

阿里天池学习赛-金融风控-贷款违约预测

發(fā)布時(shí)間：2023/12/29 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了阿里天池学习赛-金融风控-贷款违约预测小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

阿里天池學(xué)習(xí)賽-金融風(fēng)控-貸款違約預(yù)測(cè)

1 賽題理解
- 1.1 賽題數(shù)據(jù)
- 1.2 評(píng)測(cè)標(biāo)準(zhǔn)
2 探索性分析（EDA）
- 2.1 初窺數(shù)據(jù)
- 2.2 查看缺失值占比
- 2.3 數(shù)值型變量
- - 2.3.1 數(shù)據(jù)分布
  - 2.3.2 變量關(guān)系
- 2.4 離散變量
- - 2.4.1 數(shù)據(jù)分布
- 2.5 正負(fù)樣本的數(shù)據(jù)差異
3 特征工程
- 3.1 數(shù)據(jù)預(yù)處理
- - 3.1.1 缺失值處理
  - 3.1.2 時(shí)間格式處理
  - 3.1.3 對(duì)象類型特征轉(zhuǎn)換到數(shù)值
- 3.2 異常值處理
- 3.3 數(shù)據(jù)分箱
- 3.4 數(shù)據(jù)編碼
- 3.5 特征衍生
- 3.5 特征篩選
4 建模及調(diào)參
- 4.1 Baseline
- 4.2 調(diào)參
- - 4.2.1 max_depth
  - 4.2.2min_child_weight
  - 4.2.3 subsample
- 4.3 更新模型
- 4.4 預(yù)測(cè)結(jié)果并提交
5 模型融合
- 5.1 stacking\blending詳解
- 5.1 stacking 代碼

1 賽題理解

項(xiàng)目地址：
https://github.com/datawhalechina/team-learning-data-mining/tree/master/FinancialRiskControl

比賽地址：
https://tianchi.aliyun.com/competition/entrance/531830/introduction

1.1 賽題數(shù)據(jù)

賽題以預(yù)測(cè)金融風(fēng)險(xiǎn)為任務(wù)，數(shù)據(jù)集報(bào)名后可見并可下載，該數(shù)據(jù)來自某信貸平臺(tái)的貸款記錄，總數(shù)據(jù)量超過120w，包含47列變量信息，其中15列為匿名變量。為了保證比賽的公平性，將會(huì)從中抽取80萬條作為訓(xùn)練集，20萬條作為測(cè)試集A，20萬條作為測(cè)試集B，同時(shí)會(huì)對(duì)employmentTitle、purpose、postCode和title等信息進(jìn)行脫敏。

字段如下：

FieldDescription

id 為貸款清單分配的唯一信用證標(biāo)識(shí)
loanAmnt	貸款金額
term	貸款期限（year）
interestRate	貸款利率
installment	分期付款金額
grade	貸款等級(jí)
subGrade	貸款等級(jí)之子級(jí)
employmentTitle	就業(yè)職稱
employmentLength	就業(yè)年限（年）
homeOwnership	借款人在登記時(shí)提供的房屋所有權(quán)狀況
annualIncome	年收入
verificationStatus	驗(yàn)證狀態(tài)
issueDate	貸款發(fā)放的月份
purpose	借款人在貸款申請(qǐng)時(shí)的貸款用途類別
postCode	借款人在貸款申請(qǐng)中提供的郵政編碼的前3位數(shù)字
regionCode	地區(qū)編碼
dti	債務(wù)收入比
delinquency_2years	借款人過去2年信用檔案中逾期30天以上的違約事件數(shù)
ficoRangeLow	借款人在貸款發(fā)放時(shí)的fico所屬的下限范圍
ficoRangeHigh	借款人在貸款發(fā)放時(shí)的fico所屬的上限范圍
openAcc	借款人信用檔案中未結(jié)信用額度的數(shù)量
pubRec	貶損公共記錄的數(shù)量
pubRecBankruptcies	公開記錄清除的數(shù)量
revolBal	信貸周轉(zhuǎn)余額合計(jì)
revolUtil	循環(huán)額度利用率，或借款人使用的相對(duì)于所有可用循環(huán)信貸的信貸金額
totalAcc	借款人信用檔案中當(dāng)前的信用額度總數(shù)
initialListStatus	貸款的初始列表狀態(tài)
applicationType	表明貸款是個(gè)人申請(qǐng)還是與兩個(gè)共同借款人的聯(lián)合申請(qǐng)
earliesCreditLine	借款人最早報(bào)告的信用額度開立的月份
title	借款人提供的貸款名稱
policyCode	公開可用的策略_代碼=1新產(chǎn)品不公開可用的策略_代碼=2
n系列匿名特征	匿名特征n0-n14，為一些貸款人行為計(jì)數(shù)特征的處理

1.2 評(píng)測(cè)標(biāo)準(zhǔn)

提交結(jié)果為每個(gè)測(cè)試樣本是1的概率，也就是y為1的概率。評(píng)價(jià)方法為AUC評(píng)估模型效果

2 探索性分析（EDA）

探索性分析可以讓我們更好了解數(shù)據(jù)以及數(shù)據(jù)之間的關(guān)系，讓我們?cè)跀?shù)據(jù)清洗和建模的時(shí)候能夠更加順利。

2.1 初窺數(shù)據(jù)

首先導(dǎo)入數(shù)據(jù)并且大致看一下數(shù)據(jù)

test = pd.read_csv("./testA.csv") train = pd.read_csv("./train.csv") train.drop("id", axis= 1,inplace = True) train.head() train.info(verbose = True) train.describe()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 800000 entries, 0 to 799999 Data columns (total 46 columns): loanAmnt 800000 non-null float64 term 800000 non-null int64 interestRate 800000 non-null float64 installment 800000 non-null float64 grade 800000 non-null object subGrade 800000 non-null object employmentTitle 799999 non-null float64 employmentLength 753201 non-null object homeOwnership 800000 non-null int64 annualIncome 800000 non-null float64 verificationStatus 800000 non-null int64 issueDate 800000 non-null object isDefault 800000 non-null int64 purpose 800000 non-null int64 postCode 799999 non-null float64 regionCode 800000 non-null int64 dti 799761 non-null float64 delinquency_2years 800000 non-null float64 ficoRangeLow 800000 non-null float64 ficoRangeHigh 800000 non-null float64 openAcc 800000 non-null float64 pubRec 800000 non-null float64 pubRecBankruptcies 799595 non-null float64 revolBal 800000 non-null float64 revolUtil 799469 non-null float64 totalAcc 800000 non-null float64 initialListStatus 800000 non-null int64 applicationType 800000 non-null int64 earliesCreditLine 800000 non-null object title 799999 non-null float64 policyCode 800000 non-null float64 n0 759730 non-null float64 n1 759730 non-null float64 n2 759730 non-null float64 n2.1 759730 non-null float64 n4 766761 non-null float64 n5 759730 non-null float64 n6 759730 non-null float64 n7 759730 non-null float64 n8 759729 non-null float64 n9 759730 non-null float64 n10 766761 non-null float64 n11 730248 non-null float64 n12 759730 non-null float64 n13 759730 non-null float64 n14 759730 non-null float64 dtypes: float64(33), int64(8), object(5) memory usage: 280.8+ MB

發(fā)現(xiàn)數(shù)據(jù)的類型主要既有數(shù)值型也有分類變量，并且有不少變量中存在缺失值。

#正負(fù)樣本 plt.hist(train['isDefault']) plt.title("positive vs negative") plt.show()

可以看到負(fù)樣本比正樣本多很多，這也是金融風(fēng)控模型評(píng)估的中常見的現(xiàn)象，畢竟大多數(shù)的人還是不會(huì)拖欠貸款的。

2.2 查看缺失值占比

#缺失值占比 missing_val = train.isnull().sum()/train.shape[0] missing_val[missing_val >0].sort_values().plot.bar()

缺失值最多的變量是n11，大概占9%，但是還不算特別多，因此這個(gè)變量還是可以保留的
一般缺失值的辦法有很多，如果缺失值很多的話可以選擇刪除變量，否則可以根據(jù)適當(dāng)?shù)姆椒ㄟM(jìn)行填充，一般有平均值填充法，眾數(shù)填充或者隨機(jī)森林填充等，可以根據(jù)具體情況選擇。

2.3 數(shù)值型變量

稍微深入查看數(shù)值型變量

2.3.1 數(shù)據(jù)分布

numerical_cols = [] for col in train.columns:if train[col].dtype != object:numerical_cols.append(col) #數(shù)值列 numerical_cols.remove("isDefault") f,ax = plt.subplots(len(numerical_cols)//4,4,figsize = (15,60)) for i, col in enumerate(numerical_cols):sns.distplot(train[col], ax = ax[i//4,i%4])

這里可以看出幾點(diǎn)：

大部分?jǐn)?shù)據(jù)呈現(xiàn)出右偏趨勢(shì)，說明數(shù)據(jù)較大的可能是異常值
policyCode 只有一個(gè)取值，因此這個(gè)變量對(duì)于預(yù)測(cè)不會(huì)起到任何作用，可以刪除；initialListstatus是一個(gè)二分類變量；n2 和n2.1有非常相似的分布，可能是重復(fù)列

2.3.2 變量關(guān)系

用熱力圖查看各變量之間的關(guān)系，比較值觀

f, ax = plt.subplots(1,1, figsize = (20,20)) cor = train[numerical_cols].corr() sns.heatmap(cor, annot = True, linewidth = 0.2, linecolor = "white", ax = ax, fmt =".1g" )

從這個(gè)圖中能看到有一些變量有很強(qiáng)的相關(guān)性：

loanAmnt 和installment 相關(guān)性為1，這兩個(gè)變量一個(gè)是貸款總額，一個(gè)是分期付款金額，因此這兩者是會(huì)有很強(qiáng)的想關(guān)性
ficoRangeLow he ficoRangeHigh 相關(guān)性為1，這兩個(gè)是fico的上下限，因此也肯定有很強(qiáng)的相關(guān)性
n2和n2.1也有強(qiáng)相關(guān)性，根據(jù)之前的分布圖來看，這兩列基本可以確定是重復(fù)列，可以刪除其中一列
n1 n2 n4 n5 n7 n9 n10正相關(guān)關(guān)系較強(qiáng)
installment(Y) 和 loanAmnt (X)的關(guān)系圖

plt.scatter(train['loanAmnt'],train['installment'])

ficoRangeLow he ficoRangeHigh 關(guān)系圖

plt.scatter(train['ficoRangeLow'],train['ficoRangeHigh'])

這兩個(gè)變量就是線性關(guān)系，因此也可以刪除其中一個(gè)

2.4 離散變量

離散變量數(shù)不能直接用來建模的，必須通過一定的處理變成數(shù)值之后再放進(jìn)模型，方法有很多。可以直接映射，也有one-hot Encoding, Target Encoding等編碼方式。在風(fēng)控模型中還會(huì)常用到分箱的方法賦值。

2.4.1 數(shù)據(jù)分布

Grade

train['grade'].value_counts().sort_index().plot.bar()

可以直接映射轉(zhuǎn)化

subGrade

train['subGrade'].value_counts().sort_index().plot.bar(figsize=(15,5))

還是可以考慮映射，或者分箱

issueDate
日期變量，貸款發(fā)放時(shí)間，轉(zhuǎn)換為離數(shù)據(jù)集最早的發(fā)放時(shí)間的天數(shù)差

def transform_issueDate(df):df['issueDate'] = pd.to_datetime(df['issueDate'],format='%Y-%m-%d')startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')df['issueDateDT'] = df['issueDate'].apply(lambda x: x-startdate).dt.daysreturn df train = transform_issueDate(train) test = transform_issueDate(test) plt.hist(train['issueDateDT'],label = "train") plt.hist(test['issueDateDT'], label = "test")

earliesCreditLine_Year
貸款人最早報(bào)告的信用額度的時(shí)間
轉(zhuǎn)化為在距離2020的年數(shù)

def transform_earliesCreditLine(df):df['earliesCreditLine_Year'] = df['earliesCreditLine'].apply(lambda x: 2020-int(x[-4:]))return df train = transform_earliesCreditLine(train) test = transform_earliesCreditLine(test) plt.hist(train['earliesCreditLine_Year'],label = "train") plt.hist(test['earliesCreditLine_Year'],label = "test")

2.5 正負(fù)樣本的數(shù)據(jù)差異

把數(shù)據(jù)集按正負(fù)樣本分成兩份，查看變量的分布差異

train_positve = train[train['isDefault'] == 1] train_negative = train[train['isDefault'] != 1] f, ax = plt.subplots(len(numerical_cols),2,figsize = (10,80)) for i,col in enumerate(numerical_cols):sns.distplot(train_positve[col],ax = ax[i,0],color = "blue")ax[i,0].set_title("positive")sns.distplot(train_negative[col],ax = ax[i,1],color = 'red')ax[i,1].set_title("negative") plt.subplots_adjust(hspace = 1)

總體的分布差異不大，revolUtil的差別較大

3 特征工程

特征篩選是機(jī)器學(xué)習(xí)里面比較重要的一個(gè)環(huán)節(jié)，特征工程大致包括以下步驟：

數(shù)據(jù)預(yù)處理
異常值處理
數(shù)據(jù)分箱
特征衍生
數(shù)據(jù)編碼
特征選擇

3.1 數(shù)據(jù)預(yù)處理

數(shù)據(jù)預(yù)處理大致包括以下三個(gè)方面：

缺失值處理
時(shí)間格式處理
對(duì)象類型特征轉(zhuǎn)換到數(shù)值

3.1.1 缺失值處理

在上一步我們查看了缺失值，有不少變量中存在缺失值，并且可以看到n10和n4缺失值的數(shù)量是一樣的，除了n10，n4和n11之外的其他匿名變量的缺失值數(shù)量也是一樣的，所以很有可能這些缺失值在這些變量中同時(shí)缺失
以下驗(yàn)證我們的猜想

is_null_index = train['n10'].isnull() for col in train.columns:if train[col][is_null_index].notnull().sum() == 0:print(col)n0 n1 n2 n2.1 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14is_null_index = train['n1'].isnull() for col in train.columns:if train[col][is_null_index].notnull().sum() == 0:print(col)n0 n1 n2 n5 n6 n7 n8 n9 n11 n12 n13 n14

以上結(jié)果可以看出，n10缺失的行，其他匿名變量也全部缺失；n1缺失的行，除了n10和n4也全部缺失。因此推測(cè)這些匿名變量是有一定關(guān)聯(lián)性的：

n10缺失，則匿名變量均缺失；
n1缺失，則除n10和n4以外的所有匿名變量均缺失；

這樣看來匿名變量的缺失不應(yīng)該填充，應(yīng)該當(dāng)作一個(gè)值丟進(jìn)模型。
EmploymentLength這個(gè)變量的缺失值也比較多

3.1.2 時(shí)間格式處理

這個(gè)數(shù)據(jù)集一共有兩個(gè)時(shí)間變量，在EDA的時(shí)候已經(jīng)順便處理了

3.1.3 對(duì)象類型特征轉(zhuǎn)換到數(shù)值

對(duì)象類型特征有“grade",“subGrade” 和 ”employmentLength“

"grade"和”subGrade“都是表示貸款等級(jí)的特征，因此應(yīng)該是有一定的順序的，比如A>B，A1>A2之類，因此可以直接映射成數(shù)值，這種方法和Label Encoding 是一樣的。

for colname in ['grade',"subGrade"]unique_num = train.append(test)[colnamee].nunique()unuque_val = sorted(train.append(test)[colname].unique())for data in [train,test]:map_dict = {x:y for x,y in zip(unuque_val,range(unique_num))}data[colname] = data[colname].map(map_dict)

“employmentLength”

train['employmentLength'].unique()array(['2 years', '5 years', '8 years', '10+ years', nan, '7 years','9 years', '1 year', '3 years', '< 1 year', '4 years', '6 years'],dtype=object)

把數(shù)字后面的years去掉并且把10+改成10，<1改成0

for data in [train,test]:data['employmentLength'].replace("< 1 year", "0 year", inplace=True)data['employmentLength'].replace("10+ years", "10 years", inplace=True)data['employmentLength'] = data['employmentLength'].apply(lambda x: int(str(x).split()[0]) if pd.notnull(x) else x)

3.2 異常值處理

異常值的存在很可能會(huì)影響模型的最終結(jié)果，但是當(dāng)我們發(fā)現(xiàn)異常值的時(shí)候也不能馬上就刪除，應(yīng)該先看看這個(gè)異常值是不是有特殊原因造成的，特別是在金融風(fēng)控問題中，異常值的出現(xiàn)往往是存在意義的。

此處打算先不作異常值處理，二十

3.3 數(shù)據(jù)分箱

L特征分箱的目的：
從模型效果上來看，特征分箱主要是為了降低變量的復(fù)雜性，減少變量噪音對(duì)模型的影響，提高自變量和因變量的相關(guān)度。從而使模型更加穩(wěn)定。
數(shù)據(jù)分桶的對(duì)象：
- 將連續(xù)變量離散化
- 將多狀態(tài)的離散變量合并成少狀態(tài)
分箱的原因：
數(shù)據(jù)的特征內(nèi)的值跨度可能比較大，對(duì)有監(jiān)督和無監(jiān)督中如k-均值聚類它使用歐氏距離作為相似度函數(shù)來測(cè)量數(shù)據(jù)點(diǎn)之間的相似度。都會(huì)造成大吃小的影響，其中一種解決方法是對(duì)計(jì)數(shù)值進(jìn)行區(qū)間量化即數(shù)據(jù)分桶也叫做數(shù)據(jù)分箱，然后使用量化后的結(jié)果。
分箱的優(yōu)點(diǎn)：
- 處理缺失值：當(dāng)數(shù)據(jù)源可能存在缺失值，此時(shí)可以把null單獨(dú)作為一個(gè)分箱。
- 處理異常值：當(dāng)數(shù)據(jù)中存在離群點(diǎn)時(shí)，可以把其通過分箱離散化處理，從而提高變量的魯棒性（抗干擾能力）。例如，age若出現(xiàn)200這種異常值，可分入“age > 60”這個(gè)分箱里，排除影響。
- 業(yè)務(wù)解釋性：我們習(xí)慣于線性判斷變量的作用，當(dāng)x越來越大，y就越來越大。但實(shí)際x與y之間經(jīng)常存在著非線性關(guān)系，此時(shí)可經(jīng)過WOE變換。
特別要注意一下分箱的基本原則：

（1）最小分箱占比不低于5%
（2）箱內(nèi)不能全部是好客戶
（3）連續(xù)箱單調(diào)

python暫時(shí)沒找到卡方分箱的包，按照自己的理解手寫了一個(gè)

import numpy as np class ChiMerge():def __init__(self,df,col_name,target):self.num_bins = df[col_name].nunique() self.sorted_df = df.sort_values(by = col_name)[[target,col_name]]self.target = targetself.unique_val = np.sort(df[col_name].unique())self.col_name = col_nameself.reverse = 1self.shape = df.shape[0]def check_max_and_min_bin(self,to_merge_df):max_bin = to_merge_df[self.col_name].value_counts().values[0]min_bin = to_merge_df[self.col_name].value_counts().values[-1]return max_bin/self.shape, min_bin/self.shape def cal_Chi2(self,bin1,bin2, epsilon = 1e-8):#計(jì)算單個(gè)兩個(gè)箱體的卡方值，加入epsilon為了防止除0錯(cuò)誤bins = bin1.append(bin2)total = bins.shape[0]positive_rate = bins[self.target].sum()/totalnegative_rate = 1- positive_ratechi2_val = (bin1[self.target].sum() - positive_rate * bin1.shape[0])**2/(positive_rate * bin1.shape[0] + epsilon) +\(bin2[self.target].sum() - positive_rate * bin2.shape[0])**2/(positive_rate * bin2.shape[0] +epsilon) +\(bin1.shape[0] - bin1[self.target].sum() - negative_rate * bin1.shape[0])**2/(negative_rate * bin1.shape[0] + epsilon)+\(bin2.shape[0] - bin2[self.target].sum() - negative_rate * bin2.shape[0])**2/(negative_rate * bin2.shape[0] + epsilon)return chi2_valdef calculate_every_Chi2(self):chi2_list = []if self.reverse ==1:# 如果數(shù)值較多的時(shí)候可能會(huì)出現(xiàn)很多卡方為0的箱，為了減少次數(shù)，兩頭循壞，避免全列表遍歷# 水平較低，想暫時(shí)使用這個(gè)方法減少分箱時(shí)間for i in range(self.num_bins - 1):chi2 = self.cal_Chi2(self.sorted_df[self.sorted_df[self.col_name]==self.unique_val[i]],self.sorted_df[self.sorted_df[self.col_name]==self.unique_val[i+1]])chi2_list.append(chi2)if chi2 ==0:breakelse:for i in range(self.num_bins - 1,0,self.reverse):chi2 = self.cal_Chi2(self.sorted_df[self.sorted_df[self.col_name]==self.unique_val[i]],self.sorted_df[self.sorted_df[self.col_name]==self.unique_val[i+1]])chi2_list.append(chi2)if chi2 ==0:breakself.reverse = self.reverse * (-1)return chi2_listdef chi2Merge(self,chi2_val):max_bin,min_bin = self.check_max_and_min_bin(self.sorted_df)if max_bin>0.95:print("The max bin has more than 95% of samples")return self.sorted_df# 先初次判斷，如果初始數(shù)據(jù)已經(jīng)有箱體過大的情況，無法分箱chi2_list = [0]while self.num_bins > 5 and min(chi2_list) < chi2_val: remove_flag = Truechi2_list = self.calculate_every_Chi2()unique_val = self.unique_valwhile remove_flag:to_merge = np.argmin(chi2_list)to_merge_df = self.sorted_dfto_merge_df[self.col_name][to_merge_df[self.col_name] == unique_val[to_merge]] = unique_val[to_merge + 1]max_bin,min_bin = self.check_max_and_min_bin(to_merge_df)if max_bin > 0.95:chi2_list.pop(to_merge)unique_val.pop(to_merge)else:remove_flag = Falseself.unique_val = unique_valself.sorted_df[self.col_name][self.sorted_df[self.col_name] == self.unique_val[to_merge]] = self.unique_val[to_merge + 1]self.unique_val = np.sort(self.sorted_df[self.col_name].unique())self.num_bins -=1if self.num_bins%1000 == 0:print(self.num_bins)_,min_bin = self.check_max_and_min_bin(self.sorted_df)if min_bin < 0.05:print("too small bin")

在初始值較多的特征上使用的話速度比較慢，而且分箱結(jié)果不太好，打算之后再嘗試改進(jìn)或者使用其他方法。

3.4 數(shù)據(jù)編碼

編碼就是把一些離散的變量變成能夠表示特征間關(guān)系的的數(shù)值放入模型，常用的方法有:

Label Encoding
即類似{A=1,B=2}的映射
One-Hot Encoding
生成稀疏矩陣，比如有A,B,C三類，分別表示為[0,0,1] [0,1,0]和 [1,0,0]
Target Encoding
把target的均值賦給變量，比如：

targetfeature

1	A
1	A
1	A
0	A
1	B
1	B
0	B
0	B

取值為A時(shí)，有三個(gè)target是1，一個(gè)是0，因此A = 3/4=0.75
同理 B= 2/4 = 0.5
這個(gè)方法的缺點(diǎn)是容易過擬合，因此一般會(huì)使用交叉驗(yàn)證或者添加噪音的方式去編碼，我們這里的編碼使用target encoding

class KFoldTargetEncoderTrain(base.BaseEstimator, base.TransformerMixin):def __init__(self, colnames,targetName,n_fold=5,verbosity=True,discardOriginal_col=False):self.colnames = colnamesself.targetName = targetNameself.n_fold = n_foldself.verbosity = verbosityself.discardOriginal_col = discardOriginal_coldef fit(self, X, y=None):return selfdef transform(self,X):assert(type(self.targetName) == str)assert(type(self.colnames) == str)assert(self.colnames in X.columns)assert(self.targetName in X.columns)mean_of_target = X[self.targetName].mean()kf = KFold(n_splits = self.n_fold, shuffle = False)col_mean_name = self.colnames + '_' + 'Kfold_Target_Enc'X[col_mean_name] = np.nanfor tr_ind, val_ind in kf.split(X):X_tr, X_val = X.iloc[tr_ind], X.iloc[val_ind] # print(tr_ind,val_ind)X.loc[X.index[val_ind], col_mean_name] = X_val[self.colnames].map(X_tr.groupby(self.colnames)[self.targetName].mean())X[col_mean_name].fillna(mean_of_target, inplace = True)if self.verbosity:encoded_feature = X[col_mean_name].valuesprint('Correlation between the new feature, {} and, {} is {}.'.format(col_mean_name,self.targetName,np.corrcoef(X[self.targetName].values, encoded_feature)[0][1]))if self.discardOriginal_col:X = X.drop(self.colnames, axis=1)return Xclass KFoldTargetEncoderTest(base.BaseEstimator, base.TransformerMixin):def __init__(self,train,colNames,encodedName):self.train = trainself.colNames = colNamesself.encodedName = encodedNamedef fit(self, X, y=None):return selfdef transform(self,X):mean = self.train[[self.colNames,self.encodedName]].groupby(self.colNames).mean().reset_index() dd = {}for index, row in mean.iterrows():dd[row[self.colNames]] = row[self.encodedName]X[self.encodedName] = X[self.colNames]X = X.replace({self.encodedName: dd})return X

對(duì)’purpose’,“verificationStatus”, “regionCode”,“grade”,"subGrade"五個(gè)變量進(jìn)行target encoding

for colname in ['purpose',"verificationStatus", "regionCode"，"grade"，"subGrade"]:targetc = KFoldTargetEncoderTrain(colname,'isDefault',n_fold=5)train = targetc.fit_transform(train)test_targetc = KFoldTargetEncoderTest(train,colname,colname + '_' + 'Kfold_Target_Enc')test = test_targetc.fit_transform(test)

3.5 特征衍生

3.11 里提到了匿名變量里缺失值可能是某種原因造成的，分成以下三類缺失查看正負(fù)樣本比

只有n11缺失

除了n4和n10之外都缺失

全部缺失

無缺失

for data in [train,test]:data['extra_col1'] = 3data['extra_col1'].loc[data['n10'].isnull()] = 1data['extra_col1'].loc[data['n1'].isnull() & data['n10'].notnull()] = 2data['extra_col1'].loc[data['n11'].isnull() & data['n1'].notnull()] = 4for i in range(1,5):print(train[train['extra_col1']==i]['isDefault'].sum()/train[train['extra_col1']==i]['isDefault'].count())0.14362646288997863 0.17678850803584129 0.19927476692849552 0.27382809850078016

以上說明缺失值的程度似乎對(duì)正負(fù)樣本的比例有影響，因此我們可以衍生一個(gè)這樣的變量盡管他的關(guān)系不一定是線性的(后續(xù)可以進(jìn)行Target Encoding)。

LA_ration (loanAmnt / annualIncome)

3.5 特征篩選

特征篩選目的是在不犧牲模型效果的情況下減少模型和訓(xùn)練時(shí)間，由于此處數(shù)據(jù)集并不算特別大，暫時(shí)先不做特征篩選，如果后面有需要再回來補(bǔ)充這一步驟。

4 建模及調(diào)參

之前做了這么多準(zhǔn)備工作，最后的目的還是為了輸出結(jié)果，這一步我們可以開始建立模型，并且根據(jù)評(píng)價(jià)指標(biāo)不斷優(yōu)化模型

這次建模打算先用機(jī)器學(xué)習(xí)建模神器Xgboost，使用的是sklearn的接口，先導(dǎo)入可能會(huì)用到的包

from xgboost import XGBClassifier from sklearn.model_selection import train_test_split,KFold from sklearn.metrics import auc, roc_curve from xgboost import plot_importance from sklearn.metrics import auc, roc_curve from sklearn.model_selection import GridSearchCV,RandomizedSearchCV

4.1 Baseline

target = train['isDefault'] train_X = train.drop("isDefault", axis=1) #切分訓(xùn)練和檢驗(yàn)集 X_train,X_test,y_train,y_test = train_test_split(train_X, target,test_size = 0.2, random_state = 0)

隨手設(shè)置一些參數(shù)：

def XGB():model = XGBClassifier(learning_rate=0.1,n_estimators=600, max_depth=5, min_child_weight=5, gamma=1, subsample=0.8, random_state=27, verbosity= 1, nthread=-1)return model

具體的xgboost 參數(shù)設(shè)置可以參考官網(wǎng)

%%time model = XGB() model.fit(X_train, y_train, eval_set = [(X_train,y_train),(X_test,y_test)],eval_metric="auc") result = model.evals_result()pre = model.predict_proba(X_train)[:,1] fpr, tpr, thresholds = roc_curve(y_train, pre) score = auc(fpr, tpr)f,[ax1,ax2] = plt.subplots(2,1,figsize = (7,15))ax1.plot([i for i in range(1,600+1)],result['validation_0']['auc']) ax1.plot([i for i in range(1,600+1)],result['validation_1']['auc']) ax2.set_xlim(0,1) ax2.set_ylim(0,1) ax2.plot(fpr,tpr,label = "AUC = {:.3f}".format(score)) ax2.plot([0,1],[0,1],linestyle = "--") plt.legend()

左圖表示隨著迭代次數(shù)，訓(xùn)練集和測(cè)試集的AUC變化，可以看到大概在200次迭代以后測(cè)試集的auc變化就已經(jīng)很小了，因此后續(xù)可以把n_estimator設(shè)置在200-300之前以減少訓(xùn)練時(shí)間

4.2 調(diào)參

有了baseline 之后我們可以根據(jù)基礎(chǔ)模型對(duì)模型參數(shù)進(jìn)行優(yōu)化。
由于Xgboost參數(shù)較多，而且運(yùn)行的速度比較慢，如果直接使用網(wǎng)格搜索可能要耗費(fèi)幾天時(shí)間，因此我們按一個(gè)參數(shù)一個(gè)參數(shù)調(diào)。

在定義以下基準(zhǔn)模型 model = XGBClassifier(learning_rate=0.1,n_estimators=300, max_depth=5, min_child_weight=6, gamma=1, subsample=0.8, scale_pos_weight=4, random_state=27, verbosity= 1, nthread=-1 )

4.2.1 max_depth

這個(gè)參數(shù)決定最大深度

param_grid = {"max_depth":[i for i in range(3, 11)] } xgb_grid = GridSearchCV(model,param_grid = param_grid, scoring= #"roc_auc", "這次評(píng)價(jià)標(biāo)準(zhǔn)的auc"verbose=True, #"輸出過程"cv=5, #"5折檢驗(yàn)"n_jobs=-1 #"使用所有CPU") xgb_grid.best_param_ # 5

4.2.2min_child_weight

最小葉子節(jié)點(diǎn)權(quán)重和，如果在一次分裂中，葉子節(jié)點(diǎn)上所有樣本的權(quán)重和小于min_child_weight則停止分裂，能夠有效的防止過擬合，防止學(xué)到特殊樣本。

param_grid = {"min_child_weight":[i for i in range(3, 11)] } xgb_grid = GridSearchCV(model,param_grid = param_grid, scoring= #"roc_auc", "這次評(píng)價(jià)標(biāo)準(zhǔn)的auc"verbose=True, #"輸出過程"cv=5, #"5折檢驗(yàn)"n_jobs=-1 #"使用所有CPU") xgb_grid.best_param_ # 6

4.2.3 subsample

param_grid = {"subsampele":[i*0.1 for i in range(3, 11)] } xgb_grid = GridSearchCV(model,param_grid = param_grid, scoring= #"roc_auc", "這次評(píng)價(jià)標(biāo)準(zhǔn)的auc"verbose=True, #"輸出過程"cv=5, #"5折檢驗(yàn)"n_jobs=-1 #"使用所有CPU") xgb_grid.best_param_ # 0.6

4.3 更新模型

使用調(diào)整后的參數(shù)再次進(jìn)行檢驗(yàn)

def XGB():model = XGBClassifier(learning_rate=0.1,n_estimators=600, max_depth=5, min_child_weight=6, gamma=1, subsample=0.6, random_state=27, verbosity= 1, nthread=-1)return model%%time model = XGB() model.fit(X_train, y_train) pre = model.predict_proba(X_test)[:,1] fpr, tpr, thresholds = roc_curve(y_test, pre) score = auc(fpr, tpr) print(score) #0.7373661712901491

4.4 預(yù)測(cè)結(jié)果并提交

使用更新好的模型提交結(jié)果看看最終評(píng)分怎樣

test= test[train_X.columns] pre = model.predict_proba(test)[:,1] pd.DataFrame({'isDefault':pre},index=test.index).reset_index().rename(columns={"index":"id"}).to_csv('submit.csv', index=0)

線上的AUC得分是0.7337，目前大概排在前50

5 模型融合

模型融合大概有三種：stacking,bagging,blending和boosting
由于xgboost本身就已經(jīng)是基于boosting的算法，而隨機(jī)森林是基于bagging的算法，因此這兩種將不會(huì)在這使用了，主要還是通過blending和stacking來融合模型

5.1 stacking\blending詳解

stacking 將若干基學(xué)習(xí)器獲得的預(yù)測(cè)結(jié)果，將預(yù)測(cè)結(jié)果作為新的訓(xùn)練集來訓(xùn)練一個(gè)學(xué)習(xí)器。但是由于直接由多個(gè)基學(xué)習(xí)器獲得結(jié)果直接帶入模型中，容易導(dǎo)致過擬合。所以在使用多個(gè)基模型進(jìn)行預(yù)測(cè)的時(shí)候，可以考慮使用K折驗(yàn)證，防止過擬合。
blending 與stacking不同，blending是將預(yù)測(cè)的值作為新的特征和原特征合并，構(gòu)成新的特征值，用于預(yù)測(cè)。為了防止過擬合，將數(shù)據(jù)分為兩部分d1、d2，使用d1的數(shù)據(jù)作為訓(xùn)練集，d2數(shù)據(jù)作為測(cè)試集。預(yù)測(cè)得到的數(shù)據(jù)作為新特征使用d2的數(shù)據(jù)作為訓(xùn)練集結(jié)合新特征，預(yù)測(cè)測(cè)試集結(jié)果。
Blending與stacking的不同
- stacking
  stacking中由于兩層使用的數(shù)據(jù)不同，所以可以避免信息泄露的問題。
  在組隊(duì)競賽的過程中，不需要給隊(duì)友分享自己的隨機(jī)種子。
- Blending
  由于blending對(duì)將數(shù)據(jù)劃分為兩個(gè)部分，在最后預(yù)測(cè)時(shí)有部分?jǐn)?shù)據(jù)信息將被忽略。
  同時(shí)在使用第二層數(shù)據(jù)時(shí)可能會(huì)因?yàn)榈诙訑?shù)據(jù)較少產(chǎn)生過擬合現(xiàn)象。

5.1 stacking 代碼

使用之前的訓(xùn)練的lgb和xgb模型作為基分類器，邏輯回歸作為目標(biāo)分類器做stacking

from mlxtend.classifier import StackingClassifier sclf = StackingClassifier(classifiers=[lgb_model,xgb_model], meta_classifier=LR,use_probas=True,verbose= 1) sclf.fit(X_train,y_train) pre =sclf.predict_proba(X_test)[:,1] fpr, tpr, thresholds = roc_curve(y_test, pre) score = auc(fpr, tpr) print(score) #0.7390504896093062

最后提交測(cè)試結(jié)果，得分為0.7347

總結(jié)

以上是生活随笔為你收集整理的阿里天池学习赛-金融风控-贷款违约预测的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：学习日志day41（2021-09-03
下一篇： MySQL Inport--导入数据