當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

拍拍贷魔镜杯风控算法大赛——基于lightgbm

發布時間：2024/5/14 编程问答 36 豆豆

生活随笔收集整理的這篇文章主要介紹了拍拍贷魔镜杯风控算法大赛——基于lightgbm 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

本文仿照知乎一位大神的文章，基于理解的基礎上，修改了部分代碼~感謝前輩的分享~

參考文獻：

https://zhuanlan.zhihu.com/p/56864235

原始數據來源：

https://www.kesci.com/home/competition/56cd5f02b89b5bd026cb39c9/content/1

數據集構成：

三萬條已知標簽的訓練集，二萬條不知標簽的測試集

訓練集和測試集均有三種表：

Master（主要的特征表），Log_Info（用戶登陸信息表）,Userupdate_Info（客戶信息修改更新表）

（1）

Master

每一行代表一個樣本（一筆成功成交借款），每個樣本包含200多個各類字段。

idx：每一筆貸款的unique key，可以與另外2個文件里的idx相匹配。

UserInfo_*：借款人特征字段

WeblogInfo_*：Info網絡行為字段

Education_Info*：學歷學籍字段

ThirdParty_Info_PeriodN_*：第三方數據時間段N字段

SocialNetwork_*：社交網絡字段

LinstingInfo：借款成交時間

Target：違約標簽（1 = 貸款違約，0 = 正常還款）。

測試集里不包含target字段。

（2）

Log_Info

借款人的登陸信息。

ListingInfo：借款成交時間

LogInfo1：操作代碼

LogInfo2：操作類別

LogInfo3：登陸時間

idx：每一筆貸款的unique key

（3）

Userupdate_Info

借款人修改信息

ListingInfo1：借款成交時間

UserupdateInfo1：修改內容

UserupdateInfo2：修改時間

idx：每一筆貸款的unique key

本文大體的步驟是：

1）訓練數據和測試數據的合并（為了一起對特征進行處理）

2）分類型變量的清洗

3）基于一些分類型變量和其他表數據（登陸信息表、修改信息表）的特征衍生

4）數值型變量不做處理，缺失值不填充，因為lightgbm可以自行處理缺失值

5）最后對特征工程后的數據集進行特征篩選

6）篩選完后進行建模預測

7）通過調整lightgbm的參數，來提高模型的精度

代碼如下：

import numpy as np import pandas as pd import warnings import matplotlib.pyplot as plt import seaborn as sns import os # os.chdir()用于改變當前工作目錄到指定路徑 os.chdir("D:\Py_Data\拍拍貸“魔鏡杯”風控初賽數據")######################################數據的合并######################################### # 訓練集 train_LogInfo = pd.read_csv('.\Training Set\PPD_LogInfo_3_1_Training_Set.csv',encoding='gbk') train_Master = pd.read_csv('.\Training Set\PPD_Training_Master_GBK_3_1_Training_Set.csv',encoding='gbk') train_Userupdate = pd.read_csv('.\Training Set\PPD_Userupdate_Info_3_1_Training_Set.csv',encoding='gbk')# 測試集 test_LogInfo = pd.read_csv('.\Test Set\PPD_LogInfo_2_Test_Set.csv',encoding='gbk') test_Master = pd.read_csv('.\Test Set\PPD_Master_GBK_2_Test_Set.csv',encoding='gb18030') test_Userupdate = pd.read_csv('.\Test Set\PPD_Userupdate_Info_2_Test_Set.csv',encoding='gbk')# 合并時用于標記哪些樣本來自訓練集和測試集 train_Master['sample_status']='train' test_Master['sample_status']='test'# 訓練集和測試集的合并(axis=0,增加行） df_Master = pd.concat([train_Master,test_Master],axis=0).reset_index(drop=True) df_LogInfo=pd.concat([train_LogInfo,test_LogInfo],axis=0).reset_index(drop=True) df_Userupdate=pd.concat([train_Userupdate,test_Userupdate],axis=0).reset_index(drop=True)df_Master.to_csv("D:\Py_Data\拍拍貸“魔鏡杯”風控初賽數據\df_Master.csv",encoding='gb18030',index=False) df_LogInfo.to_csv("D:\Py_Data\拍拍貸“魔鏡杯”風控初賽數據\df_LogInfo.csv",encoding='gb18030',index=False) df_Userupdate.to_csv("D:\Py_Data\拍拍貸“魔鏡杯”風控初賽數據\df_Userupdate.csv",encoding='gb18030',index=False)#####################################數據的探索行分析##################################### # 導入合并后的數據 df_Master = pd.read_csv('df_Master.csv',encoding='gb18030') df_LogInfo = pd.read_csv('df_LogInfo.csv',encoding='gb18030') df_Userupdate = pd.read_csv('df_Userupdate.csv',encoding='gb18030')# 定義顯示形式 pd.set_option("display.max_columns",len(train_Master.columns)) df_Master.head(20) # 可以看到的是，數據主要分為： # 教育信息、第三方信息、社交網絡信息、用戶信息、網絡博客信息、目標標簽（target)和sample_status(自定義，用于區分數據來源于測試/訓練集)# 察看訓練集中好壞樣本比例，1為壞樣本 df_Master.target.value_counts()# 每個個體都是獨一的 len(np.unique(df_Master.Idx))#######################################（1）缺失值處理################################### # 原始中大量的缺失值用-1標識，我們將其替換成np.nan df_Master = df_Master.replace({-1:np.nan}) df_Master.head(15)# 缺失值的可視化——白色越多，代表變量缺失越多 import missingno as msno %matplotlib inline msno.bar(df_Master)# 缺失占比超過80%的變量列表 missing_columns=[] for column in df_Master.columns:if sum(pd.isnull(df_Master[column]))/len(df_Master)>=0.8:missing_columns.append(column) print(len(missing_columns)) print(missing_columns)

# 篩掉缺失大于80%的變量 df_Master = df_Master.loc[:,list(~df_Master.columns.isin(missing_columns))] df_Master.shape# 再來看樣本的特征缺失（行缺失） # 對于某個樣本，特征缺失大于100 missing_index=[] for i in np.arange(df_Master.shape[0]):if list(df_Master.loc[i,:].isnull()).count(True)>100:missing_index.append(i) print(missing_index)

# 刪除特征缺失超過100的行 df_Master = df_Master.drop(missing_index).reset_index(drop=True) df_Master.shape# 單變量占比分析 print("原變量總數：",'\n',len(df_Master.columns)) cols = [col for col in df_Master.columns if col not in ('target','sample_status')] print("排除目標標簽和標記訓練集和測試集來源的變量總數：",'\n',len(cols))# 某個變量的某個取值占比超過90%，說明信息含量低，可以刪除 drop_cols_simple=[] for col in cols:if max(df_Master[col].value_counts())/len(df_Master)>0.9:drop_cols_simple.append(col) print(drop_cols_simple) print(len(drop_cols_simple))df_Master = df_Master.drop(drop_cols_simple,axis=1) df_Master.shape df_Master = df_Master.reset_index(drop=True)# 剩下的變量的類型 df_Master.dtypes.value_counts()objectcol = df_Master.select_dtypes(include=["object"]).columns numcol = df_Master.select_dtypes(include=[np.float64]).columns# 分類型變量只有12個，我們來看一下這些變量有什么規律 df_Master[objectcol]

# 可以看到的是 # 表示省份的有 # UserInfo_19和UserInfo_7 # 表示城市的有 # UserInfo_2,UserInfo_20,UserInfo_4,UserInfo_8 city_feature = ['UserInfo_2','UserInfo_20','UserInfo_4','UserInfo_8'] province_feature=['UserInfo_7','UserInfo_19']print("城市特征：") for col in city_feature:print(col,":",df_Master[col].nunique())print('\n') print("省份特征：") for col in province_feature:print(col,":",df_Master[col].nunique())

print(df_Master.UserInfo_8.unique()[:50]) # 可以看到，同一個城市表達不一

# 去掉字段中的“市”，保持統一 df_Master['UserInfo_8'] = [a[:-1] if a.find('市')!= -1 else a[:] for a in df_Master['UserInfo_8']]# 清理后非重復計數減小 df_Master['UserInfo_8'].nunique()# 再來看看數值型變量 df_Master[numcol].head(20) # 這里我們不對數值變量進行缺失值插值或者填充，直接用于后期建模# 再來看看其他的表——該表顯示了客戶修改信息的日志 df_Userupdate

# 將上表的大小寫進行統一 df_Userupdate['UserupdateInfo1'] = df_Userupdate.UserupdateInfo1.map(lambda s:s.lower())######################################特征工程階段####################################### # 至此，我們進入特征處理階段 # 首先對類別變量進行變換 df_Master[objectcol]# 1)省份特征————————推測可能一個是籍貫省份，一個是居住省份 # 首先看看各省份好壞樣本的分布占比 def get_badrate(df,col):'''根據某個變量計算違約率'''group = df.groupby(col)df=pd.DataFrame()df['total'] = group.target.count()df['bad'] = group.target.sum()df['badrate'] = round(df['bad']/df['total'],4)*100 # 百分比形式return df.sort_values('badrate',ascending=False)# 戶籍省份的違約率計算 province_original = get_badrate(df_Master,'UserInfo_19') province_original

# 居住地省份的違約率計算 province_current = get_badrate(df_Master,'UserInfo_7') province_current

# 各取前5名的省份進行二值化 province_original.iloc[:5,]

province_current.iloc[:5,]

# 分別對戶籍省份和居住省份排名前五的省份進行二值化 # 戶籍省份的二值化 df_Master['is_tianjin_UserInfo_19']=df_Master.apply(lambda x:1 if x.UserInfo_19=='天津市' else 0,axis=1) df_Master['is_shandong_UserInfo_19']=df_Master.apply(lambda x:1 if x.UserInfo_19=='山東省' else 0,axis=1) df_Master['is_jilin_UserInfo_19']=df_Master.apply(lambda x:1 if x.UserInfo_19=='吉林省' else 0,axis=1) df_Master['is_heilongjiang_UserInfo_19']=df_Master.apply(lambda x:1 if x.UserInfo_19=='黑龍江省' else 0,axis=1) df_Master['is_hunan_UserInfo_19']=df_Master.apply(lambda x:1 if x.UserInfo_19=='湖南省' else 0,axis=1)# 居住省份的二值化 df_Master['is_tianjin_UserInfo_7']=df_Master.apply(lambda x:1 if x.UserInfo_7=='天津' else 0,axis=1) df_Master['is_shandong_UserInfo_7']=df_Master.apply(lambda x:1 if x.UserInfo_7=='山東' else 0,axis=1) df_Master['is_sichuan_UserInfo_7']=df_Master.apply(lambda x:1 if x.UserInfo_7=='四川' else 0,axis=1) df_Master['is_hainan_UserInfo_7']=df_Master.apply(lambda x:1 if x.UserInfo_7=='海南' else 0,axis=1) df_Master['is_hunan_UserInfo_7']=df_Master.apply(lambda x:1 if x.UserInfo_7=='湖南' else 0,axis=1)# 戶籍省份和居住地省份不一致的特征衍生 print(df_Master.UserInfo_19.unique()) print('\n') print(df_Master.UserInfo_7.unique())# 首先將兩者改成相同的形式 UserInfo_19_change = [] for i in df_Master.UserInfo_19:if i in ('內蒙古自治區','黑龍江省'):j = i[:3]else:j = i[:2]UserInfo_19_change.append(j) print(np.unique(UserInfo_19_change))# 判斷UserInfo_7和UserInfo_19是否一致 is_same_province=[] for i,j in zip(df_Master.UserInfo_7,UserInfo_19_change):if i==j:a=1else:a=0is_same_province.append(a) df_Master['is_same_province'] = is_same_province # 2)城市特征 # 原數據中有四個城市特征,推測為用戶常登陸的IP地址城市 # 特征衍生思路: # 一,通過xgboost挑選重要的城市,進行二值化 # 二,由四個城市特征的非重復計數衍生生成登陸IP地址的變更次數# 根據xgboost變量重要性的輸出對城市作二值化衍生 df_Master_temp = df_Master[['UserInfo_2','UserInfo_4','UserInfo_8','UserInfo_20','target']] df_Master_temp.head()

area_list=[] # 將四個城市特征都進行啞變量處理 for col in df_Master_temp:dummy_df = pd.get_dummies(df_Master_temp[col])dummy_df = pd.concat([dummy_df,df_Master_temp['target']],axis=1)area_list.append(dummy_df) df_area1 = area_list[0] df_area2 = area_list[1] df_area3 = area_list[2] df_area4 = area_list[3]df_area1

# 使用xgboost篩選出重要的城市 from xgboost.sklearn import XGBClassifier from xgboost import plot_importance# 注意,這里需要把合并后的沒有目標標簽的行數據刪除 # df_area1[~(df_area1['target'].isnull())]x_area1 = df_area1[~(df_area1['target'].isnull())].drop(['target'],axis=1) y_area1 = df_area1[~(df_area1['target'].isnull())]['target'] x_area2 = df_area2[~(df_area2['target'].isnull())].drop(['target'],axis=1) y_area2 = df_area2[~(df_area2['target'].isnull())]['target'] x_area3 = df_area3[~(df_area3['target'].isnull())].drop(['target'],axis=1) y_area3 = df_area3[~(df_area3['target'].isnull())]['target'] x_area4 = df_area4[~(df_area4['target'].isnull())].drop(['target'],axis=1) y_area4 = df_area4[~(df_area4['target'].isnull())]['target']xg_area1 = XGBClassifier(random_state=0).fit(x_area1,y_area1) xg_area2 = XGBClassifier(random_state=0).fit(x_area2,y_area2) xg_area3 = XGBClassifier(random_state=0).fit(x_area3,y_area3) xg_area4 = XGBClassifier(random_state=0).fit(x_area4,y_area4)plt.rcParams['font.sans-serif'] = ['Microsoft YaHei'] fig = plt.figure(figsize=(20,8)) ax1 = fig.add_subplot(2,2,1) ax2 = fig.add_subplot(2,2,2) ax3 = fig.add_subplot(2,2,3) ax4 = fig.add_subplot(2,2,4)plot_importance(xg_area1,ax=ax1,max_num_features=10,height=0.4) plot_importance(xg_area2,ax=ax2,max_num_features=10,height=0.4) plot_importance(xg_area3,ax=ax3,max_num_features=10,height=0.4) plot_importance(xg_area4,ax=ax4,max_num_features=10,height=0.4)

# 將特征重要性排名前三的城市進行二值化： df_Master['is_zibo_UserInfo_2'] = df_Master.apply(lambda x:1 if x.UserInfo_2=='淄博' else 0,axis=1) df_Master['is_chengdu_UserInfo_2'] = df_Master.apply(lambda x:1 if x.UserInfo_2=='成都' else 0,axis=1) df_Master['is_yantai_UserInfo_2'] = df_Master.apply(lambda x:1 if x.UserInfo_2=='煙臺' else 0,axis=1)df_Master['is_zibo_UserInfo_4'] = df_Master.apply(lambda x:1 if x.UserInfo_4=='淄博' else 0,axis=1) df_Master['is_qingdao_UserInfo_4'] = df_Master.apply(lambda x:1 if x.UserInfo_4=='青島' else 0,axis=1) df_Master['is_shantou_UserInfo_4'] = df_Master.apply(lambda x:1 if x.UserInfo_4=='汕頭' else 0,axis=1)df_Master['is_zibo_UserInfo_8'] = df_Master.apply(lambda x:1 if x.UserInfo_8=='淄博' else 0,axis=1) df_Master['is_chengdu_UserInfo_8'] = df_Master.apply(lambda x:1 if x.UserInfo_8=='成都' else 0,axis=1) df_Master['is_heze_UserInfo_8'] = df_Master.apply(lambda x:1 if x.UserInfo_8=='菏澤' else 0,axis=1)df_Master['is_ziboshi_UserInfo_20'] = df_Master.apply(lambda x:1 if x.UserInfo_20=='淄博市' else 0,axis=1) df_Master['is_chengdushi_UserInfo_20'] = df_Master.apply(lambda x:1 if x.UserInfo_20=='成都市' else 0,axis=1) df_Master['is_sanmenxiashi_UserInfo_20'] = df_Master.apply(lambda x:1 if x.UserInfo_20=='三門峽市' else 0,axis=1)#特征衍生-IP地址變更次數特征 df_Master['UserInfo_20'] = [a[:-1] if a.find('市')!= -1 else i[:] for a in df_Master.UserInfo_20] city_df = df_Master[['UserInfo_2','UserInfo_4','UserInfo_8','UserInfo_20']]city_change_cnt =[] for i in range(city_df.shape[0]):a = list(city_df.iloc[i])city_count = len(set(a))city_change_cnt.append(city_count) df_Master['city_count_cnt'] = city_change_cnt# 3)運營商種類少,直接將其轉換成啞變量 print(df_Master.UserInfo_9.value_counts()) print(set(df_Master.UserInfo_9))

df_Master['UserInfo_9'] = df_Master.UserInfo_9.replace({'中國聯通 ':'china_unicom','中國聯通':'china_unicom','中國移動':'china_mobile','中國移動 ':'china_mobile','中國電信':'china_telecom','中國電信 ':'china_telecom','不詳':'operator_unknown'})operator_dummy = pd.get_dummies(df_Master.UserInfo_9) df_Master = pd.concat([df_Master,operator_dummy],axis=1)# 刪除原變量 df_Master = df_Master.drop(['UserInfo_9'],axis=1) df_Master = df_Master.drop(['UserInfo_19','UserInfo_2','UserInfo_4','UserInfo_7','UserInfo_8','UserInfo_20'],axis=1)# 看看還剩下哪些類型變量要處理 df_Master.dtypes.value_counts() df_Master.select_dtypes(include='object')

# 可以看到,我們要將這些weibo變量進行處理 # 4) 微博特征 for col in ['WeblogInfo_19','WeblogInfo_20','WeblogInfo_21']:df_Master[col].replace({'nan':np.nan}) # 將字符型的nan,利用眾數填充df_Master[col] = df_Master[col].fillna(df_Master[col].mode()[0])# 看看這些變量有幾種類型的值 for col in ['WeblogInfo_19','WeblogInfo_20','WeblogInfo_21']:print(df_Master[col].value_counts())print('\n')# 這里我們猜測WeblogInfo_20是WeblogInfo_19和21的更細化表達,這里直接刪除該變量 # 對其他變量進行啞變量處理df_Master['WeblogInfo_19'] = ['WeblogInfo_19'+ i for i in df_Master.WeblogInfo_19] df_Master['WeblogInfo_21'] = ['WeblogInfo_21'+ i for i in df_Master.WeblogInfo_21]for col in ['WeblogInfo_19','WeblogInfo_21']:weibo_dummy = pd.get_dummies(df_Master[col])df_Master = pd.concat([df_Master,weibo_dummy],axis=1)# 刪除原變量 df_Master = df_Master.drop(['WeblogInfo_19','WeblogInfo_21','WeblogInfo_20'],axis=1)# 至此,類別變量處理完畢 df_Master.dtypes.value_counts() # 我們來看看借款的成交時間趨勢 # 首先將字符型的日期轉換成時間戳形式 import datetime from datetime import datetime df_Master['ListingInfo'] = pd.to_datetime(df_Master.ListingInfo) df_Master["Month"] = df_Master.ListingInfo.apply(lambda x:datetime.strftime(x,"%Y-%m"))plt.figure(figsize=(20,4)) plt.title("借款成功的時間趨勢變化") plt.rcParams['font.sans-serif']=['Microsoft YaHei'] sns.countplot(data=df_Master.sort_values('Month'),x='Month') plt.show()

# 也可以看看違約率的月變化趨勢 month_group = df_Master.groupby('Month') df_badrate_month = pd.DataFrame() df_badrate_month['total'] = month_group.target.count() df_badrate_month['bad'] = month_group.target.sum() df_badrate_month['badrate'] = df_badrate_month['bad']/df_badrate_month['total'] df_badrate_month=df_badrate_month.reset_index()plt.figure(figsize=(12,4)) plt.title('違約率的時間趨勢圖') sns.pointplot(data=df_badrate_month,x='Month',y='badrate',linestyles='-') plt.show() # 注:空值的部分代表的是預測樣本

# 我們不對數值型變量的缺失值做處理 df_Master = df_Master.drop('Month',axis=1)# LogInfo表 df_LogInfo# 衍生的變量有 # 1)累計登陸次數 # 2)登陸時間的平均間隔 # 3)最近一次的登陸時間距離成交時間差# 1)累計登陸次數 log_cnt = df_LogInfo.groupby('Idx',as_index=False).LogInfo3.count().rename(columns={'LogInfo3':'log_cnt'}) log_cnt.head(10)

# 2)最近一次的登陸時間距離成交時間差# 最近一次的登錄時間距離當前時間差 df_LogInfo['Listinginfo1']=pd.to_datetime(df_LogInfo.Listinginfo1) df_LogInfo['LogInfo3'] = pd.to_datetime(df_LogInfo.LogInfo3) time_log_span = df_LogInfo.groupby('Idx',as_index=False).agg({'Listinginfo1':np.max,'LogInfo3':np.max}) time_log_span.head()time_log_span['log_timespan'] = time_log_span['Listinginfo1']-time_log_span['LogInfo3'] time_log_span['log_timespan'] = time_log_span['log_timespan'].map(lambda x:str(x))time_log_span['log_timespan'] = time_log_span['log_timespan'].map(lambda x:int(x[:x.find('d')])) time_log_span= time_log_span[['Idx','log_timespan']] time_log_span.head()

# 3)登陸時間的平均時間間隔df_temp_timeinterval = df_LogInfo.sort_values(by=['Idx','LogInfo3'],ascending=['True','True']) df_temp_timeinterval['LogInfo4'] = df_temp_timeinterval.groupby('Idx')['LogInfo3'].apply(lambda x:x.shift(1)) df_temp_timeinterval

df_temp_timeinterval['time_span'] = df_temp_timeinterval['LogInfo3'] - df_temp_timeinterval['LogInfo4'] df_temp_timeinterval['time_span'] = df_temp_timeinterval['time_span'] .map(lambda x:str(x)) df_temp_timeinterval['time_span'] = df_temp_timeinterval['time_span'].replace({'NaT':'0 days 00:00:00'}) df_temp_timeinterval['time_span'] = df_temp_timeinterval['time_span'].map(lambda x:int(x[:x.find('d')])) df_temp_timeinterval

avg_log_timespan = df_temp_timeinterval.groupby('Idx',as_index=False).time_span.mean().rename(columns={'time_span':'avg_log_timespan'})log_info = pd.merge(log_cnt,time_log_span,how='left',on='Idx') log_info = pd.merge(log_info,avg_log_timespan,how='left',on='Idx') log_info.head()

log_info.to_csv('D:\Py_Data\拍拍貸“魔鏡杯”風控初賽數據\log_info_feature.csv',encoding='gbk',index=False)# 修改信息表 # 衍生變量: # 1)最近的修改時間距離成交時間差; # 2)修改信息總次數 # 3)每種信息修改的次數 # 4)按照日期修改的次數# 1)最近的修改時間距離成交時間差; df_Userupdate['ListingInfo1']=pd.to_datetime(df_Userupdate['ListingInfo1']) df_Userupdate['UserupdateInfo2']=pd.to_datetime(df_Userupdate['UserupdateInfo2']) time_span = df_Userupdate.groupby('Idx',as_index=False).agg({'UserupdateInfo2':np.max,'ListingInfo1':np.max}) time_span['update_timespan'] = time_span['ListingInfo1']-time_span['UserupdateInfo2'] time_span['update_timespan'] = time_span['update_timespan'].map(lambda x:str(x)) time_span['update_timespan'] = time_span['update_timespan'].map(lambda x:int(x[:x.find('d')])) time_span = time_span[['Idx','update_timespan']]# 2）計算每個用戶修改不同類別信息的次數 group = df_Userupdate.groupby(['Idx','UserupdateInfo1'],as_index=False).agg({'UserupdateInfo2':pd.Series.nunique})# 3）每種信息修改的次數的衍生 user_df_list=[] for idx in group.Idx.unique():user_df = group[group.Idx==idx]change_cate = list(user_df.UserupdateInfo1)change_cnt = list(user_df.UserupdateInfo2)user_col = ['Idx']+change_cateuser_value = [user_df.iloc[0]['Idx']]+change_cntuser_df2 = pd.DataFrame(np.array(user_value).reshape(1,len(user_value)),columns=user_col)user_df_list.append(user_df2) cate_change_df = pd.concat(user_df_list,axis=0) cate_change_df.head()

# 將cate_change_df里的空值填為0 cate_change_df = cate_change_df.fillna(0) cate_change_df.shapedf_Userupdate# 4）修改信息的總次數，按照日期修改的次數的衍生 update_cnt = df_Userupdate.groupby('Idx',as_index=False).agg({'UserupdateInfo2':pd.Series.nunique,'ListingInfo1':pd.Series.count}).\rename(columns={'UserupdateInfo2':'update_time_cnt','ListingInfo1':'update_all_cnt'}) update_cnt.head()

# 將三個衍生特征的臨時表進行關聯 update_info = pd.merge(time_span,cate_change_df,on='Idx',how='left') update_info = pd.merge(update_info,update_cnt,on='Idx',how='left') update_info.head()

# 保存數據至本地 update_info.to_csv(r'D:\Py_Data\拍拍貸“魔鏡杯”風控初賽數據\update_feature.csv',encoding='gbk',index=False)df_Master.to_csv(r'D:\Py_Data\拍拍貸“魔鏡杯”風控初賽數據\df_Master_tackled.csv',encoding='gbk',index=False) # 合并三個表的數據 df_Master_tackled= pd.read_csv('df_Master_tackled.csv',encoding='gbk') df_LogInfo_tackled = pd.read_csv('log_info_feature.csv',encoding='gbk') df_Userupdate_tackled = pd.read_csv('update_feature.csv',encoding='gbk')df_final = pd.merge(df_Master_tackled,df_LogInfo_tackled,on='Idx',how='left') df_final = pd.merge(df_final,df_Userupdate_tackled,on='Idx',how='left') df_final.shape #########################################特征篩選####################################### # 用lightGBM篩選特征, # 這里訓練10個模型,并對10個模型輸出的特征重要性取平均,最后對特征重要性的值進行歸一化 # 以上將訓練集和測試集合并是為了處理特征,現在再將兩者劃分開,用于模型訓練 # 將三萬訓練集劃分成訓練集和測試集,沒有目標標簽的2萬樣本作為預測集from sklearn.model_selection import train_test_splitX_train,X_test, y_train, y_test = train_test_split(df_final[df_final.sample_status=='train'].drop(['Idx','sample_status','target','ListingInfo'],axis=1),df_final[df_final.sample_status=='train']['target'],test_size=0.3, random_state=0)train_fea = np.array(X_train) test_fea = np.array(X_test) evaluate_fea = np.array(df_final[df_final.sample_status=='test'].drop(['Idx','sample_status','target','ListingInfo'],axis=1))# # reshape(-1,1轉成一列 train_label = np.array(y_train).reshape(-1,1) test_label = np.array(y_test).reshape(-1,1) evaluate_label = np.array(df_final[df_final.sample_status=='test']['target']).reshape(-1,1)fea_names = list(X_train.columns) feature_importance_values = np.zeros(len(fea_names)) # 訓練10個lightgbm，并對10個模型輸出的feature_importances_取平均import lightgbm as lgb from lightgbm import plot_importancefor i in np.arange(10):model = lgb.LGBMClassifier(n_estimators=1000,learning_rate=0.05,n_jobs=-1,verbose=-1)model.fit(train_fea,train_label,eval_metric='auc',eval_set = [(test_fea, test_label)],early_stopping_rounds=100,verbose = -1)feature_importance_values += model.feature_importances_/10# 將feature_importance_values存成臨時表 fea_imp_df1 = pd.DataFrame({'feature':fea_names,'fea_importance':feature_importance_values}) fea_imp_df1 = fea_imp_df1.sort_values('fea_importance',ascending=False).reset_index(drop=True) fea_imp_df1['norm_importance'] = fea_imp_df1['fea_importance']/fea_imp_df1['fea_importance'].sum() # 特征重要性value的歸一化 fea_imp_df1['cum_importance'] = np.cumsum(fea_imp_df1['norm_importance'])# 特征重要性value的累加值fea_imp_df1

# 特征重要性可視化 plt.figure(figsize=(16,16)) plt.rcParams['font.sans-serif']=['Microsoft YaHei'] plt.subplot(3,1,1) plt.title('特征重要性') sns.barplot(data=fea_imp_df1.iloc[:10,:],x='norm_importance',y='feature')plt.subplot(3,1,2) plt.title('特征重要性累加圖') plt.xlabel('特征個數') plt.ylabel('cum_importance') plt.plot(list(range(1, len(fea_names)+1)),fea_imp_df1['cum_importance'], 'r-')plt.subplot(3,1,3) plt.title('各個特征的歸一化得分') plt.xlabel('特征') plt.ylabel('norm_importance') plt.plot(fea_imp_df1.feature,fea_imp_df1['norm_importance'], 'b*-') plt.show()

# 剔除特征重要性為0的變量 zero_imp_col = list(fea_imp_df1[fea_imp_df1.fea_importance==0].feature) fea_imp_df11 = fea_imp_df1[~(fea_imp_df1.feature.isin(zero_imp_col))] print('特征重要性為0的變量個數為：{}'.format(len(zero_imp_col))) print(zero_imp_col) # 剔除特征重要性比較弱的變量 low_imp_col = list(fea_imp_df11[fea_imp_df11.cum_importance>=0.99].feature) print('特征重要性比較弱的變量個數為：{}'.format(len(low_imp_col))) print(low_imp_col)

# 刪除特征重要性為0和比較弱的特征 drop_imp_col = zero_imp_col+low_imp_col mydf_final_fea_selected = df_final.drop(drop_imp_col,axis=1) mydf_final_fea_selected.shape# (49701, 160)mydf_final_fea_selected.to_csv(r'D:\Py_Data\拍拍貸“魔鏡杯”風控初賽數據\mydf_final_fea_selected.csv',encoding='gbk',index=False) ##############################################建模###################################### # 篩選完特征后,再將該數據集切分成訓練集和測試集,并通過調參提高精度,然后使用精度最高的模型預測2萬個樣本的標簽# 導入數據.用于建模 df = pd.read_csv('mydf_final_fea_selected.csv',encoding='gbk')x_data = df[df.sample_status=='train'].drop(['Idx','sample_status','target','ListingInfo'],axis=1) y_data = df[df.sample_status=='train']['target']# 劃分訓練集和測試集 x_train,x_test, y_train, y_test = train_test_split(x_data,y_data,test_size=0.2)# 訓練模型 lgb_sklearn = lgb.LGBMClassifier(random_state=0).fit(x_train,y_train)# # 預測測試集的樣本 lgb_sklearn_pre = lgb_sklearn.predict_proba(x_test)###計算roc和auc from sklearn.metrics import roc_curve, auc def acu_curve(y,prob):# y真實,# prob預測fpr,tpr,threshold = roc_curve(y,prob) ###計算真陽性率(真正率)和假陽性率(假正率)roc_auc = auc(fpr,tpr) ###計算auc的值plt.figure()lw = 2plt.figure(figsize=(12,10))plt.plot(fpr, tpr, color='darkorange',lw=lw, label='ROC curve (AUC = %0.3f)' % roc_auc) ###假正率為橫坐標，真正率為縱坐標做曲線plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')plt.xlim([0.0, 1.0])plt.ylim([0.0, 1.05])plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.title('AUC')plt.legend(loc="lower right")plt.show()acu_curve(y_test,lgb_sklearn_pre[:,1])

# 以上是sklearn版,下面是原生版本 import time # 原生的lightgbm lgb_train = lgb.Dataset(x_train,y_train) lgb_test = lgb.Dataset(x_test,y_test,reference=lgb_train) lgb_origi_params = {'boosting_type':'gbdt','max_depth':-1,'num_leaves':31,'bagging_fraction':1.0,'feature_fraction':1.0,'learning_rate':0.1,'metric': 'auc'} start = time.time() lgb_origi = lgb.train(train_set=lgb_train,early_stopping_rounds=10,num_boost_round=400,params=lgb_origi_params,valid_sets=lgb_test) end = time.time() print('運行時間為{}秒'.format(round(end-start,0)))

# 原生的lightgbm的AUC lgb_origi_pre = lgb_origi.predict(x_test) acu_curve(y_test,lgb_origi_pre)

########################################lightgbm嘗試調參################################## 確定最大迭代次數，學習率設為0.1 base_parmas={'boosting_type':'gbdt', # 使用的算法,還有rf,dart,goss'learning_rate':0.1,'num_leaves':40, # 一棵樹上的葉子數,默認31'max_depth':-1, # 樹的最大深度,0：無限制'bagging_fraction':0.8, # 每次迭代隨機選取部分數據'feature_fraction':0.8, # 每次迭代隨機選取部分特征'lambda_l1':0, # 正則化,'lambda_l2':0,'min_data_in_leaf':20, # 一個葉子上數據的最小數量,默認20，處理過擬合，設置較大可以避免生成一個較深的樹，'min_sum_hessian_inleaf':0.001, # 一個葉子上最小hessian和,，處理過擬合'metric':'auc'}cv_result = lgb.cv(train_set=lgb_train,num_boost_round=200, # 迭代次數,默認100early_stopping_rounds=5, # 沒有提高，模型將停止訓練nfold=5,stratified=True,shuffle=True,params=base_parmas,metrics='auc',seed=0)print('最大的迭代次數: {}'.format(len(cv_result['auc-mean']))) print('交叉驗證的AUC: {}'.format(max(cv_result['auc-mean'])))# 輸出 # 最大的迭代次數: 28 # 交叉驗證的AUC: 0.7136171096752256 # num_leaves ，步長設為5from sklearn.model_selection import GridSearchCV from sklearn.model_selection import cross_val_score from sklearn.model_selection import StratifiedKFoldparam_find1 = {'num_leaves':range(10,50,5)} cv_fold = StratifiedKFold(n_splits=5,random_state=0,shuffle=True) start = time.time() grid_search1 = GridSearchCV(estimator=lgb.LGBMClassifier(learning_rate=0.1,n_estimators = 28,max_depth=-1,min_child_weight=0.001,min_child_samples=20,subsample=0.8,colsample_bytree=0.8,reg_lambda=0,reg_alpha=0),cv = cv_fold,n_jobs=-1,param_grid = param_find1,scoring='roc_auc') grid_search1.fit(x_train,y_train) end = time.time() print('運行時間為:{}'.format(round(end-start,0)))print(grid_search1.get_params) print('\t') print(grid_search1.best_params_) print('\t') print(grid_search1.best_score_) grid_search1.get_params

# num_leaves,步長為1 param_find2 = {'num_leaves':range(40,50,1)} grid_search2 = GridSearchCV(estimator=lgb.LGBMClassifier(n_estimators=28,learning_rate=0.1,min_child_weight=0.001,min_child_samples=20,subsample=0.8,colsample_bytree=0.8,reg_lambda=0,reg_alpha=0),cv=cv_fold,n_jobs=-1,scoring='roc_auc',param_grid = param_find2) grid_search2.fit(x_train,y_train) print(grid_search2.get_params) print('\t') print(grid_search2.best_params_) print('\t') print(grid_search2.best_score_)

# 確定num_leaves 為41 ，下面進行min_child_samples 和 min_child_weight的調參，設定步長為5 param_find3 = {'min_child_samples':range(15,35,5),'min_child_weight':[x/1000 for x in range(1,4,1)]} grid_search3 = GridSearchCV(estimator=lgb.LGBMClassifier(estimator=28,learning_rate=0.1,num_leaves=41,subsample=0.8,colsample_bytree=0.8,reg_lambda=0,reg_alpha=0),cv=cv_fold,scoring='roc_auc',param_grid = param_find3,n_jobs=-1) start = time.time() grid_search3.fit(x_train,y_train) end = time.time() print('運行時間:{} 秒'.format(round(end-start,0)))print(grid_search3.get_params) print('\t') print(grid_search3.best_params_) print('\t') print(grid_search3.best_score_)

# 確定min_child_weight為0.001，min_child_samples為20,下面對subsample和colsample_bytree進行調參 param_find4 = {'subsample':[x/10 for x in range(5,11,1)],'colsample_bytree':[x/10 for x in range(5,11,1)]} grid_search4 = GridSearchCV(estimator=lgb.LGBMClassifier(estimator=28,learning_rate=0.1,min_child_samples=20,min_child_weight=0.001,num_leaves=41,reg_lambda=0,reg_alpha=0),cv=cv_fold,scoring='roc_auc',param_grid = param_find4,n_jobs=-1) start = time.time() grid_search4.fit(x_train,y_train) end = time.time() print('運行時間:{} 秒'.format(round(end-start,0))) print(grid_search4.get_params) print('\t') print(grid_search4.best_params_) print('\t') print(grid_search4.best_score_)

# 再調整reg_lambda和reg_alpha param_find5 = {'reg_lambda':[0.001,0.01,0.03,0.08,0.1,0.3],'reg_alpha':[0.001,0.01,0.03,0.08,0.1,0.3]} grid_search5 = GridSearchCV(estimator=lgb.LGBMClassifier(estimator=28,learning_rate=0.1,min_child_samples=20,min_child_weight=0.001,num_leaves=41,subsample= 0.5,colsample_bytree=0.8 ),cv=cv_fold,scoring='roc_auc',param_grid = param_find5,n_jobs=-1) start = time.time() grid_search5.fit(x_train,y_train) end = time.time() print('運行時間:{} 秒'.format(round(end-start,0))) print(grid_search5.get_params) print('\t') print(grid_search5.best_params_) print('\t') print(grid_search5.best_score_)

param_find6 = {'learning_rate':[0.001,0.002,0.003,0.004,0.005,0.01,0.03,0.08,0.1,0.3，0.5]} grid_search6 = GridSearchCV(estimator=lgb.LGBMClassifier(estimator=28,min_child_samples=20,min_child_weight=0.001,num_leaves=41,subsample= 0.5,colsample_bytree=0.8 ,reg_alpha=0.1,reg_lambda=0.3),cv=cv_fold,scoring='roc_auc',param_grid = param_find6,n_jobs=-1) start = time.time() grid_search6.fit(x_train,y_train) end = time.time() print('運行時間:{} 秒'.format(round(end-start,0))) print(grid_search6.get_params) print('\t') print(grid_search6.best_params_) print('\t') print(grid_search6.best_score_)

# 將最佳參數再次帶入cv函數 best_params = {'boosting_type':'gbdt','learning_rate': 0.08,'num_leaves':41,'max_depth':-1,'bagging_fraction':0.5,'feature_fraction':0.8,'min_data_in_leaf':20,'min_sum_hessian_in_leaf':0.001,'lambda_l1':0.1,'lambda_l2':0.3,'metric':'auc' }best_cv = lgb.cv(train_set=lgb_train,early_stopping_rounds=5,num_boost_round=200,nfold=5,params=best_params,metrics='auc',stratified=True,shuffle=True,seed=0)print('最佳參數的迭代次數: {}'.format(len(best_cv['auc-mean']))) print('交叉驗證的AUC: {}'.format(max(best_cv['auc-mean'])))# 最佳參數的迭代次數: 50 # 交叉驗證的AUC: 0.7167089545162871 lgb_single_model = lgb.LGBMClassifier(n_estimators=50,learning_rate=0.08,min_child_weight=0.001,min_child_samples = 20,subsample=0.5,colsample_bytree=0.8,num_leaves=41,max_depth=-1,reg_lambda=0.3,reg_alpha=0.1,random_state=0) lgb_single_model.fit(x_train,y_train)pre = lgb_single_model.predict_proba(x_test)[:,1] acu_curve(y_test,pre)

總結

以上是生活随笔為你收集整理的拍拍贷魔镜杯风控算法大赛——基于lightgbm的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

拍拍贷魔镜杯风控算法大赛——基于lightgbm

總結