2019年厦门国际银行“数创金融杯”数据建模大赛总结
比賽介紹
比賽鏈接:此次大賽由廈門(mén)國(guó)際銀行與廈門(mén)大學(xué)數(shù)據(jù)挖掘研究中心聯(lián)合舉辦,廈門(mén)國(guó)際銀行-廈門(mén)大學(xué)數(shù)據(jù)挖掘研究中心“數(shù)創(chuàng)金融”聯(lián)合實(shí)驗(yàn)室承辦。
數(shù)據(jù)下載地址:https://download.csdn.net/download/weixin_35770067/13718841
數(shù)據(jù)總體概述
本次數(shù)據(jù)共分為兩個(gè)數(shù)據(jù)集,train_x.csv、train_target.csv和test_x.csv,其中train_x.csv為訓(xùn)練集的特征,train_target.csv為訓(xùn)練集的目標(biāo)變量,其中,為了增強(qiáng)模型的泛化能力,訓(xùn)練集由兩個(gè)階段的樣本組成,由字段isNew標(biāo)記。test_x.csv為測(cè)試集的特征,特征變量與訓(xùn)練集一致。建模的目標(biāo)即根據(jù)訓(xùn)練集對(duì)模型進(jìn)行訓(xùn)練,并對(duì)測(cè)試集進(jìn)行預(yù)測(cè)。
數(shù)據(jù)字段說(shuō)明
a)為用戶基本屬性信息
id, target, certId, gender, age, dist, edu, job, ethnic, highestEdu, certValidBegin, certValidStop
b)借貸相關(guān)信息
loanProduct, lmt, basicLevel, bankCard, residentAddr, linkRela,setupHour, weekday
c)用戶征信相關(guān)信息
x_0至x_78以及ncloseCreditCard, unpayIndvLoan, unpayOtherLoan, unpayNormalLoan, 5yearBadloan
該部分?jǐn)?shù)據(jù)涉及較為第三方敏感數(shù)據(jù),未做進(jìn)一步說(shuō)明。
評(píng)分標(biāo)準(zhǔn)
排名根據(jù)測(cè)試集的AUC確定
RF-basemodel(0.75+)
我們一開(kāi)始先使用RandomFroest分類器來(lái)泡一下所有的數(shù)據(jù),看一下選用默認(rèn)參數(shù)下的后果。
# 選用所有特征 # ['id', 'certId', 'loanProduct', 'gender', 'age', 'dist', 'edu', 'job', 'lmt', 'basicLevel', 'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78', 'x_79', 'certValidBegin', 'certBalidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan', 'target'] x_columns = [x for x in train_data.columns if x not in ["target", "id"]] rf = RandomForestClassifier()AUC Score (Train): 0.545862
我們發(fā)現(xiàn)達(dá)到的準(zhǔn)確率只有0.545862,基本上和盲猜沒(méi)啥區(qū)別。
我們看一下調(diào)參之后的結(jié)果:
rf = RandomForestClassifier(n_estimators=100,random_state=10)AUC Score (Train): 0.632956
rf = RandomForestClassifier(n_estimators=90,random_state=10)AUC Score (Train): 0.638696
rf = RandomForestClassifier(n_estimators=80,random_state=10)AUC Score (Train): 0.633332
rf = RandomForestClassifier(n_estimators=90, max_depth=4,random_state=10)AUC Score (Train): 0.687838
rf = RandomForestClassifier(n_estimators=90, max_depth=6,random_state=10)AUC Score (Train): 0.685170
rf = RandomForestClassifier(n_estimators=90 max_depth=8,random_state=10)AUC Score (Train): 0.653320
rf = RandomForestClassifier(n_estimators=90 max_depth=10,random_state=10)AUC Score (Train): 0.636410
通過(guò)上面簡(jiǎn)單的調(diào)參結(jié)果可以看出,不同的參數(shù)對(duì)AUC的影響還是很大的,最低為0.545862,最高可以達(dá)到0.687838,相差接近20%各百分點(diǎn),當(dāng)然這個(gè)結(jié)果不一定是最優(yōu)的,還有很多可調(diào)控的空間。
上面主要是對(duì)初期的調(diào)參,下面我們?cè)賮?lái)看看特征。
我們上面的特征選擇的是全部特征,我們下面使用隨機(jī)森林自帶的feature_importances_接口來(lái)選取一些更有效的特征,代碼如下:
importances = rf.feature_importances_ indices = np.argsort(importances)[::-1] feat_labels = X_train.columns std = np.std([tree.feature_importances_ for tree in rf.estimators_],axis=0) # inter-trees variability. print("Feature ranking:") # l1,l2,l3,l4 = [],[],[],[] # 打印每個(gè)特征的重要程度 for f in range(X_train.shape[1]):print("%d. feature no:%d feature name:%s (%f)" % (f + 1, indices[f], feat_labels[indices[f]], importances[indices[f]])) Feature ranking: 1. feature no:7 feature name:lmt (0.119897) 2. feature no:90 feature name:certBalidStop (0.070063) 3. feature no:91 feature name:bankCard (0.065635) 4. feature no:89 feature name:certValidBegin (0.061998) 5. feature no:93 feature name:residentAddr (0.055272) 6. feature no:0 feature name:certId (0.054448) 7. feature no:4 feature name:dist (0.048813) 8. feature no:8 feature name:basicLevel (0.042018) 9. feature no:97 feature name:weekday (0.040811) 10. feature no:96 feature name:setupHour (0.040214) 11. feature no:54 feature name:x_45 (0.038700) 12. feature no:3 feature name:age (0.031389) 13. feature no:1 feature name:loanProduct (0.028978) 14. feature no:95 feature name:linkRela (0.027006) 15. feature no:100 feature name:unpayOtherLoan (0.026191) 16. feature no:6 feature name:job (0.018915) 17. feature no:29 feature name:x_20 (0.018539) 18. feature no:55 feature name:x_46 (0.016263) 19. feature no:82 feature name:x_73 (0.015427) 20. feature no:42 feature name:x_33 (0.014756) 21. feature no:44 feature name:x_35 (0.009275) 22. feature no:92 feature name:ethnic (0.008969) 23. feature no:34 feature name:x_25 (0.008467) 24. feature no:71 feature name:x_62 (0.008017) 25. feature no:37 feature name:x_28 (0.007177) 26. feature no:2 feature name:gender (0.007070) 27. feature no:76 feature name:x_67 (0.006776) 28. feature no:85 feature name:x_76 (0.006183) 29. feature no:101 feature name:unpayNormalLoan (0.005641) 30. feature no:72 feature name:x_63 (0.005626) 31. feature no:98 feature name:ncloseCreditCard (0.005433) 32. feature no:81 feature name:x_72 (0.005120) 33. feature no:77 feature name:x_68 (0.004969) 34. feature no:43 feature name:x_34 (0.004652) 35. feature no:70 feature name:x_61 (0.004451) 36. feature no:35 feature name:x_26 (0.003792) 37. feature no:63 feature name:x_54 (0.003617) 38. feature no:60 feature name:x_51 (0.003151) 39. feature no:56 feature name:x_47 (0.003083) 40. feature no:25 feature name:x_16 (0.002995) 41. feature no:23 feature name:x_14 (0.002979) 42. feature no:36 feature name:x_27 (0.002700) 43. feature no:32 feature name:x_23 (0.002591) 44. feature no:99 feature name:unpayIndvLoan (0.002557) 45. feature no:80 feature name:x_71 (0.002379) 46. feature no:83 feature name:x_74 (0.002353) 47. feature no:68 feature name:x_59 (0.002294) 48. feature no:84 feature name:x_75 (0.002284) 49. feature no:61 feature name:x_52 (0.001965) 50. feature no:26 feature name:x_17 (0.001933) 51. feature no:10 feature name:x_1 (0.001912) 52. feature no:9 feature name:x_0 (0.001882) 53. feature no:31 feature name:x_22 (0.001662) 54. feature no:52 feature name:x_43 (0.001651) 55. feature no:74 feature name:x_65 (0.001631) 56. feature no:62 feature name:x_53 (0.001578) 57. feature no:13 feature name:x_4 (0.001530) 58. feature no:57 feature name:x_48 (0.001484) 59. feature no:59 feature name:x_50 (0.001357) 60. feature no:11 feature name:x_2 (0.001116) 61. feature no:16 feature name:x_7 (0.000877) 62. feature no:48 feature name:x_39 (0.000832) 63. feature no:102 feature name:5yearBadloan (0.000797) 64. feature no:64 feature name:x_55 (0.000787) 65. feature no:30 feature name:x_21 (0.000786) 66. feature no:47 feature name:x_38 (0.000759) 67. feature no:19 feature name:x_10 (0.000694) 68. feature no:66 feature name:x_57 (0.000653) 69. feature no:50 feature name:x_41 (0.000548) 70. feature no:20 feature name:x_11 (0.000508) 71. feature no:65 feature name:x_56 (0.000500) 72. feature no:17 feature name:x_8 (0.000400) 73. feature no:15 feature name:x_6 (0.000390) 74. feature no:79 feature name:x_70 (0.000378) 75. feature no:94 feature name:highestEdu (0.000355) 76. feature no:75 feature name:x_66 (0.000229) 77. feature no:53 feature name:x_44 (0.000226) 78. feature no:21 feature name:x_12 (0.000183) 79. feature no:58 feature name:x_49 (0.000129) 80. feature no:38 feature name:x_29 (0.000120) 81. feature no:51 feature name:x_42 (0.000112) 82. feature no:73 feature name:x_64 (0.000096) 83. feature no:39 feature name:x_30 (0.000005) 84. feature no:24 feature name:x_15 (0.000000) 85. feature no:40 feature name:x_31 (0.000000) 86. feature no:88 feature name:x_79 (0.000000) 87. feature no:87 feature name:x_78 (0.000000) 88. feature no:86 feature name:x_77 (0.000000) 89. feature no:5 feature name:edu (0.000000) 90. feature no:41 feature name:x_32 (0.000000) 91. feature no:78 feature name:x_69 (0.000000) 92. feature no:45 feature name:x_36 (0.000000) 93. feature no:22 feature name:x_13 (0.000000) 94. feature no:67 feature name:x_58 (0.000000) 95. feature no:12 feature name:x_3 (0.000000) 96. feature no:46 feature name:x_37 (0.000000) 97. feature no:14 feature name:x_5 (0.000000) 98. feature no:33 feature name:x_24 (0.000000) 99. feature no:49 feature name:x_40 (0.000000) 100. feature no:28 feature name:x_19 (0.000000) 101. feature no:18 feature name:x_9 (0.000000) 102. feature no:27 feature name:x_18 (0.000000) 103. feature no:69 feature name:x_60 (0.000000)根據(jù)特征的重要性評(píng)估,我們拋棄掉重要性為0的特征,進(jìn)行測(cè)試后:
x_columns = ['id', 'certId', 'loanProduct', 'gender', 'age', 'dist', 'job', 'lmt', 'basicLevel', 'x_0', 'x_1', 'x_2', 'x_4', 'x_6', 'x_7', 'x_8', 'x_10', 'x_11', 'x_12', 'x_14', 'x_16', 'x_17', 'x_20', 'x_21', 'x_22', 'x_23', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_33', 'x_34', 'x_35', 'x_38', 'x_39', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_59','x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'certValidBegin', 'certBalidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan'] rf = RandomForestClassifier(n_estimators=90, max_depth=4,random_state=10)AUC Score (Train): 0.681259
我們剔除了隨機(jī)森林認(rèn)為不重要的特征,效果反而變差了。
依次類推,我們?cè)俅未蛴√卣髦匾?#xff0c;繼續(xù)拋棄重要性為0的特征。
x_columns = ['certId', 'loanProduct', 'gender', 'age', 'dist', 'job', 'lmt', 'basicLevel', 'x_1', 'x_2', 'x_4', 'x_6', 'x_8', 'x_12', 'x_14', 'x_16', 'x_17', 'x_20', 'x_21', 'x_23', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_33', 'x_34', 'x_35', 'x_39', 'x_41', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_57','x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'certValidBegin', 'certBalidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan']rf = RandomForestClassifier(n_estimators=90, max_depth=4,random_state=10)AUC Score (Train): 0.677848
我們?cè)俅翁蕹藃f認(rèn)為不重要的特征,效果再次變差了。
AUC Score (Train): 0.690318
前兩輪效果逐漸變差,現(xiàn)在終于AUC又提升了有些,其中還是挺微妙的。
最后,我們暫且將0.690318 作為最好的結(jié)果(其實(shí)并不是)。性能提升遠(yuǎn)不止于此,初次模型的調(diào)參和特征選擇到此結(jié)束了。如果再使用特征工程、規(guī)則、交叉驗(yàn)證等的一些方法,效果肯定會(huì)更好,初次這里就只是進(jìn)行了簡(jiǎn)單的調(diào)參和特征選取。
XGBoost-basemodel(76+)
上一次使用的是基于隨機(jī)森林的basemodel,最終線上可達(dá)75+,今天嘗試了一下xgboost,線上可達(dá)76+。
首先使用XGBoost的分類器,使用默認(rèn)參數(shù)看一下效果。
# 選用所有特征 # ['id','certId', 'loanProduct', 'gender', 'age', 'dist', 'edu', 'job', 'lmt', 'basicLevel', 'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78', 'certValidBegin', 'certValidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan', 'isNew', 'target'] x_columns = [x for x in train_data.columns if x not in ["target", "id"]] xgboost = xgb.XGBClassifier()AUC Score (Train): 0.703644
之前隨機(jī)森林的默認(rèn)參數(shù)只有0.545862,兩者差距有些大呀。
首先看一下調(diào)參數(shù)的效果:
xgboost = xgb.XGBClassifier(max_depth=6, n_estimators=100)AUC Score (Train): 0.702864
xgboost = xgb.XGBClassifier(max_depth=6, n_estimators=200)AUC Score (Train): 0.688059
從上面簡(jiǎn)單的調(diào)參結(jié)果可以看出,不同參數(shù)下對(duì)AUC的影響還是很大的,始終沒(méi)有默認(rèn)參數(shù)效果好,放棄繼續(xù)手動(dòng)調(diào)參數(shù)。(后續(xù)要換成網(wǎng)格搜索了,訓(xùn)練時(shí)間會(huì)長(zhǎng)一些)
下面開(kāi)始看一下特征的選取對(duì)結(jié)果的影響。
上述的測(cè)試都選用的全部特征,下面使用XGBoost的feature_importances_接口來(lái)選取一些更有效的特征,代碼如下:
經(jīng)過(guò)和隨機(jī)森林一樣基于特征重要度對(duì)特征進(jìn)行剔除后,最終發(fā)現(xiàn)AUC沒(méi)有變化,所以直接提交了結(jié)果,線上可達(dá)76+。
XGBoost-KFold(77+)
在上一次中我們基于XGBoost的basemodel,線上可達(dá)76+,今天嘗試了一下XGBoost下不同折的交叉驗(yàn)證,線上可達(dá)77+。
下面分別給出5折、7折、8折交叉驗(yàn)證的代碼 和 各自最好結(jié)果。
# 選用所有特征 # ['id','certId', 'loanProduct', 'gender', 'age', 'dist', 'edu', 'job', 'lmt', 'basicLevel', 'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78', 'certValidBegin', 'certValidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan', 'isNew', 'target'] x_columns = [x for x in train_data.columns if x not in ["target", "id"]] ...... n_splits = 7 kf = KFold(n_splits=n_splits, shuffle=True, random_state=1234) for train_index, test_index in kf.split(X_train):xgboost = xgb.XGBClassifier()5折交叉驗(yàn)證下AUC Score (Train): 0.7245306571511836
7折交叉驗(yàn)證下AUC Score (Train): 0.7306788309565827
8折交叉驗(yàn)證下AUC Score (Train): 0.7511906354858096
最終線上成績(jī)都在77+,從結(jié)果來(lái)看隨著折數(shù)增加,線下AUC提升合理。
XGBoost-KFlod-特征工程
# 訓(xùn)練與測(cè)試數(shù)據(jù)進(jìn)行拼接 train_test_data = pd.concat([X_train,X_predict],axis=0,ignore_index = True)# 數(shù)據(jù)轉(zhuǎn)換 train_test_data['certBeginDt'] = pd.to_datetime(train_test_data["certValidBegin"] * 1000000000) - pd.offsets.DateOffset(years=70) print ("time >>>", train_test_data['certBeginDt']) train_test_data = train_test_data.drop(['certValidBegin'], axis=1) train_test_data['certStopDt'] = pd.to_datetime(train_test_data["certValidStop"] * 1000000000) - pd.offsets.DateOffset(years=70) train_test_data = train_test_data.drop(['certValidStop'], axis=1)# 特征組合 train_test_data["certStopDt"+"certBeginDt"] = train_test_data["certStopDt"] - train_test_data["certBeginDt"] print ("train_test_data>>>>>>", train_test_data["certStopDt"+"certBeginDt"])print ("進(jìn)行分箱") train_test_data["age_bin"] = pd.cut(train_test_data["age"],20,labels=False) train_test_data = train_test_data.drop(['age'], axis=1) train_test_data["dist_bin"] = pd.qcut(train_test_data["dist"],60,labels=False) train_test_data = train_test_data.drop(['dist'], axis=1) train_test_data["lmt_bin"] = pd.qcut(train_test_data["lmt"],50,labels=False) train_test_data = train_test_data.drop(['lmt'], axis=1) train_test_data["setupHour_bin"] = pd.qcut(train_test_data["setupHour"],10,labels=False) train_test_data = train_test_data.drop(['setupHour'], axis=1) train_test_data["certStopDtcertBeginDt_bin"] = pd.cut(train_test_data["certStopDtcertBeginDt"],30,labels=False) train_test_data = train_test_data.drop(['certStopDtcertBeginDt'], axis=1) # 'certValidBegin', 'certValidStop' train_test_data["certBeginDt_bin"] = pd.cut(train_test_data["certBeginDt"],30,labels=False) train_test_data = train_test_data.drop(['certBeginDt'], axis=1) train_test_data["certStopDt_bin"] = pd.cut(train_test_data["certStopDt"],30,labels=False) train_test_data = train_test_data.drop(['certStopDt'], axis=1) X_train = train_test_data.iloc[:X_train.shape[0],:] X_predict = train_test_data.iloc[X_train.shape[0]:,:]# 選用所有特征 print ("進(jìn)行onehot") train_data = X_train test_data = X_predict # 選擇要做onehot的列['id', 'certId', 'loanProduct', 'gender', 'age', 'dist', 'edu', 'job', 'lmt', 'basicLevel', 'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78', 'certValidBegin', 'certValidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan', 'isNew', 'target'] # ["gender", "edu", "job", 'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78'] # edu dummy_fea = ["gender","job", "loanProduct", "basicLevel","ethnic"] #'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78'] train_test_data = pd.concat([train_data,test_data],axis=0,ignore_index = True) dummy_df = pd.get_dummies(train_test_data.loc[:,dummy_fea]) dunmy_fea_rename_dict = {} for per_i in dummy_df.columns.values:dunmy_fea_rename_dict[per_i] = per_i + '_onehot' print (">>>>>", dunmy_fea_rename_dict) dummy_df = dummy_df.rename( columns=dunmy_fea_rename_dict ) train_test_data = pd.concat([train_test_data,dummy_df],axis=1) column_headers = list( train_test_data.columns.values ) print(column_headers) train_test_data = train_test_data.drop(dummy_fea,axis=1) column_headers = list( train_test_data.columns.values ) print(column_headers) train_train = train_test_data.iloc[:train_data.shape[0],:] test_test = train_test_data.iloc[train_data.shape[0]:,:] X_train = train_train X_predict = test_test# 交叉驗(yàn)證,可參考之前的 .......... # 網(wǎng)格搜索 n_splits = 5 cv_params = {'max_depth': [4, 6, 8, 10], 'min_child_weight': [3, 4, 5, 6], 'scale_pos_weight':[5,8,10]} other_params = {'learning_rate': 0.1, 'n_estimators': 4, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 1, 'reg_alpha': 1, 'reg_lambda': 1} xgboost = xgb.XGBClassifier() optimized_GBM = GridSearchCV(estimator=xgboost, param_grid=cv_params, scoring='roc_auc', cv=n_splits, verbose=1, n_jobs=4) xgboost_model = optimized_GBM.fit(X_train, y_train) y_pp = xgboost_model.predict_proba(X_predict)[:, 1]發(fā)現(xiàn)提升效果并不大。在這里勸誡各位打比賽的小伙伴,在不分析數(shù)據(jù)的基礎(chǔ)上隨意堆疊特征工程,效果可能不升反降,需要我們對(duì)數(shù)據(jù)進(jìn)行針對(duì)性的分析。
stacking-KFold(78+)
下面給出模型集成的代碼stacking:經(jīng)過(guò)調(diào)參和不斷優(yōu)化,最終線上成績(jī)達(dá)到78+
# -*- coding: utf-8 -*- from heamy.dataset import Dataset from heamy.estimator import Regressor, Classifier # ModelsPipeline:https://blog.csdn.net/qiqzhang/article/details/85477242 ; https://cloud.tencent.com/developer/article/1463294 from heamy.pipeline import ModelsPipeline import pandas as pd import xgboost as xgb import datetime from sklearn.metrics import roc_auc_score # lightgbm安裝:https://blog.csdn.net/weixin_41843918/article/details/85047492 # lgb樣例:https://www.jianshu.com/p/c208cac3496f import lightgbm as lgb from sklearn.preprocessing import LabelEncoder,OneHotEncoder from sklearn.linear_model import LinearRegression from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier import numpy as np from pandas.core.frame import DataFramedef xgb_feature(X_train, y_train, X_test, y_test=None):other_params = {'learning_rate': 0.125, 'max_depth': 3}model = xgb.XGBClassifier(**other_params).fit(X_train, y_train) predict = model.predict_proba(X_test)[:,1]#minmin = min(predict)#maxmax = max(predict)#vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))#return vfunc(predict)return predictdef xgb_feature2(X_train, y_train, X_test, y_test=None):# , 'num_boost_round':12other_params = {'learning_rate': 0.1, 'max_depth': 3}model = xgb.XGBClassifier(**other_params).fit(X_train, y_train) predict = model.predict_proba(X_test)[:,1]#minmin = min(predict)#maxmax = max(predict)#vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))#return vfunc(predict)return predictdef xgb_feature3(X_train, y_train, X_test, y_test=None):# , 'num_boost_round':20other_params = {'learning_rate': 0.13, 'max_depth': 3}model = xgb.XGBClassifier(**other_params).fit(X_train, y_train) predict = model.predict_proba(X_test)[:,1]#minmin = min(predict)#maxmax = max(predict)#vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))#return vfunc(predict)return predictdef rf_model(X_train, y_train, X_test, y_test=None):# n_estimators = 100model = RandomForestClassifier(n_estimators=90, max_depth=4,random_state=10).fit(X_train,y_train)predict = model.predict_proba(X_test)[:,1]#minmin = min(predict)#maxmax = max(predict)#vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))#return vfunc(predict)return predictdef et_model(X_train, y_train, X_test, y_test=None):model = ExtraTreesClassifier(max_features = 'log2', n_estimators = 1000 , n_jobs = -1).fit(X_train,y_train)return model.predict_proba(X_test)[:,1]def gbdt_model(X_train, y_train, X_test, y_test=None):# n_estimators = 700model = GradientBoostingClassifier(learning_rate = 0.02, max_features = 0.7, n_estimators = 100 , max_depth = 5).fit(X_train,y_train)predict = model.predict_proba(X_test)[:,1]#minmin = min(predict)#maxmax = max(predict)#vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))#return vfunc(predict)return predictdef logistic_model(X_train, y_train, X_test, y_test=None):model = LogisticRegression(penalty = 'l2').fit(X_train,y_train)return model.predict_proba(X_test)[:,1]def lgb_feature(X_train, y_train, X_test, y_test=None):model = lgb.LGBMClassifier(boosting_type='gbdt', min_data_in_leaf=5, max_bin=200, num_leaves=25, learning_rate=0.01).fit(X_train, y_train) predict = model.predict_proba(X_test)[:,1]#minmin = min(predict)#maxmax = max(predict)#vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))#return vfunc(predict)return predictVAILD = False if __name__ == '__main__':if VAILD == False:##############################train_data = pd.read_csv('data/train_data_target.csv',engine = 'python')# # x_columns = [x for x in train_data.columns if x not in ["target", "id"]]x_columns = ['certId', 'loanProduct', 'gender', 'age', 'dist', 'edu', 'job', 'lmt', 'basicLevel', 'x_12', 'x_14', 'x_16', 'x_20', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_33', 'x_34', 'x_41', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'certValidBegin', 'certValidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan', 'isNew']train_data.fillna(0,inplace = True)test_data = pd.read_csv('data/test.csv',engine = 'python')test_data.fillna(0,inplace = True)train_test_data = pd.concat([train_data,test_data],axis=0,ignore_index = True)train_test_data = train_test_data.fillna(-888, inplace = True)# dummy_fea = ["gender", "edu", "job"]dummy_fea = []#dummy_df = pd.get_dummies(train_test_data.loc[:,dummy_fea])#dunmy_fea_rename_dict = {}#for per_i in dummy_df.columns.values:# dunmy_fea_rename_dict[per_i] = per_i + '_onehot'#print (">>>>>", dunmy_fea_rename_dict)#dummy_df.rename( columns=dunmy_fea_rename_dict )#train_test_data = pd.concat([train_test_data,dummy_df],axis=1)#train_test_data = train_test_data.drop(dummy_fea,axis=1)train_train = train_test_data.iloc[:train_data.shape[0],:]test_test = train_test_data.iloc[train_data.shape[0]:,:]train_train_x = train_traintest_test_x = test_testxgb_dataset = Dataset(X_train=train_train_x,y_train=train_data['target'],X_test=test_test_x,y_test=None,use_cache=False)#heamyprint ("---------------------------------------------------------------------------------------)")print ("開(kāi)始構(gòu)建pipeline:ModelsPipeline(model_xgb,model_xgb2,model_xgb3,model_lgb,model_gbdt)")model_xgb = Regressor(dataset=xgb_dataset, estimator=xgb_feature,name='xgb',use_cache=False)model_xgb2 = Regressor(dataset=xgb_dataset, estimator=xgb_feature2,name='xgb2',use_cache=False)model_xgb3 = Regressor(dataset=xgb_dataset, estimator=xgb_feature3,name='xgb3',use_cache=False)model_gbdt = Regressor(dataset=xgb_dataset, estimator=gbdt_model,name='gbdt',use_cache=False)model_lgb = Regressor(dataset=xgb_dataset, estimator=lgb_feature,name='lgb',use_cache=False)model_rf = Regressor(dataset=xgb_dataset, estimator=rf_model,name='rf',use_cache=False)# pipeline = ModelsPipeline(model_xgb,model_xgb2,model_xgb3,model_lgb,model_gbdt, model_rf)pipeline = ModelsPipeline(model_xgb, model_xgb2, model_xgb3, model_lgb, model_rf)print ("---------------------------------------------------------------------------------------)")print ("開(kāi)始訓(xùn)練pipeline:pipeline.stack(k=7, seed=111, add_diff=False, full_test=True)")stack_ds = pipeline.stack(k=7, seed=111, add_diff=False, full_test=True)# k = 7 model_xgb, model_xgb2, model_xgb3, model_lgb, model_rf : AUC: 0.780043 print ("stack_ds: ", stack_ds)print ("---------------------------------------------------------------------------------------)")print ("開(kāi)始訓(xùn)練Regressor:Regressor(dataset=stack_ds, estimator=LinearRegression,parameters={'fit_intercept': False})")stacker = Regressor(dataset=stack_ds, estimator=LinearRegression,parameters={'fit_intercept': False})print ("---------------------------------------------------------------------------------------)")print ("開(kāi)始預(yù)測(cè):")predict_result = stacker.predict()id_list = test_data["id"].tolist()d ={ "id" : id_list, "target" : predict_result }res = DataFrame(d)#將字典轉(zhuǎn)換成為數(shù)據(jù)框print (">>>>", res)csv_file = 'stacking_res/res_stacking.csv'res.to_csv( csv_file )后續(xù)又做了很多特征工程和模型融合,可能是對(duì)金融風(fēng)控這一塊不了解和自己太菜的原因,成績(jī)止步于此。
總結(jié)
從之前的嘗試可以看出,在不做任何特征、不調(diào)參的情況下,提升效果的方法可以有:
- 換好的模型
- 使用交叉驗(yàn)證
- 采用模型集成的方案
當(dāng)然,后期想提升的話,方案就比較多了,還可以有 數(shù)據(jù)增強(qiáng)(數(shù)據(jù)的擴(kuò)增、不均衡的處理)、數(shù)據(jù)清洗(異常值、分布等等),特征工程(特征選擇、統(tǒng)計(jì)特征、歸一化、編碼、分箱等等)、模型選擇、 損失函數(shù)、模型集成。 (以上每種都去嘗試真的很難,這個(gè)靠平常的積累,譬如什么模型要對(duì)特征做什么樣的處理、什么樣的參數(shù)適合多大的數(shù)據(jù)量、特征選擇方法(卡方、方差、模型、分布等等)
總而言之,都需要平時(shí)的多嘗試和多積累。
總結(jié)
以上是生活随笔為你收集整理的2019年厦门国际银行“数创金融杯”数据建模大赛总结的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 参赛邀请 | 第二届古汉语自动分析国际评
- 下一篇: 2020-05-21