當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

XGBoost Plotting API以及GBDT组合特征实践

發(fā)布時(shí)間：2025/3/21 编程问答 18 豆豆

生活随笔收集整理的這篇文章主要介紹了 XGBoost Plotting API以及GBDT组合特征实践小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

XGBoost Plotting API以及GBDT組合特征實(shí)踐

寫(xiě)在前面：

最近在深入學(xué)習(xí)一些樹(shù)模型相關(guān)知識(shí)點(diǎn)，打算整理一下。剛好昨晚看到余音大神在Github上分享了一波 MachineLearningTrick，趕緊上車(chē)學(xué)習(xí)一波！大神這波節(jié)奏分享了xgboost相關(guān)的干貨，還有一些內(nèi)容未分享….總之值得關(guān)注！我主要看了：Xgboost的葉子節(jié)點(diǎn)位置生成新特征封裝的函數(shù)。之前就看過(guò)相關(guān)博文，比如Byran大神的這篇：http://blog.csdn.net/bryan__/article/details/51769118，但是自己從未實(shí)踐過(guò)。本文是基于bryan大神博客以及余音大神的代碼對(duì)GBDT組合特征實(shí)踐的理解和拓展，此外探索了一下XGBoost的Plotting API，學(xué)習(xí)為主！

官方API介紹：?
http://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

1.利用GBDT構(gòu)造組合特征原理介紹

從byran大神的博客以及這篇利用GBDT模型構(gòu)造新特征中，可以比較好的理解GBDT組合特征：

論文的思想很簡(jiǎn)單，就是先用已有特征訓(xùn)練GBDT模型，然后利用GBDT模型學(xué)習(xí)到的樹(shù)來(lái)構(gòu)造新特征，最后把這些新特征加入原有特征一起訓(xùn)練模型。構(gòu)造的新特征向量是取值0/1的，向量的每個(gè)元素對(duì)應(yīng)于GBDT模型中樹(shù)的葉子結(jié)點(diǎn)。當(dāng)一個(gè)樣本點(diǎn)通過(guò)某棵樹(shù)最終落在這棵樹(shù)的一個(gè)葉子結(jié)點(diǎn)上，那么在新特征向量中這個(gè)葉子結(jié)點(diǎn)對(duì)應(yīng)的元素值為1，而這棵樹(shù)的其他葉子結(jié)點(diǎn)對(duì)應(yīng)的元素值為0。新特征向量的長(zhǎng)度等于GBDT模型里所有樹(shù)包含的葉子結(jié)點(diǎn)數(shù)之和。

舉例說(shuō)明。下面的圖中的兩棵樹(shù)是GBDT學(xué)習(xí)到的，第一棵樹(shù)有3個(gè)葉子結(jié)點(diǎn)，而第二棵樹(shù)有2個(gè)葉子節(jié)點(diǎn)。對(duì)于一個(gè)輸入樣本點(diǎn)x，如果它在第一棵樹(shù)最后落在其中的第二個(gè)葉子結(jié)點(diǎn)，而在第二棵樹(shù)里最后落在其中的第一個(gè)葉子結(jié)點(diǎn)。那么通過(guò)GBDT獲得的新特征向量為[0, 1, 0, 1, 0]，其中向量中的前三位對(duì)應(yīng)第一棵樹(shù)的3個(gè)葉子結(jié)點(diǎn)，后兩位對(duì)應(yīng)第二棵樹(shù)的2個(gè)葉子結(jié)點(diǎn)。?

在實(shí)踐中的關(guān)鍵點(diǎn)是如何獲得每個(gè)樣本在訓(xùn)練后樹(shù)模型每棵樹(shù)的哪個(gè)葉子結(jié)點(diǎn)上。之前知乎上看到過(guò)可以設(shè)置pre_leaf=True獲得每個(gè)樣本在每顆樹(shù)上的leaf_Index，打開(kāi)XGBoost官方文檔查閱一下API:

原來(lái)這個(gè)參數(shù)是在predict里面，在對(duì)原始特征進(jìn)行簡(jiǎn)單調(diào)參訓(xùn)練后，對(duì)原始數(shù)據(jù)以及測(cè)試數(shù)據(jù)進(jìn)行new_feature= bst.predict(d_test, pred_leaf=True)即可得到一個(gè)(nsample, ntrees) 的結(jié)果矩陣，即每個(gè)樣本在每個(gè)樹(shù)上的index。了解這個(gè)方法之后，我仔細(xì)學(xué)習(xí)了余音大神的代碼，發(fā)現(xiàn)他并沒(méi)有用到這個(gè)，如下：

可以看到他用的是apply()方法，這里就有點(diǎn)疑惑了，在XGBoost官方API并沒(méi)有看到這個(gè)方法，于是我去SKlearn GBDT API看了下，果然有apply()方法可以獲得leaf indices：?

因?yàn)閄GBoost有自帶接口和Scikit-Learn接口，所以代碼上有所差異。至此，基本了解了利用GBDT(XGBoost)構(gòu)造組合特征的實(shí)現(xiàn)方法，接下去按兩種接口實(shí)踐一波。

2.利用GBDT構(gòu)造組合特征實(shí)踐

發(fā)車(chē)發(fā)車(chē)~

(1).包導(dǎo)入以及數(shù)據(jù)準(zhǔn)備

from sklearn.model_selection import train_test_split from pandas import DataFrame from sklearn import metrics from sklearn.datasets import make_hastie_10_2 from xgboost.sklearn import XGBClassifier import xgboost as xgb#準(zhǔn)備數(shù)據(jù)，y本來(lái)是[-1:1],xgboost自帶接口邀請(qǐng)標(biāo)簽是[0:1],把-1的轉(zhuǎn)成1了。 X, y = make_hastie_10_2(random_state=0) X = DataFrame(X) y = DataFrame(y) y.columns={"label"} label={-1:0,1:1} y.label=y.label.map(label) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)#劃分?jǐn)?shù)據(jù)集 y_train.head()

? label 843 9450 7766 9802 8555

1
0
1
1
1

(2).XGBoost兩種接口定義

#XGBoost自帶接口 params={'eta': 0.3,'max_depth':3, 'min_child_weight':1,'gamma':0.3, 'subsample':0.8,'colsample_bytree':0.8,'booster':'gbtree','objective': 'binary:logistic','nthread':12,'scale_pos_weight': 1,'lambda':1, 'seed':27,'silent':0 ,'eval_metric': 'auc' } d_train = xgb.DMatrix(X_train, label=y_train) d_valid = xgb.DMatrix(X_test, label=y_test) d_test = xgb.DMatrix(X_test) watchlist = [(d_train, 'train'), (d_valid, 'valid')]#sklearn接口 clf = XGBClassifier(n_estimators=30,#三十棵樹(shù)learning_rate =0.3,max_depth=3,min_child_weight=1,gamma=0.3,subsample=0.8,colsample_bytree=0.8,objective= 'binary:logistic',nthread=12,scale_pos_weight=1,reg_lambda=1,seed=27)model_bst = xgb.train(params, d_train, 30, watchlist, early_stopping_rounds=500, verbose_eval=10) model_sklearn=clf.fit(X_train, y_train)y_bst= model_bst.predict(d_test) y_sklearn= clf.predict_proba(X_test)[:,1]

print("XGBoost_自帶接口 AUC Score : %f" % metrics.roc_auc_score(y_test, y_bst)) print("XGBoost_sklearn接口 AUC Score : %f" % metrics.roc_auc_score(y_test, y_sklearn))

(3).生成兩組新特征

print("原始train大小：",X_train.shape) print("原始test大小：",X_test.shape)##XGBoost自帶接口生成的新特征 train_new_feature= model_bst.predict(d_train, pred_leaf=True) test_new_feature= model_bst.predict(d_test, pred_leaf=True) train_new_feature1 = DataFrame(train_new_feature) test_new_feature1 = DataFrame(test_new_feature) print("新的特征集(自帶接口)：",train_new_feature1.shape) print("新的測(cè)試集(自帶接口)：",test_new_feature1.shape)#sklearn接口生成的新特征 train_new_feature= clf.apply(X_train)#每個(gè)樣本在每顆樹(shù)葉子節(jié)點(diǎn)的索引值 test_new_feature= clf.apply(X_test) train_new_feature2 = DataFrame(train_new_feature) test_new_feature2 = DataFrame(test_new_feature)

print("新的特征集(sklearn接口)：",train_new_feature2.shape) print("新的測(cè)試集(sklearn接口)：",test_new_feature2.shape)

train_new_feature1.head()

? 0 1 2 3 4 5 6 7 8 9 … 20 21 22 23 24 25 26 27 28 29 0 1 2 3 4

8	11	9	9	10	8	11	12	9	9	…	10	8	10	11	9	10	10	8	11	12
10	11	9	11	11	9	11	12	9	9	…	10	9	11	11	9	10	10	8	11	13
10	11	9	12	10	9	11	12	9	9	…	10	9	14	11	10	13	10	8	11	13
10	11	9	10	10	9	11	14	9	9	…	8	9	11	12	9	10	10	8	13	12
12	11	9	9	10	7	11	12	9	10	…	10	8	13	11	9	10	10	8	12	12

5 rows × 30 columns

(4).基于新特征訓(xùn)練、預(yù)測(cè)

#用兩組新的特征分別訓(xùn)練，預(yù)測(cè)#用XGBoost自帶接口生成的新特征訓(xùn)練 new_feature1=clf.fit(train_new_feature1, y_train) y_new_feature1= clf.predict_proba(test_new_feature1)[:,1] #用XGBoost自帶接口生成的新特征訓(xùn)練 new_feature2=clf.fit(train_new_feature2, y_train) y_new_feature2= clf.predict_proba(test_new_feature2)[:,1]print("XGBoost自帶接口生成的新特征預(yù)測(cè)結(jié)果 AUC Score : %f" % metrics.roc_auc_score(y_test, y_new_feature1)) print("XGBoost自帶接口生成的新特征預(yù)測(cè)結(jié)果 AUC Score : %f" % metrics.roc_auc_score(y_test, y_new_feature2))

3.Plotting API畫(huà)圖

因?yàn)楂@得的新特征是每棵樹(shù)的葉子結(jié)點(diǎn)的Index，可以看下每棵樹(shù)的結(jié)構(gòu)。XGBoost Plotting API可以實(shí)現(xiàn)：

(1).安裝導(dǎo)入相關(guān)包:?XGBoost Plotting API需要用到graphviz 和pydot，我是Win10 環(huán)境+Anaconda3，pydot直接?pip install pydot?或者conda install pydot即可。graphviz 稍微麻煩點(diǎn)，直接pip（conda）安裝了以后導(dǎo)入沒(méi)有問(wèn)題，但是畫(huà)圖的時(shí)候就會(huì)報(bào)錯(cuò)，類(lèi)似路徑環(huán)境變量的問(wèn)題。?
網(wǎng)上找了一些解決方法，各種試不行,最后在stackoverflow上找到了解決方案：?
http://stackoverflow.com/questions/35064304/runtimeerror-make-sure-the-graphviz-executables-are-on-your-systems-path-aft

http://stackoverflow.com/questions/18334805/graphviz-windows-path-not-set-with-new-installer-issue-when-calling-from-r

需要先下載一個(gè)windows下的graphviz 安裝包，安裝完成后將安裝路徑和bin文件夾路徑添加到系統(tǒng)環(huán)境變量，然后重啟系統(tǒng)。重新pip(conda) install graphviz ,打開(kāi)jupyter notebook(本次代碼都在notebook中測(cè)試完成)或者Python環(huán)境運(yùn)行以下代碼：

from xgboost import plot_tree from xgboost import plot_importance import matplotlib.pyplot as plt from graphviz import Digraph import pydot#安裝說(shuō)明： #pip install pydot #http://www.graphviz.org/Download_windows.php #先安裝上面下載的graphviz.msi，安裝，然后把路徑添加到環(huán)境變量，重啟下 #然后pip3 install graphviz...竟然就可以了...

(2).兩種接口的model畫(huà)圖:?上面兩種接口的模型分別保存下來(lái)，自帶接口的參數(shù)設(shè)置更方便一些。沒(méi)有深入研究功能，畫(huà)出來(lái)的圖效果還不是很好。

#model_bst = xgb.train(params, d_train, 30, watchlist, early_stopping_rounds=500, verbose_eval=10) #model_sklearn=clf.fit(X_train, y_train)#model_bst plot_tree(model_bst, num_trees=0) plot_importance(model_bst) plt.show()#model_sklearn: plot_tree(model_sklearn) plot_importance(model_sklearn) plt.show()

4.完

余音大神已經(jīng)把它的代碼封裝好了，可以直接下載調(diào)用，點(diǎn)贊。

在實(shí)踐中可以根據(jù)自己的需求實(shí)現(xiàn)特征構(gòu)造，也不是很麻煩，主要就是保存每個(gè)樣本在每棵樹(shù)的葉子索引。然后可以根據(jù)情況適當(dāng)調(diào)整參數(shù)，得到的新特征再融合到原始特征中，最終是否有提升還是要看場(chǎng)景吧，下次比賽打算嘗試一下！

此外，XGBboost Plotting API 之前沒(méi)用過(guò)，感覺(jué)很nice，把每個(gè)樹(shù)的樣子畫(huà)出來(lái)可以非常直觀的觀察模型的學(xué)習(xí)過(guò)程，不過(guò)本文中的代碼畫(huà)出的圖并不是很清晰，還需進(jìn)一步實(shí)踐！

參考資料：文中已列出，這里再次感謝！

http://blog.csdn.net/bryan__/article/details/51769118

https://breezedeus.github.io/2014/11/19/breezedeus-feature-mining-gbdt.html#fn:fbgbdt

https://github.com/lytforgood/MachineLearningTrick

http://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.core

總結(jié)

以上是生活随笔為你收集整理的XGBoost Plotting API以及GBDT组合特征实践的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： gbdt和xgboost区别
下一篇： Blending and Bagging