极度随机树ExtraTreesClassifier
? ? ? ? ? ? ? ? ? ? ? ? ? ? 極度隨機樹ExtraTreesClassifier
1 聲明
本文的數據來自網絡,部分代碼也有所參照,這里做了注釋和延伸,旨在技術交流,如有冒犯之處請聯系博主及時處理。
2 極度隨機樹ExtraTreesClassifier簡介
Extremely Randomized Trees Classifier(極度隨機樹) 是一種集成學習技術,它將森林中收集的多個去相關決策樹的結果聚集起來輸出分類結果。極度隨機樹的每棵決策樹都是由原始訓練樣本構建的。在每個測試節點上,每棵樹都有一個隨機樣本,樣本中有k個特征,每個決策樹都必須從這些特征集中選擇最佳特征,然后根據一些數學指標(一般是基尼指數)來拆分數據。這種隨機的特征樣本導致多個不相關的決策樹的產生。
在構建森林的過程中,對于每個特征,計算用于分割特征決策的數學指標(如使用基尼指數)的歸一化總縮減量,這個值稱為基尼要素的重要性。基尼重要性按降序排列后,可根據需要選擇前k個特征。
3 極度隨機樹ExtraTreesClassifier代碼示例
import pandas as pd import matplotlib.pyplot as plt from sklearn.ensemble import ExtraTreesClassifierimport matplotlib # 自定義字體,以兼容中文顯示 myfont = matplotlib.font_manager.FontProperties(fname='C:\Windows\Fonts\STKAITI.TTF') df_pre = pd.read_csv('../input/PlayTennis.txt',sep="\t") # 拆分X(自變量)和y(因變量)#X = df.drop('Play Tennis', axis=1) df=df_pre.drop('Day', axis = 1) #分類類型轉數值型,通過字典映射轉換 weather_mapper = {'Sunny': 1, 'Overcast': 2,'Rain':3} df['Outlook'].replace(weather_mapper, inplace=True)temperature_mapper = {'Hot': 1, 'Mild': 2,'Cool':3} df['Temperature'].replace(temperature_mapper, inplace=True)humidity_mapper = {'High': 1, 'Normal': 2} df['Humidity'].replace(humidity_mapper, inplace=True)wind_mapper = {'Weak': 1, 'Strong': 0} df['Wind'].replace(wind_mapper, inplace=True)playTennis_mapper={"Yes":1,"No":0} df['PlayTennis'].replace(playTennis_mapper, inplace=True) print(df.head()) y = df['PlayTennis'] X = df.loc[ :,'Outlook':'Wind'] #X = df.drop('PlayTennis', axis = 1) #print(X.head())# 5棵樹、2個特征、評判指標是熵 extra_tree_forest = ExtraTreesClassifier(n_estimators=5,criterion='entropy', max_features=2) extra_tree_forest.fit(X, y)# 計算每個特征的重要性水平 feature_importance = extra_tree_forest.feature_importances_# 標準化特征的重要性水平 feature_importance_normalized = np.std([tree.feature_importances_ for tree inextra_tree_forest.estimators_],axis=0) #畫圖 # Plotting a Bar Graph to compare the models plt.bar(X.columns, feature_importance_normalized) plt.xlabel('特征',fontproperties = myfont) plt.ylabel('特征重要性',fontproperties = myfont) plt.title('特征重要性比較',fontproperties = myfont) plt.show()4 計算示意:
熵公示:
其中c為唯一類標簽的個數,p i為所屬分類的行占比。
-- 構造數據 CREATE TABLE PlayTennis( DayNo varchar(10), Outlook varchar(10), Temperature varchar(10), Humidity varchar(10), Wind varchar(10), PlayTennis varchar(10) );insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D1','Sunny','Hot','High','Weak','No'); insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D2','Sunny','Hot','High','Strong','No'); insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D3','Overcast','Hot','High','Weak','Yes'); insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D4','Rain','Mild','High','Weak','Yes'); insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D5','Rain','Cool','Normal','Weak','Yes'); insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D6','Rain','Cool','Normal','Strong','No'); insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D7','Overcast','Cool','Normal','Strong','Yes'); insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D8','Sunny','Mild','High','Weak','No'); insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D9','Sunny','Cool','Normal','Weak','Yes'); insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D10','Rain','Mild','Normal','Weak','Yes'); insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D1','Sunny','Mild','Normal','Strong','Yes'); insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D12','Overcast','Mild','High','Strong','Yes'); insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D13','Overcast','Hot','Normal','Weak','Yes'); insert into `PlayTennis`(`DayNo`,`Outlook`,`Temperature`,`Humidity`,`Wind`,`PlayTennis`) values ('D14','Rain','Mild','High','Strong','No');-- 計算熵 WITH CTE1 AS ( SELECT? DISTINCT COUNT(PlayTennis)OVER(PARTITION BY PlayTennis) gp,tatal FROM PlayTennis,(SELECT COUNT(*) tatal FROM PlayTennis) A )SELECT SUM(-(gp/tatal)*LOG(2,gp/tatal)) entropy_s? FROM (SELECT gp,tatal FROM CTE1)A-- 0.940285959354754假設第一棵決策樹選擇了特征Outlook 和Temperature,則
-- 計算OutLook特征的信息增益 WITH CTE2 AS (SELECT? DISTINCT COUNT(PlayTennis)OVER(PARTITION BY Outlook,PlayTennis ORDER BY PlayTennis) gp, COUNT(1)OVER(PARTITION BY Outlook ) num, Outlook,PlayTennis, (SELECT COUNT(*) tatal FROM PlayTennis) tatal FROM PlayTennis )SELECT 0.940285959354754-SUM(-(num/tatal)*(gp/num)*LOG(2,gp/num)) Gain_S_OutLook FROM CTE2-- 0.246749820735977同理
第二棵決策樹選擇了特征Temperature 和Wind,則Gain計算如下:
第三棵決策樹選擇了特征Outlook和Humidity,則Gain計算如下:
第四棵決策樹選擇了特征Temperature和Humidity,則Gain計算如下:
第五棵決策樹選擇了特征Wind 和 Humidity,則Gain計算如下:
則各個特征的gain(信息增益)匯總如下:
Outlook:0.246+0.246= 0.492
Temperature:0.029+0.029+0.029 = 0.087
Humidity:=0.151+0.151+0.151 = 0.453
Wind:0.048+0.048 = 0.096
所以極度隨機樹來確定的最重要變量是特征 Outlook。
注:因特征選擇的隨機性,這里計算的特征重要水平可能有差異。
5 總結
Refer :
https://www.geeksforgeeks.org/ml-extra-tree-classifier-for-feature-selection/
https://machinelearningmastery.com/extra-trees-ensemble-with-python/
與50位技術專家面對面20年技術見證,附贈技術全景圖總結
以上是生活随笔為你收集整理的极度随机树ExtraTreesClassifier的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 股票型基金在哪里购买 可以选择这些途径
- 下一篇: 二战有名的坦克 是苏制T34和德国虎式