随机森林计算特征重要性_随机森林中计算特征重要性的3种方法
隨機森林計算特征重要性
The feature importance describes which features are relevant. It can help with a better understanding of the solved problem and sometimes lead to model improvement by utilizing feature selection. In this post, I will present 3 ways (with code) to compute feature importance for the Random Forest algorithm from scikit-learn package (in Python).
功能重要性描述了哪些功能是相關的。 它可以幫助您更好地了解已解決的問題,有時還可以利用特征選擇來改進模型。 在這篇文章中,我將介紹3種方法(使用代碼),以scikit-learn包(在Python中)為隨機森林算法計算功能重要性。
內置隨機森林重要性 (Built-in Random Forest Importance)
The Random Forest algorithm has built-in feature importance which can be computed in two ways:
隨機森林算法具有內置的特征重要性,可以通過兩種方式計算:
Gini importance (or mean decrease impurity), which is computed from the Random Forest structure. Let’s look at how the Random Forest is constructed. It is a set of Decision Trees. Each Decision Tree is a set of internal nodes and leaves. In the internal node, the selected feature is used to make a decision on how to divide the data set into two separate sets with similar responses within. The features for internal nodes are selected with some criterion, which for classification tasks can be Gini impurity or information gain, and for regression is variance reduction. We can measure how each feature decreases the impurity of the split (the feature with the highest decrease is selected for internal node). For each feature, we can collect how on average it decreases the impurity. The average over all trees in the forest is the measure of the feature importance. This method is available in scikit-learn the implementation of the Random Forest (for both classifier and regressor). It is worth mentioning that in this method, we should look at the relative values of the computed importances. This biggest advantage of this method is the speed of computation - all needed values are computed during the Radom Forest training. The drawback of the method is a tendency to prefer (select as important) numerical features and categorical features with high cardinality. What is more, in the case of correlated features it can select one of the features and neglect the importance of the second one (which can lead to wrong conclusions).
基尼重要性 (或平均減少雜質),由隨機森林結構計算得出。 讓我們看看隨機森林是如何構建的。 它是一組決策樹。 每個決策樹都是一組內部節點和葉子。 在內部節點中,所選功能用于決定如何將數據集分為兩個單獨的集合,并在其中進行類似的響應。 內部節點的特征按某些標準選擇,對于分類任務可以是基尼雜質或信息增益,對于回歸是方差減少。 我們可以測量每個特征如何減少分割的雜質(為內部節點選擇具有最大減少的特征)。 對于每個功能,我們可以收集平均如何減少雜質。 森林中所有樹木的平均值是特征重要性的度量。 scikit-learn可以使用此方法來scikit-learn隨機森林的實現 (對于分類器和回歸器)。 值得一提的是,在這種方法中,我們應該查看計算出的重要性的相對值。 此方法的最大優點是計算速度-在Radom Forest訓練期間計算所有需要的值。 該方法的缺點是傾向于傾向于(選擇作為重要的)具有高基數的數字特征和分類特征。 而且,在具有相關特征的情況下,它可以選擇特征之一,而忽略第二個特征的重要性(這可能導致錯誤的結論)。
Mean Decrease Accuracy — is a method of computing the feature importance on permuted out-of-bag (OOB) samples based on a mean decrease in the accuracy. This method is not implemented in the scikit-learn package. The very similar to this method is permutation-based importance described below in this post.
平均降低準確性 -是一種基于準確性的平均降低來計算置換袋裝(OOB)樣本的特征重要性的方法。 scikit-learn包中未實現此方法。 與此方法非常相似的是下文中介紹的基于置換的重要性。
I will show how to compute feature importance for the Random Forest with scikit-learn package and Boston dataset (house price regression task).
我將展示如何使用scikit-learn軟件包和Boston數據集(房價回歸任務)來計算隨機森林的特征重要性。
# Let's load the packagesimport numpy as npimport pandas as pdfrom sklearn.datasets import load_bostonfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.inspection import permutation_importanceimport shapfrom matplotlib import pyplot as pltplt.rcParams.update({'figure.figsize': (12.0, 8.0)})
plt.rcParams.update({'font.size': 14})
Load the data set and split for training and testing.
加載數據集并拆分以進行訓練和測試。
boston = load_boston()X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)
Fit the Random Forest Regressor with 100 Decision Trees:
使隨機森林回歸器具有100個決策樹:
rf = RandomForestRegressor(n_estimators=100)rf.fit(X_train, y_train)
To get the feature importances from the Random Forest model use the feature_importances_ argument:
要從“隨機森林”模型中獲取要素重要性,請使用feature_importances_參數:
rf.feature_importances_array([0.04054781, 0.00149293, 0.00576977, 0.00071805, 0.02944643,0.25261155, 0.01969354, 0.05781783, 0.0050257 , 0.01615872,
0.01066154, 0.01185997, 0.54819617])
Let’s plot the importances (chart will be easier to interpret than values).
讓我們來畫出重要性(圖表比值更容易解釋)。
plt.barh(boston.feature_names, rf.feature_importances_)To have an even better chart, let’s sort the features, and plot again:
為了獲得更好的圖表,讓我們對特征進行排序,然后再次繪圖:
sorted_idx = rf.feature_importances_.argsort()plt.barh(boston.feature_names[sorted_idx], rf.feature_importances_[sorted_idx])
plt.xlabel("Random Forest Feature Importance")
基于排列的重要性 (Permutation-based Importance)
The permutation-based importance can be used to overcome drawbacks of default feature importance computed with mean impurity decrease. It is implemented in scikit-learn as permutation_importance method. As arguments it requires trained model (can be any model compatible with scikit-learn API) and validation (test data). This method will randomly shuffle each feature and compute the change in the model's performance. The features which impact the performance the most are the most important one.
基于置換的重要性可用于克服使用平均雜質減少計算出的默認特征重要性的缺點。 它在scikit-learn作為permutation_importance方法實現。 作為參數,它需要訓練有素的模型(可以是與scikit-learn API兼容的任何模型)和驗證(測試數據)。 該方法將隨機地對每個功能進行隨機排序,并計算模型性能的變化。 最影響性能的功能是最重要的功能。
Permutation importance can be easily computed:
排列重要性很容易計算:
perm_importance = permutation_importance(rf, X_test, y_test)To plot the importance:
繪制重要性:
sorted_idx = perm_importance.importances_mean.argsort()plt.barh(boston.feature_names[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")
The permutation-based importance is computationally expensive. The permutation-based method can have problems with highly-correlated features, it can report them as unimportant.
基于排列的重要性在計算上很昂貴。 基于置換的方法可能會遇到功能高度相關的問題,可以將其報告為不重要的。
從SHAP值計算重要性 (Compute Importance from SHAP Values)
The SHAP interpretation can be used (it is model-agnostic) to compute the feature importances from the Random Forest. It is using the Shapley values from game theory to estimate how does each feature contributes to the prediction. It can be easily installed ( pip install shap) and used with scikit-learn Random Forest:
可以使用SHAP解釋(與模型無關)來計算隨機森林中的特征重要性。 它使用博弈論中的Shapley值來估計每個特征如何對預測做出貢獻。 它可以輕松安裝( pip install shap )并與scikit-learn隨機森林一起使用:
explainer = shap.TreeExplainer(rf)shap_values = explainer.shap_values(X_test)
To plot feature importance as the horizontal bar plot we need to use summary_plot the method:
要將要素重要性繪制為水平條形圖,我們需要使用summary_plot方法:
shap.summary_plot(shap_values, X_test, plot_type="bar")The feature importance can be plotted with more details, showing the feature value:
可以使用更多細節繪制特征重要性,以顯示特征值:
shap.summary_plot(shap_values, X_test)The computing feature importances with SHAP can be computationally expensive. However, it can provide more information like decision plots or dependence plots.
SHAP對計算功能的重要性在計算上可能很昂貴。 但是,它可以提供更多信息,例如決策圖或依賴圖。
摘要 (Summary)
The 3 ways to compute the feature importance for the scikit-learn Random Forest was presented:
提出了三種計算scikit-learn隨機森林特征重要性的方法:
- built-in feature importance 內置功能的重要性
- permutation-based importance 基于置換的重要性
- computed with SHAP values 用SHAP值計算
In my opinion, it is always good to check all methods and compare the results. I’m using permutation and SHAP based methods in MLJAR’s AutoML open-source package mljar-supervised. I'm using them because they are model-agnostic and works well with algorithms not from scikit-learn: Xgboost, Neural Networks (keras+tensorflow), LigthGBM, CatBoost.
我認為,檢查所有方法并比較結果總是好的。 我在MLJAR的AutoML開源軟件包mljar-supervised使用基于置換和SHAP的方法。 我之所以使用它們,是因為它們與模型無關,并且可以很好地與scikit-learn算法配合使用:Xgboost,神經網絡(keras + tensorflow),LigthGBM,CatBoost。
重要筆記 (Important Notes)
- The more accurate model is, the more trustworthy computed importance is. 模型越準確,計算出的重要性就越值得信賴。
- The computed importances describe how important features are for the machine learning model. It is an approximation of how important features are in the data 計算出的重要性描述了機器學習模型的重要特征。 這是數據中重要特征的近似值
The mljar-supervised is an open-source Automated Machine Learning (AutoML) Python package that works with tabular data. It is designed to save time for a data scientist. It abstracts the common way to preprocess the data, construct the machine learning models, and perform hyper-parameters tuning to find the best model. It is no black-box as you can see exactly how the ML pipeline is constructed (with a detailed Markdown report for each ML model).
受mljar監督的是可處理表格數據的開源自動機器學習(AutoML)Python軟件包。 它旨在為數據科學家節省時間。 它抽象了預處理數據,構建機器學習模型以及執行超參數調整以找到最佳模型的通用方法。 這不是黑盒子,因為您可以確切地看到ML管道的構造方式(每個ML模型都有詳細的Markdown報告)。
The example report generated with a mljar-supervised AutoML package.使用mljar監督的AutoML軟件包生成的示例報告。Originally published at https://mljar.com on June 29, 2020.
最初于 2020年6月29日 發布在 https://mljar.com 上。
翻譯自: https://towardsdatascience.com/the-3-ways-to-compute-feature-importance-in-the-random-forest-96c86b49e6d4
隨機森林計算特征重要性
總結
以上是生活随笔為你收集整理的随机森林计算特征重要性_随机森林中计算特征重要性的3种方法的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 智能手机低价成潮,vivo为何执念高端?
- 下一篇: Vimeo视频下载工具