特征选择 回归_如何执行回归问题的特征选择
特征選擇 回歸
1.簡(jiǎn)介 (1. Introduction)
什么是功能選擇 ? (What is feature selection ?)
Feature selection is the procedure of selecting a subset (some out of all available) of the input variables that are most relevant to the target variable (that we wish to predict).
特征選擇是選擇與目標(biāo)變量(我們希望預(yù)測(cè))最相關(guān)的輸入 變量的子集 (某些可用變量中的一部分)的過程。
Target variable here refers to the variable that we wish to predict.
目標(biāo)變量在這里 指我們希望預(yù)測(cè)的變量 。
For this article we will assume that we only have numerical input variables and a numerical target for regression predictive modeling. Assuming that, we can easily estimate the relationship between each input variable and the target variable. This relationship can be established by calculating a metric such as the correlation value for example.
對(duì)于本文,我們將假設(shè)我們只有數(shù)字輸入變量和用于回歸預(yù)測(cè)建模的數(shù)字目標(biāo)。 假設(shè),我們可以輕松地估計(jì)每個(gè)輸入變量和目標(biāo)變量之間的關(guān)系 。 例如,可以通過計(jì)算諸如相關(guān)值之類的度量來建立該關(guān)系。
2.主要的數(shù)值特征選擇方法 (2. The main numerical feature selection methods)
The 2 most famous feature selection techniques that can be used for numerical input data and a numerical target variable are the following:
可以用于數(shù)字輸入數(shù)據(jù)和數(shù)字目標(biāo)變量的兩種最著名的特征選擇技術(shù)如下:
- Correlation (Pearson, spearman) 相關(guān)性(皮爾遜,斯皮爾曼)
- Mutual Information (MI, normalized MI) 相互信息(MI,標(biāo)準(zhǔn)化MI)
Correlation is a measure of how two variables change together. The most widely used correlation measure is the Pearson’s correlation that assumes a Gaussian distribution of each variable and detects linear relationship between numerical variables.
相關(guān)性是兩個(gè)變量如何一起變化的度量。 最廣泛使用的相關(guān)度量是Pearson相關(guān),它假設(shè)每個(gè)變量的高斯分布并檢測(cè)數(shù)值變量之間的線性關(guān)系。
This is done in 2 steps:
分兩個(gè)步驟完成:
The correlation between each regressor and the target is computed, that is, ((X[:, i] — mean(X[:, i])) * (y — mean_y)) / (std(X[:, i]) * std(y)).
計(jì)算每個(gè)回歸變量與目標(biāo)之間的相關(guān)性 ,即(((X [:, i]-mean(X [:, i]))*(y-mean_y))/(std(X [:, i] )* std(y))。
It is converted to an F score then to a p-value.
將其轉(zhuǎn)換為F分?jǐn)?shù),然后轉(zhuǎn)換為p值 。
Mutual information originates from the field of information theory. The idea is that the information gain (typically used in the construction of decision trees) is applied in order to perform the feature selection. Mutual information is calculated between two variables and measures as the reduction in uncertainty for one variable given a known value of the other variable.
互信息起源于信息理論領(lǐng)域。 這個(gè)想法是應(yīng)用信息增益(通常用于構(gòu)建決策樹)來執(zhí)行特征選擇。 互信息是在兩個(gè)變量之間計(jì)算的,并且在給定另一個(gè)變量的已知值的情況下,度量為一個(gè)變量的不確定性降低。
3.數(shù)據(jù)集 (3. The dataset)
We will use the boston house-prices dataset. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. The dataset consists of the following variables:
我們將使用波士頓 房屋 - 價(jià)格的 數(shù)據(jù)集 。 該數(shù)據(jù)集包含美國(guó)人口普查局收集的有關(guān)馬薩諸塞州波士頓地區(qū)住房的信息。該數(shù)據(jù)集包含以下變量:
4. Python代碼和工作示例 (4. Python Code & Working Example)
Let’s load and split the dataset into training (70%) and test (30%) sets.
讓我們加載數(shù)據(jù)集并將其分成訓(xùn)練(70%)和測(cè)試(30%)集。
from sklearn.datasets import load_bostonfrom sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
import matplotlib.pyplot as plt
from sklearn.feature_selection import mutual_info_regression# load the data
X, y = load_boston(return_X_y=True)# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
We will use the well known scikit-learn machine library.
我們將使用眾所周知的scikit-learn機(jī)器庫。
情況1:使用“相關(guān)”度量標(biāo)準(zhǔn)選擇特征 (Case 1: Feature selection using the Correlation metric)
For the correlation statistic we will use the f_regression() function. This function can be used in a feature selection strategy, such as selecting the top k most relevant features (largest values) via the SelectKBest class.
對(duì)于相關(guān)統(tǒng)計(jì),我們將使用f_regression()函數(shù) 。 可以在功能選擇策略中使用此功能,例如,通過SelectKBest類選擇前k個(gè)最相關(guān)的功能(最大值)。
# feature selectionf_selector = SelectKBest(score_func=f_regression, k='all')# learn relationship from training data
f_selector.fit(X_train, y_train)# transform train input data
X_train_fs = f_selector.transform(X_train)# transform test input data
X_test_fs = f_selector.transform(X_test)# Plot the scores for the features
plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_)plt.xlabel("feature index")
plt.ylabel("F-value (transformed from the correlation values)")
plt.show()
Reminder: For the correlation statistic case:
提醒 :對(duì)于相關(guān)統(tǒng)計(jì)情況:
It is converted to an F score then to a p-value.
將其轉(zhuǎn)換為F分?jǐn)?shù),然后轉(zhuǎn)換為p值 。
The plot above shows that feature 6 and 13 are more important than the other features. The y-axis represents the F-values that were estimated from the correlation values.
上圖顯示功能6和13比其他功能更重要。 y軸表示根據(jù)相關(guān)值估算的F值。
情況2:使用互信息量度選擇特征 (Case 2: Feature selection using the Mutual Information metric)
The scikit-learn machine learning library provides an implementation of mutual information for feature selection with numeric input and output variables via the mutual_info_regression() function.
scikit-learn機(jī)器學(xué)習(xí)庫通過common_info_regression()函數(shù)為帶有數(shù)字輸入和輸出變量的特征選擇提供了互信息的實(shí)現(xiàn)。
# feature selectionf_selector = SelectKBest(score_func=mutual_info_regression, k='all')# learn relationship from training data
f_selector.fit(X_train, y_train)# transform train input data
X_train_fs = f_selector.transform(X_train)# transform test input data
X_test_fs = f_selector.transform(X_test)# Plot the scores for the features
plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_)plt.xlabel("feature index")
plt.ylabel("Estimated MI value")
plt.show()Feature Importance plot特征重要性圖
The y-axis represents the estimated mutual information between each feature and the target variable. Compared to the correlation feature selection method we can clearly see many more features scored as being relevant. This may be because of the statistical noise that might exists in the dataset.
y軸表示每個(gè)特征和目標(biāo)變量之間的估計(jì)互信息。 與相關(guān)特征選擇方法相比,我們可以清楚地看到更多的特征被標(biāo)記為相關(guān)。 這可能是因?yàn)閿?shù)據(jù)集中可能存在統(tǒng)計(jì)噪聲。
5.結(jié)論 (5. Conclusion)
In this article I have provided two ways in order to perform feature selection. Feature selection is the procedure of selecting a subset (some out of all available) of the input variables that are most relevant to the target variable. Target variable here refers to the variable that we wish to predict.
在本文中,我提供了兩種方法來執(zhí)行特征選擇。 特征選擇是從輸入變量中選擇一個(gè)與目標(biāo)變量最相關(guān)的子集的過程。 目標(biāo)變量在這里 指我們希望預(yù)測(cè)的變量 。
Using either the Correlation metric or the Mutual Information metric , we can easily estimate the relationship between each input variable and the target variable.
使用“ 相關(guān)”度量或“ 互 信息”度量,我們可以輕松估計(jì)每個(gè)輸入變量和目標(biāo)變量之間的關(guān)系 。
Correlation vs Mutual Information: Compared to the correlation feature selection method we can clearly see many more features scored as being relevant. This may be because of the statistical noise that might exists in the dataset.
關(guān)聯(lián) 與 互 信息:與關(guān)聯(lián)特征選擇方法相比,我們可以清楚地看到更多的特征被標(biāo)記為相關(guān)。 這可能是因?yàn)閿?shù)據(jù)集中可能存在統(tǒng)計(jì)噪聲。
您可能還喜歡: (You might also like:)
請(qǐng)繼續(xù)關(guān)注并支持這項(xiàng)工作 (Stay tuned & support this effort)
If you liked and found this article useful, follow me to be able to see all my new posts.
如果您喜歡并認(rèn)為本文有用,請(qǐng)關(guān)注我以查看我的所有新帖子。
Questions? Post them as a comment and I will reply as soon as possible.
有什么問題嗎 將其發(fā)布為評(píng)論,我會(huì)盡快回復(fù)。
最新帖子 (Latest posts)
與我取得聯(lián)系 (Get in touch with me)
LinkedIn: https://www.linkedin.com/in/serafeim-loukas/
領(lǐng)英 : https : //www.linkedin.com/in/serafeim-loukas/
ResearchGate: https://www.researchgate.net/profile/Serafeim_Loukas
ResearchGate : https : //www.researchgate.net/profile/Serafeim_Loukas
EPFL profile: https://people.epfl.ch/serafeim.loukas
EPFL 個(gè)人資料 : https : //people.epfl.ch/serafeim.loukas
Stack Overflow: https://stackoverflow.com/users/5025009/seralouk
堆棧 溢出 : https : //stackoverflow.com/users/5025009/seralouk
翻譯自: https://towardsdatascience.com/how-to-perform-feature-selection-for-regression-problems-c928e527bbfa
特征選擇 回歸
總結(jié)
以上是生活随笔為你收集整理的特征选择 回归_如何执行回归问题的特征选择的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 博主辟谣魅族20设计图 系魅族17弃案
- 下一篇: 建立神经网络来预测贷款风险