當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

特征选择回归_如何执行回归问题的特征选择

發(fā)布時(shí)間：2023/12/15 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了特征选择回归_如何执行回归问题的特征选择小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

特征選擇回歸

1.簡(jiǎn)介 (1. Introduction)

什么是功能選擇？ (What is feature selection ?)

Feature selection is the procedure of selecting a subset (some out of all available) of the input variables that are most relevant to the target variable (that we wish to predict).

特征選擇是選擇與目標(biāo)變量(我們希望預(yù)測(cè))最相關(guān)的輸入變量的子集 (某些可用變量中的一部分)的過程。

Target variable here refers to the variable that we wish to predict.

目標(biāo)變量在這里指我們希望預(yù)測(cè)的變量。

For this article we will assume that we only have numerical input variables and a numerical target for regression predictive modeling. Assuming that, we can easily estimate the relationship between each input variable and the target variable. This relationship can be established by calculating a metric such as the correlation value for example.

對(duì)于本文，我們將假設(shè)我們只有數(shù)字輸入變量和用于回歸預(yù)測(cè)建模的數(shù)字目標(biāo)。假設(shè)，我們可以輕松地估計(jì)每個(gè)輸入變量和目標(biāo)變量之間的關(guān)系 。例如，可以通過計(jì)算諸如相關(guān)值之類的度量來建立該關(guān)系。

2.主要的數(shù)值特征選擇方法 (2. The main numerical feature selection methods)

The 2 most famous feature selection techniques that can be used for numerical input data and a numerical target variable are the following:

可以用于數(shù)字輸入數(shù)據(jù)和數(shù)字目標(biāo)變量的兩種最著名的特征選擇技術(shù)如下：

Correlation (Pearson, spearman)
相關(guān)性(皮爾遜，斯皮爾曼)
Mutual Information (MI, normalized MI)
相互信息(MI，標(biāo)準(zhǔn)化MI)

Correlation is a measure of how two variables change together. The most widely used correlation measure is the Pearson’s correlation that assumes a Gaussian distribution of each variable and detects linear relationship between numerical variables.

相關(guān)性是兩個(gè)變量如何一起變化的度量。最廣泛使用的相關(guān)度量是Pearson相關(guān)，它假設(shè)每個(gè)變量的高斯分布并檢測(cè)數(shù)值變量之間的線性關(guān)系。

This is done in 2 steps:

分兩個(gè)步驟完成：

The correlation between each regressor and the target is computed, that is, ((X[:, i] — mean(X[:, i])) * (y — mean_y)) / (std(X[:, i]) * std(y)).

計(jì)算每個(gè)回歸變量與目標(biāo)之間的相關(guān)性 ，即(((X [:, i]-mean(X [:, i]))*(y-mean_y))/(std(X [:, i] )* std(y))。

It is converted to an F score then to a p-value.

將其轉(zhuǎn)換為F分?jǐn)?shù)，然后轉(zhuǎn)換為p值。

Mutual information originates from the field of information theory. The idea is that the information gain (typically used in the construction of decision trees) is applied in order to perform the feature selection. Mutual information is calculated between two variables and measures as the reduction in uncertainty for one variable given a known value of the other variable.

互信息起源于信息理論領(lǐng)域。這個(gè)想法是應(yīng)用信息增益(通常用于構(gòu)建決策樹)來執(zhí)行特征選擇。互信息是在兩個(gè)變量之間計(jì)算的，并且在給定另一個(gè)變量的已知值的情況下，度量為一個(gè)變量的不確定性降低。

3.數(shù)據(jù)集 (3. The dataset)

We will use the boston house-prices dataset. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. The dataset consists of the following variables:

我們將使用波士頓 房屋 - 價(jià)格的 數(shù)據(jù)集 。該數(shù)據(jù)集包含美國(guó)人口普查局收集的有關(guān)馬薩諸塞州波士頓地區(qū)住房的信息。該數(shù)據(jù)集包含以下變量：

CRIM — per capita crime rate by town

CRIM —城鎮(zhèn)居民人均犯罪率

ZN — proportion of residential land zoned for lots over 25,000 sq.ft.

ZN-25,000平方英尺以上的土地劃為住宅用地的比例。

INDUS — proportion of non-retail business acres per town.

印度—每個(gè)鎮(zhèn)非零售業(yè)務(wù)英畝的比例。

CHAS — Charles River dummy variable (1 if tract bounds river; 0 otherwise)

CHAS —查爾斯河虛擬變量(如果束縛河，則為1；否則為0)

NOX — nitric oxides concentration (parts per 10 million)

NOX-一氧化氮濃度(百萬分之幾)

RM — average number of rooms per dwelling

RM —每個(gè)住宅的平均房間數(shù)

AGE — proportion of owner-occupied units built prior to 1940

年齡-1940年之前建造的自有單位的比例

DIS — weighted distances to five Boston employment centres

DIS-與五個(gè)波士頓就業(yè)中心的加權(quán)距離

RAD — index of accessibility to radial highways

RAD —徑向公路的可達(dá)性指數(shù)

TAX — full-value property-tax rate per $10,000

稅金-每10,000美元的全值財(cái)產(chǎn)稅率

PTRATIO — pupil-teacher ratio by town

PTRATIO-各鎮(zhèn)師生比例

B — 1000(Bk — 0.63)2 where Bk is the proportion of blacks by town

B — 1000(Bk-0.63)2，其中Bk是按城鎮(zhèn)劃分的黑人比例

LSTAT — % lower status of the population

LSTAT-人口狀況降低百分比

MEDV — Median value of owner-occupied homes in $1000's

MEDV —擁有住房的中位數(shù)價(jià)值(以1000美元計(jì))

4. Python代碼和工作示例 (4. Python Code & Working Example)

Let’s load and split the dataset into training (70%) and test (30%) sets.

讓我們加載數(shù)據(jù)集并將其分成訓(xùn)練(70％)和測(cè)試(30％)集。

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
import matplotlib.pyplot as plt
from sklearn.feature_selection import mutual_info_regression# load the data
X, y = load_boston(return_X_y=True)# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

We will use the well known scikit-learn machine library.

我們將使用眾所周知的scikit-learn機(jī)器庫。

情況1：使用“相關(guān)”度量標(biāo)準(zhǔn)選擇特征 (Case 1: Feature selection using the Correlation metric)

For the correlation statistic we will use the f_regression() function. This function can be used in a feature selection strategy, such as selecting the top k most relevant features (largest values) via the SelectKBest class.

對(duì)于相關(guān)統(tǒng)計(jì)，我們將使用f_regression()函數(shù) 。可以在功能選擇策略中使用此功能，例如，通過SelectKBest類選擇前k個(gè)最相關(guān)的功能(最大值)。

# feature selection
f_selector = SelectKBest(score_func=f_regression, k='all')# learn relationship from training data
f_selector.fit(X_train, y_train)# transform train input data
X_train_fs = f_selector.transform(X_train)# transform test input data
X_test_fs = f_selector.transform(X_test)# Plot the scores for the features
plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_)plt.xlabel("feature index")
plt.ylabel("F-value (transformed from the correlation values)")
plt.show()

Reminder: For the correlation statistic case:

提醒：對(duì)于相關(guān)統(tǒng)計(jì)情況：

The correlation between each regressor and the target is computed, that is, ((X[:, i] — mean(X[:, i])) * (y — mean_y)) / (std(X[:, i]) * std(y)).

計(jì)算每個(gè)回歸變量與目標(biāo)之間的相關(guān)性，即(((X [:, i]-mean(X [:, i]))*(y-mean_y))/(std(X [:, i] )* std(y))。

It is converted to an F score then to a p-value.

將其轉(zhuǎn)換為F分?jǐn)?shù)，然后轉(zhuǎn)換為p值。

Feature Importance plot特征重要性圖

The plot above shows that feature 6 and 13 are more important than the other features. The y-axis represents the F-values that were estimated from the correlation values.

上圖顯示功能6和13比其他功能更重要。 y軸表示根據(jù)相關(guān)值估算的F值。

情況2：使用互信息量度選擇特征 (Case 2: Feature selection using the Mutual Information metric)

The scikit-learn machine learning library provides an implementation of mutual information for feature selection with numeric input and output variables via the mutual_info_regression() function.

scikit-learn機(jī)器學(xué)習(xí)庫通過common_info_regression()函數(shù)為帶有數(shù)字輸入和輸出變量的特征選擇提供了互信息的實(shí)現(xiàn)。

# feature selection
f_selector = SelectKBest(score_func=mutual_info_regression, k='all')# learn relationship from training data
f_selector.fit(X_train, y_train)# transform train input data
X_train_fs = f_selector.transform(X_train)# transform test input data
X_test_fs = f_selector.transform(X_test)# Plot the scores for the features
plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_)plt.xlabel("feature index")
plt.ylabel("Estimated MI value")
plt.show()Feature Importance plot特征重要性圖

The y-axis represents the estimated mutual information between each feature and the target variable. Compared to the correlation feature selection method we can clearly see many more features scored as being relevant. This may be because of the statistical noise that might exists in the dataset.

y軸表示每個(gè)特征和目標(biāo)變量之間的估計(jì)互信息。 與相關(guān)特征選擇方法相比，我們可以清楚地看到更多的特征被標(biāo)記為相關(guān)。 這可能是因?yàn)閿?shù)據(jù)集中可能存在統(tǒng)計(jì)噪聲。

5.結(jié)論 (5. Conclusion)

In this article I have provided two ways in order to perform feature selection. Feature selection is the procedure of selecting a subset (some out of all available) of the input variables that are most relevant to the target variable. Target variable here refers to the variable that we wish to predict.

在本文中，我提供了兩種方法來執(zhí)行特征選擇。 特征選擇是從輸入變量中選擇一個(gè)與目標(biāo)變量最相關(guān)的子集的過程。 目標(biāo)變量在這里指我們希望預(yù)測(cè)的變量。

Using either the Correlation metric or the Mutual Information metric , we can easily estimate the relationship between each input variable and the target variable.

使用“ 相關(guān)”度量或“ 互 信息”度量，我們可以輕松估計(jì)每個(gè)輸入變量和目標(biāo)變量之間的關(guān)系 。

Correlation vs Mutual Information: Compared to the correlation feature selection method we can clearly see many more features scored as being relevant. This may be because of the statistical noise that might exists in the dataset.

關(guān)聯(lián) 與互 信息：與關(guān)聯(lián)特征選擇方法相比，我們可以清楚地看到更多的特征被標(biāo)記為相關(guān)。這可能是因?yàn)閿?shù)據(jù)集中可能存在統(tǒng)計(jì)噪聲。

您可能還喜歡： (You might also like:)

請(qǐng)繼續(xù)關(guān)注并支持這項(xiàng)工作 (Stay tuned & support this effort)

If you liked and found this article useful, follow me to be able to see all my new posts.

如果您喜歡并認(rèn)為本文有用，請(qǐng)關(guān)注我以查看我的所有新帖子。

Questions? Post them as a comment and I will reply as soon as possible.

有什么問題嗎將其發(fā)布為評(píng)論，我會(huì)盡快回復(fù)。

與我取得聯(lián)系 (Get in touch with me)

LinkedIn: https://www.linkedin.com/in/serafeim-loukas/
領(lǐng)英 ： https : //www.linkedin.com/in/serafeim-loukas/
ResearchGate: https://www.researchgate.net/profile/Serafeim_Loukas
ResearchGate ： https : //www.researchgate.net/profile/Serafeim_Loukas
EPFL profile: https://people.epfl.ch/serafeim.loukas
EPFL 個(gè)人資料 ： https ： //people.epfl.ch/serafeim.loukas
Stack Overflow: https://stackoverflow.com/users/5025009/seralouk
堆棧溢出： https : //stackoverflow.com/users/5025009/seralouk

翻譯自: https://towardsdatascience.com/how-to-perform-feature-selection-for-regression-problems-c928e527bbfa

特征選擇回歸

總結(jié)

以上是生活随笔為你收集整理的特征选择回归_如何执行回归问题的特征选择的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

特征

上一篇：博主辟谣魅族20设计图系魅族17弃案
下一篇：建立神经网络来预测贷款风险

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

生活随笔

生活随笔

编程问答

特征选择回归_如何执行回归问题的特征选择

1.簡(jiǎn)介 (1. Introduction)

什么是功能選擇？ (What is feature selection ?)

2.主要的數(shù)值特征選擇方法 (2. The main numerical feature selection methods)

3.數(shù)據(jù)集 (3. The dataset)

4. Python代碼和工作示例 (4. Python Code & Working Example)

情況1：使用“相關(guān)”度量標(biāo)準(zhǔn)選擇特征 (Case 1: Feature selection using the Correlation metric)

情況2：使用互信息量度選擇特征 (Case 2: Feature selection using the Mutual Information metric)

5.結(jié)論 (5. Conclusion)

您可能還喜歡： (You might also like:)

請(qǐng)繼續(xù)關(guān)注并支持這項(xiàng)工作 (Stay tuned & support this effort)

最新帖子 (Latest posts)

與我取得聯(lián)系 (Get in touch with me)

總結(jié)

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

特征选择 回归_如何执行回归问题的特征选择

1.簡(jiǎn)介 (1. Introduction)

什么是功能選擇 ？ (What is feature selection ?)

2.主要的數(shù)值特征選擇方法 (2. The main numerical feature selection methods)

3.數(shù)據(jù)集 (3. The dataset)

4. Python代碼和工作示例 (4. Python Code & Working Example)

情況1：使用“相關(guān)”度量標(biāo)準(zhǔn)選擇特征 (Case 1: Feature selection using the Correlation metric)

情況2：使用互信息量度選擇特征 (Case 2: Feature selection using the Mutual Information metric)

5.結(jié)論 (5. Conclusion)

您可能還喜歡： (You might also like:)

請(qǐng)繼續(xù)關(guān)注并支持這項(xiàng)工作 (Stay tuned & support this effort)

最新帖子 (Latest posts)

與我取得聯(lián)系 (Get in touch with me)

總結(jié)

特征选择回归_如何执行回归问题的特征选择

什么是功能選擇？ (What is feature selection ?)