Python中机器学习的特征选择技术
Introduction
介紹
Feature selection is the selection of reliable features from the bundle of large number of features. Having a good understanding of feature selection/ranking can be a great asset for a data scientist or machine learning user. It is an important process before model training as too many or redundant features negatively impacts the learning and accuracy of the model. In the collection of wanted/un-wanted features, it is important to select only those features which positively contributes towards prediction of target and remove the rest.
特征選擇是從大量特征中選擇可靠的特征。 對于數據科學家或機器學習用戶而言,對特征選擇/排序有很好的了解可能是一筆巨大的財富。 在模型訓練之前,這是一個重要的過程,因為太多或多余的特征會對模型的學習和準確性產生負面影響。 在收集所需/不需要的特征時,重要的是僅選擇那些對預測目標有積極貢獻的特征,并刪除其余特征。
Methods
方法
1. Filter Methods
1. 篩選方法
This feature selection method uses statistical approach which assigns a score to every feature. Further, Features are sorted according to their score and can be kept or removed from the data.
該特征選擇方法使用統計方法,該統計方法為每個特征分配分數。 此外,要素會根據其得分進行排序,并且可以保留或從數據中刪除。
Filter methods are very fast but might fall short in terms of accuracy when compared with the other methods.
過濾器方法非常快,但與其他方法相比,準確性可能不足。
Some examples of filter methods are as follows:
篩選方法的一些示例如下:
a. Information Gain
一個。 信息增益
Intuition: IG calculates the importance of each feature by measuring the increase in entropy when the feature is given vs. absent.
直覺:當給定特征與不存在特征時,IG通過測量熵的增加來計算每個特征的重要性。
Algorithm:
算法:
IG(S, a) = H(S) — H(S | a)
IG(S,a)= H(S)— H(S | a)
Where IG(S, a) is the information for the dataset S for the variable a for a random variable, H(S) is the entropy for the dataset before any change (described above) and H(S | a) is the conditional entropy for the dataset in the presence of variable a.
其中IG(S,a)是隨機變量的變量a的數據集S的信息, H(S)是發生任何更改(如上所述)之前數據集的熵,而H(S | a)是有條件的存在變量a時數據集的熵。
Example code:
示例代碼:
b. mRMR (Minimal Redundancy and Maximal Relevance)
b。 mRMR(最小冗余度和最大相關度)
Intuition: It selects the features, based on their relevancy with the target variable, as well as their redundancy with the other features.
直覺:它根據特征與目標變量的相關性以及與其他特征的冗余來選擇特征。
Algorithm: It uses mutual information (MI) of two random variables.
算法:它使用兩個隨機變量的互信息(MI)。
For discrete/categorical variables, the mutual information I of two variables x and y is de?ned based on their joint probabilistic distribution p(x, y) and respective marginal probabilities p(y) and p(x): This method uses MI between a feature and a class as relevance of the feature for the class, and MI between features as redundancy of each feature.
對于離散/分類變量,基于變量x和y的聯合概率分布p(x,y)以及各自的邊際概率p(y)和p(x)來定義兩個變量x和y的互信息I:該方法在a和a之間使用MI要素和類別作為要素與類別的相關性,要素之間的MI作為每個要素的冗余。
For every X in the table, S is the set of already selected attributes. Y is the target variable.
對于表中的每個X,S是已選擇的屬性的集合。 Y是目標變量。
Hence, this obtained mRMR score takes both Redundancy and Relevance into account for each feature. IF we take difference of both the factors above, we get MID (Mutual Information Difference) and if we take their ratio, we get MIQ (Mutual Information Quotient).
因此,此獲得的mRMR分數將每個功能都考慮了冗余和相關性。 如果我們采用上述兩個因素的差異,則得出MID(相互信息差異),如果采用它們的比率,則得出MIQ(相互信息商)。
Considering several examples, it has been observed that MIQ works better for most of the data sets than MID. As the divisive combination of relevance and redundancy appears to lead features with least redundancy.
考慮幾個示例,已經發現MIQ對于大多數數據集比MID更好。 由于相關性和冗余的區分性組合似乎導致了具有最少冗余的要素。
Example code:
示例代碼:
c. Chi square
C。 卡方
Intuition:
直覺:
It calculates the correlation between the feature and target and selects the best k features according to their chi square score calculated using following chi square test.
它計算特征與目標之間的相關性,并根據使用以下卡方檢驗得出的卡方得分,選擇最佳的k個特征。
Algorithm:
算法:
Where:
哪里:
c = degrees of freedom
c =自由度
O = observed value(s)
O =觀測值
E = expected value(s)
E =期望值
Let’s consider a scenario where we need to determine the relationship between the Independent category feature (predictors) and dependent category feature (target or label). In feature selection, we aim to select the features which are highly dependent on the target.
讓我們考慮一個需要確定獨立類別功能(預測變量)和從屬類別功能(目標或標簽)之間的關系的方案。 在特征選擇中,我們旨在選擇高度依賴于目標的特征。
For Higher Chi-Square value, the target variable is more dependent on the feature and it can be selected for model training.
對于較高的卡方值,目標變量更多地取決于特征,可以選擇它進行模型訓練。
Example code:
示例代碼:
d. Anova
d。 阿諾娃
Intuition:
直覺:
We perform Anova between features and target to check if they belong to same population.
我們在要素和目標之間執行Anova,以檢查它們是否屬于同一種群。
Algorithm:
算法:
If the value ‘variance_between / variance_within’ is less than the critical value (evaluated using log table). The library returns score and p value, for p<0.05 we mean that the confidence>95% for them to belong to the same population and hence are co-related. We select top k co-related features according to the score returned by Anova.
如果值'variance_between / variance_within'小于臨界值(使用日志表進行評估)。 該庫返回得分和p值,對于p <0.05,我們意味著它們屬于同一總體的置信度> 95%,因此是相關的。 我們根據Anova返回的得分選擇前k個相關的特征。
Example code:
示例代碼:
2. Wrapper Methods
2. 包裝方法
In wrapper methods, we select a subset of features from the data and train a model using them. Then we add/remove a feature and again train the model, the difference in score in both the condition decides the importance of that feature, that if the presence of it is either increasing or decreasing the score.
在包裝方法中,我們從數據中選擇特征子集,并使用它們訓練模型。 然后,我們添加/刪除特征并再次訓練模型,兩種情況下分數的差異決定了該特征的重要性,即如果該特征的存在會增加或降低分數。
Wrapper methods performs very well in terms of accuracy but falls short in speed when compared to other methods.
包裝器方法在準確性方面表現很好,但與其他方法相比,速度較慢。
Some examples of wrapper methods are as follows:
包裝方法的一些示例如下:
a. Forward Selection: It is an iterative method in which we start with zero features at the beginning and in each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.
一個。 正向選擇:這是一種迭代方法,在這種方法中,我們從零要素開始,并且在每次迭代中,我們都會不斷添加最能改善模型的要素,直到添加新變量不會改善模型的性能為止。
b. Backward Elimination: In this method, we start with the all the features, and remove features one by one if their absence increases the score of the model. We do this until no improvement is observed on removing any feature.
b。 向后消除:在這種方法中,我們從所有特征開始,如果特征缺失會增加模型的得分,則將它們逐一刪除。 我們會這樣做,直到在刪除任何功能方面未發現任何改進為止。
c. Recursive Feature elimination: It is a greedy optimization algorithm which aims to find the best performing feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the left features until all the features are exhausted. It then ranks the features based on the order of their elimination.
C。 遞歸特征消除:這是一種貪婪的優化算法,旨在找到性能最佳的特征子集。 它反復創建模型,并在每次迭代時保留性能最佳或最差的功能。 它將使用剩余的特征構造下一個模型,直到所有特征都用盡。 然后,根據特征消除的順序對特征進行排序。
Example code:
示例代碼:
3. Embedded Methods
3.嵌入式方法
Embedded methods are implemented by algorithms that can learn which features best contribute to the accuracy of the model during the creation of model itself.
嵌入式方法是通過算法實現的,算法可以了解在創建模型本身期間哪些特征最有助于模型的準確性。
It combines the qualities of both filter and wrapper methods.
它結合了filter和wrapper方法的質量。
The most commonly used embedded feature selection methods are regularization methods. Regularization methods are also called penalization methods, as it introduces penalty to the objective function which decreases the number of features and hence the complexity of the model.
最常用的嵌入式特征選擇方法是正則化方法。 正則化方法也稱為懲罰方法,因為它給目標函數帶來了懲罰,從而減少了特征數量,從而減少了模型的復雜性。
Examples of regularization algorithms are LASSO, Elastic Net and Ridge Regression.
正則化算法的示例是LASSO,彈性網和Ridge回歸 。
翻譯自: https://medium.com/@mayurtuteja97/feature-selection-techniques-for-machine-learning-in-python-455dadcd3869
總結
以上是生活随笔為你收集整理的Python中机器学习的特征选择技术的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 适马新款 50mm F1.4 DG DN
- 下一篇: 消息称半导体厂商面临消费电子与汽车应用需