救命代码_救命! 如何选择功能?
救命代碼
Often times, we’re not sure how to choose our features. This is just a small guide to help choose. (Disclaimer: For now I’ll talk about binary classification.)
通常,我們不確定如何選擇功能。 這只是幫助選擇的小指南。 (免責聲明:目前,我將討論二進制分類。)
Many times, when we are super-excited to predict using a fancy machine-learning algorithm, and we’re almost ready to apply our models to analyze and make classifications on the test data-set–– we don’t exactly know what features to pick. Often times, the # of features can range from tens to thousands, and it’s not exactly clear how to pick relevant features, and how many features we should select. Sometimes it’s a not a bad idea to combine features together, also known as feature engineering. A common example of this, you’ve probably heard in machine-learning –– is principal components analysis (PCA), where the data matrix X is factorized into its singular-value-decomposition (SVD) U*∑*V, where ∑ is a diagonal matrix with singular values, and the # of singular values you choose determines how many principal components. You can think of principal-components as a way to reduce the dimensions of your data-set. The awesome thing about PCA is that the new engineered features, or “principal-components”, are linear combinations of the original features. And that’s great! We love linear combinations, because it only involves addition and scalar-multiplication, and they’re not too hard to interpret. For example, if you did PCA on a dataset about house-price regression, and say you only selected 2 principal components. Then the first component, PC1, could be: c1*(# of bedrooms)+c2*(# sq.ft.). And PC2 could be something similar.
很多時候,當我們非常興奮地使用花哨的機器學習算法進行預測時,幾乎準備好將我們的模型應用于測試數據集的分析和分類了,我們不知道到底有什么功能選擇。 通常,功能的數量可能從數十到數千不等,并且不清楚如何選擇相關功能以及應該選擇多少功能。 有時將特征組合在一起并不是一個壞主意,也稱為特征工程。 您可能在機器學習中聽說過一個常見的例子,即主成分分析(PCA) ,其中數據矩陣X被分解為其奇異值分解(SVD) U * ∑ * V ,其中∑是具有奇異值的對角矩陣,您選擇的奇異值#決定了多少個主成分。 您可以將主成分視為減小數據集尺寸的一種方法。 PCA令人敬畏的是,新設計的功能或“主要組件”是原始功能的線性組合。 太好了! 我們喜歡線性組合,因為它只涉及加法和標量乘法,并且它們也不難解釋。 例如,如果您對有關房價回歸的數據集進行了PCA,并說您只選擇了2個主要成分。 那么第一個組件PC1可以是: c1 * (臥室數量)+ c2 * (平方英尺)。 與PC2可能類似。
The limitation with principal components is, that the new features you make are *only* linear-combinations of some of the old ones. That means you can’t take advantage of making non-linear combinations of features. This is something neural networks are awesome at; they can create TONS of non-linear combinations/functions of features. But they have an even bigger problem: interpretability of the new features. The engineered features are basically hidden inside the weight-matrix multiplications between different layers of the network (which is just a composition of non-linear functions). And neural networks, with that extra non-linearity, can often be brittle and break under adversarial attacks, such as few-pixel attacks on convolutional neural networks, or tricking a neural network into mis-classifying a panda and a black square as a vulture –– weird, nonsense stuff like that.
主要成分的局限性在于,新 您制作的功能僅是某些舊功能的線性組合。 這意味著您無法利用特征的非線性組合。 這是神經網絡的精妙之處。 他們可以創建非線性的特征組合/功能的TONS。 但是它們還有一個更大的問題:新功能的可解釋性 。 工程特征基本上隱藏在網絡不同層之間的權重矩陣乘法中(這只是非線性函數的組合)。 而且具有額外非線性的神經網絡在對抗性攻擊下通常會很脆弱,甚至會受到破壞,例如對卷積神經網絡的小像素攻擊,或者欺騙神經網絡將熊貓和黑方塊誤分類為禿ul。 -像這樣的古怪,胡說八道的東西。
So, what to do about features?? Well, if the ways we engineer new features are kinda limited, we could always just select a subset of the features we already have! But you need to be careful. There are many ways to do this, but not all of the are robust and consistent. For example, take random forests. It’s true that at the end of using the classifier, Python will output the relevant features with the feature_importances method of a random forest. But let’s think for a second: random forests work by training a bunch of decision trees, each one on a random subset of the training data. So if you kept repeating the RF model, you might get different feature-importances each time, and this is not robust or consistent. Wouldn’t it be confusing as a data scientist or ML engineer to see a different set of relevant features pop up each time? You clearly didn’t change the data set! So why should you trust different sets of “importance” features? The problem with this is that the “importance” features you’re picking, are dependent on the random-forest model itself––and even if RF’s have high accuracy, it also makes more sense to choose features based on the dataset alone, rather than including a heavy-duty model first.
那么,如何處理功能? 好吧,如果我們設計新功能的方式受到限制,那么我們總是可以選擇已經擁有的功能的子集! 但是您需要小心。 有許多方法可以做到這一點,但并非所有方法都是可靠且一致的。 例如,采用隨機森林。 的確,在使用分類器的最后,Python將使用隨機森林的feature_importances方法輸出相關功能。 但是讓我們想一想:隨機森林通過訓練一堆決策樹來工作,每個決策樹都在訓練數據的隨機子集上。 因此,如果您不斷重復RF模型,則每次功能的重要性可能會有所不同,這既不可靠也不具有一致性。 作為數據科學家或ML工程師,每次看到一組不同的相關功能都會感到困惑嗎? 您顯然沒有更改數據集! 那么,為什么要信任不同組的“重要性”功能呢? 這樣做的問題在于,您選擇的“重要性”特征依賴于隨機森林模型本身-即使RF的準確性很高,也要僅根據數據集選擇特征更有意義。而不是首先包括重型模型。
The key to selecting features that are consistent, not confusing, and robust might be this: select features independently of your model. The relevant features you select should be relevant whether or not you use a neural network, an RF, logistic regression, or any other supervised learning model. This way, you don’t have to worry about the predictive power of your machine learning model while you’re trying to pick features at the same time, which be un-reliable.
選擇一致,不混亂和健壯的特征的關鍵可能是: 獨立于模型選擇特征。 無論您是否使用神經網絡,RF,邏輯回歸或任何其他監督學習模型,您選擇的相關功能都應具有相關性。 這樣,當您試圖同時選擇不可靠的功能時,您不必擔心機器學習模型的預測能力。
So, how do you pick features that are independent of your model? Scikit-Learn has a few options. One of them which is my favorite is called mutual-information. It’s a important concept from probability-theory. Basically, it computes the dependence between your features-variables and your label-variable relative to the assumption that they’re independent. An easier way of saying that is it measure how much your class-labels depend on a specific feature.
因此,如何選擇與模型無關的功能? Scikit-Learn有一些選擇。 我最喜歡的其中一種叫做互信息 。 從概率論出發,這是一個重要的概念。 基本上,它計算功能變量和標簽變量之間的相關性 (假設它們是獨立的)。 說的更簡單的方法是測量你的類的標簽是多么 依賴一個特定的功能。
So for example, say you’re predicting if someone has a tumor by looking at a bunch of feature columns in your dataset, like geometric-area, location, color-hue, etc. If you’re trying to choose relevant features to your prediction, you can use mutual-information to talk about how much each class-label depends on the geometric-area, location, and color-hue of the tumor. And this is a measurement gotten directly from the data; it never involved using a predictive model in the first place.
因此,例如,假設您通過查看數據集中的一堆特征列(例如幾何區域,位置,色相等)來預測某人是否患有腫瘤。如果您要嘗試選擇與您的特征相關的特征預測時,您可以使用相互信息來討論每個類別標簽在多大程度上取決于腫瘤的幾何區域,位置和顏色。 這是直接從數據中獲得的度量; 它從來沒有涉及使用預測模型。
You can also use Sci-kit Learn’s chi-2, or “chi-squared”, to determine feature importance. What this does, is use a Chi-Squared test between the features and the label to determine which features are relevant to the label and which ones are independent of the label. You can think of this method as testing a “null hypothesis” H0: are the features independent of the classification label?To do this, you’d calculate a chi-squared statistic based on the data-table, get a p-value, and determine which features are independent or not. You then throw away the independent features (why? because they’re independent of the label according to your test, so they give no information) and keep the dependent ones.
您還可以使用Sci-kit Learn的chi-2或“卡方”來確定功能的重要性。 這是在特征和標簽之間使用Chi-Squared測試來確定哪些特征與標簽相關,哪些特征與標簽無關。 您可以將這種方法視為測試“零假設” H0:特征是否獨立于分類標簽?為此,您需要根據數據表計算卡方統計量,獲得p值,并確定哪些功能是獨立的。 然后,您丟棄獨立的功能(為什么?,因為根據您的測試它們獨立于標簽 ,所以它們不提供任何信息),并保留相關的功能。
This test is actually based on similar principles to the mutual-information calculation talked about above. However, chi2 does make the important assumptions that features in your dataset taking continuous values (say, 5.3, pi, sqrt(2), stuff like that) are normally distributed. Usually for big training-data sets this isn’t a problem, but for small training-data this assumption might be violated, so calculating mutual-information might be more reliable in those cases.
該測試實際上是基于與上述的互信息計算類似的原理。 但是, chi2確實做出了重要的假設,即數據集中具有連續值(例如5.3,pi,sqrt(2)等東西)的特征是正態分布的。 通常,對于大型訓練數據集,這不是問題,但是對于小型訓練數據,此假設可能會被違反 ,因此在這種情況下,計算互信息可能更可靠。
The basic point is this: mutual-information and chi-squared ways of feature-selecting are robust against the predictive model. Your predictive model might be wildly inaccurate, but the data you’ve collected is static in a table which never changes, so calculating your features without the model is more consistent.
基本要點是:特征選擇的互信息和卡方方法對預測模型具有魯棒性 。 您的預測模型可能會非常不準確,但是您收集的數據在表中是靜態的,永遠不會改變,因此在沒有模型的情況下計算特征更加一致。
Other ways of feature selecting include Recursive Feature Elimination (RFE), which uses a pre-fixed model (say, logistic/linear regression, or random forest) and tests almost all the subsets of features using the pre-fixed model, and decides which features are the best by seeing which subset of features gives the lowest accuracy error. (Technically, random forests use an additional method in Scikit Learn called feature_importance, but I won’t be getting into that here.) However, RFE does take a lot of time, because there are about 2-to-the-K subsets of features if you have K features, so it takes a long time to compute the model for each subset and get a score.
特征選擇的其他方法包括遞歸特征消除(RFE),它使用預先確定的模型(例如,邏輯/線性回歸或隨機森林),并使用預先確定的模型測試幾乎所有特征子集,并確定哪個通過查看哪些特征子集給出最低的準確度誤差,可以確定最佳特征。 (從技術上講,隨機森林在Scikit Learn中使用了另一種稱為feature_importance的方法 ,但在這里我不會贅述。)但是,RFE確實要花費很多時間,因為其中約有2個到K子集。如果您具有K個特征,則為每個子集計算模型并獲得得分將花費很長時間。
Another big reason I have against RFE and similar techniques is that it is fundamentally a feature-selection technique which is model-dependent. If your model is inaccurate, or overfits heavily, or does both and isn’t that interpretable by the user –– then the features you selected weren’t actually chosen by you, but by the model. So. the feature importance might not be an accurate representation of which features actually are predictive based just on the dataset.
我反對RFE和類似技術的另一個重要原因是,從根本上講,它是一種與模型相關的特征選擇技術。 如果您的模型不準確,或者過度擬合,或者兩者兼而有之,并且用戶無法解釋–那么您選擇的功能實際上不是您選擇的,而是模型選擇的。 所以。 特征重要性可能無法僅根據數據集準確表示哪些特征實際上是可預測的。
So what can we take away from all this? Well, in the end, feature-selecting is extremely important if you don’t know how to interpret your engineered features using, say principal-component-analysis. However, when you do feature selection it’s just as important to take note about how you’re selecting your features, as well as computational time. Is your method taking too much time on the computer? Is your feature-selection based on using a particular model first? Ideally, you would want to feature-select regardless of what model you use, so in your Jupyter Notebook, you would ideally want to make a cell for feature-selection before the model –– something like this:
那么,我們可以從這一切中拿走什么呢? 好吧,最后,如果您不知道如何使用主成分分析來解釋您的工程化特征,那么特征選擇就非常重要。 但是,在進行特征選擇時,注意如何選擇特征以及計算時間同樣重要。 您的方法在計算機上花費了太多時間嗎? 您的功能選擇是否首先基于使用特定的模型? 理想情況下, 無論使用哪種模型,您都希望進行特征選擇,因此,在Jupyter Notebook中,理想情況下,您希望在模型之前制作一個用于特征選擇的單元–像這樣:
from sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import mutual_info_classif
# "mutual_info_classif" is the mutual-information way of selecting
# the K most dependent features based on the class-labelK = 3
selector = SelectKBest(mutual_info_classif, K)
X = new_df.iloc[:, :-1]
y = new_df.iloc[:, -1]
X_reduced = selector.fit_transform(X,y)
features_selected = selector.get_support()
First, I did the feature selection (above).
首先,我進行了特征選擇(上文)。
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X_reduced, y,train_size=0.7)
# use logistic regression as a model
logreg = LogisticRegression(C=0.1, max_iter=1000, solver='lbfgs')
logreg.fit(X_train, y_train)
And then I trained the model (above)! :)
然后,我訓練了模型(上面)! :)
翻譯自: https://towardsdatascience.com/help-how-do-i-feature-select-eaf37e58fdaf
救命代碼
總結
以上是生活随笔為你收集整理的救命代码_救命! 如何选择功能?的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 银行托管账户是什么意思
- 下一篇: 投资回收期多久合理