专栏 | 基于 Jupyter 的特征工程手册:数据预处理(一)
點擊上方“AI有道”,選擇“置頂”公眾號
重磅干貨,第一時間送達
作者:Yingxiang Chen &?Zihan Yang
編輯:紅色石頭
特征工程在機器學習中的重要性不言而喻,恰當的特征工程能顯著提升機器學習模型性能。我們在 Github 上整理編寫了一份系統的特征工程教程,供大家參考學習。
項目地址:
https://github.com/YC-Coder-Chen/feature-engineering-handbook
本文將探討數據預處理部分:介紹了如何利用 scikit-learn 處理靜態的連續變量,利用 Category Encoders 處理靜態的類別變量以及利用 Featuretools 處理常見的時間序列變量。
目錄
特征工程的數據預處理我們將分為三大部分來介紹:
靜態連續變量
靜態類別變量
時間序列變量
本文將介紹 1.1 靜態連續變量的數據預處理。下面將結合 Jupyter,使用 sklearn,進行詳解。
1.1 靜態連續變量
1.1.1?離散化
離散化連續變量可以使模型更加穩健。例如,當預測客戶的購買行為時,一個已有 30 次購買行為的客戶可能與一個已有 32 次購買行為的客戶具有非常相似的行為。有時特征中的過精度可能是噪聲,這就是為什么在 LightGBM 中,模型采用直方圖算法來防止過擬合。離散連續變量有兩種方法。
1.1.1.1?二值化
將數值特征二值化。
1.1.1.2 分箱
將數值特征分箱。
均勻分箱:
分位數分箱:
1.1.2 縮放
不同尺度的特征之間難以比較,特別是在線性回歸和邏輯回歸等線性模型中。在基于歐氏距離的?k-means 聚類或?KNN 模型中,就需要進行特征縮放,否則距離的測量是無用的。而對于任何使用梯度下降的算法,縮放也會加快收斂速度。
一些常用的模型:
注:偏度影響 PCA 模型,因此最好使用冪變換來消除偏度。
1.1.2.1?標準縮放(Z 分數標準化)
公式:
其中,X 是變量(特征),?????是 X 的均值,?????是 X 的標準差。此方法對異常值非常敏感,因為異常值同時影響到 ???? 和 ????。
from sklearn.preprocessing import StandardScaler# in order to mimic the operation in real-world, we shall fit the StandardScaler # on the trainset and transform the testset # we take the top ten samples in the first column as test set # take the rest samples in the first column as train settest_set = X[0:10,0] # return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912]) train_set = X[10:,0]model = StandardScaler()model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set # top ten numbers for simplification result = model.transform(test_set.reshape(-1,1)).reshape(-1) # return array([ 2.34539745, 2.33286782, 1.78324852, 0.93339178, -0.0125957 , # 0.08774668, -0.11109548, -0.39490751, -0.94221041, -0.09419626]) # result is the same as ((X[0:10,0] - X[10:,0].mean())/X[10:,0].std()) # visualize the distribution after the scaling # fit and transform the entire first featureimport seaborn as sns import matplotlib.pyplot as pltfig, ax = plt.subplots(2,1, figsize = (13,9)) sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0]) ax[0].set_title('Histogram of the Original Distribution', fontsize=12) ax[0].set_xlabel('Value', fontsize=12) ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distributionmodel = StandardScaler() model.fit(X[:,0].reshape(-1,1)) result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)# show the distribution of the entire feature sns.distplot(result, hist = True, kde=True, ax=ax[1]) ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12) ax[1].set_xlabel('Value', fontsize=12) ax[1].set_ylabel('Frequency', fontsize=12); # the distribution is the same, but scales change fig.tight_layout()1.1.2.2?MinMaxScaler(按數值范圍縮放)
假設我們要縮放的特征數值范圍為 (a, b)。
公式:
其中,Min 是 X 的最小值,Max 是 X 的最大值。此方法也對異常值非常敏感,因為異常值同時影響到 Min 和 Max。
1.1.2.3?RobustScaler(抗異常值縮放)
使用對異常值穩健的統計(分位數)來縮放特征。假設我們要將縮放的特征分位數范圍為 (a, b)。
公式:
這種方法對異常點魯棒性更強。
import numpy as np from sklearn.preprocessing import RobustScaler# in order to mimic the operation in real-world, we shall fit the RobustScaler # on the trainset and transform the testset # we take the top ten samples in the first column as test set # take the rest samples in the first column as train settest_set = X[0:10,0] # return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912]) train_set = X[10:,0]model = RobustScaler(with_centering = True, with_scaling = True, quantile_range = (25.0, 75.0)) # with_centering = True => recenter the feature by set X' = X - X.median() # with_scaling = True => rescale the feature by the quantile set by user # set the quantile to the (25%, 75%)model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set # top ten numbers for simplification result = model.transform(test_set.reshape(-1,1)).reshape(-1) # return array([ 2.19755974, 2.18664281, 1.7077657 , 0.96729508, 0.14306683, # 0.23049401, 0.05724508, -0.19003715, -0.66689601, 0.07196918]) # result is the same as (X[0:10,0] - np.quantile(X[10:,0], 0.5))/(np.quantile(X[10:,0],0.75)-np.quantile(X[10:,0], 0.25)) # visualize the distribution after the scaling # fit and transform the entire first featureimport seaborn as sns import matplotlib.pyplot as pltfig, ax = plt.subplots(2,1, figsize = (13,9)) sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0]) ax[0].set_title('Histogram of the Original Distribution', fontsize=12) ax[0].set_xlabel('Value', fontsize=12) ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distributionmodel = RobustScaler(with_centering = True, with_scaling = True, quantile_range = (25.0, 75.0)) model.fit(X[:,0].reshape(-1,1)) result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)# show the distribution of the entire feature sns.distplot(result, hist = True, kde=True, ax=ax[1]) ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12) ax[1].set_xlabel('Value', fontsize=12) ax[1].set_ylabel('Frequency', fontsize=12); # the distribution is the same, but scales change fig.tight_layout()1.1.2.4 冪次變換(非線性變換)
以上介紹的所有縮放方法都保持原來的分布。但正態性是許多統計模型的一個重要假設。我們可以使用冪次變換將原始分布轉換為正態分布。
Box-Cox 變換:
Box-Cox 變換只適用于正數,并假設如下分布:
考慮了所有的 λ 值,通過最大似然估計選擇穩定方差和最小化偏度的最優值。
from sklearn.preprocessing import PowerTransformer# in order to mimic the operation in real-world, we shall fit the PowerTransformer # on the trainset and transform the testset # we take the top ten samples in the first column as test set # take the rest samples in the first column as train settest_set = X[0:10,0] # return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912]) train_set = X[10:,0]model = PowerTransformer(method='box-cox', standardize=True) # apply box-cox transformationmodel.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set # top ten numbers for simplification result = model.transform(test_set.reshape(-1,1)).reshape(-1) # return array([ 1.91669292, 1.91009687, 1.60235867, 1.0363095 , 0.19831579, # 0.30244247, 0.09143411, -0.24694006, -1.08558469, 0.11011933]) # visualize the distribution after the scaling # fit and transform the entire first featureimport seaborn as sns import matplotlib.pyplot as pltfig, ax = plt.subplots(2,1, figsize = (13,9)) sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0]) ax[0].set_title('Histogram of the Original Distribution', fontsize=12) ax[0].set_xlabel('Value', fontsize=12) ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distributionmodel = PowerTransformer(method='box-cox', standardize=True) model.fit(X[:,0].reshape(-1,1)) result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)# show the distribution of the entire feature sns.distplot(result, hist = True, kde=True, ax=ax[1]) ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12) ax[1].set_xlabel('Value', fontsize=12) ax[1].set_ylabel('Frequency', fontsize=12); # the distribution now becomes normal fig.tight_layout()Yeo-Johnson 變換:
Yeo Johnson 變換適用于正數和負數,并假設以下分布:
考慮了所有的 λ 值,通過最大似然估計選擇穩定方差和最小化偏度的最優值。
from sklearn.preprocessing import PowerTransformer# in order to mimic the operation in real-world, we shall fit the PowerTransformer # on the trainset and transform the testset # we take the top ten samples in the first column as test set # take the rest samples in the first column as train settest_set = X[0:10,0] # return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912]) train_set = X[10:,0]model = PowerTransformer(method='yeo-johnson', standardize=True) # apply box-cox transformationmodel.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set # top ten numbers for simplification result = model.transform(test_set.reshape(-1,1)).reshape(-1) # return array([ 1.90367888, 1.89747091, 1.604735 , 1.05166306, 0.20617221, # 0.31245176, 0.09685566, -0.25011726, -1.10512438, 0.11598074]) # visualize the distribution after the scaling # fit and transform the entire first featureimport seaborn as sns import matplotlib.pyplot as pltfig, ax = plt.subplots(2,1, figsize = (13,9)) sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0]) ax[0].set_title('Histogram of the Original Distribution', fontsize=12) ax[0].set_xlabel('Value', fontsize=12) ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distributionmodel = PowerTransformer(method='yeo-johnson', standardize=True) model.fit(X[:,0].reshape(-1,1)) result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)# show the distribution of the entire feature sns.distplot(result, hist = True, kde=True, ax=ax[1]) ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12) ax[1].set_xlabel('Value', fontsize=12) ax[1].set_ylabel('Frequency', fontsize=12); # the distribution now becomes normal fig.tight_layout()1.1.3 正則化
以上所有縮放方法都是按列操作的。但正則化在每一行都有效,它試圖“縮放”每個樣本,使其具有單位范數。由于正則化在每一行都起作用,它會扭曲特征之間的關系,因此不常見。但是正則化方法在文本分類和聚類上下文中是非常有用的。
假設 X[i][j] 表示樣本 i 中特征 j 的值。
L1 正則化公式:
L2 正則化公式:
L1 正則化:
L2 正則化:
1.1.4?缺失值的估算
在實際操作中,數據集中可能缺少值。然而,這種稀疏的數據集與大多數 scikit 學習模型不兼容,這些模型假設所有特征都是數值的,而沒有丟失值。所以在應用 scikit 學習模型之前,我們需要估算缺失的值。
但是一些新的模型,比如在其他包中實現的 XGboost、LightGBM 和 Catboost,為數據集中丟失的值提供了支持。所以在應用這些模型時,我們不再需要填充數據集中丟失的值。
1.1.4.1 單變量特征插補
假設第 i 列中有缺失值,那么我們將用常數或第 i 列的統計數據(平均值、中值或模式)對其進行估算。
from sklearn.impute import SimpleImputertest_set = X[0:10,0].copy() # no missing values # return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])# manully create some missing values test_set[3] = np.nan test_set[6] = np.nan # now sample_columns becomes # array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12 ,2.0804, 3.6912])# create the test samples # in real-world, we should fit the imputer on train set and tranform the test set. train_set = X[10:,0].copy() train_set[3] = np.nan train_set[6] = np.nanimputer = SimpleImputer(missing_values=np.nan, strategy='mean') # use mean # we can set the strategy to 'mean', 'median', 'most_frequent', 'constant' imputer.fit(train_set.reshape(-1,1)) result = imputer.transform(test_set.reshape(-1,1)).reshape(-1) # return array([8.3252 , 8.3014 , 7.2574 , 3.87023658, 3.8462 , # 4.0368 , 3.87023658, 3.12 , 2.0804 , 3.6912 ]) # all missing values are imputed with 3.87023658 # 3.87023658 = np.nanmean(train_set) # which is the mean of the trainset ignoring missing values1.1.4.2 多元特征插補
多元特征插補利用整個數據集的信息來估計和插補缺失值。在 scikit-learn 中,它以循環迭代的方式實現。
在每一步中,一個特征列被指定為輸出 y,其他特征列被視為輸入 X。一個回歸器適用于已知 y 的(X,y)。然后,回歸器被用來預測 y 的缺失值。這是以迭代的方式對每個特征進行的,然后對最大值插補回合重復進行。
使用線性模型(以 BayesianRidge 為例):
from sklearn.experimental import enable_iterative_imputer # have to import this to enable # IterativeImputer from sklearn.impute import IterativeImputer from sklearn.linear_model import BayesianRidgetest_set = X[0:10,:].copy() # no missing values, select all features # the first columns is # array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])# manully create some missing values test_set[3,0] = np.nan test_set[6,0] = np.nan test_set[3,1] = np.nan # now the first feature becomes # array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12 ,2.0804, 3.6912])# create the test samples # in real-world, we should fit the imputer on train set and tranform the test set. train_set = X[10:,:].copy() train_set[3,0] = np.nan train_set[6,0] = np.nan train_set[3,1] = np.nanimpute_estimator = BayesianRidge() imputer = IterativeImputer(max_iter = 10, random_state = 0, estimator = impute_estimator)imputer.fit(train_set) result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works # return array([8.3252 , 8.3014 , 7.2574 , 4.6237195 , 3.8462 , # 4.0368 , 4.00258149, 3.12 , 2.0804 , 3.6912 ])使用基于樹的模型(以 ExtraTrees 為例):
from sklearn.experimental import enable_iterative_imputer # have to import this to enable # IterativeImputer from sklearn.impute import IterativeImputer from sklearn.ensemble import ExtraTreesRegressortest_set = X[0:10,:].copy() # no missing values, select all features # the first columns is # array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])# manully create some missing values test_set[3,0] = np.nan test_set[6,0] = np.nan test_set[3,1] = np.nan # now the first feature becomes # array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12 ,2.0804, 3.6912])# create the test samples # in real-world, we should fit the imputer on train set and tranform the test set. train_set = X[10:,:].copy() train_set[3,0] = np.nan train_set[6,0] = np.nan train_set[3,1] = np.nanimpute_estimator = ExtraTreesRegressor(n_estimators=10, random_state=0) # parameters can be turned in CV though sklearn pipeline imputer = IterativeImputer(max_iter = 10, random_state = 0, estimator = impute_estimator)imputer.fit(train_set) result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works # return array([8.3252 , 8.3014 , 7.2574 , 4.63813, 3.8462 , 4.0368 , 3.24721, # 3.12 , 2.0804 , 3.6912 ])使用 K 近鄰(KNN):
from sklearn.experimental import enable_iterative_imputer # have to import this to enable # IterativeImputer from sklearn.impute import IterativeImputer from sklearn.neighbors import KNeighborsRegressortest_set = X[0:10,:].copy() # no missing values, select all features # the first columns is # array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])# manully create some missing values test_set[3,0] = np.nan test_set[6,0] = np.nan test_set[3,1] = np.nan # now the first feature becomes # array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12 ,2.0804, 3.6912])# create the test samples # in real-world, we should fit the imputer on train set and tranform the test set. train_set = X[10:,:].copy() train_set[3,0] = np.nan train_set[6,0] = np.nan train_set[3,1] = np.nanimpute_estimator = KNeighborsRegressor(n_neighbors=10, p = 1) # set p=1 to use manhanttan distance # use manhanttan distance to reduce effect from outliers# parameters can be turned in CV though sklearn pipeline imputer = IterativeImputer(max_iter = 10, random_state = 0, estimator = impute_estimator)imputer.fit(train_set) result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works # return array([8.3252, 8.3014, 7.2574, 3.6978, 3.8462, 4.0368, 4.052 , 3.12 , # 2.0804, 3.6912])1.1.4.3 標記估算值
有時,某些缺失值可能是有用的。因此,scikit learn 還提供了將缺少值的數據集轉換為相應的二進制矩陣的功能,該矩陣指示數據集中缺少值的存在。
from sklearn.impute import MissingIndicator# illustrate this function on trainset only # since the precess is independent in train set and test set train_set = X[10:,:].copy() # select all features train_set[3,0] = np.nan # manully create some missing values train_set[6,0] = np.nan train_set[3,1] = np.nanindicator = MissingIndicator(missing_values=np.nan, features='all') # show the results on all the features result = indicator.fit_transform(train_set) # result have the same shape with train_set # contains only True & False, True corresponds with missing valueresult[:,0].sum() # should return 2, the first column has two missing values result[:,1].sum(); # should return 1, the second column has one missing value1.1.5 特征變換
1.1.5.1 多項式變換
有時我們希望在模型中引入非線性特征,從而增加模型的復雜度。對于簡單的線性模型,這將大大增加模型的復雜度。但是對于更復雜的模型,如基于樹的 ML 模型,它們已經在非參數樹結構中包含了非線性關系。因此,這種特性轉換可能對基于樹的 ML 模型沒有太大幫助。
例如,如果我們將階數設置為 3,形式如下:
from sklearn.preprocessing import PolynomialFeatures# illustrate this function on one synthesized sample train_set = np.array([2,3]).reshape(1,-1) # shape (1,2) # return array([[2, 3]])poly = PolynomialFeatures(degree = 3, interaction_only = False) # the highest degree is set to 3, and we want more than just intereaction termsresult = poly.fit_transform(train_set) # have shape (1, 10) # array([[ 1., 2., 3., 4., 6., 9., 8., 12., 18., 27.]])1.1.5.2 自定義變換
好了,以上就是關于靜態連續變量的數據預處理介紹。建議讀者結合代碼,在 Jupyter 中實操一遍。
推薦閱讀
(點擊標題可跳轉閱讀)
干貨 | 公眾號歷史文章精選
我的深度學習入門路線
我的機器學習入門路線圖
重磅!
林軒田機器學習完整視頻和博主筆記來啦!
掃描下方二維碼,添加?AI有道小助手微信,可申請入群,并獲得林軒田機器學習完整視頻 + 博主紅色石頭的精煉筆記(一定要備注:入群?+ 地點 + 學校/公司。例如:入群+上海+復旦。?
長按掃碼,申請入群
(添加人數較多,請耐心等待)
?
最新 AI 干貨,我在看?
總結
以上是生活随笔為你收集整理的专栏 | 基于 Jupyter 的特征工程手册:数据预处理(一)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Python这个超炫的黑科技,可全网爬取
- 下一篇: 谷歌放弃C++语言,Python将要一统