當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【算法竞赛学习】二手车交易价格预测-Task3特征工程

發布時間：2023/12/15 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了【算法竞赛学习】二手车交易价格预测-Task3特征工程小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

二手車交易價格預測-Task3 特征工程

三、特征工程目標

Tip:此部分為零基礎入門數據挖掘的 Task3 特征工程部分，帶你來了解各種特征工程以及分析方法，歡迎大家后續多多交流。

賽題：零基礎入門數據挖掘 - 二手車交易價格預測

地址：https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

3.1 特征工程目標

對于特征進行進一步分析，并對于數據進行處理
完成對于特征工程的分析，并對于數據進行一些圖表或者文字總結并打卡。

3.2 內容介紹

常見的特征工程包括：

異常處理：

通過箱線圖（或 3-Sigma）分析刪除異常值；
BOX-COX 轉換（處理有偏分布）；
長尾截斷；

特征歸一化/標準化：

標準化（轉換為標準正態分布）；
歸一化（抓換到 [0,1] 區間）；
針對冪律分布，可以采用公式： $log(1+x1+median)log(\frac{1+x}{1+median})$

數據分桶：

等頻分桶；
等距分桶；
Best-KS 分桶（類似利用基尼指數進行二分類）；
卡方分桶；

缺失值處理：

不處理（針對類似 XGBoost 等樹模型）；
刪除（缺失數據太多）；
插值補全，包括均值/中位數/眾數/建模預測/多重插補/壓縮感知補全/矩陣補全等；
分箱，缺失值一個箱；

特征構造：

構造統計量特征，報告計數、求和、比例、標準差等；
時間特征，包括相對時間和絕對時間，節假日，雙休日等；
地理信息，包括分箱，分布編碼等方法；
非線性變換，包括 log/ 平方/ 根號等；
特征組合，特征交叉；
仁者見仁，智者見智。

特征篩選

過濾式（filter）：先對數據進行特征選擇，然后在訓練學習器，常見的方法有 Relief/方差選擇發/相關系數法/卡方檢驗法/互信息法；
包裹式（wrapper）：直接把最終將要使用的學習器的性能作為特征子集的評價準則，常見方法有 LVM（Las Vegas Wrapper）；
嵌入式（embedding）：結合過濾式和包裹式，學習器訓練過程中自動進行了特征選擇，常見的有 lasso 回歸；

降維

PCA/ LDA/ ICA；
特征選擇也是一種降維。

3.3 代碼示例

3.3.0 導入數據

import pandas as pd import numpy as np import matplotlib import matplotlib.pyplot as plt import seaborn as sns from operator import itemgetter%matplotlib inline train = pd.read_csv('train.csv', sep=' ') test = pd.read_csv('testA.csv', sep=' ') print(train.shape) print(test.shape) (150000, 30) (50000, 30) train.head() nameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamage...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_1401234

736	20040402	30.0	6	1.0	0.0	60	12.5	0.0	...	0.235676	0.101988	0.129549	0.022816	0.097462	-2.881803	2.804097	-2.420821	0.795292	0.914762
2262	20030301	40.0	1	2.0	0.0	0	15.0	-	...	0.264777	0.121004	0.135731	0.026597	0.020582	-4.900482	2.096338	-1.030483	-1.722674	0.245522
14874	20040403	115.0	15	1.0	0.0	163	12.5	0.0	...	0.251410	0.114912	0.165147	0.062173	0.027075	-4.846749	1.803559	1.565330	-0.832687	-0.229963
71865	19960908	109.0	10	0.0	1.0	193	15.0	0.0	...	0.274293	0.110300	0.121964	0.033395	0.000000	-4.509599	1.285940	-0.501868	-2.438353	-0.478699
111080	20120103	110.0	5	1.0	0.0	68	5.0	0.0	...	0.228036	0.073205	0.091880	0.078819	0.121534	-1.896240	0.910783	0.931110	2.834518	1.923482

5 rows × 30 columns

train.columns Index(['name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox','power', 'kilometer', 'notRepairedDamage', 'regionCode', 'seller','offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4','v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14'],dtype='object') test.columns Index(['name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox','power', 'kilometer', 'notRepairedDamage', 'regionCode', 'seller','offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4','v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14'],dtype='object')

3.3.1 刪除異常值

# 這里我包裝了一個異常值處理的代碼，可以隨便調用。 def outliers_proc(data, col_name, scale=3):"""用于清洗異常值，默認用 box_plot（scale=3）進行清洗:param data: 接收 pandas 數據格式:param col_name: pandas 列名:param scale: 尺度:return:"""def box_plot_outliers(data_ser, box_scale):"""利用箱線圖去除異常值:param data_ser: 接收 pandas.Series 數據格式:param box_scale: 箱線圖尺度，:return:"""iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))val_low = data_ser.quantile(0.25) - iqrval_up = data_ser.quantile(0.75) + iqrrule_low = (data_ser < val_low)rule_up = (data_ser > val_up)return (rule_low, rule_up), (val_low, val_up)data_n = data.copy()data_series = data_n[col_name]rule, value = box_plot_outliers(data_series, box_scale=scale)index = np.arange(data_series.shape[0])[rule[0] | rule[1]]print("Delete number is: {}".format(len(index)))data_n = data_n.drop(index)data_n.reset_index(drop=True, inplace=True)print("Now column number is: {}".format(data_n.shape[0]))index_low = np.arange(data_series.shape[0])[rule[0]]outliers = data_series.iloc[index_low]print("Description of data less than the lower bound is:")print(pd.Series(outliers).describe())index_up = np.arange(data_series.shape[0])[rule[1]]outliers = data_series.iloc[index_up]print("Description of data larger than the upper bound is:")print(pd.Series(outliers).describe())fig, ax = plt.subplots(1, 2, figsize=(10, 7))sns.boxplot(y=data[col_name], data=data, palette="Set1", ax=ax[0])sns.boxplot(y=data_n[col_name], data=data_n, palette="Set1", ax=ax[1])return data_n # 我們可以刪掉一些異常數據，以 power 為例。 # 這里刪不刪同學可以自行判斷 # 但是要注意 test 的數據不能刪 = = 不能掩耳盜鈴是不是train = outliers_proc(train, 'power', scale=3) Delete number is: 963 Now column number is: 149037 Description of data less than the lower bound is: count 0.0 mean NaN std NaN min NaN 25% NaN 50% NaN 75% NaN max NaN Name: power, dtype: float64 Description of data larger than the upper bound is: count 963.000000 mean 846.836968 std 1929.418081 min 376.000000 25% 400.000000 50% 436.000000 75% 514.000000 max 19312.000000 Name: power, dtype: float64

3.3.2 特征構造

# 訓練集和測試集放在一起，方便構造特征 train['train']=1 test['train']=0 data = pd.concat([train, test], ignore_index=True, sort=False) # 使用時間：data['creatDate'] - data['regDate']，反應汽車使用時間，一般來說價格與使用時間成反比 # 不過要注意，數據里有時間出錯的格式，所以我們需要 errors='coerce' data['used_time'] = (pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce') - pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce')).dt.days # 看一下空數據，有 15k 個樣本的時間是有問題的，我們可以選擇刪除，也可以選擇放著。 # 但是這里不建議刪除，因為刪除缺失數據占總樣本量過大，7.5% # 我們可以先放著，因為如果我們 XGBoost 之類的決策樹，其本身就能處理缺失值，所以可以不用管； data['used_time'].isnull().sum() 15072 # 從郵編中提取城市信息，因為是德國的數據，所以參考德國的郵編，相當于加入了先驗知識 data['city'] = data['regionCode'].apply(lambda x : str(x)[:-3]) # 計算某品牌的銷售統計量，同學們還可以計算其他特征的統計量 # 這里要以 train 的數據計算統計量 train_gb = train.groupby("brand") all_info = {} for kind, kind_data in train_gb:info = {}kind_data = kind_data[kind_data['price'] > 0]info['brand_amount'] = len(kind_data)info['brand_price_max'] = kind_data.price.max()info['brand_price_median'] = kind_data.price.median()info['brand_price_min'] = kind_data.price.min()info['brand_price_sum'] = kind_data.price.sum()info['brand_price_std'] = kind_data.price.std()info['brand_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)all_info[kind] = info brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"}) data = data.merge(brand_fe, how='left', on='brand') # 數據分桶以 power 為例 # 這時候我們的缺失值也進桶了， # 為什么要做數據分桶呢，原因有很多，= = # 1. 離散后稀疏向量內積乘法運算速度更快，計算結果也方便存儲，容易擴展； # 2. 離散后的特征對異常值更具魯棒性，如 age>30 為 1 否則為 0，對于年齡為 200 的也不會對模型造成很大的干擾； # 3. LR 屬于廣義線性模型，表達能力有限，經過離散化后，每個變量有單獨的權重，這相當于引入了非線性，能夠提升模型的表達能力，加大擬合； # 4. 離散后特征可以進行特征交叉，提升表達能力，由 M+N 個變量編程 M*N 個變量，進一步引入非線形，提升了表達能力； # 5. 特征離散后模型更穩定，如用戶年齡區間，不會因為用戶年齡長了一歲就變化# 當然還有很多原因，LightGBM 在改進 XGBoost 時就增加了數據分桶，增強了模型的泛化性bin = [i*10 for i in range(31)] data['power_bin'] = pd.cut(data['power'], bin, labels=False) data[['power_bin', 'power']].head() power_binpower01234

5.0	60
NaN	0
16.0	163
19.0	193
6.0	68

# 利用好了，就可以刪掉原始數據了 data = data.drop(['creatDate', 'regDate', 'regionCode'], axis=1) print(data.shape) data.columns (199037, 38)Index(['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'power','kilometer', 'notRepairedDamage', 'seller', 'offerType', 'price', 'v_0','v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10','v_11', 'v_12', 'v_13', 'v_14', 'train', 'used_time', 'city','brand_amount', 'brand_price_average', 'brand_price_max','brand_price_median', 'brand_price_min', 'brand_price_std','brand_price_sum', 'power_bin'],dtype='object') # 目前的數據其實已經可以給樹模型使用了，所以我們導出一下 data.to_csv('data_for_tree.csv', index=0) # 我們可以再構造一份特征給 LR NN 之類的模型用 # 之所以分開構造是因為，不同模型對數據集的要求不同 # 我們看下數據分布： data['power'].plot.hist() <matplotlib.axes._subplots.AxesSubplot at 0x12904e5c0>

# 我們剛剛已經對 train 進行異常值處理了，但是現在還有這么奇怪的分布是因為 test 中的 power 異常值， # 所以我們其實剛剛 train 中的 power 異常值不刪為好，可以用長尾分布截斷來代替 train['power'].plot.hist() <matplotlib.axes._subplots.AxesSubplot at 0x12de6bba8>

# 我們對其取 log，在做歸一化 from sklearn import preprocessing min_max_scaler = preprocessing.MinMaxScaler() data['power'] = np.log(data['power'] + 1) data['power'] = ((data['power'] - np.min(data['power'])) / (np.max(data['power']) - np.min(data['power']))) data['power'].plot.hist() <matplotlib.axes._subplots.AxesSubplot at 0x129ad5dd8>

# km 的比較正常，應該是已經做過分桶了 data['kilometer'].plot.hist() <matplotlib.axes._subplots.AxesSubplot at 0x12de58cf8>

# 所以我們可以直接做歸一化 data['kilometer'] = ((data['kilometer'] - np.min(data['kilometer'])) / (np.max(data['kilometer']) - np.min(data['kilometer']))) data['kilometer'].plot.hist() <matplotlib.axes._subplots.AxesSubplot at 0x128b4fd30>

# 除此之外還有我們剛剛構造的統計量特征： # 'brand_amount', 'brand_price_average', 'brand_price_max', # 'brand_price_median', 'brand_price_min', 'brand_price_std', # 'brand_price_sum' # 這里不再一一舉例分析了，直接做變換， def max_min(x):return (x - np.min(x)) / (np.max(x) - np.min(x))data['brand_amount'] = ((data['brand_amount'] - np.min(data['brand_amount'])) / (np.max(data['brand_amount']) - np.min(data['brand_amount']))) data['brand_price_average'] = ((data['brand_price_average'] - np.min(data['brand_price_average'])) / (np.max(data['brand_price_average']) - np.min(data['brand_price_average']))) data['brand_price_max'] = ((data['brand_price_max'] - np.min(data['brand_price_max'])) / (np.max(data['brand_price_max']) - np.min(data['brand_price_max']))) data['brand_price_median'] = ((data['brand_price_median'] - np.min(data['brand_price_median'])) /(np.max(data['brand_price_median']) - np.min(data['brand_price_median']))) data['brand_price_min'] = ((data['brand_price_min'] - np.min(data['brand_price_min'])) / (np.max(data['brand_price_min']) - np.min(data['brand_price_min']))) data['brand_price_std'] = ((data['brand_price_std'] - np.min(data['brand_price_std'])) / (np.max(data['brand_price_std']) - np.min(data['brand_price_std']))) data['brand_price_sum'] = ((data['brand_price_sum'] - np.min(data['brand_price_sum'])) / (np.max(data['brand_price_sum']) - np.min(data['brand_price_sum']))) # 對類別特征進行 OneEncoder data = pd.get_dummies(data, columns=['model', 'brand', 'bodyType', 'fuelType','gearbox', 'notRepairedDamage', 'power_bin']) print(data.shape) data.columns (199037, 369)Index(['name', 'power', 'kilometer', 'seller', 'offerType', 'price', 'v_0','v_1', 'v_2', 'v_3',...'power_bin_20.0', 'power_bin_21.0', 'power_bin_22.0', 'power_bin_23.0','power_bin_24.0', 'power_bin_25.0', 'power_bin_26.0', 'power_bin_27.0','power_bin_28.0', 'power_bin_29.0'],dtype='object', length=369) # 這份數據可以給 LR 用 data.to_csv('data_for_lr.csv', index=0)

3.3.3 特征篩選

1) 過濾式

# 相關性分析 print(data['power'].corr(data['price'], method='spearman')) print(data['kilometer'].corr(data['price'], method='spearman')) print(data['brand_amount'].corr(data['price'], method='spearman')) print(data['brand_price_average'].corr(data['price'], method='spearman')) print(data['brand_price_max'].corr(data['price'], method='spearman')) print(data['brand_price_median'].corr(data['price'], method='spearman')) 0.5737373458520139 -0.4093147076627742 0.0579639618400197 0.38587089498185884 0.26142364388130207 0.3891431767902722 # 當然也可以直接看圖 data_numeric = data[['power', 'kilometer', 'brand_amount', 'brand_price_average', 'brand_price_max', 'brand_price_median']] correlation = data_numeric.corr()f , ax = plt.subplots(figsize = (7, 7)) plt.title('Correlation of Numeric Features with Price',y=1,size=16) sns.heatmap(correlation,square = True, vmax=0.8) <matplotlib.axes._subplots.AxesSubplot at 0x129059470>

2) 包裹式

!pip install mlxtend # k_feature 太大會很難跑，沒服務器，所以提前 interrupt 了 from mlxtend.feature_selection import SequentialFeatureSelector as SFS from sklearn.linear_model import LinearRegression sfs = SFS(LinearRegression(),k_features=10,forward=True,floating=False,scoring = 'r2',cv = 0) x = data.drop(['price'], axis=1) x = x.fillna(0) y = data['price'] sfs.fit(x, y) sfs.k_feature_names_ STOPPING EARLY DUE TO KEYBOARD INTERRUPT...('powerPS_ten','city','brand_price_std','vehicleType_andere','model_145','model_601','fuelType_andere','notRepairedDamage_ja') # 畫出來，可以看到邊際效益 from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs import matplotlib.pyplot as plt fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev') plt.grid() plt.show() /Users/chenze/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py:140: RuntimeWarning: Degrees of freedom <= 0 for slicekeepdims=keepdims) /Users/chenze/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py:132: RuntimeWarning: invalid value encountered in double_scalarsret = ret.dtype.type(ret / rcount)

3) 嵌入式

# 下一章介紹，Lasso 回歸和決策樹可以完成嵌入式特征選擇 # 大部分情況下都是用嵌入式做特征篩選

3.4 經驗總結

特征工程是比賽中最至關重要的的一塊，特別的傳統的比賽，大家的模型可能都差不多，調參帶來的效果增幅是非常有限的，但特征工程的好壞往往會決定了最終的排名和成績。

特征工程的主要目的還是在于將數據轉換為能更好地表示潛在問題的特征，從而提高機器學習的性能。比如，異常值處理是為了去除噪聲，填補缺失值可以加入先驗知識等。

特征構造也屬于特征工程的一部分，其目的是為了增強數據的表達。

有些比賽的特征是匿名特征，這導致我們并不清楚特征相互之間的關聯性，這時我們就只有單純基于特征進行處理，比如裝箱，groupby，agg 等這樣一些操作進行一些特征統計，此外還可以對特征進行進一步的 log，exp 等變換，或者對多個特征進行四則運算（如上面我們算出的使用時長），多項式組合等然后進行篩選。由于特性的匿名性其實限制了很多對于特征的處理，當然有些時候用 NN 去提取一些特征也會達到意想不到的良好效果。

對于知道特征含義（非匿名）的特征工程，特別是在工業類型比賽中，會基于信號處理，頻域提取，豐度，偏度等構建更為有實際意義的特征，這就是結合背景的特征構建，在推薦系統中也是這樣的，各種類型點擊率統計，各時段統計，加用戶屬性的統計等等，這樣一種特征構建往往要深入分析背后的業務邏輯或者說物理原理，從而才能更好的找到 magic。

當然特征工程其實是和模型結合在一起的，這就是為什么要為 LR NN 做分桶和特征歸一化的原因，而對于特征的處理效果和特征重要性等往往要通過模型來驗證。

總的來說，特征工程是一個入門簡單，但想精通非常難的一件事。

總結

以上是生活随笔為你收集整理的【算法竞赛学习】二手车交易价格预测-Task3特征工程的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：视觉中国全面取消创意类摄影图片、设计素材
下一篇：【算法竞赛学习】二手车交易价格预测-Ta