python进阶指南_Python特性工程动手指南
python進階指南
介紹 (Introduction)
In this guide, I will walk through how to utilize data manipulating to extract features manually.
在本指南中,我將逐步介紹如何利用數據處理來手動提取特征。
Manual feature engineering could be exhausting and needs plenty of time, experience, and domain knowledge experience to develop the right features. There are many automatic feature engineering tools available, like the FeatureTools and the AutoFeat. However, manual feature engineering is essential to understand those advanced tools. Furthermore, it would help build a robust and generic model. I will use the home-credit-default-risk dataset available on the Kaggle platform. I will use only two tables bureauand bureau_balancefrom the main folder. According to the dataset description on the competition page, the tables are the following:
手動要素工程可能會很累,并且需要大量的時間,經驗和領域知識經驗才能開發出正確的要素。 有許多可用的自動功能工程工具,例如FeatureTools和AutoFeat。 但是,手動功能工程對于理解這些高級工具至關重要。 此外,這將有助于構建健壯且通用的模型。 我將使用Kaggle平臺上可用的home-credit-default-risk數據集。 我將只使用主文件夾中的兩個表bureau和bureau_balance 。 根據比賽頁面上的數據集描述,下表如下:
bureau.csv
Bureau.csv
- This table includes all clients’ previous credits from other financial institutions that reported to the Credit Bureau. 該表包括已向信用局報告的所有其他金融機構客戶以前的信用。
bureau_balance.csv
Bureau_balance.csv
- Monthly balances of earlier loans in the Credit Bureau. 信用局中較早貸款的每月余額。
- This table has one row for each month of the history of every previous loan reported to the Credit Bureau. 對于向信用局報告的每筆先前貸款的歷史記錄,此表每個月都有一行。
本教程將涵蓋主題 (Topics will be covered in this tutorial)
1.讀取和整理數據 (1. Reading and Munging the data)
I will start by importing some important libraries that would help in understanding the data.
我將首先導入一些有助于理解數據的重要庫。
# pandas and numpy for data manipulationimport pandas as pdimport numpy as np# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns# Suppress warnings from pandas
import warnings
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')
I will start analyzing the bureau.csv first:
我將首先開始分析Bureau.csv:
# Read in bureaubureau = pd.read_csv('../input/home-credit-default-risk/bureau.csv')
bureau.head()
This table has 1716428 observations and 17 feature.
該表具有1716428個觀測值和17個功能。
SK_ID_CURR int64SK_ID_BUREAU int64
CREDIT_ACTIVE object
CREDIT_CURRENCY object
DAYS_CREDIT int64
CREDIT_DAY_OVERDUE int64
DAYS_CREDIT_ENDDATE float64
DAYS_ENDDATE_FACT float64
AMT_CREDIT_MAX_OVERDUE float64
CNT_CREDIT_PROLONG int64
AMT_CREDIT_SUM float64
AMT_CREDIT_SUM_DEBT float64
AMT_CREDIT_SUM_LIMIT float64
AMT_CREDIT_SUM_OVERDUE float64
CREDIT_TYPE object
DAYS_CREDIT_UPDATE int64
AMT_ANNUITY float64
dtype: object
We need to get how many previous loans per client id which is SK_ID_CURR. We can get that using pandas aggregation functions groupby and count(). Then store the new results in a new dataframe after renaming the SK_ID_BUREAU into previous_loan_count for readability.
我們需要獲取每個客戶ID以前有多少筆貸款,即SK_ID_CURR 。 我們可以使用pandas聚合函數groupby和count().得到它count(). 然后,將SK_ID_BUREAU重命名為previous_loan_count以提高可讀性后,將新結果存儲在新數據SK_ID_BUREAU 。
# groupby client-id, count #previous loansfrom pandas import DataFrameprev_loan_count = bureau.groupby('SK_ID_CURR', as_index = False).count().rename(columns = {'SK_ID_BUREAU': 'previous_loan_count'})
The new prev_loan_count has only 305811 observations. Now, I will merge the prev_loan_count dataframe into the train dataset through the client id SK_ID_CURR then fill the missing values with 0. Finally, check if the new column has been added using the dtypes function.
新的prev_loan_count只有305811個觀測值。 現在,我將通過客戶端ID SK_ID_CURR將prev_loan_count數據幀合并到train數據集中,然后用0填充缺少的值。最后,檢查是否已使用dtypes函數添加了新列。
# join with the training dataframe# read train.csvpd.set_option('display.max_column', None)
train = pd.read_csv('../input/home-credit-default-risk/application_train.csv')
train = train.merge(prev_loan_count, on = 'SK_ID_CURR', how = 'left')# fill the missing values with 0train['previous_loan_count'] = train['previous_loan_count'].fillna(0)
train['previous_loan_count'].dtypesdtype('float64')
It is already there!
它已經在那里!
2.研究相關性 (2. Investigate correlation)
The next step is to explore the Pearson correlation value or (r-value) between attributes through feature importance. It is not a measure of importance for new variables; however, it provides a reference of whether a variable will be helpful to the model or not.
下一步是通過特征重要性探索屬性之間的皮爾遜相關值或( r值 )。 它不是衡量新變量重要性的方法; 但是,它提供了變量是否對模型有用的參考。
Higher correlation with respect to the dependent variable means any change in that variable would lead to a significant change in the dependent variable. So, in the next step, I would look into the highest absolute value of r-value relative to the dependent variable.
相對于因變量的更高相關性意味著該變量的任何變化都將導致因變量的重大變化。 因此,在下一步中,我將研究r值相對于因變量的最高絕對值。
The Kernel Density Estimator (KDE) is the best to describe relation between dependent and independent variable.
核密度估計器(KDE)最能描述因變量和自變量之間的關系。
# Plots the disribution of a variable colored by value of the dependent variabledef kde_target(var_name, df):# Calculate the correlation coefficient between the new variable and the target
corr = df['TARGET'].corr(df[var_name])
# Calculate medians for repaid vs not repaid
avg_repaid = df.loc[df['TARGET'] == 0, var_name].median()
avg_not_repaid = df.loc[df['TARGET'] == 1, var_name].median()
plt.figure(figsize = (12, 6))
# Plot the distribution for target == 0 and target == 1
sns.kdeplot(df.loc[df['TARGET'] == 0, var_name], label = 'TARGET == 0')
sns.kdeplot(df.loc[df['TARGET'] == 1, var_name], label = 'TARGET == 1')
# label the plot
plt.xlabel(var_name); plt.ylabel('Density'); plt.title('%s Distribution' % var_name)
plt.legend();
# print out the correlation
print('The correlation between %s and the TARGET is %0.4f' % (var_name, corr)) # Print out average values
print('Median value for loan that was not repaid = %0.4f' % avg_not_repaid) print('Median value for loan that was repaid = %0.4f' % avg_repaid)
Then check the distribution of the previous_loan_count against Target
然后針對Target檢查previous_loan_count的分布
kde_target('previous_loan_count', train)The KDE plot for the previous_loan_countprevious_loan_count的KDE圖It is hard to see any significant correlation between the TARGETand the previous_loan_count . There is no significant correlation can be detected from the diagram. So, more variables need to be investigated using aggregation functions.
很難看到TARGET和previous_loan_count之間有任何顯著相關性。 從圖中無法檢測到明顯的相關性。 因此,需要使用聚合函數研究更多變量。
3.匯總數字列 (3. Aggregate numeric columns)
I will pick the numeric columns grouped by client id then apply the statistics functions min, max, sum, mean, and count to get a summary statistics for per numeric feature.
我將選擇按客戶ID分組的數字列,然后應用統計函數min, max, sum, mean, and count以獲得每個數字功能的摘要統計信息。
# Group by the client id, calculate aggregation statisticsbureau_agg = bureau.drop(columns = ['SK_ID_BUREAU']).groupby('SK_ID_CURR', as_index = False).agg(['count', 'mean', 'min','max','sum']).reset_index()
Creating a new name for each columns for readability sake. Then merge with the train dataset.
為便于閱讀,請為每列創建一個新名稱。 然后與train數據集合并。
# List of column namescolumns = ['SK_ID_CURR']# Iterate through the variables namesfor var in bureau_agg.columns.levels[0]:
# Skip the id name
if var != 'SK_ID_CURR':
# Iterate through the stat names
for stat in bureau_agg.columns.levels[1][:-1]:
# Make a new column name for the variable and stat
columns.append('bureau_%s_%s' % (var, stat))# Assign the list of columns names as the dataframe column names
bureau_agg.columns = columns# merge with the train dataset
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')
Getting the correlation with the TARGET variable then sort the correlations by the absolute value using the sort_values()Python function.
獲取與TARGET變量的相關性,然后使用sort_values() Python函數按絕對值對相關性進行排序。
# Calculate correlation between variables and the dependent variable# Sort the correlations by the absolute valuenew_corrs = train.drop(columns=['TARGET']).corrwith(train['TARGET']).sort_values(ascending=False)
new_corrs[:15]correlation with the TARGET variable與TARGET變量的相關性
Now check the KDE plot for the newly created variables
現在檢查KDE圖以了解新創建的變量
kde_target('bureau_DAYS_CREDIT_mean', train)The correlation between bureau_DAYS_CREDIT_mean and the TARGETBureau_DAYS_CREDIT_mean與TARGET之間的相關性As illustrated, again the correlation is very weak and could be just noise. Furthermore, a larger negative number indicates the loan was further before the current loan application.
如圖所示,相關性再次非常弱,可能僅僅是噪聲。 此外,較大的負數表示該筆貸款比當前的貸款申請還早。
4.獲取bureau_balance的統計信息 (4. Get stats for the bureau_balance)
bureau_balance = pd.read_csv('../input/home-credit-default-risk/bureau_balance.csv')bureau_balance.head()bureau_balance.csvBureau_balance.csv5.調查分類變量 (5. Investigating the categorical variables)
The following function iterate over the dataframe and pick the categorical column and create a dummy variable to it.
以下函數遍歷數據框并選擇類別列,并為其創建一個虛擬變量。
def process_categorical(df, group_var, col_name):"""Computes counts and normalized counts for each observation
of `group_var` of each unique category in every categorical variable
Parameters
--------
df : dataframe
The dataframe to calculate the value counts for.
group_var : string
The variable by which to group the dataframe. For each unique
value of this variable, the final dataframe will have one row
col_name : string
Variable added to the front of column names to keep track of columnsReturn
--------
categorical : dataframe
A dataframe with counts and normalized counts of each unique category in every categorical variable
with one row for every unique value of the `group_var`.
"""
# pick the categorical column
categorical = pd.get_dummies(df.select_dtypes('O'))
# put an id for each column
categorical[group_var] = df[group_var]
# aggregate the group_var
categorical = categorical.groupby(group_var).agg(['sum', 'mean'])
columns_name = []
# iterate over the columns in level 0
for var in categorical.columns.levels[0]:
# iterate through level 1 for stats
for stat in ['count', 'count_norm']:
# make new column name
columns_name.append('%s_%s_%s' %(col_name, var, stat))
categorical.columns = columns_name
return categorical
This function will return a stats of sum and mean for each categorical column.
此函數將為每個分類列返回sum和mean的統計信息。
bureau_count = process_categorical(bureau, group_var = 'SK_ID_CURR',col_name = 'bureau')Do the same for bureau_balance
對bureau_balance執行相同的操作
bureau_balance_counts = process_categorical(df = bureau_balance, group_var = 'SK_ID_BUREAU', col_name = 'bureau_balance')Now, we have the calculations on each loan. We need to aggregate for each client. I will merging all the previous dataframes together then aggregate the statistics again grouped by the SK_ID_CURR.
現在,我們有了每筆貸款的計算。 我們需要為每個客戶匯總。 我將所有先前的數據幀合并在一起,然后再次匯總按SK_ID_CURR分組的統計信息。
# dataframe grouped by the loanbureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index = True, left_on = 'SK_ID_BUREAU', how = 'outer')# Merge to include the SK_ID_CURR
bureau_by_loan = bureau[['SK_ID_BUREAU', 'SK_ID_CURR']].merge(bureau_by_loan, on = 'SK_ID_BUREAU', how = 'left')# Aggregate the stats for each client
bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns = ['SK_ID_BUREAU']), group_var = 'SK_ID_CURR', col_name = 'client')
6.將計算出的特征插入訓練數據集中 (6. Insert computed feature into train dataset)
original_features = list(train.columns)print('Original Number of Features: ', len(original_features))
The output : Original Number of Features: 122
輸出:原始功能數量:122
# Merge with the value counts of bureautrain = train.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')# Merge with the stats of bureau
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')# Merge with the monthly information grouped by client
train = train.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')new_features = list(train.columns)
print('Number of features using previous loans from other institutions data: ', len(new_features))# Number of features using previous loans from other institutions data: 333
Output is: Number of features using previous loans from other institutions data: 333
輸出為:使用先前從其他機構獲得的數據的要素數量:333
7.檢查丟失的數據 (7. Check the missing data)
It is very important to check missing data in the training set after merging the new features.
合并新功能后,檢查訓練集中的缺失數據非常重要。
# Function to calculate missing values by column# Functdef missing_percent(df):"""Computes counts and normalized counts for each observation
of `group_var` of each unique category in every categorical variable
Parameters
--------
df : dataframe
The dataframe to calculate the value counts for.Return
--------
mis_column : dataframe
A dataframe with missing information .
"""
# Total missing values
mis_val = df.isnull().sum()
# Percentage of missing values
mis_percent = 100 * df.isnull().sum() / len(df)
# Make a table with the results
mis_table = pd.concat([mis_val, mis_percent], axis=1)
# Rename the columns
mis_columns = mis_table.rename(
columns = {0 : 'Missing Values', 1 : 'Percent of Total Values'})
# Sort the table by percentage of missing descending
mis_columns = mis_columns[
mis_columns.iloc[:,1] != 0].sort_values(
'Percent of Total Values', ascending=False).round(2)
# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_columns.shape[0]) +
" columns that have missing values.")
# Return the dataframe with missing information
return mis_columnstrain_missing = missing_percent(train)
train_missing.head()train_missingtrain_missing
There are a quite number of columns that have a plenty of missing data. I am going to drop any column that have missing data than 90%.
有相當多的列缺少大量數據。 我將刪除所有缺少數據超過90%的列。
missing_vars_train = train_missing.loc[train_missing['Percent of Total Values'] > 90, 'Percent of Total Values']len(missing_vars_train)
# 0
I will do the same for the test data
我將對測試數據進行相同的操作
# Read in the test dataframetest = pd.read_csv('../input/home-credit-default-risk/application_test.csv')# Merge with the value counts of bureau
test = test.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')# Merge with the stats of bureau
test = test.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')# Merge with the value counts of bureau balance
test = test.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')
Then, will align the train and test dataset together and check their shape and same columns.
然后,將train和test數據集對齊,并檢查它們的形狀和相同的列。
# create a train target labeltrain_label = train['TARGET']# align both dataframes, this will remove TARGET column
train, test = train.align(test, join='inner', axis = 1)
train['TARGET'] = train_label
print('Training Data Shape: ', train.shape)
print('Testing Data Shape: ', test.shape)#Training Data Shape: (307511, 333)
#Testing Data Shape: (48744, 332)
Let’s check the missing percent on the test set.
讓我們檢查test集上丟失的百分比。
test_missing = missing_percent(test)test_missing.head()test_missing.head()test_missing.head()
8.相關性 (8. Correlations)
I will check the correlation with the TARGET variable and the newly created features.
我將檢查與TARGET變量和新創建的功能的相關性。
# calculate correlation for all dataframescorr_train = train.corr()# Sort the resulted values in an ascending order
corr_train = corr_train.sort_values('TARGET', ascending = False)# show the ten most positive correlations
pd.DataFrame(corr_train['TARGET'].head(10))the top 10 correlated feature with the target variable與目標變量相關的前10個相關特征
As observed from the sample above, the most correlated variables are the variables that were engineered earlier. However, correlation doesn’t mean causation that’s why we need to assess those correlations and pick the variables that have deeper influence on the TARGET . To do so, I will stick with the KDE plot.
從上面的樣本中可以看出,最相關的變量是較早設計的變量。 但是,相關性并不意味著因果關系,這就是為什么我們需要評估那些相關性并選擇對TARGET有更深影響的變量。 為此,我將堅持使用KDE圖。
kde_target('bureau_DAYS_CREDIT_mean', train)KDE plot for the bureau_DAYS_CREDIT_mean該局的KDE圖_DAYS_CREDIT_meanThe plot says that the applicants with a greater number of monthly record per loan tends to repay the new loan. Let’s look more into the bureau_CREDIT_ACTIVE_Active_count_norm variable to see if this is true.
情節說,每筆貸款的每月記錄數量較多的申請人傾向于償還新的貸款。 讓我們進一步看一下bureau_CREDIT_ACTIVE_Active_count_norm變量,看是否為真。
kde_target('bureau_CREDIT_ACTIVE_Active_count_norm', train)KDE plot for the bureau_CREDIT_ACTIVE_Active_count_norm局的KDE圖_CREDIT_ACTIVE_Active_count_normThe correlation here is very weak, we can’t notice any significance.
這里的相關性很弱,我們沒有注意到任何意義。
9.共線性 (9. Collinearity)
I will set a threshold of 80% to remove any highly correlated variables with the TARGET
我將閾值設置為80%,以使用TARGET刪除所有高度相關的變量
# Set the thresholdthreshold = 0.8# Empty dictionary to hold correlated variables
above_threshold_vars = {}# For each column, record the variables that are above the thresholdfor col in corr_train:
above_threshold_vars[col] = list(corr_train.index[corr_train[col] > threshold])# Track columns to remove and columns already examined
cols_to_remove = []
cols_seen = []
cols_to_remove_pair = []# Iterate through columns and correlated columnsfor key, value in above_threshold_vars.items():
# Keep track of columns already examined
cols_seen.append(key)
for x in value:
if x == key:
next
else:
# Only want to remove one in a pair
if x not in cols_seen:
cols_to_remove.append(x)
cols_to_remove_pair.append(key)
cols_to_remove = list(set(cols_to_remove))
print('Number of columns to remove: ', len(cols_to_remove))
The output is: Number of columns to remove: 134
輸出為:要刪除的列數:134
Then, we can remove those column from the dataset as a preparation step to use for the model building
然后,我們可以從數據集中刪除這些列,作為準備步驟以用于模型構建
rain_corrs_removed = train.drop(columns = cols_to_remove)test_corrs_removed = test.drop(columns = cols_to_remove)
print('Training Corrs Removed Shape: ', train_corrs_removed.shape)
print('Testing Corrs Removed Shape: ', test_corrs_removed.shape)
Training Corrs Removed Shape: (307511, 199)Testing Corrs Removed Shape: (48744, 198)
訓練芯去除形狀:(307511,199)測試芯去除形狀:(48744,198)
摘要 (Summary)
The purpose of this tutorial was to introduce you to many concepts that may seem confusing at the beginning:
本教程的目的是向您介紹許多在開始時可能會令人困惑的概念:
翻譯自: https://towardsdatascience.com/hands-on-guide-to-feature-engineering-de793efc785
python進階指南
總結
以上是生活随笔為你收集整理的python进阶指南_Python特性工程动手指南的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 苹果8有128g内存的吗
- 下一篇: 酷睿i3与i5的差别