【特征工程】特征分箱
??對數(shù)據(jù)分析、機器學習、數(shù)據(jù)科學、金融風控等感興趣的小伙伴,需要數(shù)據(jù)集、代碼、行業(yè)報告等各類學習資料,可添加微信:wu805686220(記得要備注喔!),也可關(guān)注微信公眾號:風控圏子(別打錯字,是圏子,不是圈子,算了直接復制吧!)
關(guān)注公眾號后,可聯(lián)系圈子助手加入如下社群:
- 機器學習風控討論群(微信群)
- 反欺詐討論群(微信群)
- python學習交流群(微信群)
- 研習社資料(qq群:102755159)(干貨、資料、項目、代碼、報告、課件)
相互學習,共同成長。
腳本介紹:
??1)一份完整的自動化特征評估腳本
??2)包括數(shù)據(jù)預處理、特征分箱、特征重要性評估
作者:研習社-正陽
一. 導入相關(guān)工具和路徑
二. 數(shù)據(jù)預處理
1.自定義缺失值處理函數(shù)
1.1 缺失值計算
計算特征數(shù)據(jù)缺失占比
1.2 按特征(列)刪除
- 若字段數(shù)據(jù)缺失嚴重,可先檢查字段特性,是業(yè)務(wù)層面設(shè)計需求,或者是數(shù)據(jù)抓取異常
- 如無上述問題,建議刪除缺失值占比大于設(shè)定閾值的字段
- 常見閾值為90%以上或者40%~50%以上,根據(jù)特征是否對應(yīng)明確的業(yè)務(wù)含義而決定是否保留
1.3 按樣本(行)刪除
- 在無數(shù)據(jù)采集問題的情況下,若單樣本數(shù)據(jù)缺失嚴重,可認為樣本數(shù)據(jù)無效,建議刪除。
2. 自定義常變量處理函數(shù)
- 同值化較嚴重的字段,如無特殊業(yè)務(wù)含義,某一數(shù)據(jù)占比超過閾值時,建議刪除
3. 自定義data_processing函數(shù),執(zhí)行完整數(shù)據(jù)預處理步驟:
1、導入數(shù)據(jù)
2、刪除缺失值(自定義函數(shù))
3、刪除常變量(自定義函數(shù))
??1)常變量(自定義函數(shù))
??2)方差為0
4、缺失值填充
??1)分類型特征填充(自定義函數(shù))
??2)連續(xù)型特征填充(自定義函數(shù))
def data_processing(df, target):"""df:包含了label和特征的寬表return:df :清洗后的數(shù)據(jù)集"""# 特征缺失處理df = missing_delete_var(df, threshold=0.8)# 樣本缺失處理df = missing_delete_user(df, threshold=int(df.shape[1] * 0.8))col_list = [x for x in df.columns if x != target]# 常變量處理df = const_delete(df, col_list, threshold=0.9)desc = df.describe().T# 剔除方差為0的特征std_0_col = list(desc[desc['std'] == 0].index)if len(std_0_col) > 0:df = df.drop(std_0_col, axis=1)df.reset_index(drop=True, inplace=True)# 缺失值計算和填充miss_df = missing_cal(df)cate_col = list(df.select_dtypes(include=['O']).columns)num_col = [x for x in list(df.select_dtypes(include=['int64', 'float64']).columns) if x != 'label']# 分類型特征填充cate_miss_col1 = [x for x in list(miss_df[miss_df.missing_pct > 0.05]['col']) if x in cate_col]cate_miss_col2 = [x for x in list(miss_df[miss_df.missing_pct <= 0.05]['col']) if x in cate_col]# 連續(xù)型特征填充num_miss_col1 = [x for x in list(miss_df[miss_df.missing_pct > 0.05]['col']) if x in num_col]num_miss_col2 = [x for x in list(miss_df[miss_df.missing_pct <= 0.05]['col']) if x in num_col]for col in cate_miss_col1:df[col] = df[col].fillna('未知')for col in cate_miss_col2:df[col] = df[col].fillna(df[col].mode()[0])for col in num_miss_col1:df[col] = df[col].fillna(-999)for col in num_miss_col2:df[col] = df[col].fillna(df[col].median())return df, miss_df三、特征分箱
分箱邏輯:
1、類別型特征
??1)類別數(shù)在5個以下,可以直接根據(jù)類別來分箱 (binning_cate)
??2)類別數(shù)在5個以上,建議做降基處理,再根據(jù)降基后的類別做分箱
2、數(shù)值型特征
??1)離散型數(shù)值特征(特征value的變動幅度較小):
??? 若特征value的非重復計數(shù)在5個以下,可以直接根據(jù)非重復計數(shù)值來分箱(binning_cate)
??? 若特征value的非重復計數(shù)在5個以上,建議根據(jù)業(yè)務(wù)解釋或者數(shù)據(jù)分布做自定義分箱(binning_self)
??2)連續(xù)型數(shù)值特征(特征value的變動幅度較大):
??? 可以用卡方分箱或自定義分箱。(binning_num,binning_self)
??? PS:一些特征用卡方分可能會報錯,建議這些特征改為手動自定義分箱
3、特征有缺失
??1)缺失率在5%以下,可以先對缺失做填充處理再分箱(binning_num)
??2)缺失率在5%以上,建議將缺失當作一個類別來分箱(binning_sparse_col)
4、稀疏特征分箱
??建議將稀疏值(一般為0)單獨分為一箱,剩下的值做卡方或者自定義分箱(binning_sparse_col)
1.自定義指標評估函數(shù)
- KS、precision、 tpr、 fpr
2.自定義卡方分箱函數(shù)
2.1 變量分割點
2.2 計算違約率
2.3 計算卡方值
2.4 卡方分箱(干貨)
def ChiMerge(df, col, target, max_bin=5, min_binpct=0):col_unique = sorted(list(set(df[col]))) # 變量的唯一值并排序n = len(col_unique) # 變量唯一值得個數(shù)df2 = df.copy()if n > 100: # 如果變量的唯一值數(shù)目超過100,則將通過split_data和assign_group將x映射為split對應(yīng)的valuesplit_col = split_data(df2, col, 100) # 通過這個目的將變量的唯一值數(shù)目人為設(shè)定為100df2['col_map'] = df2[col].map(lambda x: assign_group(x, split_col))else:df2['col_map'] = df2[col] # 變量的唯一值數(shù)目沒有超過100,則不用做映射# 生成dict_bad,regroup,all_bad_rate的元組(dict_bad, regroup, all_bad_rate) = bin_bad_rate(df2, 'col_map', target, grantRateIndicator=1)col_map_unique = sorted(list(set(df2['col_map']))) # 對變量映射后的value進行去重排序group_interval = [[i] for i in col_map_unique] # 對col_map_unique中每個值創(chuàng)建list并存儲在group_interval中while (len(group_interval) > max_bin): # 當group_interval的長度大于max_bin時,執(zhí)行while循環(huán)chi_list = []for i in range(len(group_interval) - 1):temp_group = group_interval[i] + group_interval[i + 1] # temp_group 為生成的區(qū)間,list形式,例如[1,3]chi_df = regroup[regroup['col_map'].isin(temp_group)]chi_value = cal_chi2(chi_df, all_bad_rate) # 計算每一對相鄰區(qū)間的卡方值chi_list.append(chi_value)best_combined = chi_list.index(min(chi_list)) # 最小的卡方值的索引# 將卡方值最小的一對區(qū)間進行合并group_interval[best_combined] = group_interval[best_combined] + group_interval[best_combined + 1]# 刪除合并前的右區(qū)間group_interval.remove(group_interval[best_combined + 1])# 對合并后每個區(qū)間進行排序group_interval = [sorted(i) for i in group_interval]# cutoff點為每個區(qū)間的最大值cutoffpoints = [max(i) for i in group_interval[:-1]]# 檢查是否有箱只有好樣本或者只有壞樣本df2['col_map_bin'] = df2['col_map'].apply(lambda x: assign_bin(x, cutoffpoints)) # 將col_map映射為對應(yīng)的區(qū)間Bin# 計算每個區(qū)間的違約率(dict_bad, regroup) = bin_bad_rate(df2, 'col_map_bin', target)# 計算最小和最大的違約率[min_bad_rate, max_bad_rate] = [min(dict_bad.values()), max(dict_bad.values())]# 當最小的違約率等于0,說明區(qū)間內(nèi)只有好樣本,當最大的違約率等于1,說明區(qū)間內(nèi)只有壞樣本while min_bad_rate == 0 or max_bad_rate == 1:bad01_index = regroup[regroup['bad_rate'].isin([0, 1])].col_map_bin.tolist() # 違約率為1或0的區(qū)間bad01_bin = bad01_index[0]if bad01_bin == max(regroup.col_map_bin):cutoffpoints = cutoffpoints[:-1] # 當bad01_bin是最大的區(qū)間時,刪除最大的cutoff點elif bad01_bin == min(regroup.col_map_bin):cutoffpoints = cutoffpoints[1:] # 當bad01_bin是最小的區(qū)間時,刪除最小的cutoff點else:bad01_bin_index = list(regroup.col_map_bin).index(bad01_bin) # 找出bad01_bin的索引prev_bin = list(regroup.col_map_bin)[bad01_bin_index - 1] # bad01_bin前一個區(qū)間df3 = df2[df2.col_map_bin.isin([prev_bin, bad01_bin])](dict_bad, regroup1) = bin_bad_rate(df3, 'col_map_bin', target)chi1 = cal_chi2(regroup1, all_bad_rate) # 計算前一個區(qū)間和bad01_bin的卡方值later_bin = list(regroup.col_map_bin)[bad01_bin_index + 1] # bin01_bin的后一個區(qū)間df4 = df2[df2.col_map_bin.isin([later_bin, bad01_bin])](dict_bad, regroup2) = bin_bad_rate(df4, 'col_map_bin', target)chi2 = cal_chi2(regroup2, all_bad_rate) # 計算后一個區(qū)間和bad01_bin的卡方值if chi1 < chi2: # 當chi1<chi2時,刪除前一個區(qū)間對應(yīng)的cutoff點cutoffpoints.remove(cutoffpoints[bad01_bin_index - 1])else: # 當chi1>=chi2時,刪除bin01對應(yīng)的cutoff點cutoffpoints.remove(cutoffpoints[bad01_bin_index])df2['col_map_bin'] = df2['col_map'].apply(lambda x: assign_bin(x, cutoffpoints))(dict_bad, regroup) = bin_bad_rate(df2, 'col_map_bin', target)# 重新將col_map映射至區(qū)間,并計算最小和最大的違約率,直達不再出現(xiàn)違約率為0或1的情況,循環(huán)停止[min_bad_rate, max_bad_rate] = [min(dict_bad.values()), max(dict_bad.values())]# 檢查分箱后的最小占比if min_binpct > 0:group_values = df2['col_map'].apply(lambda x: assign_bin(x, cutoffpoints))df2['col_map_bin'] = group_values # 將col_map映射為對應(yīng)的區(qū)間Bingroup_df = group_values.value_counts().to_frame()group_df['bin_pct'] = group_df['col_map'] / n # 計算每個區(qū)間的占比min_pct = group_df.bin_pct.min() # 得出最小的區(qū)間占比while min_pct < min_binpct and len(cutoffpoints) > 2: # 當最小的區(qū)間占比小于min_pct且cutoff點的個數(shù)大于2,執(zhí)行循環(huán)# 下面的邏輯基本與“檢驗是否有箱體只有好/壞樣本”的一致min_pct_index = group_df[group_df.bin_pct == min_pct].index.tolist()min_pct_bin = min_pct_index[0]if min_pct_bin == max(group_df.index):cutoffpoints = cutoffpoints[:-1]elif min_pct_bin == min(group_df.index):cutoffpoints = cutoffpoints[1:]else:minpct_bin_index = list(group_df.index).index(min_pct_bin)prev_pct_bin = list(group_df.index)[minpct_bin_index - 1]df5 = df2[df2['col_map_bin'].isin([min_pct_bin, prev_pct_bin])](dict_bad, regroup3) = bin_bad_rate(df5, 'col_map_bin', target)chi3 = cal_chi2(regroup3, all_bad_rate)later_pct_bin = list(group_df.index)[minpct_bin_index + 1]df6 = df2[df2['col_map_bin'].isin([min_pct_bin, later_pct_bin])](dict_bad, regroup4) = bin_bad_rate(df6, 'col_map_bin', target)chi4 = cal_chi2(regroup4, all_bad_rate)if chi3 < chi4:cutoffpoints.remove(cutoffpoints[minpct_bin_index - 1])else:cutoffpoints.remove(cutoffpoints[minpct_bin_index])return cutoffpoints3. 自定義變量分箱函數(shù)
3.1 類別型特征
3.2 數(shù)值型特征
3.2.1 離散型數(shù)值特征
def binning_self(df, col, target, cut=None, right_border=True):"""df:數(shù)據(jù)集col:輸入的特征target:好壞標記的字段名cut:總定義劃分區(qū)間的listright_border:設(shè)定左開右閉、左閉右開return:bin_df :特征的評估結(jié)果"""total = df[target].count()bad = df[target].sum()good = total - badbucket = pd.cut(df[col], cut, right=right_border)d1 = df.groupby(bucket)d2 = pd.DataFrame()d2['樣本數(shù)'] = d1[target].count()d2['黑樣本數(shù)'] = d1[target].sum()d2['白樣本數(shù)'] = d2['樣本數(shù)'] - d2['黑樣本數(shù)']d2['逾期用戶占比'] = d2['黑樣本數(shù)'] / d2['樣本數(shù)']d2['badattr'] = d2['黑樣本數(shù)'] / badd2['goodattr'] = d2['白樣本數(shù)'] / goodd2['WOE'] = np.log(d2['badattr'] / d2['goodattr'])d2['bin_iv'] = (d2['badattr'] - d2['goodattr']) * d2['WOE']d2['IV'] = d2['bin_iv'].sum()bin_df = d2.reset_index()bin_df.drop(['badattr', 'goodattr', 'bin_iv'], axis=1, inplace=True)bin_df.rename(columns={col: '分箱結(jié)果'}, inplace=True)bin_df['特征名'] = colbin_df = pd.concat([bin_df['特征名'], bin_df.iloc[:, :-1]], axis=1)ks, precision, tpr, fpr = cal_ks(df, col, target)bin_df['準確率'] = precisionbin_df['召回率'] = tprbin_df['打擾率'] = fprbin_df['KS'] = ksreturn bin_df3.2.2 連續(xù)型數(shù)值特征
def binning_num(df, target, col, max_bin=None, min_binpct=None):"""df:數(shù)據(jù)集col:輸入的特征target:好壞標記的字段名max_bin:最大的分箱個數(shù)min_binpct:區(qū)間內(nèi)樣本所占總體的最小比return:bin_df :特征的評估結(jié)果"""total = df[target].count()bad = df[target].sum()good = total - badinf = float('inf')ninf = float('-inf')cut = ChiMerge(df, col, target, max_bin=max_bin, min_binpct=min_binpct)cut.insert(0, ninf)cut.append(inf)bucket = pd.cut(df[col], cut)d1 = df.groupby(bucket)d2 = pd.DataFrame()d2['樣本數(shù)'] = d1[target].count()d2['黑樣本數(shù)'] = d1[target].sum()d2['白樣本數(shù)'] = d2['樣本數(shù)'] - d2['黑樣本數(shù)']d2['逾期用戶占比'] = d2['黑樣本數(shù)'] / d2['樣本數(shù)']d2['badattr'] = d2['黑樣本數(shù)'] / badd2['goodattr'] = d2['白樣本數(shù)'] / goodd2['WOE'] = np.log(d2['badattr'] / d2['goodattr'])d2['bin_iv'] = (d2['badattr'] - d2['goodattr']) * d2['WOE']d2['IV'] = d2['bin_iv'].sum()bin_df = d2.reset_index()bin_df.drop(['badattr', 'goodattr', 'bin_iv'], axis=1, inplace=True)bin_df.rename(columns={col: '分箱結(jié)果'}, inplace=True)bin_df['特征名'] = colbin_df = pd.concat([bin_df['特征名'], bin_df.iloc[:, :-1]], axis=1)ks, precision, tpr, fpr = cal_ks(df, col, target)bin_df['準確率'] = precisionbin_df['召回率'] = tprbin_df['打擾率'] = fprbin_df['KS'] = ksreturn bin_df3.3 稀疏特征分箱
def binning_sparse_col(df, target, col, max_bin=None, min_binpct=None, sparse_value=None):"""df:數(shù)據(jù)集col:輸入的特征target:好壞標記的字段名max_bin:最大的分箱個數(shù)min_binpct:區(qū)間內(nèi)樣本所占總體的最小比sparse_value:單獨分為一箱的value值return:bin_df :特征的評估結(jié)果"""total = df[target].count()bad = df[target].sum()good = total - bad# 對稀疏值0值或者缺失值單獨分箱temp1 = df[df[col] == sparse_value]temp2 = df[~(df[col] == sparse_value)]bucket_sparse = pd.cut(temp1[col], [float('-inf'), sparse_value])group1 = temp1.groupby(bucket_sparse)bin_df1 = pd.DataFrame()bin_df1['樣本數(shù)'] = group1[target].count()bin_df1['黑樣本數(shù)'] = group1[target].sum()bin_df1['白樣本數(shù)'] = bin_df1['樣本數(shù)'] - bin_df1['黑樣本數(shù)']bin_df1['逾期用戶占比'] = bin_df1['黑樣本數(shù)'] / bin_df1['樣本數(shù)']bin_df1['badattr'] = bin_df1['黑樣本數(shù)'] / badbin_df1['goodattr'] = bin_df1['白樣本數(shù)'] / goodbin_df1['WOE'] = np.log(bin_df1['badattr'] / bin_df1['goodattr'])bin_df1['bin_iv'] = (bin_df1['badattr'] - bin_df1['goodattr']) * bin_df1['WOE']bin_df1 = bin_df1.reset_index()# 對剩余部分做卡方分箱cut = ChiMerge(temp2, col, target, max_bin=max_bin, min_binpct=min_binpct)cut.insert(0, sparse_value)cut.append(float('inf'))bucket = pd.cut(temp2[col], cut)group2 = temp2.groupby(bucket)bin_df2 = pd.DataFrame()bin_df2['樣本數(shù)'] = group2[target].count()bin_df2['黑樣本數(shù)'] = group2[target].sum()bin_df2['白樣本數(shù)'] = bin_df2['樣本數(shù)'] - bin_df2['黑樣本數(shù)']bin_df2['逾期用戶占比'] = bin_df2['黑樣本數(shù)'] / bin_df2['樣本數(shù)']bin_df2['badattr'] = bin_df2['黑樣本數(shù)'] / badbin_df2['goodattr'] = bin_df2['白樣本數(shù)'] / goodbin_df2['WOE'] = np.log(bin_df2['badattr'] / bin_df2['goodattr'])bin_df2['bin_iv'] = (bin_df2['badattr'] - bin_df2['goodattr']) * bin_df2['WOE']bin_df2 = bin_df2.reset_index()# 合并分箱結(jié)果bin_df = pd.concat([bin_df1, bin_df2], axis=0)bin_df['IV'] = bin_df['bin_iv'].sum().round(3)bin_df.drop(['badattr', 'goodattr', 'bin_iv'], axis=1, inplace=True)bin_df.rename(columns={col: '分箱結(jié)果'}, inplace=True)bin_df['特征名'] = colbin_df = pd.concat([bin_df['特征名'], bin_df.iloc[:, :-1]], axis=1)ks, precision, tpr, fpr = cal_ks(df, col, target)bin_df['準確率'] = precisionbin_df['召回率'] = tprbin_df['打擾率'] = fprbin_df['KS'] = ksreturn bin_df四. 自定義get_feature_result函數(shù),執(zhí)行完整數(shù)據(jù)預處理步驟:
1、數(shù)據(jù)預處理,調(diào)用data_processing函數(shù)
2、變量分箱
??1)類別型變量分箱
??2)數(shù)值型變量分箱
??2)卡方分箱報錯的變量分箱
3、得到分箱結(jié)果feature_result及其評估指標
??order_col = [‘特征名’, ‘分箱結(jié)果’, ‘樣本數(shù)’, ‘黑樣本數(shù)’, ‘白樣本數(shù)’, ‘逾期用戶占比’, ‘WOE’, ‘IV’, ‘準確率’, ‘召回率’, ‘打擾率’, ‘KS’]
def get_feature_result(df, target):""""df-- 含有特征和標簽的寬表target -- 好壞標簽字段名return:feature_result -- 每個特征的評估結(jié)果"""if target not in df.columns:print('請將特征文件關(guān)聯(lián)樣本好壞標簽(字段名label)后再重新運行!')else:print('數(shù)據(jù)清洗開始')df, miss_df = data_processing(df, target)print('數(shù)據(jù)清洗完成')cate_col = list(df.select_dtypes(include=['O']).columns)num_col = [x for x in list(df.select_dtypes(include=['int64', 'float64']).columns) if x != 'label']# 類別性變量分箱bin_cate_list = []for col in cate_col:bin_cate = binning_cate(df, col, target)bin_cate['rank'] = list(range(1, bin_cate.shape[0] + 1, 1))bin_cate_list.append(bin_cate)# 數(shù)值型特征分箱num_col1 = [x for x in list(miss_df[miss_df.missing_pct > 0.05]['col']) if x in num_col]num_col2 = [x for x in list(miss_df[miss_df.missing_pct <= 0.05]['col']) if x in num_col]print('特征分箱開始')bin_num_list1 = []err_col1 = []for col in tqdm(num_col1):try:bin_df1 = binning_sparse_col(df, 'label', col, min_binpct=0.05, max_bin=4, sparse_value=-999)bin_df1['rank'] = list(range(1, bin_df1.shape[0] + 1, 1))bin_num_list1.append(bin_df1)except (IndexError,ZeroDivisionError):err_col1.append(col)continuebin_num_list2 = []err_col2 = []for col in tqdm(num_col2):try:bin_df2 = binning_num(df, 'label', col, min_binpct=0.05, max_bin=5)bin_df2['rank'] = list(range(1, bin_df2.shape[0] + 1, 1))bin_num_list2.append(bin_df2)except (IndexError,ZeroDivisionError):err_col2.append(col)continue# 卡方分箱報錯的特征分箱err_col = err_col1 + err_col2bin_num_list3 = []if len(err_col) > 0:for col in tqdm(err_col):ninf = float('-inf')inf = float('inf')q_25 = df[col].quantile(0.25)q_50 = df[col].quantile(0.5)q_75 = df[col].quantile(0.75)cut = list(sorted(set([ninf, q_25, q_50, q_75, inf])))bin_df3 = binning_self(df, col, target, cut=cut, right_border=True)bin_df3['rank'] = list(range(1, bin_df3.shape[0] + 1, 1))bin_num_list3.append(bin_df3)print('特征分箱結(jié)束')bin_all_list = bin_num_list1 + bin_num_list2 + bin_num_list3 + bin_cate_listfeature_result = pd.concat(bin_all_list, axis=0)feature_result = feature_result.sort_values(['IV', 'rank'], ascending=[False, True])feature_result = feature_result.drop(['rank'], axis=1)order_col = ['特征名', '分箱結(jié)果', '樣本數(shù)', '黑樣本數(shù)', '白樣本數(shù)', '逾期用戶占比', 'WOE', 'IV', '準確率', '召回率', '打擾率', 'KS']feature_result = feature_result[order_col]return feature_result五. 導入數(shù)據(jù),運行函數(shù),實現(xiàn)自動化特征評估功能
《新程序員》:云原生和全面數(shù)字化實踐50位技術(shù)專家共同創(chuàng)作,文字、視頻、音頻交互閱讀總結(jié)
以上是生活随笔為你收集整理的【特征工程】特征分箱的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【数据库】SQL极速入门(多种方式查询用
- 下一篇: 【文本挖掘】反欺诈模糊匹配