當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

详解卡方分箱及应用

發布時間：2025/3/21 编程问答 12 豆豆

生活随笔收集整理的這篇文章主要介紹了详解卡方分箱及应用小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

?　最近在研究評分卡建模的流程，在特征處理的過程中涉及到分箱這一基本的常用技巧，本文就對分箱中的卡方分箱展開詳細介紹。
?　分箱就是將連續型的數據離散化，比如年齡這個變量是，可以分箱為0-18，18-30，30-45，45-60。這也是建立評分卡過程中常見的操作，首先思考一個問題，為什么要進行分箱？直接用年齡這個變量去建模是否可以？其實是可以的。只不過評分卡需要模型有很強的業務可解釋性，這和你的建模算法有關。如果你用xgb、lgb等機器學習算法的話，模型會變得不可解釋，此時不分箱也是可以的。
?　分箱的好處主要有這些：

分箱后的特征對異常數據有更強的魯棒性。比如年齡中有一個異常值為300，分箱之后就可能劃到>80這一箱中，而如果直接入模的話會對模型造成很大干擾。

特征離散化之后，每個變量有單獨的權重，可以為邏輯回歸模型引入了非線性，能夠提升模型表達能力，加大擬合。

特征離散化以后，起到了簡化了邏輯回歸模型的作用，降低了模型過擬合的風險。

可以將缺失作為獨立的一類帶入模型。

稀疏向量內積乘法運算速度快，計算結果方便存儲，容易擴展。

?　下面開始介紹卡方分箱，首先要先了解卡方檢驗。因為卡方分箱是一種基于卡方檢驗的分箱方法，具體來說是基于卡方檢驗中的獨立性檢驗來實現分箱功能。

卡方檢驗

?　卡方檢驗就是對分類數據的頻數進行分析的一種方法，它的應用主要表現在兩個方面：擬合優度檢驗和獨立性檢驗（列聯分析）。

擬合優度檢驗
?　擬合優度是對一個分類變量的檢驗，即根據總體分布狀況，計算出分類變量中各類別的期望頻數，與分布的觀察頻數進行對比，判斷期望頻數與觀察頻數是否有顯著差異，從而達到對分類變量進行分析的目的。比如，泰坦尼克號中我們觀察幸存者是否與性別有關，可以理解為一個X是否與Y有必然聯系。
獨立性檢驗
?　獨立性檢驗是兩個特征變量之間的計算，它可以用來分析兩個分類變量是否獨立，或者是否有關聯。比如某原料質量和產地是否依賴關系，可以理解為一個X與另一個X是否獨立。

卡方檢驗步驟

?　卡方檢驗也是一種假設檢驗，與常見的假設檢驗方法一致。

提出假設，比如假設兩個變量之間獨立
根據分類的觀察頻數計算期望頻數
根據卡方公式，計算實際頻數與期望頻數的卡方值
根據自由度和事先確定的顯著性水平，查找卡方分布表計算卡法值，并與上一步卡方值比較
得出結果判斷是否拒絕原假設

評分卡中的卡方分箱

?　下面以年齡變量為例，講解一下評分卡建模過程中如何對年齡變量進行卡方分箱。先舉實際例子再講理論。

?　首先，將年齡從小到大排序，每一個年齡取值為單獨一箱。統計對應的違約和不違約的個數。然后進行合并，具體步驟如下：

如果有1,2,3,4個分箱，那么就需要綁定相鄰的兩個分箱，共三組：12,23,34。然后分別計算三個綁定組的卡方值。

從計算的卡方值中找出最小的一個，并把這兩個分箱合并：比如，23是卡方值最小的一個，那么就將2和3合并，本輪計算中分箱就變為了1,23,4。

?　分箱背后的理論依據：如果兩個相鄰的區間具有非常類似的類分布，那么這兩個區間可以合并。否則，它們應該分開。低卡方值表明它們具有相似的類分布。

?　對于卡方值越小分布越相似這一核心理論我也做了個簡單的推導：

?　可以看到如果需要合并的兩箱分布完全一致的話，合并之后的卡方值為0。
下面給出卡方分箱的理論及公式：

?　上面的步驟只是每一輪需要計算的內容，如果不設置停止條件，算法就會一直運行。當然，我們一般會設置一些停止條件：

卡方停止的閾值
分箱數目的限制

?　根據經驗值，卡方停止的閾值一般設置置信度為0.9、0.95、0.99，自由度可以設置為4是對應的卡方值，分箱數一般可以設置為5。卡方分箱的自由度是分類變量類型的個數減一。

?　下面給一個卡方分箱的代碼，建議仔細閱讀，有助于代碼水平的提高和更好地理解卡方分箱。一定要一次性看完，因為看完你就會忘的。

## 自寫卡方最優分箱過程 def get_chi2(X, col):'''計算卡方統計量'''# 計算樣本期望頻率pos_cnt = X['Defaulter'].sum()all_cnt = X['Defaulter'].count()expected_ratio = float(pos_cnt) / all_cnt # 對變量按屬性值從大到小排序df = X[[col, 'Defaulter']]df = df.dropna()col_value = list(set(df[col]))col_value.sort()# 計算每一個區間的卡方統計量chi_list = []pos_list = []expected_pos_list = []for value in col_value:df_pos_cnt = df.loc[df[col] == value, 'Defaulter'].sum()df_all_cnt = df.loc[df[col] == value,'Defaulter'].count()expected_pos_cnt = df_all_cnt * expected_ratiochi_square = (df_pos_cnt - expected_pos_cnt)**2 / expected_pos_cntchi_list.append(chi_square)pos_list.append(df_pos_cnt)expected_pos_list.append(expected_pos_cnt)# 導出結果到dataframechi_result = pd.DataFrame({col: col_value, 'chi_square':chi_list,'pos_cnt':pos_list, 'expected_pos_cnt':expected_pos_list})return chi_resultdef chiMerge(chi_result, maxInterval=5):'''根據最大區間數限制法則，進行區間合并'''group_cnt = len(chi_result)# 如果變量區間超過最大分箱限制，則根據合并原則進行合并，直至在maxInterval之內while(group_cnt > maxInterval):## 取出卡方值最小的區間min_index = chi_result[chi_result['chi_square'] == chi_result['chi_square'].min()].index.tolist()[0]# 如果分箱區間在最前,則向下合并if min_index == 0:chi_result = merge_chiSquare(chi_result, min_index+1, min_index)# 如果分箱區間在最后，則向上合并elif min_index == group_cnt-1:chi_result = merge_chiSquare(chi_result, min_index-1, min_index)# 如果分箱區間在中間，則判斷兩邊的卡方值，選擇最小卡方進行合并else:if chi_result.loc[min_index-1, 'chi_square'] > chi_result.loc[min_index+1, 'chi_square']:chi_result = merge_chiSquare(chi_result, min_index, min_index+1)else:chi_result = merge_chiSquare(chi_result, min_index-1, min_index)group_cnt = len(chi_result)return chi_resultdef cal_chisqure_threshold(dfree=4, cf=0.1):'''根據給定的自由度和顯著性水平, 計算卡方閾值'''percents = [0.95, 0.90, 0.5, 0.1, 0.05, 0.025, 0.01, 0.005]## 計算每個自由度，在每個顯著性水平下的卡方閾值df = pd.DataFrame(np.array([chi2.isf(percents, df=i) for i in range(1, 30)]))df.columns = percentsdf.index = df.index+1pd.set_option('precision', 3)return df.loc[dfree, cf]def chiMerge_chisqure(chi_result, dfree=4, cf=0.1, maxInterval=5):threshold = cal_chisqure_threshold(dfree, cf)min_chiSquare = chi_result['chi_square'].min()group_cnt = len(chi_result)# 如果變量區間的最小卡方值小于閾值，則繼續合并直到最小值大于等于閾值while(min_chiSquare < threshold and group_cnt > maxInterval):min_index = chi_result[chi_result['chi_square']==chi_result['chi_square'].min()].index.tolist()[0]# 如果分箱區間在最前,則向下合并if min_index == 0:chi_result = merge_chiSquare(chi_result, min_index+1, min_index)# 如果分箱區間在最后，則向上合并elif min_index == group_cnt-1:chi_result = merge_chiSquare(chi_result, min_index-1, min_index)# 如果分箱區間在中間，則判斷與其相鄰的最小卡方的區間，然后進行合并else:if chi_result.loc[min_index-1, 'chi_square'] > chi_result.loc[min_index+1, 'chi_square']:chi_result = merge_chiSquare(chi_result, min_index, min_index+1)else:chi_result = merge_chiSquare(chi_result, min_index-1, min_index)min_chiSquare = chi_result['chi_square'].min()group_cnt = len(chi_result)return chi_resultdef merge_chiSquare(chi_result, index, mergeIndex, a = 'expected_pos_cnt',b = 'pos_cnt', c = 'chi_square'):'''按index進行合并，并計算合并后的卡方值mergeindex 是合并后的序列值'''chi_result.loc[mergeIndex, a] = chi_result.loc[mergeIndex, a] + chi_result.loc[index, a]chi_result.loc[mergeIndex, b] = chi_result.loc[mergeIndex, b] + chi_result.loc[index, b]## 兩個區間合并后，新的chi2值如何計算chi_result.loc[mergeIndex, c] = (chi_result.loc[mergeIndex, b] - chi_result.loc[mergeIndex, a])**2 /chi_result.loc[mergeIndex, a]chi_result = chi_result.drop([index])## 重置indexchi_result = chi_result.reset_index(drop=True)return chi_resultimport copy chi_train_X = copy.deepcopy(train_X)## 對數據進行卡方分箱，按照自由度進行分箱chi_result_all = dict()for col in chi_train_X.columns:print("start get " + col + " chi2 result")chi2_result = get_chi2(train, col)chi2_merge = chiMerge_chisqure(chi2_result, dfree=4, cf=0.05, maxInterval=5)chi_result_all[col] = chi2_merge

【作者】：Labryant
【原創公眾號】：風控獵人
【簡介】：某創業公司策略分析師，積極上進，努力提升。乾坤未定，你我都是黑馬。
【轉載說明】：轉載請說明出處，謝謝合作！~

總結

以上是生活随笔為你收集整理的详解卡方分箱及应用的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。