當(dāng)前位置：首頁 > 编程资源 > 综合教程 >内容正文

综合教程

机器学习类别不平衡处理之欠采样（undersampling）

發(fā)布時(shí)間：2024/8/26 综合教程 30 生活家

生活随笔收集整理的這篇文章主要介紹了机器学习类别不平衡处理之欠采样（undersampling）小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

類別不平衡就是指分類任務(wù)中不同類別的訓(xùn)練樣例數(shù)目差別很大的情況

常用的做法有三種，分別是1.欠采樣， 2.過采樣， 3.閾值移動(dòng)

由于這幾天做的project的target為正值的概率不到4%，且數(shù)據(jù)量足夠大，所以我采用了欠采樣：

欠采樣，即去除一些反例使得正、反例數(shù)目接近，然后再進(jìn)行學(xué)習(xí)，基本的算法如下：

def undersampling(train, desired_apriori):

    # Get the indices per target value
    idx_0 = train[train.target == 0].index
    idx_1 = train[train.target == 1].index
    # Get original number of records per target value
    nb_0 = len(train.loc[idx_0])
    nb_1 = len(train.loc[idx_1])
    # Calculate the undersampling rate and resulting number of records with target=0
    undersampling_rate = ((1-desired_apriori)*nb_1)/(nb_0*desired_apriori)
    undersampled_nb_0 = int(undersampling_rate*nb_0)
    print('Rate to undersample records with target=0: {}'.format(undersampling_rate))
    print('Number of records with target=0 after undersampling: {}'.format(undersampled_nb_0))
    # Randomly select records with target=0 to get at the desired a priori
    undersampled_idx = shuffle(idx_0, n_samples=undersampled_nb_0)
    # Construct list with remaining indices
    idx_list = list(undersampled_idx) + list(idx_1)
    # Return undersample data frame
    train = train.loc[idx_list].reset_index(drop=True)

    return train

因?yàn)閷?duì)應(yīng)具體的project，所以里面欠采樣的為反例，如果要使用的話需要做一些改動(dòng)。

欠采樣法若隨機(jī)丟棄反例，可能會(huì)丟失一些重要信息。為此，周志華實(shí)驗(yàn)室提出了欠采樣的算法EasyEnsemble：利用集成學(xué)習(xí)機(jī)制，將反例劃分為若干個(gè)集合供不同學(xué)習(xí)器使用，這樣對(duì)每個(gè)學(xué)習(xí)器來看都進(jìn)行了欠采樣，但在全局來看卻不會(huì)丟失重要信息。其實(shí)這個(gè)方法可以再基本欠采樣方法上進(jìn)行些許改動(dòng)即可：

def easyensemble(df, desired_apriori, n_subsets=10):
    train_resample = []
    for _ in range(n_subsets):
        sel_train = undersampling(df, desired_apriori)
        train_resample.append(sel_train)
    return train_resample

仔細(xì)來看，下圖是原始論文Exploratory Undersampling for Class-Imbalance Learning里的算法介紹：

PS: 對(duì)于類別不平衡的時(shí)候采用CV進(jìn)行交叉驗(yàn)證時(shí)，由于分類問題在目標(biāo)分布上表現(xiàn)出很大的不平衡性。如果用sklearn庫中的函數(shù)進(jìn)行交叉驗(yàn)證的話，建議采用如StratifiedKFold 和 StratifiedShuffleSplit中實(shí)現(xiàn)的分層抽樣方法，確保相對(duì)的類別概率在每個(gè)訓(xùn)練和驗(yàn)證折疊中大致保留。

Reference:

《機(jī)器學(xué)習(xí)》. 周志華
https://www.kaggle.com/bertcarremans/data-preparation-exploration
http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.ensemble.BalanceCascade.html

總結(jié)

以上是生活随笔為你收集整理的机器学习类别不平衡处理之欠采样（undersampling）的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：游击战为什么在有些地方发展很快,成为了政
下一篇：帝豪四代是国六b吗？

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

综合教程

机器学习类别不平衡处理之欠采样（undersampling）

總結(jié)