類別不平衡之欠采樣（undersampling）

阿新 • • 發佈：2018-05-22

HR shuffle cat 圖片 mage cascade sele cas awk

類別不平衡就是指分類任務中不同類別的訓練樣例數目差別很大的情況

常用的做法有三種，分別是1.欠采樣， 2.過采樣， 3.閾值移動

由於這幾天做的project的target為正值的概率不到4%，且數據量足夠大，所以我采用了欠采樣：

欠采樣，即去除一些反例使得正、反例數目接近，然後再進行學習，基本的算法如下：

def undersampling(train, desired_apriori):

    # Get the indices per target value
    idx_0 = train[train.target == 0].index
    idx_1 = train[train.target == 
 1].index
    # Get original number of records per target value
    nb_0 = len(train.loc[idx_0])
    nb_1 = len(train.loc[idx_1])
    # Calculate the undersampling rate and resulting number of records with target=0
    undersampling_rate = ((1-desired_apriori)*nb_1)/(nb_0*desired_apriori)
    undersampled_nb_0 = 
 int(undersampling_rate*nb_0)
    print(‘Rate to undersample records with target=0: {}‘.format(undersampling_rate))
    print(‘Number of records with target=0 after undersampling: {}‘.format(undersampled_nb_0))
    # Randomly select records with target=0 to get at the desired a priori
    undersampled_idx = 
 shuffle(idx_0, n_samples=undersampled_nb_0)
    # Construct list with remaining indices
    idx_list = list(undersampled_idx) + list(idx_1)
    # Return undersample data frame
    train = train.loc[idx_list].reset_index(drop=True)

    return train

因為對應具體的project，所以裏面欠采樣的為反例，如果要使用的話需要做一些改動。

欠采樣法若隨機丟棄反例，可能會丟失一些重要信息。為此，周誌華實驗室提出了欠采樣的算法EasyEnsemble：利用集成學習機制，將反例劃分為若幹個集合供不容學習器適用，這樣對每個學習器來看都進行了欠采樣，但在全局來看卻不會丟失重要信息。其實這個方法可以再基本欠采樣方法上進行些許改動即可：

def easyensemble(df, desired_apriori, n_subsets=10):
    train_resample = []
    for _ in range(n_subsets):
        sel_train = undersampling(df, desired_apriori)
        train_resample.append(sel_train)
    return train_resample

仔細來看，下圖是原始論文Exploratory Undersampling for Class-Imbalance Learning裏的算法介紹：
技術分享圖片

Reference:

《機器學習》. 周誌華
https://www.kaggle.com/bertcarremans/data-preparation-exploration
http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.ensemble.BalanceCascade.html#imblearn.ensemble.BalanceCascade

類別不平衡之欠采樣（undersampling）

HR shuffle cat 圖片 mage cascade sele cas awk 類別不平衡就是指分類任務中不同類別的訓練樣例數目差別很大的情況常用的做法有三種，分別是1.欠采樣， 2.過采樣， 3.閾值移動由於這幾天做的project的target為正值的概率不

類別不平衡之欠采樣（undersampling）

類別不平衡之欠采樣（undersampling）

資料不平衡：下采樣、上取樣python程式碼實現

影象的上取樣（upsampling）與下采樣（subsampled）

matlab 訊號與系統（一）—— 上取樣（Upsampling）和下采樣（Downsampling）

下采樣（處理資料不平衡問題）

機器學習之類別不平衡問題 (2) —— ROC和PR曲線

Focal Loss（RetinaNet）筆記一種減小類別不平衡影響的方法

機器學習之類別不平衡問題 (1) —— 各種評估指標

圖像的降采樣與升采樣（二維插值）----轉自LOFTER-gengjiwen

機器學習-類別不平衡問題

pytorch處理類別不平衡問題

opencv013-影象上取樣和下采樣（+高斯不同）

機器不學習：如何處理資料中的「類別不平衡」？

分類任務中資料類別不平衡問題的幾種解決方案

新手網站建設優化，這些網站為你提供數之不盡的免費素材！（3）

分類中解決類別不平衡問題

機器學習中的類別不平衡問題

SVM 解決類別不平衡問題(scikit_learn)

【機器學習】類別不平衡學習

8種應對機器學習資料集類別不平衡的策略

類別不平衡之欠采樣（undersampling）

相關推薦