類別不平衡之欠采樣(undersampling)
阿新 • • 發佈:2018-05-22
HR shuffle cat 圖片 mage cascade sele cas awk
類別不平衡就是指分類任務中不同類別的訓練樣例數目差別很大的情況
常用的做法有三種,分別是1.欠采樣, 2.過采樣, 3.閾值移動
由於這幾天做的project的target為正值的概率不到4%,且數據量足夠大,所以我采用了欠采樣:
欠采樣,即去除一些反例使得正、反例數目接近,然後再進行學習,基本的算法如下:
def undersampling(train, desired_apriori):
# Get the indices per target value
idx_0 = train[train.target == 0].index
idx_1 = train[train.target == 1].index
# Get original number of records per target value
nb_0 = len(train.loc[idx_0])
nb_1 = len(train.loc[idx_1])
# Calculate the undersampling rate and resulting number of records with target=0
undersampling_rate = ((1-desired_apriori)*nb_1)/(nb_0*desired_apriori)
undersampled_nb_0 = int(undersampling_rate*nb_0)
print(‘Rate to undersample records with target=0: {}‘.format(undersampling_rate))
print(‘Number of records with target=0 after undersampling: {}‘.format(undersampled_nb_0))
# Randomly select records with target=0 to get at the desired a priori
undersampled_idx = shuffle(idx_0, n_samples=undersampled_nb_0)
# Construct list with remaining indices
idx_list = list(undersampled_idx) + list(idx_1)
# Return undersample data frame
train = train.loc[idx_list].reset_index(drop=True)
return train
因為對應具體的project,所以裏面欠采樣的為反例,如果要使用的話需要做一些改動。
欠采樣法若隨機丟棄反例,可能會丟失一些重要信息。為此,周誌華實驗室提出了欠采樣的算法EasyEnsemble:利用集成學習機制,將反例劃分為若幹個集合供不容學習器適用,這樣對每個學習器來看都進行了欠采樣,但在全局來看卻不會丟失重要信息。其實這個方法可以再基本欠采樣方法上進行些許改動即可:
def easyensemble(df, desired_apriori, n_subsets=10):
train_resample = []
for _ in range(n_subsets):
sel_train = undersampling(df, desired_apriori)
train_resample.append(sel_train)
return train_resample
仔細來看,下圖是原始論文Exploratory Undersampling for Class-Imbalance Learning裏的算法介紹:
Reference:
- 《機器學習》. 周誌華
- https://www.kaggle.com/bertcarremans/data-preparation-exploration
- http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.ensemble.BalanceCascade.html#imblearn.ensemble.BalanceCascade
類別不平衡之欠采樣(undersampling)