【Python第三方包】scikitlearn-KFold與StratifiedKFold的區別
阿新 • • 發佈:2018-11-01
KFold與StratifiedKFold的區別
class sklearn.model_selection.StratifiedKFold(n_splits=3, shuffle=False, random_state=None)
Stratified K-Folds cross-validator Provides train/test indices to split data in train/test sets.This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class
意思就是Stra~是按著不同類別標籤的相對佔比來進行的分折
看看程式碼
import numpy as np
import sklearn.datasets as ds
import sklearn.svm as sksvm
import sklearn.model_selection as skmodsel
a=[[i] for i in range(100)]
b=[0]*30;b.extend([1]*30);b.extend([2]*30);b.extend([3]*10)
K_strafold=skmodsel.StratifiedKFold(n_splits=10)
K_folds=skmodsel.KFold(n_splits=10 )
for train_indices, test_indices in K_folds.split(a):
d={}
for i in train_indices:
d[b[i]]=d.setdefault(b[i],0)+1
print(d)
print('------------------')
for train_indices, test_indices in K_strafold.split(a,b):
d={}
for i in train_indices:
d[b[i]]=d.setdefault(b[i],0 )+1
print(d)
d={}
for i in range(100):
d[b[i]]=d.setdefault(b[i],0)+1
print(d)
結果如下
{0: 20, 1: 30, 2: 30, 3: 10}
{0: 20, 1: 30, 2: 30, 3: 10}
{0: 20, 1: 30, 2: 30, 3: 10}
{0: 30, 1: 20, 2: 30, 3: 10}
{0: 30, 1: 20, 2: 30, 3: 10}
{0: 30, 1: 20, 2: 30, 3: 10}
{0: 30, 1: 30, 2: 20, 3: 10}
{0: 30, 1: 30, 2: 20, 3: 10}
{0: 30, 1: 30, 2: 20, 3: 10}
{0: 30, 1: 30, 2: 30}
------------------
{0: 27, 1: 27, 2: 27, 3: 9}
{0: 27, 1: 27, 2: 27, 3: 9}
{0: 27, 1: 27, 2: 27, 3: 9}
{0: 27, 1: 27, 2: 27, 3: 9}
{0: 27, 1: 27, 2: 27, 3: 9}
{0: 27, 1: 27, 2: 27, 3: 9}
{0: 27, 1: 27, 2: 27, 3: 9}
{0: 27, 1: 27, 2: 27, 3: 9}
{0: 27, 1: 27, 2: 27, 3: 9}
{0: 27, 1: 27, 2: 27, 3: 9}
------------------
{0: 30, 1: 30, 2: 30, 3: 10}
我們設定的是十折,也就是說將a分為十份,每次取一份來做test集
結果很明顯,類別標籤裡有0,1,2各30個,和10個3標籤,他們的佔比就是3:3:3:1
當直接使用KFold的時候,則是每次取出十個同一列表標籤的來做一折,看起來就好像是按著b的排列順序,第一次取0~9做第一折,10-19做第二折,以此類推
但當使用Stra~的時候,每次取都是按著佔比來取的,每一折都滿足3:3:3:1,測試集也是滿足這種