1. 程式人生 > >sklearn.model_selection Part 1: Splitter Classes

sklearn.model_selection Part 1: Splitter Classes

這也 efault += 代碼 spl parameter repeated 不同 ide

1. GroupKFold(_BaseKFold)

主要參數:

n_splits : int, default=3

在GroupKFold.split(X[, y, groups])中會調用下面的方法

def _iter_test_indices(self, X, y, groups):
    if groups is None:
        raise ValueError("The ‘groups‘ parameter should not be None.")
    groups = check_array(groups, ensure_2d=False, dtype=None)

    unique_groups, groups = np.unique(groups, return_inverse=True)  # unique_groups[groups] 可以重建原來的groups
    n_groups = len(unique_groups)

    if self.n_splits > n_groups:
        raise ValueError("Cannot have number of splits n_splits=%d greater"
                         " than the number of groups: %d."
                         % (self.n_splits, n_groups))

    # Weight groups by their number of occurrences
    n_samples_per_group = np.bincount(groups)  # 每組中的樣本數

    # Distribute the most frequent groups first
    indices = np.argsort(n_samples_per_group)[::-1]  # 按從每組樣本數由多到少的順序排列每組的索引
    n_samples_per_group = n_samples_per_group[indices]  # 每組的樣本數從大到小的排列,n_samples_per_group 的索引不再有意義而被 indices 代替

    # Total weight of each fold
    n_samples_per_fold = np.zeros(self.n_splits)

    # Mapping from group index to fold index
    group_to_fold = np.zeros(len(unique_groups))

    # Distribute samples by adding the largest weight to the lightest fold
    # largest weight 就是當前最大的一組樣本數,lightest fold 就是所有fold中所含樣本最少的那一折
    for group_index, weight in enumerate(n_samples_per_group):
        lightest_fold = np.argmin(n_samples_per_fold)
        n_samples_per_fold[lightest_fold] += weight
        group_to_fold[indices[group_index]] = lightest_fold  # 這裏就是組數要比折數多的原因,因為一組必須全部放到一折裏面,一折可包含多組

    indices = group_to_fold[groups]

    for f in range(self.n_splits):
        yield np.where(indices == f)[0]  # 返回每一折的 test index

總結

GroupKFold是沒有隨機性參數的,也就是樣本的groups確定後,每一折包含那些樣本也是確定的。算法的文字描述如下:

  1. 每組按照組內樣本數的多少進行排序
  2. 以組為單位對所有組做一次遍歷
  3. 將當前組內所有樣本放入當前分配的所有折中所含樣本數最少的折中

使用時註意組數要大於折數,同一組的樣本一定被分到同一折中。

2. GroupShuffleSplit(ShuffleSplit)

主要參數:

n_splits : int (default 5) 劃分樣本訓練集和測試集的次數,不同於折數

train_size/test_size : 訓練集或測試集的樣本比例或數目

random_state

在GroupShuffleSplit.split(X[, y, groups])方法中:

def _iter_indices(self, X, y, groups):
    if groups is None:
        raise ValueError("The ‘groups‘ parameter should not be None.")
    groups = check_array(groups, ensure_2d=False, dtype=None)
    classes, group_indices = np.unique(groups, return_inverse=True)
    for group_train, group_test in super(
            GroupShuffleSplit, self)._iter_indices(X=classes): # 對組進行ShuffleSplit,group_train包含了作為訓練集的組號,group_test類同
        # these are the indices of classes in the partition
        # invert them into data indices

        # np.in1d 返回group_indices中的元素是否在group_train中,True False的數組
        # np.flatnonzero() 返回數組中不為0的元素的索引,實際就是選為train或者test的樣本的index
        train = np.flatnonzero(np.in1d(group_indices, group_train))
        test = np.flatnonzero(np.in1d(group_indices, group_test))

        yield train, test

然後其中調用ShuffleSplit的同名方法:

# ShuffleSplit的方法
def _iter_indices(self, X, y=None, groups=None):
    n_samples = _num_samples(X) # 返回 X 中樣本數目
    n_train, n_test = _validate_shuffle_split(n_samples,  # 對傳入參數做數據檢查,返回訓練集和測試集的樣本數
                                              self.test_size,
                                              self.train_size)
    rng = check_random_state(self.random_state)  # 對random_state做數據檢查,返回隨機種子 np.random.RandomState()
    for i in range(self.n_splits):
        # random partition
        permutation = rng.permutation(n_samples) # 隨機打亂數據
        ind_test = permutation[:n_test]
        ind_train = permutation[n_test:(n_test + n_train)]
        yield ind_train, ind_test  # 返回每一折的訓練集的index和測試集的index

總結

GroupShuffleSplit是有隨機性參數random_state的,其隨機性來自於其父類ShuffleSplit。該算法的核心就是對組編號做ShuffleSplit,這也是其繼承該類的原因,然後所有落在train中的組的所有樣本組成訓練集,其他組成測試集。和GroupKFold類似,一組的數組要麽只出現在train中,要麽只出現test中,不可同時出現在二者之中。

3. KFold(_BaseKFold)

註意

參數shuffle的默認值是False,而我們一般要設置為True。當shuffle設置為True時,random_state才會被用到,這時,如果random_state如果不設置,每次結果會不一樣,只有給每次設置random_state同一個值,shuffle的結果才是相同的。所以shuffle決定是否引入隨機性,random_state只不過是讓隨機性可以重現。

4. LeaveOneGroupOut(BaseCrossValidator)

    def _iter_test_masks(self, X, y, groups):
        if groups is None:
            raise ValueError("The ‘groups‘ parameter should not be None.")
        # We make a copy of groups to avoid side-effects during iteration
        groups = check_array(groups, copy=True, ensure_2d=False, dtype=None)
        unique_groups = np.unique(groups)
        if len(unique_groups) <= 1:
            raise ValueError(
                "The groups parameter contains fewer than 2 unique groups "
                "(%s). LeaveOneGroupOut expects at least 2." % unique_groups)
        for i in unique_groups:
            yield groups == i # 返回作為測試集的index

總結

一開始對留一法不是很清楚,其實留一法就是交叉驗證方法的極端情形,當交叉驗證的折數和數據集的個數相等時就是留一法。理解了留一法後上面的源碼就是非常清晰簡潔的。一句話概括該算法就是以組為單位做留一法。

5. LeavePGroupsOut(BaseCrossValidator)

理解前面的就很簡單,就是留出P個Group做測試集。

6. LeaveOneOut(BaseCrossValidator)

7. LeavePOneOut(BaseCrossValidator)

8. PredefinedSplit(BaseCrossValidator)

def __init__(self, test_fold):
    self.test_fold = np.array(test_fold, dtype=np.int)
    self.test_fold = column_or_1d(self.test_fold)  # 將shape是(n_samples,1)的數組拉成(n_samples,)的數組
    self.unique_folds = np.unique(self.test_fold)
    self.unique_folds = self.unique_folds[self.unique_folds != -1] # test_fold 中所有-1的樣本要放到train中,也就是test_fold中不為-1的值的種類數就是split的數目。


def split(self, X=None, y=None, groups=None):
    ind = np.arange(len(self.test_fold))
    for test_index in self._iter_test_masks():
        train_index = ind[np.logical_not(test_index)]
        test_index = ind[test_index]
        yield train_index, test_index


def _iter_test_masks(self):
    """Generates boolean masks corresponding to test sets."""
    for f in self.unique_folds:
        test_index = np.where(self.test_fold == f)[0]
        test_mask = np.zeros(len(self.test_fold), dtype=np.bool)
        test_mask[test_index] = True
        yield test_mask

上面的源碼是清晰明了的,下面舉例子說明:

PredefinedSplit需要的參數只有一個那就是test_fold, test_fold的size要和數據集的size相同,test_fold中元素的值為-1,表示這個樣本要放到訓練集中,test_fold中具有相同值的元素對應的樣本要放到同一個test_set中。比如 test_fold = [1, 1, 1, -1, -1, -1, 2, 2, 2, 2]表示或做兩個split,第一次split中第4(首個index是1)到第10個樣本做train,第1個到第3個做test;第二次split中第1到第7個樣本做train,第8個到第10個樣本做test 。

9. RepeatedKFold(_RepeatedSplits)

用不同的隨機化重復KFold若幹次,內部代碼在每次KFold時會把shuffle設置為True。

10. RepeatedStratifiedKFold(_RepeatedSplits)

11. ShuffleSplit(BaseShuffleSplit)

12. StratifiedKFold(_BaseKFold)

註意

split(X, y, [groups])中y是必須的參數而不是可選的。

13. StratifiedShuffleSplit(BaseShuffleSplit)

14. TimeSeriesSplit(_BaseKFold)

def __init__(self, n_splits=3, max_train_size=None):
    super(TimeSeriesSplit, self).__init__(n_splits,
                                          shuffle=False,
                                          random_state=None)
    self.max_train_size = max_train_size  # 訓練集的最大樣本數

def split(self, X, y=None, groups=None):
    X, y, groups = indexable(X, y, groups)
    n_samples = _num_samples(X)
    n_splits = self.n_splits
    n_folds = n_splits + 1
    if n_folds > n_samples: # 折數不能大於樣本數,也即是參數n_splits不能大於樣本數減1
        raise ValueError(
            ("Cannot have number of folds ={0} greater"
             " than the number of samples: {1}.").format(n_folds,
                                                         n_samples))
    indices = np.arange(n_samples)
    test_size = (n_samples // n_folds)
    test_starts = range(test_size + n_samples % n_folds,
                        n_samples, test_size)
    for test_start in test_starts:
        if self.max_train_size and self.max_train_size < test_start:
            yield (indices[test_start - self.max_train_size:test_start],
                   indices[test_start:test_start + test_size])
        else:
            yield (indices[:test_start],
                   indices[test_start:test_start + test_size])

使用方法總結

  • KFold類:n_splits既是折數(確定train和test的size),也是重復次數。
  • ShuffleSplit類:n_splits是重復次數,需要額外參數train_size/test_size來確定train和test被劃分的大小。
  • LeaveOneOut類:不需要參數,LeaveOnePOut需要參數p。
# template
spliter = Spliter(...)
for i, trn_idx,test_idx in enumerate(spliter.split(X,y)):
    ....

sklearn.model_selection Part 1: Splitter Classes