1. 程式人生 > >關於Cross-validation的學習筆記

關於Cross-validation的學習筆記

1.train_test_split函式

train_test_split函式可以很快的將資料劃分為訓練集和測試集,以iris資料集為例,用svm演算法來做分類預測:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
iris = datasets.load_iris()
x = iris.data
y = iris.target

下面可以用train_text_split函式來將資料分為訓練集和測試集

X_train,X_test,y_train,y_test = train_test_split(x,y,test_size = 0.4,random_state=0)
X_train.shape, y_train.shape
out:((90, 4), (90,))
X_test.shape, y_test.shape
out:((60, 4), (60,))
clf = svm.SVC(kernel='linear', C=1).fit(x_train, y_train)
clf.score(x_test,y_test)
out:0.9666666666666667

score顯示,此訓練後的分類器預測的正確率為0.9666666666666667

2.cross-val_score函式

下面演示用K折交叉驗證後的情況

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
scores                                              
out:array([0.96..., 1.  ..., 0.96..., 0.96..., 1.        ])

3.KFold函式

from sklearn.model_selection import KFold
kf = KFold(n_splits = 5)
for train,test in kf.split(x):
    print (test)

將上述資料集裡的x平均分成5份,每次取1份作為驗證集,其他作為訓練集。比如,n_splits = 5表示把資料集分成5份,按照樣本順序平均分的,並且驗證集是按照5份的順序依次出現的。也就是說,第1次是以第1份作為驗證集,後面4份作為訓練集,第2次是以第2份作為驗證集,第1,3,4,5份作為訓練集,以此類推。

4.LeaveOneOut函式(留一驗證)

>>> from sklearn.model_selection import LeaveOneOut
>>> X = [1, 2, 3, 4]
>>> loo = LeaveOneOut()
>>> for train, test in loo.split(X):
...     print("%s %s" % (train, test))
[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]

即每次只留一個樣本作為驗證集,並且是按照順序留下驗證集的。

5.LeavePOut函式(留P驗證)

顧名思義,留下P個樣本作為驗證集,此函式將所有可能出現的P個樣本的組合都出現一遍。

>>> from sklearn.model_selection import LeavePOut

>>> X = np.ones(4)
>>> lpo = LeavePOut(p=2)
>>> for train, test in lpo.split(X):
...     print("%s %s" % (train, test))
[2 3] [0 1]
[1 3] [0 2]
[1 2] [0 3]
[0 3] [1 2]
[0 2] [1 3]
[0 1] [2 3]

X為[0,1,2,3],留2驗證,則該函式將所有2個組合都會出現一次,並按順序作為驗證集。

6.ShuffleSplit函式

ShuffleSplit函式是先將資料打亂順序,再分割成訓練集和驗證集

>>> from sklearn.model_selection import ShuffleSplit
>>> X = np.arange(10)
>>> ss = ShuffleSplit(n_splits=5, test_size=0.25,
...     random_state=0)
>>> for train_index, test_index in ss.split(X):
...     print("%s %s" % (train_index, test_index))
[9 1 6 7 3 0 5] [2 8 4]
[2 9 8 0 6 7 4] [3 5 1]
[4 5 1 0 6 9 7] [2 3 8]
[2 7 5 8 0 3 4] [6 1 9]
[4 1 0 6 8 9 3] [5 2 7]

7.StratifiedKFold函式

在分類問題中,如果正負樣本差別較大,用一般的KFold相關的函式進行分割資料集時,容易造成正負樣本不均勻的現象,為了解決此問題,需要用到StratifiedKFold函式,該函式可以在分割資料集的同時,也會照顧到正負樣本的比例。

>>> from sklearn.model_selection import StratifiedKFold

>>> X = np.ones(10)
>>> y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
>>> skf = StratifiedKFold(n_splits=3)
>>> for train, test in skf.split(X, y):
...     print("%s %s" % (train, test))
[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]

可以看到,此例中的正樣本較多,這樣在分割的時候,x,y引數都加進去,StratifiedKFold函式可以考慮到正負類的差異。
同ShuffleSplit一樣,也有StratifiedShuffleSplit函式。

8.TimeSeriesSplit函式

處理時間序列的分割函式,具體原理尚不明白,先看例子:

>>> from sklearn.model_selection import TimeSeriesSplit

>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4, 5, 6])
>>> tscv = TimeSeriesSplit(n_splits=3)
>>> print(tscv)  
TimeSeriesSplit(max_train_size=None, n_splits=3)
>>> for train, test in tscv.split(X):
       print("%s %s" % (train, test))
[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]