statistical learning -- Model selection
Model selection
https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#score-and-cross-validated-scores
模型選擇,包括兩個部分:
(1)選擇不同型別的estimator, 即模型本身。
(2)選擇模型的引數。
choosing estimators and their parameters
Score, and cross-validated scores
單個模型可以通過模型的 score 介面, 評估模型的好壞。
score的值越大, 則效能越好。
效能的評估跟資料相關, 一般為了得到更好的預測精度的度量, 可以使用KFold方法。
As we have seen, every estimator exposes a
score
method that can judge the quality of the fit (or the prediction) on new data. Bigger is better.>>> from sklearn import datasets, svm >>> X_digits, y_digits = datasets.load_digits(return_X_y=True) >>> svc = svm.SVC(C=1, kernel='linear') >>> svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:]) 0.98
To get a better measure of prediction accuracy (which we can use as a proxy for goodness of fit of the model), we can successively split the data in folds that we use for training and testing:
>>> import numpy as np >>> X_folds = np.array_split(X_digits, 3) >>> y_folds = np.array_split(y_digits, 3) >>> scores = list() >>> for k in range(3): ... # We use 'list' to copy, in order to 'pop' later on ... X_train = list(X_folds) ... X_test = X_train.pop(k) ... X_train = np.concatenate(X_train) ... y_train = list(y_folds) ... y_test = y_train.pop(k) ... y_train = np.concatenate(y_train) ... scores.append(svc.fit(X_train, y_train).score(X_test, y_test)) >>> print(scores) [0.934..., 0.956..., 0.939...]
This is called a
KFold
cross-validation.
Cross-validation generators
https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#cross-validation-generators
實際上sklearn提供了 校驗驗證的工具。
KFold工具,可以將資料進行K等份劃分,並生成 訓練集 和 驗證集合。
Scikit-learn has a collection of classes which can be used to generate lists of train/test indices for popular cross-validation strategies.
They expose a
split
method which accepts the input dataset to be split and yields the train/test set indices for each iteration of the chosen cross-validation strategy.This example shows an example usage of the
split
method.
>>> from sklearn.model_selection import KFold, cross_val_score >>> X = ["a", "a", "a", "b", "b", "c", "c", "c", "c", "c"] >>> k_fold = KFold(n_splits=5) >>> for train_indices, test_indices in k_fold.split(X): ... print('Train: %s | test: %s' % (train_indices, test_indices)) Train: [2 3 4 5 6 7 8 9] | test: [0 1] Train: [0 1 4 5 6 7 8 9] | test: [2 3] Train: [0 1 2 3 6 7 8 9] | test: [4 5] Train: [0 1 2 3 4 5 8 9] | test: [6 7] Train: [0 1 2 3 4 5 6 7] | test: [8 9]
然後使用 列表生成式 計算每種等份 模型訓練和驗證score。
The cross-validation can then be performed easily:
>>> [svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test]) ... for train, test in k_fold.split(X_digits)] [0.963..., 0.922..., 0.963..., 0.963..., 0.930...
實際上這個工作,可以使用 cross_val_score 工具完成。
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score
The cross-validation score can be directly calculated using the
cross_val_score
helper. Given an estimator, the cross-validation object and the input dataset, thecross_val_score
splits the data repeatedly into a training and a testing set, trains the estimator using the training set and computes the scores based on the testing set for each iteration of cross-validation.By default the estimator’s
score
method is used to compute the individual scores.Refer the metrics module to learn more on the available scoring methods.
>>> cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)
array([0.96388889, 0.92222222, 0.9637883 , 0.9637883 , 0.93036212])
n_jobs
可以控制多少個程序同時執行, 來執行交叉驗證工作, 如果為 -1 則表示使用盡量多的CPU。
同時也可以使用 scoring引數來指定計算那種型別的得分值。
n_jobs=-1
means that the computation will be dispatched on all the CPUs of the computer.Alternatively, the
scoring
argument can be provided to specify an alternative scoring method.>>> cross_val_score(svc, X_digits, y_digits, cv=k_fold, ... scoring='precision_macro') array([0.96578289, 0.92708922, 0.96681476, 0.96362897, 0.93192644])
交叉驗證示例
https://scikit-learn.org/stable/auto_examples/exercises/plot_cv_diabetes.html#sphx-glr-auto-examples-exercises-plot-cv-diabetes-py
此示例針對 Lasso 模型, 做超參選擇分析。
隨著alpha引數的遞增, 交叉驗證的分數則先增後降。此引數應該選擇中間值。
Cross-validation on diabetes Dataset Exercise
A tutorial exercise which uses cross-validation with linear models.
print(__doc__) import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.linear_model import LassoCV from sklearn.linear_model import Lasso from sklearn.model_selection import KFold from sklearn.model_selection import GridSearchCV X, y = datasets.load_diabetes(return_X_y=True) X = X[:150] y = y[:150] lasso = Lasso(random_state=0, max_iter=10000) alphas = np.logspace(-4, -0.5, 30) tuned_parameters = [{'alpha': alphas}] n_folds = 5 clf = GridSearchCV(lasso, tuned_parameters, cv=n_folds, refit=False) clf.fit(X, y) scores = clf.cv_results_['mean_test_score'] scores_std = clf.cv_results_['std_test_score'] plt.figure().set_size_inches(8, 6) plt.semilogx(alphas, scores) # plot error lines showing +/- std. errors of the scores std_error = scores_std / np.sqrt(n_folds) plt.semilogx(alphas, scores + std_error, 'b--') plt.semilogx(alphas, scores - std_error, 'b--') # alpha=0.2 controls the translucency of the fill color plt.fill_between(alphas, scores + std_error, scores - std_error, alpha=0.2) plt.ylabel('CV score +/- std error') plt.xlabel('alpha') plt.axhline(np.max(scores), linestyle='--', color='.5') plt.xlim([alphas[0], alphas[-1]]) # ############################################################################# # Bonus: how much can you trust the selection of alpha? # To answer this question we use the LassoCV object that sets its alpha # parameter automatically from the data by internal cross-validation (i.e. it # performs cross-validation on the training data it receives). # We use external cross-validation to see how much the automatically obtained # alphas differ across different cross-validation folds. lasso_cv = LassoCV(alphas=alphas, random_state=0, max_iter=10000) k_fold = KFold(3) print("Answer to the bonus question:", "how much can you trust the selection of alpha?") print() print("Alpha parameters maximising the generalization score on different") print("subsets of the data:") for k, (train, test) in enumerate(k_fold.split(X, y)): lasso_cv.fit(X[train], y[train]) print("[fold {0}] alpha: {1:.5f}, score: {2:.5f}". format(k, lasso_cv.alpha_, lasso_cv.score(X[test], y[test]))) print() print("Answer: Not very much since we obtained different alphas for different") print("subsets of the data and moreover, the scores for these alphas differ") print("quite substantially.") plt.show()
output
Answer to the bonus question: how much can you trust the selection of alpha? Alpha parameters maximising the generalization score on different subsets of the data: [fold 0] alpha: 0.05968, score: 0.54209 [fold 1] alpha: 0.04520, score: 0.15523 [fold 2] alpha: 0.07880, score: 0.45193 Answer: Not very much since we obtained different alphas for different subsets of the data and moreover, the scores for these alphas differ quite substantially.
Grid-search
https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#grid-search
對於使用交叉驗證來進行超參的選擇, sklearn提供了一個工具, 來自動實現, 在超參集合上,分別進行學習, 最後選擇最大得分模型的工具。
GridSearchCV 類似網格搜尋, 但是其返回值本身也是一個 模型, 表示最後選擇的模型。
scikit-learn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score. This object takes an estimator during the construction and exposes an estimator API:
>>> from sklearn.model_selection import GridSearchCV, cross_val_score >>> Cs = np.logspace(-6, -1, 10) >>> clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs), ... n_jobs=-1) >>> clf.fit(X_digits[:1000], y_digits[:1000]) GridSearchCV(cv=None,... >>> clf.best_score_ 0.925... >>> clf.best_estimator_.C 0.0077... >>> # Prediction performance on test set is not as good as on train set >>> clf.score(X_digits[1000:], y_digits[1000:]) 0.943...
Cross-validated estimators
https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#cross-validated-estimators
交叉驗證用於選擇超參是更加有效的, 在基於演算法的演算法基礎上。
所以有一些模型自身就提供校驗驗證模型, 例如 lassoCV
Cross-validation to set a parameter can be done more efficiently on an algorithm-by-algorithm basis. This is why, for certain estimators, scikit-learn exposes Cross-validation: evaluating estimator performance estimators that set their parameter automatically by cross-validation:
>>> from sklearn import linear_model, datasets >>> lasso = linear_model.LassoCV() >>> X_diabetes, y_diabetes = datasets.load_diabetes(return_X_y=True) >>> lasso.fit(X_diabetes, y_diabetes) LassoCV() >>> # The estimator chose automatically its lambda: >>> lasso.alpha_ 0.00375...
These estimators are called similarly to their counterparts, with ‘CV’ appended to their name.
N折交叉驗證的作用
https://zhuanlan.zhihu.com/p/113623623
用途一:模型選擇
交叉驗證最關鍵的作用是進行模型選擇,也可以稱為超引數選擇。
在這種情況下,資料集需要劃分成訓練集、驗證集、測試集三部分,訓練集和驗證集的劃分採用N折交叉的方式。很多人會把驗證集和測試集搞混,如果是這種情況,必須明確地區分驗證集和測試集。
- 驗證集是在訓練過程中用於檢驗模型的訓練情況,從而確定合適的超引數;
- 測試集是在訓練結束之後,測試模型的泛化能力。
具體的過程是,首先在訓練集和驗證集上對多種模型選擇(超引數選擇)進行驗證,選出平均誤差最小的模型(超引數)。選出合適的模型(超引數)後,可以把訓練集和驗證集合並起來,在上面重新把模型訓練一遍,得到最終模型,然後再用測試集測試其泛化能力。
對這種型別的交叉驗證比較有代表性的解釋有:臺大李巨集毅的《機器學習》課程、李飛飛的《CS231N計算機視覺》課程等。
臺大李巨集毅《機器學習》課程 Lec2 ”where does the error come from“
交叉驗證的另一個用途,就是模型是確定的,沒有多個候選模型需要選,只是用交叉驗證的方法來對模型的performance進行評估。
這種情況下,資料集被劃分成訓練集、測試集兩部分,訓練集和測試集的劃分採用N折交叉的方式。這種情況下沒有真正意義上的驗證集,個人感覺這種方法叫做”交叉測試“更合理...
相比於傳統的模型評估的方式(劃分出固定的訓練集和測試集),交叉驗證的優勢在於:避免由於資料集劃分不合理而導致的問題,比如模型在訓練集上過擬合,這種過擬合不是可能不是模型導致的,而是因為資料集劃分不合理造成的。這種情況在用小規模資料集訓練模型時很容易出現,所以在小規模資料集上用交叉驗證的方法評估模型更有優勢。
對這種型別的交叉驗證比較有代表性的解釋有:周志華《機器學習》。
周志華《機器學習》