關於K-fold cross validation 下不同的K的選擇的疑惑？

阿新 • • 發佈：2019-01-25

在K-fold cross validation 下比較不同的K的選擇對於引數選擇（模型引數，CV意義下的估計的泛化誤差）以及實際泛化誤差的影響。更一般的問題，在實際模型選擇問題中，選擇幾重交叉驗證比較合適？

交叉驗證的背景知識：

CV是用來驗證模型假設（hypothesis）效能的一種統計分析方法，基本思想是在某種意義下將原始資料進行分組，一部分作為訓練集，一部分作為驗證集，使用訓練集對每個hypothesis進行訓練，再用驗證集對每個hypothesis的效能進行評估，然後選取效能最好的hypothesis作為問題對應的模型。

常用CV 方法：

1. Hold-out method

最簡單的驗證方法，將訓練資料隨機分為兩份（典型做法是七三分）。不是真正意義上的CV，沒有交叉的思想，所以驗證集上的測試精度與原始資料的分組有很大關係，具有隨機性，不具有說服性。（是否可通過多次平均的方法來消除這種隨機性？待驗證）

2. K-fold CV

一般，k>=2。經驗上，k取5即可（計算量與精度的權衡），k=5時的結果大致和10以上類似。

3. Leave-one-out CV（LOO-CV）

K-fold CV 的極端情況，將k設為樣本數。

優點：（1）結果可靠。

（2）實驗過程可被複制。

缺點：計算量過大。實際操作困難，除非並行化。

實驗: 使用高斯核最小二乘做迴歸。

Code:

training set

the number of the tarining samples

trainSize=1000;
% the dimension of the tarining samples
trainDim=1;

% the gaussian noise (u=0, sigma= 0.4472(variance equals to 0.2))
% epsilon=normrnd(0, 0.4472,trainSize,trainDim);

% the gaussian noise (u=0, sigma= 0.3162(variance equals to 0.1)) 

% epsilon=normrnd(0, 0.3162,trainSize,trainDim);

epsilon=normrnd(0, 0.1,trainSize,trainDim);
% the nosiy training samples
% the uniform distribution in [-a,a]
% R = a - 2*a*rand(m,n)
%  x_train=1-2.*rand(trainSize,trainDim); %[-1,1]
x_train=pi-(2*pi).*rand(trainSize,trainDim); %[-pi,pi]
% sinc target function
y_train=sinc(x_train)+epsilon;
%y_train=sinc(x_train);

test set

the number of the test samples

testSize=1000;
% the dimension of the tarining samples
testDim=1;

% the test samples
x_test=pi-2*pi.*rand(testSize,testDim);
y_test=sinc(x_test);

================Cross Validation=======================

[mse,bestk,bestg] = RLScgForRegress(x_train,y_train);

================ Normal Equations ================

fprintf('Solving with normal equations...\n');

D_train=generateDictonary(x_train,x_train,bestg);

D_test=generateDictonary(x_test,x_train,bestg);

% Map D_train onto Guassian high-dim Features and Normalize
[D_train, mu, sigma] = featureNormalize(D_train);

% Map D_test and normalize (using mu and sigma)
D_test = bsxfun(@minus, D_test, mu);
D_test = bsxfun(@rdivide, D_test, sigma);

% Calculate the parameters from the normal equation
ntheta = normalEqn(D_train,y_train,bestk);

%     % Display normal equation's result
%     fprintf('Theta computed from the normal equations: \n');
%     fprintf(' %f \n', ntheta);
%     fprintf('\n');


trainError=sqrt(sum((y_train-D_train*ntheta).^2)/size(y_train,1));
testError=sqrt(sum((y_test-D_test*ntheta).^2)/size(y_test,1));


%  Plot fit over the data
figure;
plot(x_test, y_test, 'rx', 'MarkerSize', 10, 'LineWidth', 1.5);
xlabel('x test');
ylabel('y test');
hold on;
grid on;
plot(x_test,D_test*ntheta, 'b.', 'LineWidth', 2);
hold off;

對於這個問題，選擇K為多少比較合適？

1、首先確定待選模型引數的範圍（即假設空間），確保所選範圍能包含最優假設。K-fold CV 可以選出某種意義下的最優的引數，但通過實驗觀察，似乎的趨勢是，不同的K對應的假設空間是不同的，K越大，需要增加引數的區間，以保證假設空間能包含住最優效能的假設。

Eg：lambda_vec = [0.001 0.003 0.01 0.03 0.1 0.3 1 1.3 1.6 1.9 2.3 3 6 10 20 40 70 100 150 200 250 300];

sigma_vec = [0.03 0.1 0.3 1 1.3 1.6];

這兩組引數區間下，對應的假設的效能：

2、對於不同的K對應的CV的評價指標：

（1）CV意義下的估計泛化誤差（使用RMSE）

（2）實際泛化誤差

K=2時

最優估計泛化誤差 0.1832 實際泛化誤差 0.1769

K=5時

最優估計泛化誤差 0.2011 實際泛化誤差 0.2571

K=10時

最優估計泛化誤差 0.2020 實際泛化誤差 0.2740

這種隨著K增加，泛化誤差增加的趨勢和理論上不符？

理論上，隨著K越大，可供訓練的樣本更多，這樣評估的結果更可靠。即是這兩種泛化誤差都應是下降趨勢。

關於K-fold cross validation 下不同的K的選擇的疑惑？

Contents

training set

test set

================Cross Validation=======================

關於K-fold cross validation 下不同的K的選擇的疑惑？

機器學習模型評測：holdout cross-validation & k-fold cross-validation

【機器學習】k-fold cross validation（k-摺疊交叉驗證）

k-fold cross validation（k-摺疊交叉驗證）,python pandas （ix & iloc &loc）的區別

機器學習為什麼需要交叉驗證？怎麼使用k-fold cross validation（k-摺疊交叉驗證）

[LeetCode] 340. Longest Substring with At Most K Distinct Characters 最多有K個不同字符的最長子串

K-Fold

51 Nod 1116 K進位制下的大數

Digits of Factorial --計算n!在k進位制下位數

規則化和模型選擇（Regularization and model selection）——機器學習：交叉驗證Cross validation

[LeetCode] Longest Substring with At Most K Distinct Characters 最多有K個不同字元的最長子串

keras和傳統sklearn api結合實現k-fold、CV

sklearn-Cross_Validation1：knn演算法中不同k值對應的模型準確率

模式識別之k-折交叉驗證(k-fold crossValidation)

Scikit-learn的K-fold交叉驗證類ShuffleSplit、GroupShuffleSplit用法介紹

K-折交叉驗證(k-fold crossValidation)以及在matlab中的實現

【模型評估與選擇】交叉驗證Cross-validation: evaluating estimator performance

在n不確定的情況下生成k個隨機數

linux下sort -k的一些體會

k-折交叉驗證(k-fold crossValidation)

關於K-fold cross validation 下不同的K的選擇的疑惑？

Contents

training set

test set

================Cross Validation=======================

相關推薦