1. 程式人生 > >交叉驗證 matlab實現

交叉驗證 matlab實現

轉自:http://www.xuebuyuan.com/1409669.html

crossvalind交叉驗證

Generate cross-validation indices  生成交叉驗證索引

Syntax語法

Indices = crossvalind('Kfold', N, K) K折交叉
[Train, Test] = crossvalind('HoldOut', N, P)
[Train, Test] = crossvalind('LeaveMOut', N, M)留M法交叉驗證,預設M為1,留一法交叉驗證
[Train, Test] = crossvalind('Resubstitution', N, [P,Q])
[...] = crossvalind(Method, Group, ...)
[...] = crossvalind(Method, Group, ..., 'Classes', C)
[...] = crossvalind(Method, Group, ..., 'Min', MinValue)

Description描述

Indices = crossvalind('Kfold', N, K) returns randomly generated indices for a K-fold cross-validation ofNobservations.Indices contains equal (or approximately equal) proportions of the
integers1 throughK that define a partition of the N observations intoK disjoint subsets. Repeated calls return different randomly generated partitions.K

 defaults to5 when omitted. In K-fold cross-validation,
K-1 folds are used for training and the last fold is used for evaluation. This process is repeatedK times, leaving one different fold for evaluation each time.

[Train, Test] = crossvalind('HoldOut', N, P) returns logical index vectors for cross-validation ofN

 observations by randomly selectingP*N (approximately) observations to hold out for the evaluation set.P
must be a scalar between0 and 1. P defaults to 0.5 when omitted, corresponding to holding50% out. Using holdout cross-validation within a loop is similar to K-fold cross-validation one time outside the loop, except that
non-disjointed subsets are assigned to each evaluation.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

LeaveMOut

[Train, Test] = crossvalind('LeaveMOut', N, M), where M is an integer, returns logical index vectors for cross-validation ofN observations by randomly selectingM of the observations to hold out for the evaluation set.M
defaults to1 when omitted. Using 'LeaveMOut' cross-validation within a loop does not guarantee disjointed evaluation sets. To guarantee disjointed evaluation sets, use'Kfold' instead.

M是整數,返回交叉索引邏輯索引向量,其中N個觀測值,從N個觀測值中隨機選取M個觀測值保留作為驗證集,其餘作為訓練集。省略時,M預設為1,即留一法交叉驗證。
在一個迴圈中使用LeaveMOut交叉驗證不保證不連貫的驗證集.為保證非連貫的驗證集,使用K-fold方法替換。

Approximate a leave-one-out prediction error estimate. 擬合一個留一法交叉驗證預測誤差估計

load carbig
x = Displacement; y = Acceleration;
% x為轎車形狀的大小,y為轎車轎車速度從0到60公里所用時間
N = length(x);
% N為x長度=406
sse = 0;
for i = 1:100
    [train,test] = crossvalind('LeaveMOut',N,1);
    yhat = polyval(polyfit(x(train),y(train),2),x(test));
    sse = sse + sum((yhat - y(test)).^2);
end
CVerr = sse / 100
% sse=353.10 CVerr交叉驗證誤差為sse/100=3.5310 
CVerr =

    4.9750

 polyfit(x(train),y(train),2)   x為橫座標,y為縱座標,擬合2次多項式

polyfit 輸出是一個多項式係數的行向量,從左到右表示從高次到低次的多項式係數。2 1 0次

y = polyval(p,x) 返回n次多項式在x處的值。輸入變數p是一個長度為n+1的向量,其元素為按降冪排列的多項式係數。 y=p1*x^n+p2*x^(n-1)+...+pn*x+p(n+1) x可以是一個矩陣或者一個向量,在這兩種情況下,polyval計算在X中任意元素處的多項式p的估值。

計算均方誤差(估值減去觀測值的平方)之和。

進行了100次交叉驗證,除以總次數100,為單次均方誤差。

模型的均方誤差越小,擬合的越好。

其中carbig.mat是一個各國轎車的統計資料,總計406輛轎車。

這裡: Accelaration: 轎車速度從0到60公里所用時間 Cylinders:    轎車的汽缸數 Displacement:轎車形狀的大小 Horsepower:轎車的馬力 MPG:             每加侖汽油行駛的里程 Model:          轎車的型號 Model_year:那一年代的模型 Origin:          轎車產自那裡 Weight:        轎車的重量 其中有些是資料型變數,有些是字元型變數。
>> x=1:10

x =

     1     2     3     4     5     6     7     8     9    10

>> y=sin(x)

y =

  Columns 1 through 8

    0.8415    0.9093    0.1411   -0.7568   -0.9589   -0.2794    0.6570    0.9894

  Columns 9 through 10

    0.4121   -0.5440

 
>> [train,test]=crossvalind('LeaveMOut',10,1);
>> train

train =

     1
     1
     1
     1
     0
     1
     1
     1
     1
     1

>> test

test =

     0
     0
     0
     0
     1
     0
     0
     0
     0
     0

>> [train,test]=crossvalind('LeaveMOut',10,2);
>> train

train =

     1
     0
     1
     1
     1
     1
     1
     1
     1
     0

>> test

test =

     0
     1
     0
     0
     0
     0
     0
     0
     0
     1

----------------------------------------------------------------------------

[Train, Test] = crossvalind('Resubstitution', N, [P,Q]) returns logical index vectors of indices for cross-validation ofN observations by randomly selectingP*N observations for the evaluation set andQ*N
observations for training. Sets are selected in order to minimize the number of observations that are used in both sets.P andQ are scalars between
0 and 1Q=1-P corresponds to holding out (100*P)%, whileP=Q=1 corresponds to full resubstitution.[P,Q] defaults to
[1,1] when omitted.

[...] = crossvalind(Method, Group, ...) takes the group structure of the data into account.Group is a grouping vector that defines the class for each observation.Group can be a numeric vector, a string array,
or a cell array of strings. The partition of the groups depends on the type of cross-validation: For K-fold, each group is divided intoK subsets, approximately equal in size. For all others, approximately equal numbers of observations from each group
are selected for the evaluation set. In both cases the training set contains at least one observation from each group.

[...] = crossvalind(Method, Group, ..., 'Classes', C) restricts the observations to only those values specified inC.C can be a numeric vector, a string array, or a cell array of strings, but it is of the
same form asGroup. If one output argument is specified, it contains the value0 for observations belonging to excluded classes. If two output arguments are specified, both will contain the logical value false for observations belonging to
excluded classes.

[...] = crossvalind(Method, Group, ..., 'Min', MinValue) sets the minimum number of observations that each group has in the training set.Min defaults to1. Setting a large value for
Min can help to balance the training groups, but adds partial resubstitution when there are not enough observations. You cannot setMin when using K-fold cross-validation.

Examples

Create a 10-fold cross-validation to compute classification error.

load fisheriris 
indices = crossvalind('Kfold',species,10);
cp = classperf(species);
for i = 1:10
    test = (indices == i); train = ~test;
    class = classify(meas(test,:),meas(train,:),species(train,:));
    classperf(cp,class,test)
end
cp.ErrorRate

Divide cancer data 60/40 without using the 'Benign' observations. Assume groups are the true labels of the observations.

labels = {'Cancer','Benign','Control'};
groups = labels(ceil(rand(100,1)*3));
[train,test] = crossvalind('holdout',groups,0.6,'classes',...
                           {'Control','Cancer'});
sum(test) % Total groups allocated for testing

ans =

    35

sum(train) % Total groups allocated for training

ans =

    26

函式原型

function [tInd,eInd] = crossvalind(method,N,varargin)
%CROSSVALIND generates cross-validation indices 按比例取出每次交叉驗證的索引
% each time.
%
% [TRAIN,TEST] = CROSSVALIND('HoldOut',N,P) returns logical index vectors 返回邏輯索引向量
%
% [TRAIN,TEST] = CROSSVALIND('LeaveMOut',N,M), where M is an integer,
% returns logical index vectors for cross-validation of N observations by
% randomly selecting M of the observations to hold out for the evaluation
% set. M defaults to 1 when omitted. Using LeaveMOut cross-validation
% within a loop does not guarantee disjointed evaluation sets. Use K-fold
% instead.
% M是整數,返回交叉索引邏輯索引向量,其中N個觀測值,隨機選取M個觀測值保留作為驗證集,其餘作為訓練集
% 省略時,M預設為1,即留一法交叉驗證。
% 在一個迴圈中使用LeaveMOut交叉驗證不保證不連貫的驗證集.使用K-fold方法替換
% [TRAIN,TEST] = CROSSVALIND('Resubstitution',N,[P,Q]) returns logical
% index vectors of indices for cross-validation of N observations by
% randomly selecting P*N observations for the evaluation set and Q*N
% observations for training. Sets are selected in order to minimize the
% number of observations that are used in both sets. P and Q are scalars
% between 0 and 1. Q=1-P corresponds to holding out (100*P)%, while P=Q=1
% corresponds to full resubstitution. [P,Q] defaults to [1,1] when omitted.
%
% [...] = CROSSVALIND(METHOD,GROUP,...) takes the group structure of the
% data into account. GROUP is a grouping vector that defines the class for
% each observation. GROUP can be a numeric vector, a string array, or a
% cell array of strings. The partition of the groups depends on the type
% of cross-validation: For K-fold, each group is divided into K subsets,
% approximately equal in size. For all others, approximately equal
% numbers of observations from each group are selected for the evaluation
% set. In both cases the training set will contain at least one
% observation from each group.
%
% [...] = CROSSVALIND(METHOD,GROUP,...,'CLASSES',C) restricts the
% observations to only those values specified in C. C can be a numeric
% vector, a string array, or a cell array of strings, but it is of the
% same form as GROUP. If one output argument is specified, it will
% contain the value 0 for observations belonging to excluded classes. If
% two output arguments are specified, both will contain the logical value
% false for observations belonging to excluded classes.
%
% [...] = CROSSVALIND(METHOD,GROUP,...,'MIN',MIN) sets the minimum number
% of observations that each group has in the training set. MIN defaults
% to 1. Setting a large value for MIN can help to balance the training
% groups, but adds partial resubstitution when there are not enough
% observations. You cannot set MIN when using K-fold cross-validation.
%
% Examples:示例
%
% % Create a 10-fold cross-validation to compute classification error.十折交叉驗證 計算分類誤差
% 將樣本打亂,然後均勻分成K份,輪流選擇其中K-1份訓練,剩餘的一份做驗證,計算預測誤差平方和,
% 最後把K次的預測誤差平方和再做平均作為選擇最優模型結構的依據。這裡取K=10
% 特別的K取N,就是留一法(leave one out)。
%
% load fisheriris
% indices = crossvalind('Kfold',species,10);
% cp = classperf(species);
% for i = 1:10
% test = (indices == i); train = ~test;
% class = classify(meas(test,:),meas(train,:),species(train,:));
% classperf(cp,class,test)
% end
% cp.ErrorRate
%
% % Approximate a leave-one-out prediction error estimate.
% load carbig
% x = Displacement; y = Acceleration;
% N = length(x);
% sse = 0;
% for i = 1:100
% [train,test] = crossvalind('LeaveMOut',N,1);
% yhat = polyval(polyfit(x(train),y(train),2),x(test));
% sse = sse + sum((yhat - y(test)).^2);
% end
% CVerr = sse / 100
%
% % Divide cancer data 60/40 without using the 'Benign' observations.
% % Assume groups are the true labels of the observations.
% labels = {'Cancer','Benign','Control'};
% groups = labels(ceil(rand(100,1)*3));
% [train,test] = crossvalind('holdout',groups,0.6,'classes',...
% {'Control','Cancer'});
% sum(test) % Total groups allocated for testing
% sum(train) % Total groups allocated for training
%
% See also CLASSPERF, CLASSIFY, GRP2IDX, KNNCLASSIFY, SVMCLASSIFY.

% References:
% [1] Hastie, T. Tibshirani, R, and Friedman, J. (2001) The Elements of
% Statistical Learning, Springer, pp. 214-216.
% [2] Theodoridis, S. and Koutroumbas, K. (1999) Pattern Recognition,
% Academic Press, pp. 341-342.

% Copyright 2003-2008 The MathWorks, Inc.
% $Revision: 1.1.10.5 $ $Date: 2008/06/16 16:32:40 $

% set defaults
classesProvided = false;
MG = 1; % default for minimum number of observations for every training group
P = 0.5; % default value for holdout
K = 5; % default value for Kfold
M = 1; % default value for leave-M-out
Q = [1 1];% default value for resubstitution

% get and validate the method (first input)
if ischar(method) && size(method,1)==1
validMethods = {'holdout','kfold','resubstitution','leavemout'};
method = strmatch(lower(method),validMethods); 
if isempty(method)
error('Bioinfo:crossvalind:NotValidMethod',...
'Not a valid method.')
end
method = validMethods{method};
else
error('Bioinfo:crossvalind:NotValidTypeForMethod',...
'Valid methods are ''KFold'', ''HoldOut'', ''LeaveMOut'', or ''Resubstitution''.')
end

if nargout>1 && isequal(method,'kfold')
error('Bioinfo:crossvalind:TooManyOutputArgumentsForKfold',...
'To many output arguments for Kfold cross-validation.')
end

% take P,K,Q, or M if provided by the third input (first varargin) and
% validate it
if numel(varargin) && isnumeric(varargin{1})
S = varargin{1};
varargin(1)=[];
switch method
case 'holdout'
if numel(S)==1 && S>0 && S<1
P = S;
else
error('Bioinfo:crossvalind:InvalidThirdInputP',...
'For hold-out cross-validation, the third input must be a scalar between 0 and 1.');
end
case 'kfold'
if numel(S)==1 && S>=1
K = round(S);
else
error('Bioinfo:crossvalind:InvalidThirdInputK',...
'For Kfold cross-validation, the third input must be a positive integer.');
end
case 'leavemout'
if numel(S)==1 && S>=1
M = round(S);
else
error('Bioinfo:crossvalind:InvalidThirdInputM',...
'For leave-M-out cross-validation, the third input must be a positive integer.');
end
case 'resubstitution'
if numel(S)==2 && all(S>0) && all(S<=1)
Q = S(:);
else
error('Bioinfo:crossvalind:InvalidThirdInputQ',...
'For resubstitution cross-validation, the third input must be a 2x1 vector with values between 0 and 1.');
end
end %switch
end

% read optional paired input arguments in
if numel(varargin)
if rem(numel(varargin),2)
error('Bioinfo:crossvalind:IncorrectNumberOfArguments',...
'Incorrect number of arguments to %s.',mfilename);
end
okargs = {'classes','min'};
for j=1:2:numel(varargin)
pname = varargin{j};
pval = varargin{j+1};
k = find(strncmpi(pname, okargs,length(pname)));
if isempty(k)
error('Bioinfo:crossvalind:UnknownParameterName',...
'Unknown parameter name: %s.',pname);
elseif length(k)>1
error('Bioinfo:crossvalind:AmbiguousParameterName',...
'Ambiguous parameter name: %s.',pname);
else
switch(k)
case 1 % classes
classesProvided = true;
classes = pval;
case 2 % min
MG = round(pval(1));
if MG<0
error('Bioinfo:crossvalind:NotValidMIN',...
'MIN must be a positive scalar.')
end
end
end
end
end

if isscalar(N) && isnumeric(N)
if N<1 || N~=floor(N)
error('Bioinfo:crossvalind:NNotPositiveInteger',...
'The number of observations must be a positive integer.');
end
group = ones(N,1);
else
[group, groupNames] = grp2idx(N); % at this point group is numeric only
N = numel(group);
end

if classesProvided
orgN = N;
% change classes to same type as groups
[dummy,classes]=grp2idx(classes);
validGroups = intersect(classes,groupNames);
if isempty(validGroups)
error('bioinfo:crossvalind:EmptyValidGroups',...
'Could not find any valid group. Are CLASSES the same type as GROUP ?')
end
selectedGroups = ismember(groupNames(group),validGroups);
group = grp2idx(group(selectedGroups)); % group idxs are reduced to only the sel groups
N = numel(group); % the new size of the reduced vector
end

nS = accumarray(group(:),1);
if min(nS)<MG
error('Bioinfo:crossvalind:MissingObservations',...
'All the groups must have at least least MIN obeservation(s).')
end

switch method
case {'leavemout','holdout','resubstitution'}
switch method
case 'leavemout'
% number of samples for holdout in every group
nSE = repmat(M,numel(nS),1);
% at least there is MG sample(s) for training in every group
nST = max(nS-nSE,MG);
case 'holdout'
% computes the number of samples for holdout in every group
nSE = floor(nS*P);
% at least there is MG sample(s) for training in every group
nST = max(nS-nSE,MG);
case 'resubstitution'
% computes the number of samples for training and evaluation
nSE = floor(nS*Q(1));
nST = floor(nS*Q(2));
% at least there is MG sample(s) for training in every group
nST = max(nST,MG);
end
% Initializing the outputs
tInd = false(N,1);
eInd = false(N,1);
% for every group select randomly the samples for both sets
for g = 1:numel(nS)
h = find(group==g);
randInd = randperm(nS(g));
tInd(h(randInd(1:nST(g))))=true;
eInd(h(randInd(end-nSE(g)+1:end)))=true;
end
case 'kfold'
tInd = zeros(N,1);
for g = 1:numel(nS)
h = find(group==g);
% compute fold id's for every observation in the group
q = ceil(K*(1:nS(g))/nS(g));
% and permute them to try to balance among all groups
pq = randperm(K);
% randomly assign the id's to the observations of this group
randInd = randperm(nS(g));
tInd(h(randInd))=pq(q);
end
end

if classesProvided
if isequal(method,'kfold')
temp = zeros(orgN,1);
temp(selectedGroups) = tInd;
tInd = temp;
else
temp = false(orgN,1);
temp(selectedGroups) = tInd;
tInd = temp;
temp = false(orgN,1);
temp(selectedGroups) = eInd;
eInd = temp;
end
end