1. 程式人生 > >CS229 6.5 Neurons Networks Implements of Sparse Autoencoder

CS229 6.5 Neurons Networks Implements of Sparse Autoencoder

sparse autoencoder的一個例項練習,這個例子所要實現的內容大概如下:從給定的很多張自然圖片中截取出大小為8*8的小patches圖片共10000張,現在需要用sparse autoencoder的方法訓練出一個隱含層網路所學習到的特徵。該網路共有3層,輸入層是64個節點,隱含層是25個節點,輸出層當然也是64個節點了。

main函式,  分五步走,每個函式的實現細節在下邊都列出了。

  1 %%======================================================================
  2 %% STEP 0: Here we provide the relevant parameters values that will
3 % allow your sparse autoencoder to get good filters; you do not need to 4 % change the parameters below. 5 6 visibleSize = 8*8; % number of input units 7 hiddenSize = 25; % number of hidden units 8 sparsityParam = 0.01; % desired average activation of the hidden units. 9 % (This was denoted by the Greek alphabet rho,
10 % which looks like a lower-case "p", 11 % in the lecture notes). 12 lambda = 0.0001; % weight decay parameter 13 beta = 3; % weight of sparsity penalty term 14 15 %%====================================================================== 16
%% STEP 1: Implement sampleIMAGES 17 % 18 % After implementing sampleIMAGES, the display_network command should 19 % display a random sample of 200 patches from the dataset 20 patches = sampleIMAGES; 21 display_network(patches(:,randi(size(patches,2),200,1)),8); 22 23 24 % Obtain random parameters theta 25 theta = initializeParameters(hiddenSize, visibleSize); 26 27 %%====================================================================== 28 %% STEP 2: Implement sparseAutoencoderCost 29 % 30 % You can implement all of the components (squared error cost, weight decay term, 31 % sparsity penalty) in the cost function at once, but it may be easier to do 32 % it step-by-step and run gradient checking (see STEP 3) after each step. We 33 % suggest implementing the sparseAutoencoderCost function using the following steps: 34 % 35 % (a) Implement forward propagation in your neural network, and implement the 36 % squared error term of the cost function. Implement backpropagation to 37 % compute the derivatives. Then (using lambda=beta=0), run Gradient Checking 38 % to verify that the calculations corresponding to the squared error cost 39 % term are correct. 40 % 41 % (b) Add in the weight decay term (in both the cost function and the derivative 42 % calculations), then re-run Gradient Checking to verify correctness. 43 % 44 % (c) Add in the sparsity penalty term, then re-run Gradient Checking to 45 % verify correctness. 46 % 47 % Feel free to change the training settings when debugging your 48 % code. (For example, reducing the training set size or 49 % number of hidden units may make your code run faster; and setting beta 50 % and/or lambda to zero may be helpful for debugging.) However, in your 51 % final submission of the visualized weights, please use parameters we 52 % gave in Step 0 above. 53 54 [cost, grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ... 55 lambda,sparsityParam, beta, patches); 56 57 %%====================================================================== 58 %% STEP 3: Gradient Checking 59 % 60 % Hint: If you are debugging your code, performing gradient checking on smaller models 61 % and smaller training sets (e.g., using only 10 training examples and 1-2 hidden 62 % units) may speed things up. 63 64 % First, lets make sure your numerical gradient computation is correct for a 65 % simple function. After you have implemented computeNumericalGradient.m, 66 % run the following: 67 checkNumericalGradient(); 68 69 % Now we can use it to check your cost function and derivative calculations 70 % for the sparse autoencoder. 71 numgrad = computeNumericalGradient( @(x) sparseAutoencoderCost(x, visibleSize, ... 72 hiddenSize, lambda,sparsityParam, beta, patches), theta); 73 74 % Use this to visually compare the gradients side by side 75 disp([numgrad grad]); 76 77 % Compare numerically computed gradients with the ones obtained from backpropagation 78 diff = norm(numgrad-grad)/norm(numgrad+grad); 79 disp(diff); % Should be small. In our implementation, these values are 80 % usually less than 1e-9. 81 % When you got this working, Congratulations!!! 82 83 %%====================================================================== 84 %% STEP 4: After verifying that your implementation of 85 % sparseAutoencoderCost is correct, You can start training your sparse 86 % autoencoder with minFunc (L-BFGS). 87 88 % Randomly initialize the parameters 89 theta = initializeParameters(hiddenSize, visibleSize); 90 91 % Use minFunc to minimize the function 92 addpath minFunc/ 93 options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost 94 % function. Generally, for minFunc to work, you 95 % need a function pointer with two outputs: the 96 % function value and the gradient. In our problem, 97 % sparseAutoencoderCost.m satisfies this. 98 options.maxIter = 400; % Maximum number of iterations of L-BFGS to run 99 options.display = 'on'; 100 [opttheta, cost] = minFunc( @(p) sparseAutoencoderCost(p,visibleSize, hiddenSize, ... 101 lambda, sparsityParam, beta, patches),theta, options); 102 %%====================================================================== 103 %% STEP 5: Visualization 104 105 W1 = reshape(opttheta(1:hiddenSize*visibleSize), hiddenSize, visibleSize); 106 display_network(W1', 12); 107 108 print -djpeg weights.jpg % save the visualization to a file 109 110 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 對應step1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%% 111 %三個函式(sampleIMAGES)(normalizeData)(initializeParameters)%%%% 112 function patches = sampleIMAGES() 113 load IMAGES; % 載入初始的10張512*512大圖片 114 115 patchsize = 8; % 取樣大小 116 numpatches = 10000; 117 118 % 初始化該矩陣為0,該矩陣為 64*10000維每一列為一張圖片. 119 patches = zeros(patchsize*patchsize, numpatches); 120 121 % IMAGES 為一個包含10 張images的三維陣列,IMAGES(:,:,6) 是一個第六張圖片的 512x512 的二維陣列, 122 % 命令 "imagesc(IMAGES(:,:,6)), colormap gray;" 可以把第六張圖視覺化. 123 % 這幾張圖是經過whiteing預處理的? 124 % IMAGES(21:30,21:30,1) 就是從第一張圖取樣得到的(21,21) to (30,30) 的小patchs 125 126 %在每張圖片中隨機選取1000個patch,共10000個patch 127 for imageNum = 1:10 128 [rowNum colNum] = size(IMAGES(:,:,imageNum)); 129 %實現每張圖片選取1000個patch 130 for patchNum = 1:1000 131 %得到左上角的兩個點 132 xPos = randi([1,rowNum-patchsize+1]); 133 yPos = randi([1, colNum-patchsize+1]); 134 %填充到矩陣裡 135 patches(:,(imageNum-1)*1000+patchNum) = ... 136 reshape(IMAGES(xPos:xPos+7,yPos:yPos+7,imageNum),64,1); 137 end 138 end 139 %由於autoencoder的激勵函式是sigmod函式,輸出值限定在[0,1],故為了達到H W,b(x)= x,x作為輸入, 140 %也要限定在0-1之間,故需要進行正則化 141 patches = normalizeData(patches); 142 end 143 144 % 正則化的函式,不太明白s-sigma法則? 145 function patches = normalizeData(patches) 146 % 減去均值 147 patches = bsxfun(@minus, patches, mean(patches)); 148 % s = std(X),此處X是一個向量,該函式返回標準偏差(注意其分母為n-1,而不是n) 。 149 % 結果s是一個X各樣本偏差無偏估計的平方根(X包含獨立的、同分布樣本)。 150 % 如果X是一個矩陣,該函式返回一個行向量,它包含了X每列元素的標準偏差。 151 pstd = 3 * std(patches(:)); 152 patches = max(min(patches, pstd), -pstd) / pstd; 153 % 重新壓縮 從[-1,1] 到 [0.1,0.9] 154 patches = (patches + 1) * 0.4 + 0.1; 155 end 156 157 %首先初始化引數 158 function theta = initializeParameters(hiddenSize, visibleSize) 159 % Initialize parameters randomly based on layer sizes. 160 % we'll choose weights uniformly from the interval [-r, r] 161 r = sqrt(6) / sqrt(hiddenSize+visibleSize+1); 162 %rand(a,b)產生均勻分佈的隨機矩陣維度為a*b,元素取值範圍0.01.0163 W1 = rand(hiddenSize, visibleSize) * 2 * r - r; 164 %rand(a,b)*2*r即取值範圍為(0-2r), rand(a,b)*2*r -r即取值範圍為(-r - r) 165 W2 = rand(visibleSize, hiddenSize) * 2 * r - r; 166 b1 = zeros(hiddenSize, 1); %連線到hidden unit的偏置單元 167 b2 = zeros(visibleSize, 1); %連結到output layer的偏置單元 168 % 將矩陣合併為一個向量 169 theta = [W1(:) ; W2(:) ; b1(:) ; b2(:)]; 170 %初始化引數結束 171 end 172 173 174 175 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 對應step 2 %%%%%%%%%%%%%%%%%%%%%%%%%%%% 176 %%%%%返回稀疏損失函式的值與梯度值%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 177 function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ... 178 lambda, sparsityParam, beta, data) 179 % visibleSize: 輸入層單元數 180 % hiddenSize: 隱藏單元數 181 % lambda: 正則項 182 % sparsityParam: (p)指定的平均啟用度p 183 % beta: 稀疏權重項B 184 % data: 64x10000 的矩陣為training data,data(:,i) 是第i個訓練樣例. 185 % 把引數拼接為一個向量,因為採用L-BFGS優化,L-BFGS要求的就是向量. 186 % 將長向量轉換成每一層的權值矩陣和偏置向量值 187 % theta向量的的 1->hiddenSize*visibleSize,W1共hiddenSize*visibleSize 個元素,重新作為矩陣 188 W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize); 189 190 %類似以上一直往後放 191 W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize); 192 b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize); 193 b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end); 194 195 % 引數對應的梯度矩陣 ; 196 cost = 0; 197 W1grad = zeros(size(W1)); 198 W2grad = zeros(size(W2)); 199 b1grad = zeros(size(b1)); 200 b2grad = zeros(size(b2)); 201 202 Jcost = 0; %直接誤差 203 Jweight = 0;%權值懲罰 204 Jsparse = 0;%稀疏性懲罰 205 [n m] = size(data); %m為樣本的個數,n為樣本的特徵數 206 207 %前向演算法計算各神經網路節點的線性組合值和active值 208 %W1為 hiddenSize*visibleSize的矩陣 209 %data為 visibleSize* trainexampleNum的矩陣 210 %remat(b1,1,m)把向量b1複製擴充套件為hiddenSize*m列 211 % 根據公式 Z^(l) = z^(l-1)*W^(l-1)+b^(l-1) 212 %z2儲存的是10000個樣本下隱藏層的輸入,為hiddenSize*m維的矩陣,每一列代表一次輸入 213 z2= W1*data + remat(b1,1,m);%第二層的輸入 214 a2 = sigmoid(z2); %對z2取sigmod 即得到a2,即隱藏層的輸出 215 z3 = W2*a2+repmat(b2,1,m); %output layer 的輸入 216 a3 = sigmoid(z3); %output 層的輸出 217 218 % 計算預測產生的誤差 219 %對應J(W,b), 外邊的sum是對所有樣本求和,裡邊的sum是對輸出層的所有分量求和 220 Jcost = (0.5/m)*sum(sum((a3-data).^2)); 221 %計算權值懲罰項 正則化項,並沒有帶正則項引數 222 Jweight = (1/2)*(sum(sum(W1.^2))+sum(sum(W2.^2))); 223 %計算稀疏性規則項 sum(matrix,2)是進行按行求和運算,即所有樣本在隱層的輸出累加求均值 224 % rho為一個hiddenSize*1 維的向量 225 226 rho = (1/m).*sum(a2,2);%求出隱含層輸出aj的平均值向量 rho為hiddenSize維的 227 %求稀疏項的損失 228 Jsparse = sum(sparsityParam.*log(sparsityParam./rho)+(1-sparsityParam).*log((1-sparsityParam)./(1-rho))); 229 %損失函式的總表示式 損失項 + 正則化項 + 稀疏項 230 cost = Jcost + lambda*Jweight + beta*Jsparse; 231 %計算l = 3 即 output-layer層的誤差dleta3,因為在autoencoder中輸入等於輸出h(W,b)=x 232 delta3 = -(data-a3).*sigmoidInv(z3); 233 %因為加入了稀疏規則項,所以計算偏導時需要引入該項,sterm為稀疏項,為hiddenSize維的向量 234 sterm = beta*(-sparsityParam./rho+(1-sparsityParam)./(1-rho)) 235 % W2 為64*25的矩陣,d3為第三層的輸出為64*10000的矩陣,每一列為每個樣本x^(i)的輸出,W2'為W2的轉置 236 % repmat(sterm,1,m)會把函式複製擴充套件為m列的矩陣,每一列都為sterm向量。 237 % d2為hiddenSize*10000的矩陣 238 delta2 = (W2'*delta3+repmat(sterm,1,m)).*sigmoidInv(z2); 239 240 %計算W1grad 241 % data'為10000*64的矩陣 d2*data' 位25*64的矩陣 242 W1grad = W1grad+delta2*data'; 243 W1grad = (1/m)*W1grad+lambda*W1; 244 245 %計算W2grad 246 W2grad = W2grad+delta3*a2'; 247 W2grad = (1/m).*W2grad+lambda*W2; 248 249 %計算b1grad 250 b1grad = b1grad+sum(delta2,2); 251 b1grad = (1/m)*b1grad;%注意b的偏導是一個向量,所以這裡應該把每一行的值累加起來 252 253 %計算b2grad 254 b2grad = b2grad+sum(delta3,2); 255 b2grad = (1/m)*b2grad; 256 %計算完成重新轉為向量 257 grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)]; 258 end 259 260 %------------------------------------------------------------------- 261 % Here's an implementation of the sigmoid function, which you may find useful 262 % in your computation of the costs and the gradients. This inputs a (row or 263 % column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)). 264 265 function sigm = sigmoid(x) 266 sigm = 1 ./ (1 + exp(-x)); 267 end 268 269 %sigmoid函式的導函式 270 function sigmInv = sigmoidInv(x) 271 sigmInv = sigmoid(x).*(1-sigmoid(x)); 272 end 273 274 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 對應step 3 %%%%%%%%%%%%%%%%%%%%%%%%%%%% 275 %三個函式:(checkNumericalGradient)(simpleQuadraticFunction)(computeNumericalGradient) 276 function [] = checkNumericalGradient() 277 x = [4; 10]; 278 %當前簡單函式實際的值與實際的導函式 279 [value, grad] = simpleQuadraticFunction(x); 280 % 在點 x 處計算簡單函式的梯度,("@simpleQuadraticFunction" denotes a pointer to a function.) 281 numgrad = computeNumericalGradient(@simpleQuadraticFunction, x); 282 % disp()等價於 print() 283 disp([numgrad grad]); 284 fprintf('The above two columns you get should be very similar.\n(Left-Your Numerical Gradient, Right-Analytical Gradient)\n\n'); 285 % norm 等價於 sqrt(sum(X.^2)); 如果實現正確,設定 EPSILON = 0.0001,誤差應該為2.1452e-12 286 diff = norm(numgrad-grad)/norm(numgrad+grad); 287 disp(diff); 288 fprintf('Norm of the difference between numerical and analytical gradient (should be < 1e-9)\n\n'); 289 end 290 291 %這個簡單函式用來檢驗寫的computeNumericalGradient函式的正確性 292 function [value,grad] = simpleQuadraticFunction(x) 293 % this function accepts a 2D vector as input. 294 % Its outputs are: 295 % value: h(x1, x2) = x1^2 + 3*x1*x2 296 % grad: A 2x1 vector that gives the partial derivatives of h with respect to x1 and x2 297 % Note that when we pass @simpleQuadraticFunction(x) to computeNumericalGradients, we're assuming 298 % that computeNumericalGradients will use only the first returned value of this function. 299 value = x(1)^2 + 3*x(1)*x(2); 300 grad = zeros(2, 1); 301 grad(1) = 2*x(1) + 3*x(2); 302 grad(2) = 3*x(1); 303 end 304 305 %梯度檢驗的函式 306 function numgrad = computeNumericalGradient(J, theta) 307 % theta: 引數,向量或者實數均可 308 % J: 輸出值為實數的函式. 呼叫y = J(theta)將會返回函式在theta處的值 309 310 % numgrad初始化為0,與theta維度相同 311 numgrad = zeros(size(theta)); 312 EPSILON = 1e-4; 313 % theta是一個行向量,size(theta,1)是求行數 314 n = size(theta,1); 315 %產生一個維度為n的單位矩陣 316 E = eye(n); 317 for i = 1:n 318 % (n,:)代表第n行,所有的列 319 % (:,n)代表所有行,第n列 320 % 由於E是單位矩陣,所以只有第i行第i列的元素變為EPSILON 321 delta = E(:,i)*EPSILON; 322 %向量第i維度的值 323 numgrad(i) = (J(theta+delta)-J(theta-delta))/(EPSILON*2.0); 324 end 325 %% --------------------------------------------------------------- 326 327 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 對應step 5 %%%%%%%%%%%%%%%%%%%%%%%%%%%% 328 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%關於函式的展示%%%%%%%%%%%%%%%%%%%%%%%%%%% 329 function [h, array] = display_network(A, opt_normalize, opt_graycolor, cols, opt_colmajor) 330 % This function visualizes filters in matrix A. Each column of A is a 331 % filter. We will reshape each column into a square image and visualizes 332 % on each cell of the visualization panel. 333 % All other parameters are optional, usually you do not need to worry 334 % about it. 335 % opt_normalize: whether we need to normalize the filter so that all of 336 % them can have similar contrast. Default value is true. 337 % opt_graycolor: whether we use gray as the heat map. Default is true. 338 % cols: how many columns are there in the display. Default value is the 339 % squareroot of the number of columns in A. 340 % opt_colmajor: you can switch convention to row major for A. In that 341 % case, each row of A is a filter. Default value is false. 342 warning off all 343 344 if ~exist('opt_normalize', 'var') || isempty(opt_normalize) 345 opt_normalize= true; 346 end 347 348 if ~exist('opt_graycolor', 'var') || isempty(opt_graycolor) 349 opt_graycolor= true; 350 end 351 352 if ~exist('opt_colmajor', 'var') || isempty(opt_colmajor) 353 opt_colmajor = false; 354 end 355 356 % rescale 357 A = A - mean(A(:)); 358 359 if opt_graycolor, colormap(gray); end 360 361 % compute rows, cols 362 [L M]=size(A); 363 sz=sqrt(L); 364 buf=1; 365 if ~exist('cols', 'var') 366 if floor(sqrt(M))^2 ~= M 367 n=ceil(sqrt(M)); 368 while mod(M, n)~=0 && n<1.2*sqrt(M), n=n+1; end 369 m=ceil(M/n); 370 else 371 n=sqrt(M); 372 m=n; 373 end 374 else 375 n = cols; 376 m = ceil(M/n); 377 end 378 379 array=-ones(buf+m*(sz+buf),buf+n*(sz+buf)); 380 381 if ~opt_graycolor 382 array = 0.1.* array; 383 end 384 385 386 if ~opt_colmajor 387 k=1; 388 for i=1:m 389 for j=1:n 390 if k>M, 391 continue; 392 end 393 clim=max(abs(A(:,k))); 394 if opt_normalize 395 array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/clim; 396 else 397 array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/max(abs(A(:))); 398 end 399 k=k+1; 400 end 401 end 402 else 403 k=1; 404 for j=1:n 405 for i=1:m 406 if k>M, 407 continue; 408 end 409 clim=max(abs(A(:,k))); 410 if opt_normalize 411 array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/clim; 412 else 413 array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz); 414 end 415 k=k+1; 416 end 417 end 418 end 419 420 if opt_graycolor 421 h=imagesc(array,'EraseMode','none',[-1 1]); 422 else 423 h=imagesc(array,'EraseMode','none',[-1 1]); 424 end 425 axis image off 426 427 drawnow; 428 429 warning on all