CS229 Machine Learning作業代碼:Problem Set 2
阿新 • • 發佈:2018-07-20
貝葉斯分類 correct log pan rect ref time -- computed
垃圾郵件過濾(多項式事件模型貝葉斯分類器)
公式推導
直接參考:https://www.cnblogs.com/qpswwww/p/9308786.html
註意,這裏為了數值穩定性,用了一個小trick,保證數值太小時不會下溢
\[p(y=1|x)=\frac {(\prod_{i=1}^n\phi_{x_i|y=1})\phi_{y}}{(\prod_{i=1}^n\phi_{x_i|y=1})\phi_y+(\prod_{i=1}^n\phi_{x_i|y=0})(1-\phi_y)}\]
\[=\frac {1}{1+\frac{(\prod_{i=1}^n\phi_{x_i|y=0})(1-\phi_y)}{(\prod_{i=1}^n\phi_{x_i|y=1})\phi_{y}}}\]
\[=\frac {1}{1+\exp(\log\frac{(\prod_{i=1}^n\phi_{x_i|y=0})(1-\phi_y)}{(\prod_{i=1}^n\phi_{x_i|y=1})\phi_{y}})}\]
\[=\frac {1}{1+\exp(\sum_{i=1}^n\phi_{x_i|y=0}+\log(1-\phi_y)-\sum_{i=1}^n \log \phi_{x_i|y=1}-\log \phi_{y})}\]
代碼
nb_train.m
[spmatrix, tokenlist, trainCategory] = readMatrix(‘MATRIX.TRAIN‘); trainMatrix = full(spmatrix); numTrainDocs = size(trainMatrix, 1); numTokens = size(trainMatrix, 2); % trainMatrix is now a (numTrainDocs x numTokens) matrix. % Each row represents a unique document (email). % The j-th column of the row $i$ represents the number of times the j-th % token appeared in email $i$. % tokenlist is a long string containing the list of all tokens (words). % These tokens are easily known by position in the file TOKENS_LIST % trainCategory is a (1 x numTrainDocs) vector containing the true % classifications for the documents just read in. The i-th entry gives the % correct class for the i-th email (which corresponds to the i-th row in % the document word matrix). % Spam documents are indicated as class 1, and non-spam as class 0. % Note that for the SVM, you would want to convert these to +1 and -1. % YOUR CODE HERE n=size(trainMatrix,2); m=length(trainCategory); phi_y=sum(trainCategory)/m; phi_y1=zeros(n,1); phi_y0=zeros(n,1); for i=1:m if(trainCategory(i)==1) for j=1:n phi_y1(j)=phi_y1(j)+trainMatrix(i,j); end else for j=1:n phi_y0(j)=phi_y0(j)+trainMatrix(i,j); end end end for i=1:n sum1=0; sum0=0; for j=1:m if(trainCategory(j)==1) sum1=sum1+trainMatrix(j,i); else sum0=sum0+trainMatrix(j,i); end end phi_y1(i)=(phi_y1(i)+1)/(sum1+n); phi_y0(i)=(phi_y0(i)+1)/(sum0+n); end
nb_test.m
[spmatrix, tokenlist, category] = readMatrix(‘MATRIX.TEST‘); testMatrix = full(spmatrix); numTestDocs = size(testMatrix, 1); numTokens = size(testMatrix, 2); % Assume nb_train.m has just been executed, and all the parameters computed/needed % by your classifier are in memory through that execution. You can also assume % that the columns in the test set are arranged in exactly the same way as for the % training set (i.e., the j-th column represents the same token in the test data % matrix as in the original training data matrix). % Write code below to classify each document in the test set (ie, each row % in the current document word matrix) as 1 for SPAM and 0 for NON-SPAM. % Construct the (numTestDocs x 1) vector ‘output‘ such that the i-th entry % of this vector is the predicted class (1/0) for the i-th email (i-th row % in testMatrix) in the test set. output = zeros(numTestDocs, 1); %--------------- % YOUR CODE HERE n=size(testMatrix,2); m=size(testMatrix,1); for t=1:m log_a=0; log_b=0; for i=1:n if(testMatrix(t,i)==0) continue; end log_a=log_a+testMatrix(t,i)*log(phi_y1(i)); log_b=log_b+testMatrix(t,i)*log(phi_y0(i)); end p=1/(1+exp(log_b+log(1-phi_y)-log_a-log(phi_y))); if(p>=0.5) output(t)=1; else output(t)=0; end end %--------------- % Compute the error on the test set y = full(category); y = y(:); error = sum(y ~= output) / numTestDocs; %Print out the classification error on the test set fprintf(1, ‘Test error: %1.4f\n‘, error);
程序運行結果
Test error: 0.0525
CS229 Machine Learning作業代碼:Problem Set 2