Andrew NG 機器學習練習8-Anomaly Detection and Recommender Systems

1 Anomaly detection

實現一個異常檢測演算法檢測伺服器的異常行為
特徵是每個伺服器的吞吐量（throughput）(mb/s) 和相應延遲（ms）
採集 m=307 臺執行中的伺服器的特徵，{x(1),...,x(m)}
其中大部分是 normal 的伺服器特徵

你將使用高斯模型檢測資料集中的異常樣例
從 2D 資料集開始，以便視覺化演算法過程
在那個資料集中你將擬合一個高斯分佈，發現低可能性的值，從而找出異常樣例
之後，你將在一個大的多維資料集中應用異常檢測演算法

首先視覺化資料，如圖：
這裡寫圖片描述

%% ================== Part 1: Load Example Dataset  =================== 

%  We start this exercise by using a small dataset that is easy to
%  visualize.
%
%  Our example case consists of 2 network server statistics across
%  several machines: the latency and throughput of each machine.
%  This exercise will help us find possibly faulty (or very fast) machines.
%

fprintf('Visualizing example dataset for outlier detection.\n\n' 
);

%  The following command loads the dataset. You should now have the
%  variables X, Xval, yval in your environment
load('ex8data1.mat');

%  Visualize the example dataset
plot(X(:, 1), X(:, 2), 'bx');
axis([0 30 0 30]);
xlabel('Latency (ms)');
ylabel('Throughput (mb/s)');

fprintf('Program paused. Press enter to continue.\n' 
);
pause

1.1 Gaussian distribution

為了實施異常檢測，你需要首先根據資料分佈，擬合一個模型

給一個訓練集 {x(1),...,x(m)}，x(i)∈Rn
需要對每一個特徵 xi 估算高斯分佈
對於每一個特徵，需要計算引數 μi和σ2i

通常如果我們認為變數 x 符合高斯分佈 x~N(μ，σ2) 則其概率密度函式為：
這裡寫圖片描述
其中：μ是均值，σ2是方差

1.2 Estimating parameters for a Gaussian

通過下列公式計算每個特徵的 μi和σ2i：
這裡寫圖片描述

%% ================== Part 2: Estimate the dataset statistics ===================
%  For this exercise, we assume a Gaussian distribution for the dataset.
%
%  We first estimate the parameters of our assumed Gaussian distribution, 
%  then compute the probabilities for each of the points and then visualize 
%  both the overall distribution and where each of the points falls in 
%  terms of that distribution.
%
fprintf('Visualizing Gaussian fit.\n\n');

%  Estimate my and sigma2
[mu sigma2] = estimateGaussian(X);

%  Returns the density of the multivariate normal at each data point (row) 
%  of X
p = multivariateGaussian(X, mu, sigma2);

%  Visualize the fit
visualizeFit(X,  mu, sigma2);
xlabel('Latency (ms)');
ylabel('Throughput (mb/s)');

fprintf('Program paused. Press enter to continue.\n');
pause;

estimateGaussian.m

function [mu sigma2] = estimateGaussian(X)
%ESTIMATEGAUSSIAN This function estimates the parameters of a 
%Gaussian distribution using the data in X
%   [mu sigma2] = estimateGaussian(X), 
%   The input X is the dataset with each n-dimensional data point in one row
%   The output is an n-dimensional vector mu, the mean of the data set
%   and the variances sigma^2, an n x 1 vector
% 

% Useful variables
[m, n] = size(X);

% You should return these values correctly
mu = zeros(n, 1);
sigma2 = zeros(n, 1);
% ====================== YOUR CODE HERE ======================
% Instructions: Compute the mean of the data and the variances
%               In particular, mu(i) should contain the mean of
%               the data for the i-th feature and sigma2(i)
%               should contain variance of the i-th feature.
%
mu=1/m*sum(X);
sigma2=1/m*sum((X-repmat(mu,m,1)).^2);
% =============================================================
end

這裡寫圖片描述

1.3 Selecting the threshold, ε

現在我們有了高斯引數，我們就可以調查一下那些樣例根據這個分佈有高可能性，那些樣例有非常低的可能性。有低可能性的樣例更有可能是異常的。

決定那些是異常的，一種方法是根據 交叉驗證集 選擇一個閾值。

這部分，實現一個演算法選擇，在交叉驗證集中使用 F1 值來選擇閾值ε。

交叉驗證集 {(x(1)cv,y(1)cv),...,(x(mcv)cv,y(mcv)cv)}
標籤 y=1 表示是異常樣例，y=0 表示是正常樣例

對於每一個交叉驗證集，計算 p(x(i)cv)

所有的 p(x(1)cv)，…,p(x(mcv)cv) 和 y(1)cv,...,y(mcv)cv 以向量的形式傳遞到 selectThreshold.m 以計算閾值 ε，該方法也要返回使用該ε 的 F1值。

這裡寫圖片描述

tp 是正確的積極判定（true positives）的數量：標籤表明是異常，演算法正確分類為異常
fp 是錯誤的積極判定（false positives）的數量：標籤表明是正常，演算法錯誤的分類為異常
fn 是錯誤的消極判定（false negatives）的數量：標籤表明是異常，演算法錯誤的分類為正常

%% ================== Part 3: Find Outliers ===================
%  Now you will find a good epsilon threshold using a cross-validation set
%  probabilities given the estimated Gaussian distribution
% 

pval = multivariateGaussian(Xval, mu, sigma2);

[epsilon F1] = selectThreshold(yval, pval);
fprintf('Best epsilon found using cross-validation: %e\n', epsilon);
fprintf('Best F1 on Cross Validation Set:  %f\n', F1);
fprintf('   (you should see a value epsilon of about 8.99e-05)\n');
fprintf('   (you should see a Best F1 value of  0.875000)\n\n');

%  Find the outliers in the training set and plot the
outliers = find(p < epsilon);

%  Draw a red circle around those outliers
hold on
plot(X(outliers, 1), X(outliers, 2), 'ro', 'LineWidth', 2, 'MarkerSize', 10);
hold off

fprintf('Program paused. Press enter to continue.\n');
pause;

selectThreshold.m

function [bestEpsilon bestF1] = selectThreshold(yval, pval)
%SELECTTHRESHOLD Find the best threshold (epsilon) to use for selecting
%outliers
%   [bestEpsilon bestF1] = SELECTTHRESHOLD(yval, pval) finds the best
%   threshold to use for selecting outliers based on the results from a
%   validation set (pval) and the ground truth (yval).
%

bestEpsilon = 0;
bestF1 = 0;
F1 = 0;

stepsize = (max(pval) - min(pval)) / 1000;%計算步長
for epsilon = min(pval):stepsize:max(pval)

    % ====================== YOUR CODE HERE ======================
    % Instructions: Compute the F1 score of choosing epsilon as the
    %               threshold and place the value in F1. The code at the
    %               end of the loop will compare the F1 score for this
    %               choice of epsilon and set it to be the best epsilon if
    %               it is better than the current choice of epsilon.
    %               
    % Note: You can use predictions = (pval < epsilon) to get a binary vector
    %       of 0's and 1's of the outlier predictions

    predictions = (pval < epsilon);%概率小於閾值的數量，即預測為異常的數量
    fp = sum((predictions == 1) & (yval == 0));%演算法錯誤的分類為異常，標籤表明是正常
    fn = sum((predictions == 0) & (yval == 1));%演算法正確分類為正常,標籤表明是異常
    tp = sum((predictions == 1) & (yval == 1));%演算法正確分類為異常,標籤表明是異常

    prec = tp / (tp + fp);%準確率
    rec = tp / (tp + fn);%召回率

    F1 = 2 * prec * rec / (prec + rec);%F1值
    % =============================================================

    if F1 > bestF1
       bestF1 = F1;
       bestEpsilon = epsilon;
    end
end

end

這裡寫圖片描述

1.4 High dimensional dataset

將前面實現的異常檢測演算法應用在一個更現實、更難的資料集。
一個樣例有11個特徵，捕捉了伺服器更多的屬性。

%% ================== Part 4: Multidimensional Outliers ===================
%  We will now use the code from the previous part and apply it to a 
%  harder problem in which more features describe each datapoint and only 
%  some features indicate whether a point is an outlier.
%

%  Loads the second dataset. You should now have the
%  variables X, Xval, yval in your environment
load('ex8data2.mat');

%  Apply the same steps to the larger dataset
[mu sigma2] = estimateGaussian(X);

%  Training set 
p = multivariateGaussian(X, mu, sigma2);

%  Cross-validation set
pval = multivariateGaussian(Xval, mu, sigma2);

%  Find the best threshold
[epsilon F1] = selectThreshold(yval, pval);

fprintf('Best epsilon found using cross-validation: %e\n', epsilon);
fprintf('Best F1 on Cross Validation Set:  %f\n', F1);
fprintf('   (you should see a value epsilon of about 1.38e-18)\n');
fprintf('   (you should see a Best F1 value of 0.615385)\n');
fprintf('# Outliers found: %d\n\n', sum(p < epsilon));

Best epsilon found using cross-validation: 1.377229e-18
Best F1 on Cross Validation Set:  0.615385
   (you should see a value epsilon of about 1.38e-18)
   (you should see a Best F1 value of 0.615385)
# Outliers found: 117

2 Recommender Systems

這部分，你將實現協同過濾學習演算法，並將其應用在一個電影評分資料集中。

評分範圍是1到5。

有 nu=943 個使用者；nm=1682 個電影。

在練習的下一部分，你將實現 cofiCostFunc.m 方法，計算協同過濾目標函式和梯度。之後使用 vfmincg.m 學習協同過濾的引數。

2.1 Movie ratings dataset

從 ex8 movies.mat 讀取變數 Y 和 R 。

Y矩陣（num_movies × num_users）儲存評分 y(i,j) 從 1-5。

R矩陣是一個0-1標記矩陣，R(i,j)=1 表示使用者 j 給電影 i 評過分；R(i,j)=0 相反。

協同過濾的目標是預測沒有被評分，（即R(i,j)=0 ）位置的評分。這樣就可以推薦預測使用者評分最高的電影給這個使用者了。

通過這部分練習，你將用 X 和 Theta 這兩個矩陣工作：
這裡寫圖片描述

X矩陣的第 i 行對應第 i 個電影的特徵向量 x(i)
Theta矩陣的第 j 行對用第 j 個使用者的引數向量 θ(j)

x(i) 和 θ(j) 都是 n 維向量。

這個練習中特徵數 n=100，相應的 X 是一個 n

Andrew NG 機器學習練習8-Anomaly Detection and Recommender Systems

1 Anomaly detection

1.1 Gaussian distribution

1.2 Estimating parameters for a Gaussian

1.3 Selecting the threshold, ε

1.4 High dimensional dataset

2 Recommender Systems

2.1 Movie ratings dataset

Andrew NG 機器學習練習8-Anomaly Detection and Recommender Systems

【原】Coursera—Andrew Ng機器學習—Week 8 習題—聚類和降維

斯坦福NG機器學習課程：Anomaly Detection筆記

Coursera-吳恩達-機器學習-第九周-程式設計作業-Anomaly Detection and Recommender Systems

【原】Coursera—Andrew Ng機器學習—課程筆記 Lecture 15—Anomaly Detection異常檢測

廣義線性模型 - Andrew Ng機器學習公開課筆記1.6

Andrew Ng機器學習筆記+Weka相關算法實現（四）SVM和原始對偶問題

Andrew Ng機器學習課程筆記（四）之神經網絡

Andrew Ng機器學習第一章——初識機器學習

Andrew Ng機器學習第一章——單變量線性回歸

Andrew Ng機器學習（零）：什麽是機器學習

Andrew Ng 機器學習筆記 16 ：照片OCR

Andrew Ng 機器學習筆記 15 ：大資料集梯度下降

Andrew Ng 機器學習筆記 14 ：異常檢測

Andrew Ng 機器學習筆記 13 ：降維(dimensionality reduction)

Andrew Ng 機器學習筆記 12 ：聚類

Andrew Ng 機器學習筆記 11 ：支援向量機(Support Vector Machine)

Andrew Ng 機器學習筆記 10 ：評價學習演算法

Andrew Ng 機器學習筆記 09 ：神經網路

Andrew Ng 機器學習筆記 07 ：Octave/Matlab 使用說明

Andrew NG 機器學習 練習8-Anomaly Detection and Recommender Systems

1 Anomaly detection

1.1 Gaussian distribution

1.2 Estimating parameters for a Gaussian

1.3 Selecting the threshold, ε

1.4 High dimensional dataset

2 Recommender Systems

2.1 Movie ratings dataset

相關推薦

Andrew NG 機器學習練習8-Anomaly Detection and Recommender Systems