sklearn中異常檢測演算法建模彙總
阿新 • • 發佈:2018-12-29
借鑑於http://scikit-learn.org/stable/modules/outlier_detection.html#novelty-and-outlier-detection
一、概況
兩大異常
novelty detection
這些訓練資料沒有被異常值所汙染,我們有興趣在新的觀測中發現異常。
outlier detection
訓練資料中包含異常值,和我們需要合適的訓練資料中心模式忽略的越軌的意見。
機器學習(無監督學習)
學習:estimator.fit(X_train)
預測:estimator.predict(X_test),異常值為-1
二、novelty detection
Paste_Image.png
以下為建模程式碼:
import numpy as np from sklearn import svm xx, yy = np.meshgrid(np.linspace(-5, 5, 500), np.linspace(-5, 5, 500)) # Generate train data 生成訓練資料 X = 0.3 * np.random.randn(100, 2) X_train = np.r_[X + 2, X - 2] # Generate some regular novel observations 生成一些常規的新奇觀察 X = 0.3 * np.random.randn(20, 2) X_test = np.r_[X + 2, X - 2] # Generate some abnormal novel observations 產生一些異常新穎的觀察 X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2)) # fit the model 模型學習 clf = svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1) clf.fit(X_train) y_pred_train = clf.predict(X_train) y_pred_test = clf.predict(X_test) y_pred_outliers = clf.predict(X_outliers) n_error_train = y_pred_train[y_pred_train == -1].size n_error_test = y_pred_test[y_pred_test == -1].size n_error_outliers = y_pred_outliers[y_pred_outliers == 1].size
三、Outlier Detection
covariance.EmpiricalCovariance演算法
在高斯分佈資料上顯示具有馬氏距離的協方差估計的示例。
Paste_Image.png
以下為建模程式碼:
import numpy as np from sklearn.covariance import EmpiricalCovariance, MinCovDet n_samples = 125 n_outliers = 25 n_features = 2 # generate data 生成資料 gen_cov = np.eye(n_features) gen_cov[0, 0] = 2. X = np.dot(np.random.randn(n_samples, n_features), gen_cov) # add some outliers 新增一些異常值 outliers_cov = np.eye(n_features) outliers_cov[np.arange(1, n_features), np.arange(1, n_features)] = 7. X[-n_outliers:] = np.dot(np.random.randn(n_outliers, n_features),outliers_cov) # fit a Minimum Covariance Determinant (MCD) robust estimator to data # 擬合最小協方差行列式(MCD)對資料的魯棒估計 robust_cov = MinCovDet().fit(X) # compare estimators learnt from the full data set with true parameters # 比較估計器從完整的資料集和真實引數的學習 emp_cov = EmpiricalCovariance().fit(X) # Computes the squared Mahalanobis distances of given observations. # 計算給定觀測值的平方Mahalanobis距離。 Y = emp_cov.mahalanobis(X)
ensemble.IsolationForest演算法
在高維資料集中執行異常值檢測的一種有效方法是使用隨機森林
neighbors.LocalOutlierFactor(LOF)演算法
對中等高維資料集執行異常值檢測的另一種有效方法是使用區域性離群因子(LOF)演算法。
結合以上四種異常檢測方法建模比較:
sklearn.svm(支援向量機)
sklearn.covariance.EllipticEnvelope(高斯分佈的協方差估計)
sklearn.ensemble.IsolationForest(隨機森林)
sklearn.neighbors.LocalOutlierFactor(LOF)
Paste_Image.png
import numpy as np
from scipy import stats
from sklearn import svm
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
#隨機數發生器
rng = np.random.RandomState(42)
# Example settings 示例設定
n_samples = 200
outliers_fraction = 0.25
clusters_separation = [0, 1, 2]
# define two outlier detection tools to be compared 定義兩個異常的檢測工具進行比較
classifiers = {
"One-Class SVM": svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05,
kernel="rbf", gamma=0.1),
"Robust covariance": EllipticEnvelope(contamination=outliers_fraction),
"Isolation Forest": IsolationForest(max_samples=n_samples,
contamination=outliers_fraction,
random_state=rng),
"Local Outlier Factor": LocalOutlierFactor.LocalOutlierFactor(n_neighbors = 35,
contamination=outliers_fraction)
}
# Compare given classifiers under given settings 比較給定設定下的分類器
xx, yy = np.meshgrid(np.linspace(-7, 7, 100), np.linspace(-7, 7, 100))
n_inliers = int((1. - outliers_fraction) * n_samples)
n_outliers = int(outliers_fraction * n_samples)
ground_truth = np.ones(n_samples, dtype=int)
ground_truth[-n_outliers:] = -1
# Fit the problem with varying cluster separation 將不同的叢集分離擬合
for i, offset in enumerate(clusters_separation):
np.random.seed(42)
# Data generation 生成資料
X1 = 0.3 * np.random.randn(n_inliers // 2, 2) - offset
X2 = 0.3 * np.random.randn(n_inliers // 2, 2) + offset
X = np.r_[X1, X2]
# Add outliers 新增異常值
X = np.r_[X, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))]
# Fit the model 模型擬合
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit the data and tag outliers 擬合數據和標籤離群值
if clf_name == "Local Outlier Factor":
y_pred = clf.fit_predict(X)
scores_pred = clf.negative_outlier_factor_
else:
clf.fit(X)
scores_pred = clf.decision_function(X)
y_pred = clf.predict(X)
threshold = stats.scoreatpercentile(scores_pred,
100 * outliers_fraction)
n_errors = (y_pred != ground_truth).sum()
print(scores_pred)
if clf_name == "Local Outlier Factor":
# decision_function is private for LOF 決策函式是LOF的私有函式
Z = clf._decision_function(np.c_[xx.ravel(), yy.ravel()])
else:
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
print(Z)