1. 程式人生 > 實用技巧 >ROC曲線評估和異常點去除

ROC曲線評估和異常點去除

1、詳細連結見https://www.cnblogs.com/mdevelopment/p/9456486.html

複習ROC曲線:

ROC曲線是一個突出ADS分辨能力的曲線,用來區分正常點和異常點。ROC曲線將TPR召回率描繪為FPR假陽性率的函式。

曲線下的面積(AUC)越大,曲線越接近水平漸近線,ADS效果越好。

def evaluate(scores, labels):
"""
It retures the auc and prauc scores.
:param scores: list<float> | the anomaly scores predicted by CellPAD.


:param labels: list<float> | the true labels.
:return: the auc, prauc.
"""
from sklearn import metrics 呼叫方式為:metrics.評價指標函式名稱(parameter)

fpr, tpr, thresholds = metrics.roc_curve(labels, scores, pos_label=1)

計算ROC曲線的橫縱座標值,TPR,FPR

TPR = TP/(TP+FN) = recall(真正例率,敏感度) FPR = FP/(FP+TN)(假正例率,1-特異性)


precision, recall, thresholds = metrics.precision_recall_curve(labels, scores, pos_label=1)

使用python畫precision-recall曲線的程式碼
auc = metrics.auc(fpr, tpr)

auc(x,y,reorder=False): ROC曲線下的面積;較大的AUC代表了較好的performance
pruc = metrics.auc(recall, precision)
return auc, pruc

2、

def detect_anomaly(self, predicted_series, practical_series):

通過比較預測值和實際值來計算每個點的掉落率。
然後,它執行filter_anomaly()函式以通過引數“ rule”過濾掉異常。

"""
It calculates the drop ratio of each point by comparing the predicted value and practical value.
Then it runs filter_anomaly() function to filter out the anomalies by the parameter "rule".
:param predicted_series: the predicted values of a KPI series
:param practical_series: the practical values of a KPI series
:return: drop_ratios, drop_labels and drop_scores
"""
drop_ratios = []
for i in range(len(practical_series)):

dp=(實際值-預測值)/(預測值+10的7次方)
dp = (practical_series[i] - predicted_series[i]) / (predicted_series[i] + 1e-7)
drop_ratios.append(dp)
drop_scores = []

如有負數,改為正數
for r in drop_ratios:
if r < 0:
drop_scores.append(-r)
else:
drop_scores.append(0.0)

drop_labels = self.filter_anomaly(drop_ratios)
return drop_ratios, drop_labels, drop_scores

3、由2呼叫filter_anomaly()函式

def filter_anomaly(self, drop_ratios):

"""

它計算不同方法的閾值(規則),然後呼叫filter_by_threshold()。
It calculates the threshold for different approach(rule) and then calls filter_by_threshold().
- gauss: threshold = mean - self.sigma * std
- threshold: the given threshold variable
- proportion: threshold = sort_scores[threshold_index]
:param drop_ratios: list<float> | a measure of predicted drop anomaly degree
:return: list<bool> | the drop labels
"""
if self.rule == 'gauss':
mean = np.mean(drop_ratios)
std = np.std(drop_ratios) 方差, 總體標準偏差
threshold = mean - self.sigma * std 閾值=平均數-方差*sigma
drop_labels = self.filter_by_threshold(drop_ratios, threshold)
return drop_labels

if self.rule == "threshold":
threshold = self.threshold
drop_labels = self.filter_by_threshold(drop_ratios, threshold)
return drop_labels

if self.rule == "proportion":
sort_scores = sorted(np.array(drop_ratios)) 從小到大排序
threshold_index = int(len(drop_ratios) * self.proportion)
threshold = sort_scores[threshold_index]
drop_labels = self.filter_by_threshold(drop_ratios, threshold)
return drop_labels

4、由3呼叫filter_by_threshold函式

def filter_by_threshold(self, drop_scores, threshold):
"""

通過比較其下降分數和閾值來判斷一個點是否為異常。
It judges whether a point is an anomaly by comparing its drop score and the threshold.
:param drop_scores: list<float> | a measure of predicted drop anomaly degree.
:param threshold: float | the threshold to filter out anomalies.
:return: list<bool> | a list of labels where a point with a "true" label is an anomaly.
"""
drop_labels = []
for r in drop_scores:
if r < threshold:
drop_labels.append(True)
else:
drop_labels.append(False)
return drop_labels