1. 程式人生 > >Sklearn.metrics類的學習筆記----Classification metrics

Sklearn.metrics類的學習筆記----Classification metrics

關於分類問題的metrics有很多,這裡僅介紹幾個常用的標準。

1.Accuracy score(準確率)

假設真實值為\(y\),預測值為\(\hat{y}\),則Accuracy score的計算公式為:
\(accuracy(y,\hat{y}) = \dfrac 1 m \displaystyle\sum_{i=1}^m(y_i = \hat{y}_i)\),舉例說明:

>>> import numpy as np
>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5
>>> accuracy_score(y_true, y_pred, normalize=False)
2

引數解釋:
加入引數normalize=False後,計算的是預測正確的個數。

2.Confusion matrix(混淆矩陣)

舉例說明:

#多分類
>>> from sklearn.metrics import confusion_matrix
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

如果是二分類問題,也可以用ravel()函式,直接輸出TN,FP,FN,TP的值,這幾個值的順序也就是混淆矩陣裡的按行按列依次寫出的順序。

>>> tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0]).ravel()
>>> (tn, fp, fn, tp)
(0, 2, 1, 1)

3.Classification report(分類報告)

Classification report提供了一種輸出Precision,Recall,F1-score等值的一個報告。舉例如下:

>>> from sklearn.metrics import classification_report
>>> y_true = [0, 1, 2, 2, 0]
>>> y_pred = [0, 0, 2, 1, 0]
>>> target_names = ['class 0', 'class 1', 'class 2']
>>> print(classification_report(y_true, y_pred, target_names=target_names))
              precision    recall  f1-score   support

     class 0       0.67      1.00      0.80         2
     class 1       0.00      0.00      0.00         1
     class 2       1.00      0.50      0.67         2

   micro avg       0.60      0.60      0.60         5
   macro avg       0.56      0.50      0.49         5
weighted avg       0.67      0.60      0.59         5

4.Precision, recall and F-measures(精確率、召回率和F1值)

這幾個概念不在敘述,直接上程式碼:

import sklearn.metrics
from sklearn.metrics import confusion_matrix
y_true = [1,1,1,1,0,0,0,0,0,0,0]
y_pred = [1,1,1,0,0,0,0,0,1,1,1]
confusion_matrix(y_true,y_pred)

array([[4, 3],
      [1, 3]], dtype=int64)
sklearn.metrics.precision_score(y_true,y_pred)
0.5
sklearn.metrics.recall_score(y_true,y_pred)
0.75
sklearn.metrics.f1_score(y_true,y_pred)
0.6

5.Log loss(對數損失)

Log loss又稱為logistic regression loss或者cross-entropy loss,對於二分類來說,我們記\(y\)為真實值,記\(p\)為概率值\(P(y=1)\),則Log loss計算公式為:
\(L_{log}(y,p) = -ylogp-(1-y)log(1-p)\),舉例如下:

>>> from sklearn.metrics import log_loss
>>> y_true = [0, 0, 1, 1]
>>> y_pred = [[.9, .1], [.8, .2], [.3, .7], [.01, .99]]
>>> log_loss(y_true, y_pred)    
0.1738...

程式碼解釋:上述程式碼中的y_pred為預測0和1的概率值,也可以直接寫成y_pred = [0.1,0.2,0.7,0.99]得到的loss是一樣的。

6.Receiver operating characteristic (ROC)

Receiver operating characteristic (ROC)的橫軸是FPR,縱軸是TPR,具體舉例如下:

>>> import numpy as np
>>> from sklearn.metrics import roc_curve
>>> y = np.array([1, 1, 2, 2])
>>> scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> fpr, tpr, thresholds = roc_curve(y, scores, pos_label=2)
>>> fpr
array([0. , 0. , 0.5, 0.5, 1. ])
>>> tpr
array([0. , 0.5, 0.5, 1. , 1. ])
>>> thresholds
array([1.8 , 0.8 , 0.4 , 0.35, 0.1 ])

其中,上述引數pos_label=2用於指定標籤為2的為正類。roc_curve函式會返回三個值,即fpr,tpr和thresholds三個值,欲畫出ROC曲線,則可以呼叫plt.plot(fpr,tpr)命令即可。
在此,也介紹下roc_auc_score函式的用法,用於計算AUC值,即ROC曲線下方的面積。

>>> import numpy as np
>>> from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75