機器學習筆記——ROC曲線
1.效能度量簡介
在對學習器的泛化 能力進行評估是模型泛化的能力,即要用到機器學習的效能度量,不同的效能度量往往會導致不同的評判結果,這意味著模型的好壞是相對的,什麼樣的模型是好的,不僅取決於演算法和資料,還決定於任務的要求。
分類任務重最常用的是準確度(accuracy)及錯誤率(error rate):
上面兩個度量經常用,但對於不平衡的資料集不適應。
首先做一些定義:
- TN (True Negative):case was negative and predicted negative
- TP (True Positive):case was positive and predicted positve
- FN (False Positive):case was negative but predicted positve;
- FP (False Positve): case was negative but predicted positve
精度(Precision)也稱為查準率
召回率(recall)也稱為查全率:
查準率是確定分類器中斷言為正樣本的部分其實際中屬於正樣本的比例,精度越高則假的正例就越低,召回率則是被分類器正確預測的正樣本的比例。兩者是一對矛盾的度量,其可以合併成另一個度量,F1度量:
ROC(receiver oprating characteristic)接受者操作特徵,其顯示的分類器的真正率和假正率之間的關係,ROC曲線有助於比較不同分類器的相對效能,ROC曲線的面積為AUC(area under curve),其面積越大則分類的效能越好,理想的分類器auc=1.
PR(precision recall)曲線表現的是precision和recall之間的關係
2.python sklearn 部分函式簡介
2.1 Sklearn 資料劃分方法
K折交叉驗證
kFold,GroupKFold, StratifiedKFold
- 將全部驗證集S分為k個不想交的子集,假設S中的訓練樣例個數為m,那麼每一個子集有m/k個訓練樣例,相應的子集稱為{s1,s2.....,sk};
- 每次從分好的子集裡面,拿出一個作為測試集,其它k-1個作為訓練集;
- 在k-1個訓練集上訓練出學習器模型;
- 把這個模型放到測試集上,得到分類率;
- 計算k次求得的分類率的平均值,作為該模型或者假設函式的真實分類率;
這個方法充分利用了所有樣本,但計算比較繁瑣,需要訓練k次,測試k次。
2.2 sklearn ROC引數呼叫介紹
sklearn.metrics.roc_curve(y_true, y_score, pos_label=None, sample_weight=None, drop_intermediate=True)
y_true:陣列,shape=[樣本數]
在範圍{0,1}或{-1,1}中真正的二進位制標籤,如果標籤不是二進位制的,則應該顯示地給出pos_label;
y_score:陣列, shape=[樣本數]
目標得分,可以是正類的概率估計,或者是決定的非閾值度量(在某些分類器上由"decision_function"返回);
pos_label:int or str,標籤被認為是正的,其它的被認為是負的;
sample_weight:樣本的權重;
drop_intermediate:boolean, optional(default = True)
是否放棄一些不出現在繪製的ROC曲線上的次優閾值,這有助於建立更輕的ROC曲線;
Returns:
fpr:arry, shape =[>2] Increasing FP
tpr: arry, shape=[>2] Increasing TP
thresholds: arry, shape=[n_thresholds]
3.例子
"""
=============================================================
Receiver Operating Characteristic (ROC) with cross validation
=============================================================
Example of Receiver Operating Characteristic (ROC) metric to evaluate
classifier output quality using cross-validation.
ROC curves typically feature true positive rate on the Y axis, and false
positive rate on the X axis. This means that the top left corner of the plot is
the "ideal" point - a false positive rate of zero, and a true positive rate of
one. This is not very realistic, but it does mean that a larger area under the
curve (AUC) is usually better.
The "steepness" of ROC curves is also important, since it is ideal to maximize
the true positive rate while minimizing the false positive rate.
This example shows the ROC response of different datasets, created from K-fold
cross-validation. Taking all of these curves, it is possible to calculate the
mean area under curve, and see the variance of the curve when the
training set is split into different subsets. This roughly shows how the
classifier output is affected by changes in the training data, and how
different the splits generated by K-fold cross-validation are from one another.
.. note::
See also :func:`sklearn.metrics.roc_auc_score`,
:func:`sklearn.model_selection.cross_val_score`,
:ref:`sphx_glr_auto_examples_model_selection_plot_roc.py`,
"""
print(__doc__)
import numpy as np
from scipy import interp
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import StratifiedKFold
# #############################################################################
# Data IO and generation
# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
X, y = X[y != 2], y[y != 2]
n_samples, n_features = X.shape
# Add noisy features
random_state = np.random.RandomState(0)
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
# #############################################################################
# Classification and ROC analysis
# Run classifier with cross-validation and plot ROC curves
cv = StratifiedKFold(n_splits=6) #資料劃分
classifier = svm.SVC(kernel='linear', probability=True,
random_state=random_state)
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
i = 0
for train, test in cv.split(X, y):
probas_ = classifier.fit(X[train], y[train]).predict_proba(X[test])
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
tprs.append(interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
plt.plot(fpr, tpr, lw=1, alpha=0.3,
label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
i += 1
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
label='Chance', alpha=.8)
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
plt.plot(mean_fpr, mean_tpr, color='b',
label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
lw=2, alpha=.8)
std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
label=r'$\pm$ 1 std. dev.')
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.savefig("roc.png")
結果圖: