sklearn的roc_curve()函式分析

阿新 • • 發佈：2019-01-31

在用sklearn的roc_curve()函式的時候，發現返回的結果和想象中不太一樣，理論上threshold應該取遍所有的y_score（即模型預測值）。但是roc_curve()的結果只輸出了一部分的threhold。從原始碼找到了原因。

初始資料：

y_true = [0, 0, 1, 0, 0, 1, 0, 1, 0, 0]
y_score = [0.31689620142873609, 0.32367439192936548, 0.42600526758001989, 0.38769987193780364, 0.3667541015524296, 0.39760831479768338, 0.42017521636505745 
, 0.41936155918127238, 0.33803961944475219, 0.33998332945141224]

通過sklearn的roc_curve函式計算false positive rate和true positive rate以及對應的threshold：

fpr_skl, tpr_skl, thresholds_skl = roc_curve(y_true, y_score, drop_intermediate=False)

計算得到的值如下：

fpr_skl
[ 0.          0.14285714  0.14285714  0.14285714  0.28571429 
  0.42857143
  0.57142857  0.71428571  0.85714286  1.        ]

tpr_skl
[ 0.          0.14285714  0.14285714  0.14285714  0.28571429  0.42857143
  0.57142857  0.71428571  0.85714286  1.        ]

thresholds_skl
[ 0.42600527  0.42017522  0.41936156  0.39760831  0.38769987  0.3667541
  0.33998333  0.33803962  0.32367439  0.3168962 
 ]

roc_curve()函式

分析一下roc_curve()程式碼，看看這三個值都是怎麼算出來的，其實就是常規auc的計算過程。

首先是_binary_clf_curve()函式：

    fps, tps, thresholds = _binary_clf_curve(
        y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)

fps和tps就是混淆矩陣中的FP和TP的值；thresholds就是y_score逆序排列後的結果（由於保留的小數位數不同，所以表面上看上去不一樣，其實是一樣的）。在這個例子中，其值如下：

fps = [0, 1, 1, 1, 2, 3, 4, 5, 6, 7]
tps = [1, 1, 2, 3, 3, 3, 3, 3, 3, 3]
thresholds = [0.42600526758001989, 0.42017521636505745, 0.41936155918127238, 0.39760831479768338, 0.38769987193780364, 0.3667541015524296, 0.33998332945141224, 0.33803961944475219, 0.32367439192936548, 0.31689620142873609]

為了便於理解，這裡用更直觀的方式實現了fps和tps的計算：

for threshold in thresholds:
    # 大於等於閾值為1, 否則為0
    y_prob = [1 if i>=threshold else 0 for i in y_score]
    # 結果是否正確
    result = [i==j for i,j in zip(y_true, y_prob)]
    # 是否預測為正類
    positive = [i==1 for i in y_prob]

    tp = [i and j for i,j in zip(result, positive)] # 預測為正類且預測正確
    fp = [(not i) and j for i,j in zip(result, positive)] # 預測為正類且預測錯誤

    print(tp.count(True), fp.count(True))

# 輸出
0 1
1 1
1 2
1 3
2 3
3 3
4 3
5 3
6 3
7 3

通過fps和tps，就可以計算出相應的fpr和tpr，其中-1就是閾值取最小，也就是所有樣本都判斷為positive，相應地，fps[-1]就是負樣本總和，tpr[-1]就是正樣本總和。原始碼相應的計算程式碼簡化後如下：

fpr = [i/fps[-1] for i in fps] # fps / fps[-1]
tpr = [i/tps[-1] for i in tps] # tps / tps[-1]

drop_intermediate引數

roc_curve()函式有drop_intermediate引數，相應的原始碼為：

if drop_intermediate and len(fps) > 2:
    optimal_idxs = np.where(np.r_[True,
                                  np.logical_or(np.diff(fps, 2),
                                                np.diff(tps, 2)),
                                  True])[0]
    fps = fps[optimal_idxs]
    tps = tps[optimal_idxs]
    thresholds = thresholds[optimal_idxs]

在這個例子中，相應變數的值為：

# 取兩階差值
np.diff(fps, 2)
[-1  0  1  0  0  0  0  0]
np.diff(tps, 2)
[ 1  0 -1  0  0  0  0  0]

# 取或
np.logical_or(np.diff(fps, 2), np.diff(tps, 2))
[ True, False,  True, False, False, False, False, False]

# 在頭尾各加上一個True
np.r_[True, np.logical_or(np.diff(fps, 2), np.diff(tps, 2)), True]
[ True,  True, False,  True, False, False, False, False, False,  True]

# True所在的陣列下標
np.where(np.r_[True, np.logical_or(np.diff(fps, 2), np.diff(tps, 2)), True])[0]
[0, 1, 3, 9]

optimal_idxs實際上就是roc影象的拐點，對於畫圖而言，只需要拐點即可。將fps和tps想象為一個人在圖上的位移，則一階差值即為“移動速度”，二階差值即為“加速度”。

“roc影象”如下：

fps = [0, 1, 1, 1, 2, 3, 4, 5, 6, 7]
tps = [1, 1, 2, 3, 3, 3, 3, 3, 3, 3]

plt.plot(
    fps,
    tps,
    'b')
plt.xlim([-1, 8])
plt.ylim([-1, 8])
plt.ylabel('tps')
plt.xlabel('fps')
plt.show()

這裡寫圖片描述

因此，drop_intermediate引數實際上是對roc計算過程的優化，不影響roc影象。

sklearn的roc_curve()函式分析

roc_curve()函式

drop_intermediate引數

alloc_skb申請函式分析

C++物件模型中的虛擬函式分析

Linux啟動過程分析（十一)---da850_set_emif_clk_rate()函式分析

Linux核心啟動第二階段之setup_arch函式分析

函式分析題

Hive函式分類、CLI命令、簡單函式、聚合函式、集合函式、特殊函式(分析函式、視窗函式、混合函式，UDTF)，常用函式Demo

dump_stack()函式分析

LINUX中的訊息傳遞函式分析

4.WaitForSingleObject函式分析

uCOS-II OSTaskCreate函式分析

凸優化（四）凸函式分析

C++ 類中成員函式分析

sql常用函式分析

python3中替換python2中cmp函式的新函式分析(lt、le、eq、ne、ge、gt)

linux0.11程序睡眠sleep_on函式和喚醒wake_up函式分析

lua_next函式分析

關於container_of函式分析

openCV中的findHomography函式分析以及RANSAC演算法的詳解（原始碼分析）

PHP內建函式分析之strlen 與 mb_strlen

Druid的Segment Balance及其代價計算函式分析

sklearn的roc_curve()函式分析

roc_curve()函式

drop_intermediate引數

相關推薦