一文搞定風控模型6大核心指標（附程式碼）

阿新 • • 發佈：2020-11-05

python金融風控評分卡模型和資料分析微專業課（博主親自錄製視訊）：http://dwz.date/b9vv

在我們開發完信用分模型後，經常需要計算如下的一些指標：
●區分度的指標：○AUC○KS○GINI
●穩定性的指標：○PSI
●分數分佈：○總人數比例○壞使用者比例接下來，本文從如下5個模組為讀者朋友們介紹如何利用Python實現這些指標資料。
目錄
Part 1. 生成樣本
Part 2. 計算AUC、KS、GINI
Part 3. PSI
Part 4. 分數分佈
Part 5. 完整工程程式碼
致謝

01
生成樣本

首先，本文用以示例，故使用程式碼創造的一些假樣本。

import numpy as np
import pandas as pd

n_sample = 1000

df_score = pd.DataFrame({
    'user_id': [u for u in range(n_sample)],
    'label':np.random.randint(2, size=n_sample),
    'score': 900*np.random.random(size=n_sample),
    'term': 20201+np.random.randint(5, size=n_sample)
})

然後我們統計下分term的總人數，壞人數和壞人比例：

df_score.groupby('term').agg(total=('label', 'count'), 
                             bad=('label', 'sum'), 
                             bad_rate=('label', 'mean'))

02
區分度指標

計算區分度指標KS、AUC、GINI的helper function如下：

from sklearn.metrics import roc_auc_score, roc_curve

def get_auc(ytrue, yprob):
    auc = roc_auc_score(ytrue, yprob)
    if auc < 0.5:
        auc = 1 - auc
    return auc

def get_ks(ytrue, yprob):
    fpr, tpr, thr = roc_curve(ytrue, yprob)
    ks = max(abs(tpr - fpr))
    return ks

def get_gini(ytrue, yprob):
    auc = get_auc(ytrue, yprob)
    gini = 2 * auc - 1
    return gini

這裡對原有sklearn的auc計算做了一點修改，如果AUC<0.5的話會返回1-AUC, 這樣能忽略區分度的方向性。

然後對每個term做區分度的計算：

df_metrics = pd.DataFrame({
    'auc': df_score.groupby('term').apply(lambda x: get_auc(x['label'], x['score'])),
    'ks': df_score.groupby('term').apply(lambda x: get_ks(x['label'], x['score'])),
    'gini': df_score.groupby('term').apply(lambda x: get_gini(x['label'], x['score']))
})

這裡先分成2步：

簡單對隨機生成的信用分按固定分數區間分段；
按照分段計算PSI:使用pivot_table把資料按照term進行排列計算每個term上的人數比例。

df_score['score_bin'] = pd.cut(df_score['score'], [0, 500, 700, 800, 900])

df_total = pd.pivot_table(df_score, 
                          values='user_id', 
                          index='score_bin', 
                          columns=['term'], 
                          aggfunc="count", 
                          margins=True)
df_ratio = df_total.div(df_total.iloc[-1, :], axis=1)

根據人數比例計算PSI再放回表格內

eps = np.finfo(np.float32).eps
lst_psi = list()
for idx in range(1, len(df_ratio.columns)-1):
    last, cur = df_ratio.iloc[0, -1: idx-1]+eps, df_ratio.iloc[0, -1: idx]+eps
    psi = sum((cur-last) * np.log(cur / last))
    lst_psi.append(psi)
df_ratio.append(pd.Series([np.nan]+lst_psi+[np.nan], 
                          index=df_ratio.columns, 
                          name='psi'))

統計總人數分佈和壞使用者比例的分佈，其實在上面計算PSI的時候已經計算出人數分佈，就是上面的df_ratio：

df_total = pd.pivot_table(df_score, 
                          values='user_id', 
                          index='score_bin', 
                          columns=['term'], 
                          aggfunc="count", 
                          margins=True)
df_ratio = df_total.div(df_total.iloc[-1, :], axis=1)

所以，這裡照葫蘆畫瓢把壞使用者抽取出來再重複一遍，就可以把壞使用者比例計算出來。

df_bad = pd.pivot_table(df_score[df_score['label']==1], 
                        values='user_id', 
                        index='score_bin', 
                        columns=['term'], 
                        aggfunc="count", 
                        margins=True)
df_bad_rate = df_bad/df_total

可以使用seaborn的stacked line和stacked bar來做出總使用者的分佈和壞使用者的比列分佈。
4.1.總人數分佈：

import seaborn as sns

colormap = sns.diverging_palette(130, 20, as_cmap=True)
df_ratio.drop('All').T.plot(kind='bar', stacked=True, colormap=colormap)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

4.2.壞人比例分佈：

colormap = sns.diverging_palette(130, 20, as_cmap=True)
df_bad_rate.drop('All').T.plot(kind='line', colormap=colormap)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

05
完整Python程式碼

至此，本文已經把信用風控模型6大核心指標的計算過程都展示出來了。最後附上完整工程Python程式碼，供FALers學習使用。

import numpy as np
import pandas as pd

n_sample = 1000

df_score = pd.DataFrame({
    'user_id': [u for u in range(n_sample)],
    'label':np.random.randint(2, size=n_sample),
    'score': 900*np.random.random(size=n_sample),
    'term': 20201+np.random.randint(5, size=n_sample)
})


df_score.groupby('term').agg(total=('label', 'count'), 
                             bad=('label', 'sum'), 
                             bad_rate=('label', 'mean'))

#KS,GINI,AUC

from sklearn.metrics import roc_auc_score, roc_curve

def get_auc(ytrue, yprob):
    auc = roc_auc_score(ytrue, yprob)
    if auc < 0.5:
        auc = 1 - auc
    return auc

def get_ks(ytrue, yprob):
    fpr, tpr, thr = roc_curve(ytrue, yprob)
    ks = max(abs(tpr - fpr))
    return ks

def get_gini(ytrue, yprob):
    auc = get_auc(ytrue, yprob)
    gini = 2 * auc - 1
    return gini

df_metrics = pd.DataFrame({
    'auc': df_score.groupby('term').apply(lambda x: get_auc(x['label'], x['score'])),
    'ks': df_score.groupby('term').apply(lambda x: get_ks(x['label'], x['score'])),
    'gini': df_score.groupby('term').apply(lambda x: get_gini(x['label'], x['score']))
})


#PSI

df_score['score_bin'] = pd.cut(df_score['score'], [0, 500, 700, 800, 900])

df_total = pd.pivot_table(df_score, 
                          values='user_id', 
                          index='score_bin', 
                          columns=['term'], 
                          aggfunc="count", 
                          margins=True)
df_ratio = df_total.div(df_total.iloc[-1, :], axis=1)

eps = np.finfo(np.float32).eps
lst_psi = list()
for idx in range(1, len(df_ratio.columns)-1):
    last, cur = df_ratio.iloc[0, -1: idx-1]+eps, df_ratio.iloc[0, -1: idx]+eps
    psi = sum((cur-last) * np.log(cur / last))
    lst_psi.append(psi)
df_ratio.append(pd.Series([np.nan]+lst_psi+[np.nan], 
                          index=df_ratio.columns, 
                          name='psi'))


#總人數比例和壞客戶比例

df_total = pd.pivot_table(df_score, 
                          values='user_id', 
                          index='score_bin', 
                          columns=['term'], 
                          aggfunc="count", 
                          margins=True)
df_ratio = df_total.div(df_total.iloc[-1, :], axis=1)

df_bad = pd.pivot_table(df_score[df_score['label']==1], 
                        values='user_id', 
                        index='score_bin', 
                        columns=['term'], 
                        aggfunc="count", 
                        margins=True)
df_bad_rate = df_bad/df_total

#做圖

import seaborn as sns

colormap = sns.diverging_palette(130, 20, as_cmap=True)
df_ratio.drop('All').T.plot(kind='bar', stacked=True, colormap=colormap)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

colormap = sns.diverging_palette(130, 20, as_cmap=True)
df_bad_rate.drop('All').T.plot(kind='line', colormap=colormap)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

轉載：https://mp.weixin.qq.com/s/9o-u7zRM0mv-3wXPTv8oAQ

一文搞定風控模型6大核心指標（附程式碼）

python金融風控評分卡模型和資料分析微專業課（博主親自錄製視訊）：http://dwz.date/b9vv

python信用評分卡建模（附程式碼，博主錄製）

一文搞定風控模型6大核心指標（附程式碼）

風控模型6大核心指標（附程式碼）

一文搞定 Spring Data JPA

Python 3.X 亂碼解決（一文搞定Python3.x 亂碼問題）

一文搞定 Spring Boot & Shiro 實戰

一文搞定Python正則表示式

專案實戰之介面處理篇~一文搞定介面請求

專案實戰之跨域處理~一文搞定所有跨域需求

一文搞定oscp的緩衝區溢位

一文搞定 Linux 常用高頻命令

乾貨｜一文搞定 uiautomator2 自動化測試工具使用

一文搞定基址暫存器、界限暫存器、靜態重定位與動態重定位

Spring Boot 非同步請求和非同步呼叫，一文搞定！

一文搞定陣列扁平化（超全面的陣列拉平方案及實現）

十一、一文搞定spring cloud重要成員--Hystrix(斷路器)

Spring的藝術（六）：一文搞定Spring整合Mybatis

面試中Java泛型問題一文搞定

一文搞定this問題

樹莓派利用frp服務實現內網穿透——一文搞定

一文搞定Spark叢集的安裝（Standalone模式）

一文搞定風控模型6大核心指標（附程式碼）

python金融風控評分卡模型和資料分析微專業課（博主親自錄製視訊）：http://dwz.date/b9vv

相關推薦