金融風控貸款預測之EDAtask1

阿新 • • 發佈：2020-09-18

檢視train與test列特徵

train 800000條資料，47列； testa 200000條資料，48列。

>>>print(train.shape)
>>>print(testa.shape)
(800000, 47)
(200000, 48)

testa資料集存在n2.2, n2.3, 但在train資料集中沒有

目標列為isDefault

>>>print('testa have no column : ', set(train.columns).difference(set(testa.columns)))
>>>print('train have no column : ', set(testa.columns).difference(set(train.columns)))
testa have no column :  {'isDefault'}
train have no column :  {'n2.2', 'n2.3'}

檢視空缺、重複值

1、train資料集共22列存在空缺值。 testa資料集共11列存在空缺值。

2、train資料集：除匿名特徵n3外其餘匿名特徵均存在空缺值並且在3w以上，n11空缺值最多達6.9w；空缺值第二多的特徵是employmentLength（就業年限）存在4.6w個空缺值；其它特徵空缺數量在百位和個數。

3、testa資料集：除匿名特徵n3外其餘匿名特徵均存在空缺值，n11空缺值最多達1.7w；空缺值第二多的特徵是employmentLength（就業年限）存在1.1w個空缺值;其它特徵空缺數量在百位和十數。

train列列名	數量
n11	69752
employmentLength	46799
n8	40271
n14	40270
n5	40270
n0	40270
n1	40270
n2	40270
n13	40270
n2.1	40270
n6	40270
n7	40270
n9	40270
n12	40270
n4	33239
n10	33239
revolUtil	531
pubRecBankruptcies	405
dti	239
title	1
postCode	1
employmentTitle	1

testa列名	數量
n11	17575
employmentLength	11742
n13	10111
n0	10111
n1	10111
n2	10111
n2.1	10111
n2.2	10111
n2.3	10111
n14	10111
n5	10111
n6	10111
n7	10111
n8	10111
n9	10111
n12	10111
n10	8394
n4	8394
revolUtil	127
pubRecBankruptcies	116
dti	61

4、重複值
train, testa 均沒有重複值

>>>print(train.duplicated().sum())
>>>print(testa.duplicated().sum())
0
0

檢視資料集資料型別

這裡檢視資料型別是為了後續分開連續型變數和類別型變數

資料集分為數值型和object型別。

object型別：grade,subGrade,employmentLength,issueDate,earliesCreditLine

--------------input---------------------
num_types = []
object_types = []
for i, types in enumerate(train.dtypes):
    object_types.append(train.dtypes.index[i]) if types=='object' else num_types.append(train.dtypes.index[i])
print('num_types:',num_types)
print('-'*24)
print('object_types:',object_types)
---------------output-----------------------------
num_types: ['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'homeOwnership', 'annualIncome', 'verificationStatus', 'isDefault', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14']
------------------------
object_types: ['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']

檢視匿名特徵與標籤isDefault相關強度

觀察匿名特徵與標籤的相關係數視覺化結果，發現匿名特徵普遍與標籤的相關強度弱，但某些匿名特徵之間存在強相關性。後續資料處理過程應該注意這個細節

# 提取匿名變數
nList = ['isDefault']
for column in  train.columns:
    if column.startswith('n'):
        nList.append(column)
# 加入標籤isDefault
nList.append('isDefault')

plt.subplots(figsize=(20,20))
sns.heatmap(train[nList].corr(), annot=True, square=True, cmap='Blues', vmax=True)
plt.show()

檢視非匿名特徵的數值特徵與標籤isDefault相關強度

計算後發現，與標籤相關強度最大的是interestRate（貸款利率）。和我預想的不太一樣（原以為貸款金額與標籤相關強度最大，可能這就是人的錯覺吧）

此外，某些特徵之間相關性比較強。如loanAmnt（貸款金額）、installment（分期付款金額）；totalAcc（當前信用額度）、openAcc（未結信用額度數量）等。這些特徵也是在後續特徵處理中需要注意的地方。

num_data = train.drop(columns=object_types)
drop_nList_data = num_data.drop(columns = nList)
drop_nList_data['isDefault'] = train.isDefault

# 檢視相關性係數並降序
num_data.corr().isDefault.sort_values(ascending=False)

檢視標籤0、1比例

01樣本數量不平衡。 0/1 接近 8/2

>>>train.groupby(['isDefault']).count().id/train.shape[0]
isDefault
0    0.800488
1    0.199513
Name: id, dtype: float64

(train.groupby(['isDefault']).count().id/train.shape[0]).plot(kind='bar')
plt.title('label rate')

金融風控貸款預測之EDAtask1

檢視train與test列特徵

檢視空缺、重複值

檢視資料集資料型別

檢視匿名特徵與標籤isDefault相關強度

檢視非匿名特徵的數值特徵與標籤isDefault相關強度

檢視標籤0、1比例

檢視異常值

金融風控貸款預測之EDAtask1

阿里雲的金融風控-貸款違約預測

阿里雲的金融風控-貸款違約預測_特徵工程

阿里雲的金融風控-貸款違約預測_建模和調參

阿里雲的金融風控-貸款違約預測_模型融合

【第17期Datawhale | 零基礎入門金融風控-貸款違約預測】Task02打卡：探索性資料分析【pandas_profiling生成資料報告異常，解決後單開一篇】

特徵錦囊：金融風控裡的WOE前的分箱一定要單調嗎？

網商銀行釋出衛星風控系統“大山雀”：用於農村貸款領域

資源利用率提高67%，騰訊實時風控平臺雲原生容器化之路

【風控要略】網際網路風控業務-反欺詐之路

以攻擊者角度學習某風控裝置指紋產品

風控PM必知必會：網際網路業務安全的黑灰產業鏈的故事

風控模型6大核心指標（附程式碼）

一文搞定風控模型6大核心指標（附程式碼）

實驗六：泰坦尼克生存預測之缺失值處理

期末預測之安全指數——【CCF CSP 202012-1】

CCF CSP 202012-1 期末預測之安全指數

CCF CSP 202012-2 期末預測之最佳閾值

期末預測之最佳閾值

CSP202012-1 期末預測之安全指數

金融風控貸款預測之EDAtask1

檢視train與test列特徵

檢視空缺、重複值

檢視資料集資料型別

檢視匿名特徵與標籤isDefault相關強度

檢視非匿名特徵的數值特徵與標籤isDefault相關強度

檢視標籤0、1比例

檢視異常值

相關推薦