信用評分卡（python）

匯入資料
缺失值和異常值處理
特徵視覺化
特徵選擇
模型訓練
模型評估
模型結果轉評分
計算使用者總分

一、匯入資料

#匯入模組
import pandas as pd 
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.rc("font",family="SimHei",size="12")  #解決中文無法顯示的問題

#匯入資料
train=pd.read_csv('F:\\python\\Give-me-some-credit-master\\data\\cs-training.csv 
')

資料資訊簡單檢視

#簡單檢視資料
train.info()

'''
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 12 columns):
Unnamed: 0                              150000 non-null int64
SeriousDlqin2yrs                        150000 non-null int64
RevolvingUtilizationOfUnsecuredLines    150000 non-null float64
age                                     150000 non-null int64
NumberOfTime30-59DaysPastDueNotWorse    150000 non-null int64
DebtRatio                               150000 non-null float64
MonthlyIncome                           120269 non-null float64
NumberOfOpenCreditLinesAndLoans         150000 non-null int64
NumberOfTimes90DaysLate                 150000 non-null int64
NumberRealEstateLoansOrLines            150000 non-null int64
NumberOfTime60-89DaysPastDueNotWorse    150000 non-null int64
NumberOfDependents                      146076 non-null float64
dtypes: float64(4), int64(8)
memory usage: 13.7 MB

 
'''

頭三行和末尾三行資料檢視

#頭三行和尾三行資料檢視
train.head(3).append(train.tail(3))

shape檢視

#shape
train.shape  #(150000, 11)

將各英文欄位轉為中文欄位名方便理解

states={'Unnamed: 0':'id',
        'SeriousDlqin2yrs':'好壞客戶',
        'RevolvingUtilizationOfUnsecuredLines':'可用額度比值',
        'age':'年齡',
        'NumberOfTime30-59DaysPastDueNotWorse 
':'逾期30-59天筆數',
        'DebtRatio':'負債率',
        'MonthlyIncome':'月收入',
        'NumberOfOpenCreditLinesAndLoans':'信貸數量',
        'NumberOfTimes90DaysLate':'逾期90天筆數',
        'NumberRealEstateLoansOrLines':'固定資產貸款量',
        'NumberOfTime60-89DaysPastDueNotWorse':'逾期60-89天筆數',
        'NumberOfDependents':'家屬數量'}
train.rename(columns=states,inplace=True)

#設定索引
train=train.set_index('id',drop=True)

描述性統計

#描述性統計
train.describe()

二、缺失值和異常值處理

1.缺失值處理

檢視缺失值

#檢視每列缺失情況
train.isnull().sum()
#檢視缺失佔比情況
train.isnull().sum()/len(train)
#缺失值視覺化
missing=train.isnull().sum()
missing[missing>0].sort_values().plot.bar()  #將大於0的拿出來並排序

可知

月收入缺失值是：29731，缺失比例是：0.198207

家屬數量缺失值：3924，缺失比例是：0.026160

先copy一份資料，保留原資料，然後對缺失值進行處理，

#保留原資料
train_cp=train.copy()

#月收入使用平均值填補缺失值
train_cp.fillna({'月收入':train_cp['月收入'].mean()},inplace=True)
train_cp.isnull().sum()

#家屬數量缺失的行去掉
train_cp=train_cp.dropna()
train_cp.shape   #(146076, 11)

2.異常值處理

檢視異常值

#檢視異常值
#畫箱型圖
for col in train_cp.columns:
    plt.boxplot(train_cp[col])
    plt.title(col)
    plt.show()

可用額度比率大於1的資料是異常的

年齡為0的資料也是異常，其實小於18歲的都可以認定為異常，逾期30-59天筆數的有一個超級離群資料

異常值處理消除不合邏輯的資料和超級離群的資料，可用額度比值應該小於1，年齡為0的是異常值，逾期天數筆數大於80的是超級離群資料，將這些離群值過濾掉，篩選出剩餘部分資料

train_cp=train_cp[train_cp['可用額度比值']<1]
train_cp=train_cp[train_cp['年齡']>0]
train_cp=train_cp[train_cp['逾期30-59天筆數']<80]
train_cp=train_cp[train_cp['逾期60-89天筆數']<80]
train_cp=train_cp[train_cp['逾期90天筆數']<80]
train_cp=train_cp[train_cp['固定資產貸款量']<50]
train_cp=train_cp[train_cp['負債率']<5000]
train_cp.shape  #(141180, 11)

三、特徵視覺化

1.單變數視覺化

好壞使用者

#好壞使用者
train_cp.info()
train_cp['好壞客戶'].value_counts()
train_cp['好壞客戶'].value_counts()/len(train_cp)
train_cp['好壞客戶'].value_counts().plot.bar()

'''
0    132787
1      8393
Name: 好壞客戶, dtype: int64

0    0.940551
1    0.059449
Name: 好壞客戶, dtype: float64

'''

可知y值嚴重傾斜

可用額度比值和負債率

#可用額度比值和負債率
train_cp['可用額度比值'].plot.hist()
train_cp['負債率'].plot.hist()

#負債率大於1的資料影響太大了
a=train_cp['負債率']
a[a<=1].plot.hist()

逾期30-59天筆數,逾期90天筆數,逾期60-89天筆數

#逾期30-59天筆數,逾期90天筆數,逾期60-89天筆數 
for i,col in enumerate(['逾期30-59天筆數','逾期90天筆數','逾期60-89天筆數']):
    plt.subplot(1,3,i+1)
    train_cp[col].value_counts().plot.bar()
    plt.title(col)


train_cp['逾期30-59天筆數'].value_counts().plot.bar()
train_cp['逾期90天筆數'].value_counts().plot.bar()
train_cp['逾期60-89天筆數'].value_counts().plot.bar()

年齡：基本符合正態分佈

#年齡
train_cp['年齡'].plot.hist()

月收入：

#月收入
train_cp['月收入'].plot.hist()
sns.distplot(train_cp['月收入'])
#超級離群值影響太大了，我們取小於5w的資料畫圖
a=train_cp['月收入']
a[a<=50000].plot.hist()

#發現小於5萬的也不多，那就取2w
a=train_cp['月收入']
a[a<=20000].plot.hist()

信貸數量：

#信貸數量
train_cp['信貸數量'].value_counts().plot.bar()
sns.distplot(train_cp['信貸數量'])

固定資產貸款量：

#固定資產貸款量
train_cp['固定資產貸款量'].value_counts().plot.bar()
sns.distplot(train_cp['固定資產貸款量'])

家屬數量

#家屬數量
train_cp['家屬數量'].value_counts().plot.bar()
sns.distplot(train_cp['家屬數量'])

2.單變數與y值視覺化

#單變數與y值視覺化
#可用額度比值、負債率、年齡、月收入，這些需要分箱
#可用額度比值
train_cp['可用額度比值_cut']=pd.cut(train_cp['可用額度比值'],5)
pd.crosstab(train_cp['可用額度比值_cut'],train_cp['好壞客戶']).plot(kind="bar")
a=pd.crosstab(train_cp['可用額度比值_cut'],train_cp['好壞客戶'])
a['壞使用者佔比']=a[1]/(a[0]+a[1])
a['壞使用者佔比'].plot()

信用評分卡（python）

信用評分卡（python）

信用評分卡模型在Python中實踐（上）

信用評分卡模型在Python中實踐（下）

批量獲取IP的位置，ISP指令碼（python）

常見的設計模式（python）———工廠模式

常見的設計模式（python）———單例模式（轉載）

P1036 選數（python）解題報告

常見的設計模式（python ）———介面卡模式

判斷單向列表是否有環（Python）

連結串列（python）

介面測試（Python）之DDT

【模擬】第二屆全國高校綠色計算大賽預賽第二階段（Python）文字編輯器

信用評分卡（python）

相關推薦