【第17期Datawhale | 零基礎入門金融風控-貸款違約預測】Task02打卡:探索性資料分析 【pandas_profiling生成資料報告異常,解決後單開一篇】
阿新 • • 發佈:2020-09-19
零基礎入門金融風控-貸款違約預測 Task02 探索性資料分析
Task02目的:
- 熟悉整體資料集的基本情況,異常值,缺失值等, 判斷資料集是否可以進行接下來的機器學習或者深度學習建模.
- 瞭解變數間的專案關係/變數與預測值之間的存在關係
- 為特徵工程作準備
準備資料
import os import pandas as pd import numpy as np import matplotlib.pyplot as plt import datetime import seaborn as sns import warnings warnings.filterwarnings('ignore')
file_path = 'E:\\阿里雲開發者-天池比賽\\02_零基礎入門金融風控_貸款違約預測\\' train_file_path = file_path+'train.csv' testA_file_path = file_path+'testA.csv' now = datetime.datetime.now().strftime('%Y-%m-%d_%H:%M:%S') output_path = 'E:\\PycharmProjects\\TianChiProject\\00_山楓葉紛飛\\competitions\\002_financial_risk\\profiling\\' data_train = pd.read_csv(train_file_path) data_test_a = pd.read_csv(testA_file_path) # print('Train Data shape 行*列:',data_train.shape) # print('TestA Data shape 行*列:',data_test_a.shape) print('易得\n' '結果列 isDefault\n' 'testA相較於train多出兩列: \'n2.2\' \'n2.3\' ')
輸出
易得
結果列 isDefault
testA相較於train多出兩列: 'n2.2' 'n2.3'
2.3.0 通過nrows引數, 來設定讀取檔案的前多少行,
# data_train_sample = pd.read_csv(testA_file_path, nrows=5) #b. 分塊讀取 #設定chunksize引數,來控制每次迭代資料的大小 # chunker = pd.read_csv(testA_file_path, chunksize=5000) # for item in chunker: # print(type(item)) #<class 'pandas.core.frame.DataFrame'> # print(len(item)) #5
2.3.1 資料總體瞭解
"""
a. 讀取資料集並瞭解資料集大小,原始特徵維度;
b. 通過info熟悉資料型別;
c. 粗略檢視資料集中各特徵基本統計量;
"""
print('data_train.shape', data_train.shape) # (800000, 47)
print('data_train.columns', data_train.columns)
print('data_test_a.shape', data_test_a.shape) # (200000, 48)
輸出
data_train.shape (800000, 47)
data_train.columns Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
'policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8',
'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
dtype='object')
data_test_a.shape (200000, 48)
2.3.2 缺失值和恆定(唯一)值
"""
a. 檢視資料缺失值情況
b. 檢視唯一值特徵情況
"""
fea_dict_with_null_num = (data_train.isnull().sum()/len(data_train)).to_dict()
fea_null_moreThan0point1 = {}
have_null_cnt = 0
have_null_arr =[]
for key,value in fea_dict_with_null_num.items():
if value > 0.05:
fea_null_moreThan0point1[key] = value
if value > 0:
have_null_cnt += 1
have_null_arr.append(key)
print('存在缺失值的列的個數為{}, 分別是{}'.format(have_null_cnt, have_null_arr))
print('超過5%異常點的特徵列為=', fea_null_moreThan0point1)
存在缺失值的列的個數為22, 分別是['employmentTitle', 'employmentLength', 'postCode', 'dti', 'pubRecBankruptcies', 'revolUtil', 'title', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14']
超過5%異常點的特徵列為= {'employmentLength': 0.05849875, 'n0': 0.0503375, 'n1': 0.0503375, 'n2': 0.0503375, 'n2.1': 0.0503375, 'n5': 0.0503375, 'n6': 0.0503375, 'n7': 0.0503375, 'n8': 0.05033875, 'n9': 0.0503375, 'n11': 0.08719, 'n12': 0.0503375, 'n13': 0.0503375, 'n14': 0.0503375}
nan視覺化
missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
# 列印missing
print('missing type', type(missing))
print('missing :\n', missing)
輸出
missing type <class 'pandas.core.series.Series'>
missing :
employmentTitle 0.000001
postCode 0.000001
title 0.000001
dti 0.000299
pubRecBankruptcies 0.000506
revolUtil 0.000664
n10 0.041549
n4 0.041549
n12 0.050338
n9 0.050338
n7 0.050338
n6 0.050338
n2.1 0.050338
n13 0.050338
n2 0.050338
n1 0.050338
n0 0.050338
n5 0.050338
n14 0.050338
n8 0.050339
employmentLength 0.058499
n11 0.087190
dtype: float64
2.3.2.1 檢視訓練集和測試集中中特徵屬性只有一個值的特徵
one_value_fea = []
for col in data_train.columns:
if data_train[col].nunique() <= 1:
one_value_fea.append(col)
print('訓練集one_value_fea=', one_value_fea)
one_value_fea_test = []
for col in data_test_a.columns:
if data_test_a[col].nunique() <= 1:
one_value_fea_test.append(col)
print('測試集one_value_fea_test=', one_value_fea_test)
輸出
訓練集one_value_fea= ['policyCode']
測試集one_value_fea_test= ['policyCode']
2.3.3 深入資料-檢視資料型別
"""
a. 類別型資料
b. 數值型資料
離散數值型資料
連續數值型資料
"""
print('data_train.head():\n', data_train.head())
print('data_train.tail():\n', data_train.tail())
print('data_train.info():\n', data_train.info())
print('總體粗略的檢視資料集各個特徵的一些基本統計量:\n',
data_train.describe())
print('拼接首尾10行資料\n', data_train.head(5).append(data_train.tail(5)))
輸出
data_train.head():
id loanAmnt term interestRate installment grade subGrade \
0 0 35000.0 5 19.52 917.97 E E2
1 1 18000.0 5 18.49 461.90 D D2
2 2 12000.0 5 16.99 298.17 D D3
3 3 11000.0 3 7.26 340.96 A A4
4 4 3000.0 3 12.99 101.07 C C2
employmentTitle employmentLength homeOwnership ... n5 n6 n7 \
0 320.0 2 years 2 ... 9.0 8.0 4.0
1 219843.0 5 years 0 ... NaN NaN NaN
2 31698.0 8 years 0 ... 0.0 21.0 4.0
3 46854.0 10+ years 1 ... 16.0 4.0 7.0
4 54.0 NaN 1 ... 4.0 9.0 10.0
n8 n9 n10 n11 n12 n13 n14
0 12.0 2.0 7.0 0.0 0.0 0.0 2.0
1 NaN NaN 13.0 NaN NaN NaN NaN
2 5.0 3.0 11.0 0.0 0.0 0.0 4.0
3 21.0 6.0 9.0 0.0 0.0 0.0 1.0
4 15.0 7.0 12.0 0.0 0.0 0.0 4.0
[5 rows x 47 columns]
data_train.tail():
id loanAmnt term interestRate installment grade subGrade \
799995 799995 25000.0 3 14.49 860.41 C C4
799996 799996 17000.0 3 7.90 531.94 A A4
799997 799997 6000.0 3 13.33 203.12 C C3
799998 799998 19200.0 3 6.92 592.14 A A4
799999 799999 9000.0 3 11.06 294.91 B B3
employmentTitle employmentLength homeOwnership ... n5 n6 \
799995 2659.0 7 years 1 ... 6.0 2.0
799996 29205.0 10+ years 0 ... 15.0 16.0
799997 2582.0 10+ years 1 ... 4.0 26.0
799998 151.0 10+ years 0 ... 10.0 6.0
799999 13.0 5 years 0 ... 3.0 4.0
n7 n8 n9 n10 n11 n12 n13 n14
799995 12.0 13.0 10.0 14.0 0.0 0.0 0.0 3.0
799996 2.0 19.0 2.0 7.0 0.0 0.0 0.0 0.0
799997 4.0 10.0 4.0 5.0 0.0 0.0 1.0 4.0
799998 12.0 22.0 8.0 16.0 0.0 0.0 0.0 5.0
799999 4.0 8.0 3.0 7.0 0.0 0.0 0.0 2.0
[5 rows x 47 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 800000 non-null int64
1 loanAmnt 800000 non-null float64
2 term 800000 non-null int64
3 interestRate 800000 non-null float64
4 installment 800000 non-null float64
5 grade 800000 non-null object
6 subGrade 800000 non-null object
7 employmentTitle 799999 non-null float64
8 employmentLength 753201 non-null object
9 homeOwnership 800000 non-null int64
10 annualIncome 800000 non-null float64
11 verificationStatus 800000 non-null int64
12 issueDate 800000 non-null object
13 isDefault 800000 non-null int64
14 purpose 800000 non-null int64
15 postCode 799999 non-null float64
16 regionCode 800000 non-null int64
17 dti 799761 non-null float64
18 delinquency_2years 800000 non-null float64
19 ficoRangeLow 800000 non-null float64
20 ficoRangeHigh 800000 non-null float64
21 openAcc 800000 non-null float64
22 pubRec 800000 non-null float64
23 pubRecBankruptcies 799595 non-null float64
24 revolBal 800000 non-null float64
25 revolUtil 799469 non-null float64
26 totalAcc 800000 non-null float64
27 initialListStatus 800000 non-null int64
28 applicationType 800000 non-null int64
29 earliesCreditLine 800000 non-null object
30 title 799999 non-null float64
31 policyCode 800000 non-null float64
32 n0 759730 non-null float64
33 n1 759730 non-null float64
34 n2 759730 non-null float64
35 n2.1 759730 non-null float64
36 n4 766761 non-null float64
37 n5 759730 non-null float64
38 n6 759730 non-null float64
39 n7 759730 non-null float64
40 n8 759729 non-null float64
41 n9 759730 non-null float64
42 n10 766761 non-null float64
43 n11 730248 non-null float64
44 n12 759730 non-null float64
45 n13 759730 non-null float64
46 n14 759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 271.6+ MB
...