金融貸款逾期的模型構建5——資料預處理

阿新 • • 發佈：2019-01-14

文章目錄

一、相關庫
二、資料讀取
三、資料清洗——刪除無關、重複資料
四、資料清洗——型別轉換

1、資料集劃分
2、缺失值處理
3、異常值處理
4、離散特徵編碼
5、日期特徵處理
6、特徵組合

五、資料集劃分
六、模型構建
七、模型評估

資料傳送門（與之前的不同）： https://pan.baidu.com/s/1G1b2QJjYkkk7LDfGorbj5Q

目標：資料集是金融資料（非脫敏），要預測貸款使用者是否會逾期。表格中 “status” 是結果標籤：0表示未逾期，1表示逾期。

任務：資料型別轉換和缺失值處理（嘗試不同的填充看效果）以及及其他你能借鑑的資料探索。

一、相關庫

# -*- coding:utf-8 -*-
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import 
 DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt

二、資料讀取

file_path = "data.csv" 

data = pd.read_csv(file_path, encoding='gbk')
print(data.head())
print(data.shape)

結果輸出

   Unnamed: 0   custid       ...        latest_query_day loans_latest_day
0           5  2791858       ...                    12.0             18.0
1          10   534047       ...                     4.0              2.0
2          12  2849787       ...                     2.0              6.0
3          13  1809708       ...                     2.0              4.0
4          14  2499829       ...                    22.0            120.0

[5 rows x 90 columns]
(4754, 90)

遇到的問題：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 0: invalid start byte

原因：‘utf-8’不能解碼位元組（0xbf），也就是這個位元組超出了utf-8的表示範圍了
解決方法：顯式新增編碼方式。親測：encoding=‘gbk’ 或’ISO-8859-1’編碼。

三、資料清洗——刪除無關、重複資料

## 刪除與個人身份相關的列
data.drop(['custid', 'trade_no', 'bank_card_no', 'id_name'], axis=1, inplace=True)

## 刪除列中資料均相同的列
X = data.drop(labels='status',axis=1)
L = []
for col in X:
    if len(X[col].unique()) == 1:
        L.append(col)
for col in L:
    X.drop(col, axis=1, inplace=True)

四、資料清洗——型別轉換

1、資料集劃分

劃分不同資料型別：數值型、非數值型、標籤
使用：Pandas物件有 select_dtypes() 方法可以篩選出特定資料型別的特徵
引數：include 包括（預設）；exclude 不包括

X_num = X.select_dtypes(include='number').copy()
X_str = X.select_dtypes(exclude='number').copy()
y = data['status']

2、缺失值處理

發現缺失值方法：缺失個數、缺失率

# 使用缺失率(可以瞭解比重)並按照值降序排序 ascending=False
X_num_miss = (X_num.isnull().sum() / len(X_num)).sort_values(ascending=False)
print(X_num_miss.head())
print('----------' * 5)     
X_str_miss = (X_str.isnull().sum() / len(X_str)).sort_values(ascending=False)
print(X_str_miss.head())

輸出結果

student_feature                     0.630627
cross_consume_count_last_1_month    0.089609
latest_one_month_apply              0.063946
query_finance_count                 0.063946
latest_six_month_apply              0.063946
dtype: float64
--------------------------------------------------
latest_query_time          0.063946
loans_latest_time          0.062474
reg_preference_for_trad    0.000421
dtype: float64

分析：缺失率最高的特徵是student_feature，為 63.0627% > 50% ，其他的特徵缺失率都在10%以下。

高缺失率特徵處理：EM插補、多重插補。
==》由於兩種方法比較複雜，這裡先將缺失值歸為一類，用0填充。
其他特徵：平均數、中位數、眾數…

## student_feature特徵處理設定為0
X_num.fillna(0, inplace = True)

## 其他特徵插值: 眾數
X_num.fillna(X_num.mode().iloc[0, :], inplace=True)
X_str.fillna(X_str.mode().iloc[0, :], inplace=True)

3、異常值處理

箱型圖的四分位距(IQR)

## 異常值處理：箱型圖的四分位距(IQR)
def iqr_outlier(x, thre = 1.5):
    x_cl = x.copy()
    q25, q75 = x.quantile(q = [0.25, 0.75])
    iqr = q75 - q25
    top = q75 + thre * iqr
    bottom = q25 - thre * iqr
    x_cl[x_cl > top] = top
    x_cl[x_cl < bottom] = bottom
    return  x_cl
X_num_cl = pd.DataFrame()
for col in X_num.columns:
    X_num_cl[col] = iqr_outlier(X_num[col])
X_num = X_num_cl

4、離散特徵編碼

序號編碼：用於有大小關係的資料
one-hot編碼：用於無序關係的資料

X_str_oh = pd.get_dummies(X_str['reg_preference_for_trad'])

5、日期特徵處理

X_date = pd.DataFrame()
X_date['latest_query_time_year'] = pd.to_datetime(X_str['latest_query_time']).dt.year
X_date['latest_query_time_month'] = pd.to_datetime(X_str['latest_query_time']).dt.month
X_date['latest_query_time_weekday'] = pd.to_datetime(X_str['latest_query_time']).dt.weekday
X_date['loans_latest_time_year'] = pd.to_datetime(X_str['loans_latest_time']).dt.year
X_date['loans_latest_time_month'] = pd.to_datetime(X_str['loans_latest_time']).dt.month
X_date['loans_latest_time_weekday'] = pd.to_datetime(X_str['loans_latest_time']).dt.weekday

6、特徵組合

X = pd.concat([X_num, X_str_oh, X_date], axis=1, sort=False)
print(X.shape)

五、資料集劃分

## 預處理:標準化
# X_std = StandardScaler().fit(X)

## 劃分資料集
X_std_train, X_std_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2019)

六、模型構建

## 模型1：Logistic Regression
lr = LogisticRegression()
lr.fit(X_std_train, y_train)

## 模型2：Decision Tree
dtc = DecisionTreeClassifier(max_depth=8)
dtc.fit(X_std_train,y_train)

# ## 模型3：SVM
# svm = SVC(kernel='linear',probability=True)
# svm.fit(X_std_train,y_train)

## 模型4：Random Forest
rfc = RandomForestClassifier()
rfc.fit(X_std_train,y_train)

## 模型5：XGBoost
xgbc = xgb.XGBClassifier()
xgbc.fit(X_std_train,y_train)

## 模型6：LightGBM
lgbc = lgb.LGBMClassifier()
lgbc.fit(X_std_train,y_train)

七、模型評估

## 模型評估
def model_metrics(clf, X_test,  y_test):
    y_test_pred = clf.predict(X_test)
    y_test_prob = clf.predict_proba(X_test)[:, 1]

    accuracy = accuracy_score(y_test, y_test_pred)
    print('The accuracy: ', accuracy)
    precision = precision_score(y_test, y_test_pred)
    print('The precision: ', precision)
    recall = recall_score(y_test, y_test_pred)
    print('The recall: ', recall)
    f1_score = recall_score(y_test, y_test_pred)
    print('The F1 score: ', f1_score)
    print('----------------------------------')
    # roc_auc_score = roc_auc_score(y_test, y_test_prob)
    # print('The AUC of: ', roc_auc_score)

model_metrics(lr,X_std_test,y_test)
model_metrics(dtc,X_std_test,y_test)
model_metrics(rfc,X_std_test,y_test)
model_metrics(xgbc,X_std_test,y_test)
model_metrics(lgbc,X_std_test,y_test)