客戶貸款逾期預測[1]-邏輯迴歸模型
任務
預測貸款客戶是否會逾期,status為響應變數,有0和1兩種值,0表示未逾期,1表示逾期。
程式碼:
# -*- coding: utf-8 -*-
"""
Created on Thu Nov 15 13:02:11 2018@author: keepi
"""import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
pd.set_option('display.max_row',1000)#匯入資料
data = pd.read_csv('data.csv',encoding='gb18030')
data = pd.DataFrame(data.fillna(10))#特徵工程
'''
n = set(data['reg_preference_for_trad'])
dic = {}
for i,j in enumerate(n):
dic[j] = i
data['reg_preference_for_trad'] = data['reg_preference_for_trad'].map(dic)
'''
x_dummy = pd.get_dummies(data['reg_preference_for_trad'])
data = pd.concat([data.drop('reg_preference_for_trad',axis=1),x_dummy],axis=1,sort=False)
data.drop('source',axis=1,inplace=True)
data.drop('bank_card_no',axis=1,inplace=True)
data.drop('latest_query_time',axis=1,inplace=True)
data.drop('loans_latest_time',axis=1,inplace=True)
data.drop('id_name',axis=1,inplace=True)#劃分測試集、訓練集
train,test = train_test_split(data,test_size=0.3,random_state=25)
y_train = train.loc[:,'status']
train_2 = train.drop('status',axis=1)
y_test = test.loc[:,'status']
test_2 = test.drop('status',axis=1)#模型訓練與預測
lr = LogisticRegression(C=190,dual=True,random_state=535)
lr.fit(train_2,y_train)y_test_pre = lr.predict(test_2)
#評分
score = f1_score(y_test,y_test_pre,average='macro')
print('驗證集分數',score)
驗證集分數:0.43838
遇到的問題
1.SettingWithCopyWarning:A value is trying to be set on a copy of a slice from a DataFrame
原因是我在處理資料時對原始資料進行了修改
train.drop('status',axis=1,inplace=True)
#警告:SettingWithCopyWarning
#修改為下面程式碼即可
train_2 = train.drop('status',axis=1)
2.固定了劃分測試集和訓練集的隨機數種子,每次訓練的分數都不同
因為邏輯迴歸的隨機數種子沒有設定
lr = LogisticRegression(C=100,dual=True,random_state=535) #這樣即可
3.在用svm預測後計算f1值的時候出現警告:
UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
這個是說f1值因為某些項為0所以無法計算,因為我訓練出來的結果全為1,而測試集中的標籤含有0,1兩種值。那麼為什麼用LinearSVC訓練後會只預測出一種值呢?