1. 程式人生 > >ML - 貸款使用者逾期情況分析6 - Final

ML - 貸款使用者逾期情況分析6 - Final

文章目錄


Task9 - 統一資料,資料三七分,隨機種子2018,用AUC作為模型評價指標,對比單模型和融合模型的比分。

具體程式碼見Github

思路

匯入原始資料,特徵歸一化後,調參,然後模型融合。

1. 匯入資料

匯入資料和特徵歸一化

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

# 匯入資料
data = pd.read_csv('data_all.csv')
y = data['status']
data.drop('status', axis = 1, inplace =
True) X = data # 劃分訓練集測試集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=2018) # 特徵歸一化 std = StandardScaler() X_train = std.fit_transform(X_train) X_test = std.transform(X_test)

2. 效能評估函式

from sklearn.metrics import accuracy_score, roc_auc_score

def model_metrics
(clf, X_train, X_test, y_train, y_test): # 預測 y_train_pred = clf.predict(X_train) y_test_pred = clf.predict(X_test) y_train_proba = clf.predict_proba(X_train)[:,1] y_test_proba = clf.predict_proba(X_test)[:,1] # 準確率 print('[準確率]', end = ' ') print('訓練集:', '%.4f'%accuracy_score(y_train, y_train_pred), end = ' ') print('測試集:', '%.4f'%accuracy_score(y_test, y_test_pred)) # auc取值:用roc_auc_score或auc print('[auc值]', end = ' ') print('訓練集:', '%.4f'%roc_auc_score(y_train, y_train_proba), end = ' ') print('測試集:', '%.4f'%roc_auc_score(y_test, y_test_proba))

3. 模型優化

調參:首先大範圍粗略地調,然後細分割槽間調。

對於包含較多引數的模型,例如xgb和lgb。首先固定其他引數為常用值,然後掃描某幾個引數,迴圈掃描,直到分數不再增加。

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier
from mlxtend.classifier import StackingClassifier

3.1 LR模型

調參:正則化因子C和正則化方式penalty。

lr = LogisticRegression(random_state = 2018)
# param = {'C': [1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']}
param = {'C': [i/100 for i in range(1,21)], 'penalty':['l1', 'l2']}

gsearch = GridSearchCV(lr, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)

print('最佳引數:',gsearch.best_params_)
print('訓練集的最佳分數:', gsearch.best_score_)
print('測試集的最佳分數:', gsearch.score(X_test, y_test))
lr = LogisticRegression(C = 0.04, penalty = 'l1',random_state = 2018)
lr.fit(X_train, y_train)
model_metrics(lr, X_train, X_test, y_train, y_test)

輸出

[準確率] 訓練集: 0.8016 測試集: 0.7884
[auc值] 訓練集: 0.8080 測試集: 0.7831

3.2 SVM模型

# 線性SVM
svm_linear = svm.SVC(kernel = 'linear', probability=True, random_state = 2018)
param = {'C':[0.01, 0.05, 0.1, 0.5, 1]}

gsearch = GridSearchCV(svm_linear, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)

print('最佳引數:',gsearch.best_params_)
print('訓練集的最佳分數:', gsearch.best_score_)
print('測試集的最佳分數:', gsearch.score(X_test, y_test))
svm_linear = svm.SVC(C = 0.01, kernel = 'linear', probability=True,random_state = 2018)
svm_linear.fit(X_train, y_train)
model_metrics(svm_linear, X_train, X_test, y_train, y_test)

輸出

[準確率] 訓練集: 0.7992 測試集: 0.7765
[auc值] 訓練集: 0.8152 測試集: 0.7790

其他三個svm_poly、svm_rbf和svm_sigmoid 此處不再展示,具體參見Github

3.3 決策樹模型

1)首先觀察一下預設引數的結果。(最後調參完肯定要比預設引數好才對)

dt = DecisionTreeClassifier(random_state = 2018)
dt.fit(X_train, y_train)
model_metrics(dt, X_train, X_test, y_train, y_test)

輸出

[準確率] 訓練集: 1.0000 測試集: 0.6854
[auc值] 訓練集: 1.0000 測試集: 0.5956

2)具體調參過程如下:先大範圍調,再小範圍調。調完後,再回到起始位置迴圈調,直到引數不再需要變化。

param = {'max_depth':range(3,14,2), 'min_samples_split':range(100,801,200)}
#param = {'min_samples_split':range(50,1000,100), 'min_samples_leaf':range(60,101,10)}
#param = {'min_samples_split':range(100,401,10), 'min_samples_leaf':range(40,101,10)}
#param = {'max_features':range(7,20,2)}
#param = {'max_features':[18,19,20]}
gsearch = GridSearchCV(DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9, random_state = 2018),
                       param_grid = param,scoring ='roc_auc', cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

3)調參最終結果

dt = DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9, random_state = 2018)
dt.fit(X_train, y_train)
model_metrics(dt, X_train, X_test, y_train, y_test)

輸出

[準確率] 訓練集: 0.7812 測試集: 0.7561
[auc值] 訓練集: 0.7721 測試集: 0.6946

3.4 XGBoost模型

1)首先觀察一下預設引數的結果。

import warnings
warnings.filterwarnings("ignore")

xgb0 = XGBClassifier(random_state =2018)
xgb0.fit(X_train, y_train)

model_metrics(xgb0, X_train, X_test, y_train, y_test)

2)具體調參過程:下面的引數迴圈調。最後降低學習速率為0.01,調一下n_estimators,看一下效能有沒有改善。

param_test = {'n_estimators':range(20,200,20)}
# param_test = {'n_estimators':range(40,81,10)}
# param_test = {'max_depth':range(3,10,2), 'min_child_weight':range(1,12,2)}
# param_test = {'max_depth':[2,3,4], 'min_child_weight':[10,11,12]}
# param_test = {'gamma':[i/10 for i in range(6)]}
# param_test = {'subsample':[i/10 for i in range(5,10)], 'colsample_bytree':[i/10 for i in range(5,10)]}
# param_test = { 'subsample':[i/100 for i in range(60,81,5)], 'colsample_bytree':[i/100 for i in range(70,91,5)]}
#param_test = {'reg_alpha':[1e-5, 1e-2, 0.1, 0, 1, 100]}
# param_test = {'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1]}

gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, 
                                                  min_child_weight=11, gamma=0, subsample=0.7, 
                                                  colsample_bytree=0.8, objective= 'binary:logistic', 
                                                  nthread=4,scale_pos_weight=1, random_state =2018), 
                        param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch.fit(X_train, y_train)
#gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

3)調參最終結果

xgb = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11, 
                    gamma=0, subsample=0.7,colsample_bytree=0.8, objective= 'binary:logistic',
                    nthread=4,scale_pos_weight=1, random_state =2018)
xgb.fit(X_train, y_train)
model_metrics(xgb, X_train, X_test, y_train, y_test)

輸出

[準確率] 訓練集: 0.8302 測試集: 0.7891
[auc值] 訓練集: 0.8710 測試集: 0.7780

3.5 LightGBM模型

類似XGBoost

lgb0 = LGBMClassifier(random_state =2018)
lgb0.fit(X_train, y_train)

model_metrics(lgb0, X_train, X_test, y_train, y_test)
param_test = {'n_estimators':range(20,200,20)}
# param_test = {'n_estimators':range(30,51,10)}
# param_test = {'max_depth':range(3,10,2), 'min_child_weight':range(1,12,2)}
# param_test = {'max_depth':[2,3,4], 'min_child_weight':[6,7,8]}
# param_test = {'gamma':[i/10 for i in range(6)]}
# param_test = {'subsample':[i/10 for i in range(5,10)], 'colsample_bytree':[i/10 for i in range(5,10)]}
# param_test = { 'subsample':[i/100 for i in range(60,81,5)], 'colsample_bytree':[i/100 for i in range(70,91,5)]}
#param_test = {'reg_alpha':[1e-5, 1e-2, 0.1, 0, 1, 100]}
# param_test = {'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1]}
# 上述迴圈調整, 然後降低學習速率
gsearch = GridSearchCV(estimator = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3, 
                                                  min_child_weight=7, gamma=0, subsample=0.5, 
                                                  colsample_bytree=0.8, reg_alpha = 1e-5,
                                                  nthread=4,scale_pos_weight=1, random_state =2018), 
                        param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_
lgb = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7, 
                    gamma=0, subsample=0.5, colsample_bytree=0.8, reg_alpha=1e-5, 
                    nthread=4,scale_pos_weight=1,random_state =2018)
lgb.fit(X_train, y_train)
model_metrics(lgb, X_train, X_test, y_train, y_test)

輸出

[準確率] 訓練集: 0.8269 測試集: 0.7877
[auc值] 訓練集: 0.8741 測試集: 0.7746

3.6 模型融合

1.在融合的時候需要對模型進行篩選
2.StackingClassifier的引數設定

如果average_probas=True,則對分類器的結果求平均,得到:p=[0.25,0.45,0.35]
如果average_probas=False,則分類器的所有結果都保留作為新的特徵:p=[0.2,0.5,0.3,0.3,0.4,0.4]

average_probas嘗試True後, 效果更好。其次, 決策樹和svm_poly單模型效果並不好, 嘗試去掉兩者後再Stacking

lr = LogisticRegression(C = 0.04, penalty = 'l1',random_state = 2018)
svm_linear =svm.SVC(C = 0.01, kernel = 'linear', probability=True,random_state = 2018)
svm_poly =  svm.SVC(C = 0.01, kernel = 'poly', probability=True,random_state = 2018)
svm_rbf =  svm.SVC(gamma = 0.01, C =0.01 , probability=True,random_state = 2018)
svm_sigmoid =  svm.SVC(C = 0.01, kernel = 'sigmoid',probability=True,random_state = 2018)
dt = DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9, 
                            random_state = 2018)
xgb = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11, 
                    gamma=0, subsample=0.7,colsample_bytree=0.8, objective= 'binary:logistic',
                    nthread=4,scale_pos_weight=1, random_state =2018)
lgb = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7, 
                    gamma=0, subsample=0.5, colsample_bytree=0.8, reg_alpha=1e-5, 
                    nthread=4,scale_pos_weight=1,random_state =2018)

將初級分類器產生的輸出類概率作為新特徵

sclf = StackingClassifier(classifiers=[lr, svm_linear, svm_rbf, xgb, lgb], 
                            meta_classifier=lr, use_probas=True,average_probas=True)
                            
sclf.fit(X_train, y_train.values)
model_metrics(sclf, X_train, X_test, y_train, y_test)

輸出

[準確率] 訓練集: 0.8161 測試集: 0.7821
[auc值] 訓練集: 0.8556 測試集: 0.7861

4. 結果對比和分析

模型 引數 auc值
LR C = 0.04, penalty = ‘l1’ 訓練集: 0.8080 測試集: 0.7831
svm_linear C = 0.01 訓練集: 0.8152 測試集: 0.7790
svm_poly C = 0.01 訓練集: 0.8626 測試集: 0.7347
svm_rbf gamma = 0.01, C =0.01 訓練集: 0.8522 測試集: 0.7708
svm_sigmoid C = 0.01 訓練集: 0.7660 測試集: 0.7590
決策樹 max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9 訓練集: 0.7721 測試集: 0.6946
XGBoost learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11, gamma=0, subsample=0.7,colsample_bytree=0.8 訓練集: 0.8710 測試集: 0.7780
LightGBM learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7, gamma=0, subsample=0.5, colsample_bytree=0.8 訓練集: 0.8741 測試集: 0.7746
Stacking - 訓練集: 0.8750 測試集: 0.7861

分析
測試集最好情況是LR模型0.7831。
並且可以看到LR取最好結果時, 選擇的是L1正則化。所以猜測需要進一步進行特徵選擇。

5. 遇到的問題

調參後best_score_的分數和求得的測試集auc值不相同。

原因是best_score_時使用交叉驗證,和最終的test資料切分不一樣嗎?感覺有幾個模型差了0.2,有點多。