ML - 貸款使用者逾期情況分析6 - Final

阿新 • • 發佈：2018-12-04

文章目錄

思路

1. 匯入資料
2. 效能評估函式
3. 模型優化

3.1 LR模型
3.2 SVM模型
3.3 決策樹模型
3.4 XGBoost模型
3.5 LightGBM模型
3.6 模型融合

4. 結果對比和分析
5. 遇到的問題

Task9 - 統一資料，資料三七分，隨機種子2018，用AUC作為模型評價指標，對比單模型和融合模型的比分。

具體程式碼見Github

思路

匯入原始資料，特徵歸一化後，調參，然後模型融合。

1. 匯入資料

匯入資料和特徵歸一化

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

# 匯入資料
data = pd.read_csv('data_all.csv')
y = data['status']
data.drop('status', axis = 1, inplace = 
 True)
X = data

# 劃分訓練集測試集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=2018)

# 特徵歸一化
std = StandardScaler()
X_train = std.fit_transform(X_train)
X_test = std.transform(X_test)

2. 效能評估函式

from sklearn.metrics import accuracy_score, roc_auc_score

def model_metrics 
(clf, X_train, X_test, y_train, y_test):
    # 預測
    y_train_pred = clf.predict(X_train)
    y_test_pred = clf.predict(X_test)
    
    y_train_proba = clf.predict_proba(X_train)[:,1]
    y_test_proba = clf.predict_proba(X_test)[:,1]
    
    # 準確率
    print('[準確率]', end = ' ')
    print('訓練集：', '%.4f'%accuracy_score(y_train, y_train_pred), end = ' ')
    print('測試集：', '%.4f'%accuracy_score(y_test, y_test_pred))
    
    # auc取值：用roc_auc_score或auc
    print('[auc值]', end = ' ')
    print('訓練集：', '%.4f'%roc_auc_score(y_train, y_train_proba), end = ' ')
    print('測試集：', '%.4f'%roc_auc_score(y_test, y_test_proba))

3. 模型優化

調參：首先大範圍粗略地調，然後細分割槽間調。

對於包含較多引數的模型，例如xgb和lgb。首先固定其他引數為常用值，然後掃描某幾個引數，迴圈掃描，直到分數不再增加。

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier
from mlxtend.classifier import StackingClassifier

3.1 LR模型

調參：正則化因子C和正則化方式penalty。

lr = LogisticRegression(random_state = 2018)
# param = {'C': [1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']}
param = {'C': [i/100 for i in range(1,21)], 'penalty':['l1', 'l2']}

gsearch = GridSearchCV(lr, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)

print('最佳引數：',gsearch.best_params_)
print('訓練集的最佳分數：', gsearch.best_score_)
print('測試集的最佳分數：', gsearch.score(X_test, y_test))

lr = LogisticRegression(C = 0.04, penalty = 'l1',random_state = 2018)
lr.fit(X_train, y_train)
model_metrics(lr, X_train, X_test, y_train, y_test)

輸出

[準確率] 訓練集： 0.8016 測試集： 0.7884
[auc值] 訓練集： 0.8080 測試集： 0.7831

3.2 SVM模型

# 線性SVM
svm_linear = svm.SVC(kernel = 'linear', probability=True, random_state = 2018)
param = {'C':[0.01, 0.05, 0.1, 0.5, 1]}

gsearch = GridSearchCV(svm_linear, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)

print('最佳引數：',gsearch.best_params_)
print('訓練集的最佳分數：', gsearch.best_score_)
print('測試集的最佳分數：', gsearch.score(X_test, y_test))

svm_linear = svm.SVC(C = 0.01, kernel = 'linear', probability=True,random_state = 2018)
svm_linear.fit(X_train, y_train)
model_metrics(svm_linear, X_train, X_test, y_train, y_test)

輸出

[準確率] 訓練集： 0.7992 測試集： 0.7765
[auc值] 訓練集： 0.8152 測試集： 0.7790

其他三個svm_poly、svm_rbf和svm_sigmoid 此處不再展示，具體參見Github。

3.3 決策樹模型

1）首先觀察一下預設引數的結果。（最後調參完肯定要比預設引數好才對）

dt = DecisionTreeClassifier(random_state = 2018)
dt.fit(X_train, y_train)
model_metrics(dt, X_train, X_test, y_train, y_test)

輸出

[準確率] 訓練集： 1.0000 測試集： 0.6854
[auc值] 訓練集： 1.0000 測試集： 0.5956

2）具體調參過程如下：先大範圍調，再小範圍調。調完後，再回到起始位置迴圈調，直到引數不再需要變化。

param = {'max_depth':range(3,14,2), 'min_samples_split':range(100,801,200)}
#param = {'min_samples_split':range(50,1000,100), 'min_samples_leaf':range(60,101,10)}
#param = {'min_samples_split':range(100,401,10), 'min_samples_leaf':range(40,101,10)}
#param = {'max_features':range(7,20,2)}
#param = {'max_features':[18,19,20]}
gsearch = GridSearchCV(DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9, random_state = 2018),
                       param_grid = param,scoring ='roc_auc', cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

3）調參最終結果

dt = DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9, random_state = 2018)
dt.fit(X_train, y_train)
model_metrics(dt, X_train, X_test, y_train, y_test)

輸出

[準確率] 訓練集： 0.7812 測試集： 0.7561
[auc值] 訓練集： 0.7721 測試集： 0.6946

3.4 XGBoost模型

1）首先觀察一下預設引數的結果。

import warnings
warnings.filterwarnings("ignore")

xgb0 = XGBClassifier(random_state =2018)
xgb0.fit(X_train, y_train)

model_metrics(xgb0, X_train, X_test, y_train, y_test)

2）具體調參過程：下面的引數迴圈調。最後降低學習速率為0.01，調一下n_estimators，看一下效能有沒有改善。

param_test = {'n_estimators':range(20,200,20)}
# param_test = {'n_estimators':range(40,81,10)}
# param_test = {'max_depth':range(3,10,2), 'min_child_weight':range(1,12,2)}
# param_test = {'max_depth':[2,3,4], 'min_child_weight':[10,11,12]}
# param_test = {'gamma':[i/10 for i in range(6)]}
# param_test = {'subsample':[i/10 for i in range(5,10)], 'colsample_bytree':[i/10 for i in range(5,10)]}
# param_test = { 'subsample':[i/100 for i in range(60,81,5)], 'colsample_bytree':[i/100 for i in range(70,91,5)]}
#param_test = {'reg_alpha':[1e-5, 1e-2, 0.1, 0, 1, 100]}
# param_test = {'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1]}

gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, 
                                                  min_child_weight=11, gamma=0, subsample=0.7, 
                                                  colsample_bytree=0.8, objective= 'binary:logistic', 
                                                  nthread=4,scale_pos_weight=1, random_state =2018), 
                        param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch.fit(X_train, y_train)
#gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

3）調參最終結果

xgb = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11, 
                    gamma=0, subsample=0.7,colsample_bytree=0.8, objective= 'binary:logistic',
                    nthread=4,scale_pos_weight=1, random_state =2018)
xgb.fit(X_train, y_train)
model_metrics(xgb, X_train, X_test, y_train, y_test)

輸出

[準確率] 訓練集： 0.8302 測試集： 0.7891
[auc值] 訓練集： 0.8710 測試集： 0.7780

3.5 LightGBM模型

類似XGBoost

lgb0 = LGBMClassifier(random_state =2018)
lgb0.fit(X_train, y_train)

model_metrics(lgb0, X_train, X_test, y_train, y_test)

param_test = {'n_estimators':range(20,200,20)}
# param_test = {'n_estimators':range(30,51,10)}
# param_test = {'max_depth':range(3,10,2), 'min_child_weight':range(1,12,2)}
# param_test = {'max_depth':[2,3,4], 'min_child_weight':[6,7,8]}
# param_test = {'gamma':[i/10 for i in range(6)]}
# param_test = {'subsample':[i/10 for i in range(5,10)], 'colsample_bytree':[i/10 for i in range(5,10)]}
# param_test = { 'subsample':[i/100 for i in range(60,81,5)], 'colsample_bytree':[i/100 for i in range(70,91,5)]}
#param_test = {'reg_alpha':[1e-5, 1e-2, 0.1, 0, 1, 100]}
# param_test = {'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1]}
# 上述迴圈調整, 然後降低學習速率
gsearch = GridSearchCV(estimator = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3, 
                                                  min_child_weight=7, gamma=0, subsample=0.5, 
                                                  colsample_bytree=0.8, reg_alpha = 1e-5,
                                                  nthread=4,scale_pos_weight=1, random_state =2018), 
                        param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

lgb = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7, 
                    gamma=0, subsample=0.5, colsample_bytree=0.8, reg_alpha=1e-5, 
                    nthread=4,scale_pos_weight=1,random_state =2018)
lgb.fit(X_train, y_train)
model_metrics(lgb, X_train, X_test, y_train, y_test)

輸出

[準確率] 訓練集： 0.8269 測試集： 0.7877
[auc值] 訓練集： 0.8741 測試集： 0.7746

3.6 模型融合

1.在融合的時候需要對模型進行篩選
2.StackingClassifier的引數設定

如果average_probas=True，則對分類器的結果求平均，得到：p=[0.25,0.45,0.35]
如果average_probas=False，則分類器的所有結果都保留作為新的特徵：p=[0.2,0.5,0.3,0.3,0.4,0.4]

average_probas嘗試True後, 效果更好。其次, 決策樹和svm_poly單模型效果並不好, 嘗試去掉兩者後再Stacking

lr = LogisticRegression(C = 0.04, penalty = 'l1',random_state = 2018)
svm_linear =svm.SVC(C = 0.01, kernel = 'linear', probability=True,random_state = 2018)
svm_poly =  svm.SVC(C = 0.01, kernel = 'poly', probability=True,random_state = 2018)
svm_rbf =  svm.SVC(gamma = 0.01, C =0.01 , probability=True,random_state = 2018)
svm_sigmoid =  svm.SVC(C = 0.01, kernel = 'sigmoid',probability=True,random_state = 2018)
dt = DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9, 
                            random_state = 2018)
xgb = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11, 
                    gamma=0, subsample=0.7,colsample_bytree=0.8, objective= 'binary:logistic',
                    nthread=4,scale_pos_weight=1, random_state =2018)
lgb = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7, 
                    gamma=0, subsample=0.5, colsample_bytree=0.8, reg_alpha=1e-5, 
                    nthread=4,scale_pos_weight=1,random_state =2018)

將初級分類器產生的輸出類概率作為新特徵

sclf = StackingClassifier(classifiers=[lr, svm_linear, svm_rbf, xgb, lgb], 
                            meta_classifier=lr, use_probas=True,average_probas=True)
                            
sclf.fit(X_train, y_train.values)
model_metrics(sclf, X_train, X_test, y_train, y_test)

輸出

[準確率] 訓練集： 0.8161 測試集： 0.7821
[auc值] 訓練集： 0.8556 測試集： 0.7861

4. 結果對比和分析

模型	引數	auc值
LR	C = 0.04, penalty = ‘l1’	訓練集： 0.8080 測試集： 0.7831
svm_linear	C = 0.01	訓練集： 0.8152 測試集： 0.7790
svm_poly	C = 0.01	訓練集： 0.8626 測試集： 0.7347
svm_rbf	gamma = 0.01, C =0.01	訓練集： 0.8522 測試集： 0.7708
svm_sigmoid	C = 0.01	訓練集： 0.7660 測試集： 0.7590
決策樹	max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9	訓練集： 0.7721 測試集： 0.6946
XGBoost	learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11, gamma=0, subsample=0.7,colsample_bytree=0.8	訓練集： 0.8710 測試集： 0.7780
LightGBM	learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7, gamma=0, subsample=0.5, colsample_bytree=0.8	訓練集： 0.8741 測試集： 0.7746
Stacking	-	訓練集： 0.8750 測試集： 0.7861

分析
測試集最好情況是LR模型0.7831。
並且可以看到LR取最好結果時, 選擇的是L1正則化。所以猜測需要進一步進行特徵選擇。

5. 遇到的問題

調參後best_score_的分數和求得的測試集auc值不相同。

原因是best_score_時使用交叉驗證，和最終的test資料切分不一樣嗎？感覺有幾個模型差了0.2，有點多。

ML - 貸款使用者逾期情況分析6 - Final

文章目錄思路 1. 匯入資料 2. 效能評估函式 3. 模型優化 3.1 LR模型 3.2 SVM模型 3.3 決策樹模型 3.4 XGBoost模型 3.5 LightG

ML - 貸款使用者逾期情況分析5 - 特徵工程2（特徵選擇）

文章目錄特徵選擇 (判定貸款使用者是否逾期) 1. IV值進行特徵選擇 1.1 基本介紹 1.2 計算公式 2. 隨機森林進行特徵選擇 2.1 平均不純度減少 mean decre

ML - 貸款使用者逾期情況分析4 - 模型融合之Stacking

文章目錄模型融合之Stacking (判定貸款使用者是否逾期) 1. 理論介紹 1.1 系統解釋 1.2 詳細解釋 2. 程式碼 2.1 調包實現 2.2 自己實現

ML - 貸款使用者逾期情況分析3 - 模型調優

文章目錄模型調優 (判定貸款使用者是否逾期) 1. 資料集劃分 2. 模型評估 3. LR模型 4. SVM模型 5. 決策樹模型 6. XGBoost模型 7. LightGBM模型

ML - 貸款使用者逾期情況分析2 - 特徵工程1（資料預處理）

文章目錄資料預處理 (判定貸款使用者是否逾期) 1. 刪除無用特徵 2. 資料格式化 - X_date 3. 資料處理 - 類別特徵 X_cate 4. 資料處理 - 其他非數值型特徵 5. 資料處理 - 數值型特徵

ML - 貸款使用者逾期情況分析1 - Baseline

文章目錄任務總述基本思路程式碼部分 1. 資料集預覽 2. 資料預處理 2.1 刪除無用特徵 2.2 字元型特徵-編碼 2.3 缺失特徵處理

貸款使用者逾期情況分析

目錄任務描述實現過程基本思路 4.2 LR 遇到的問題參考 More 任務描述給定金融資料，預測貸款使用者是否會逾期。（status是標籤：0表示未逾期，1表示逾期。） Misson1 - 構建邏輯

MySQL Innodb表導致死鎖日誌情況分析與歸納

進程設置歸納操作數 into time uid int 死鎖發現當備份表格的sql語句與刪除該表部分數據的sql語句同時運行時，mysql會檢測出死鎖，並打印出日誌案例描述在定時腳本運行過程中，發現當備份表格的sql語句與刪除該表部分數據的

第1階段——uboot查找命令run_command函數和命令定義分析(6)

unknown res loop hello 字符串獲取保存 style 調用本節主要學習,run_command函數命令查找過程,命令生成過程 1.run_command函數命令查找過程分析：在u-boot界面中(main_loop();位於u-boot-1.1.6

通過大數據對智能手機市場消費情況分析

尺寸 logs 網通分析 ont com es2017 5.5 通過最近30天，智能手機行業在1688市場的熱門處理器核心為：八核 , 四核。預計未來一個月，1688市場智能手機行業的熱門處理器核心為：八核 , 四核 , 十核。最近30天，智能手機行業在

Mysql索引會失效的幾種情況分析

status 過程 ges 此外 ont 其中 like hand ext 轉自：http://www.jb51.net/article/50649.htm 在做項目的過程中，難免會遇到明明給mysql建立了索引，可是查詢還是很緩慢的情況出現，下面我們來具體分析下這種

AWR收集緩慢、掛起的幾種常見情況分析

oracle awr mmon/mmnl AWR（Automatic Workload Repository）作為對數據庫性能診斷的工具，采集與性能相關的統計數據，根據這些統計數據中的性能指標，以跟蹤潛在的問題。若因某些情況導致相關數據無法收集，就會對數據庫性能診斷大打折扣。以下列舉AWR收集緩

HDFS文件系統空間使用情況分析

集群 dev .html vro hdu 情況說明 capacity configure 1、查看集群的空間使用狀態[hduser@master1 bin]$ ./hdfs dfsadmin -reportConfigured Capacity: 845376883916

Oracle11g 密碼延遲認證導致library cache lock的情況分析

安全性 user instance col mos 庫服務器基本 temp 數據庫hang住在 Oracle 11g 中，為了提升安全性，Oracle 引入了『密碼延遲驗證』的新特性。這個特性的作用是，如果用戶輸入了錯誤的密碼嘗試登錄，那麽隨著登錄錯誤次數的增加，每次登

undefined reference 問題各種情況分析

扒自網友文章關於undefined reference這樣的問題，大家其實經常會遇到，在此，我以詳細地示例給出常見錯誤的各種原因以及解決方法，希望對初學者有所幫助。 1. 連結時缺失了相關目標檔案（.o）測試程式碼如下：

JS的六大對象：Global、Math、Number、Date、JSON、console，運行在服務器上方的支持情況分析

大對象在服務器 class .html width target code html http 在ASP中使用runat="server"來調用JS的相關函數，代碼如下： <script runat="server" language="javascript"&g

客戶逾期貸款預測[6] - 網格搜尋調參和交叉驗證

任務使用網格搜尋對模型進行調優並採用五折交叉驗證的方式進行模型評估實現之前已經進行過資料探索，缺失值和異常值處理、特徵生成、特徵刪除、縮放等處理

python資料分析6:雙色球使用線性迴歸演算法預測下期中獎結果

本次將進行下期雙色球號碼的預測，想想有些小激動啊。程式碼中使用了線性迴歸演算法，這個場景使用這個演算法，預測效果一般，各位可以考慮使用其他演算法嘗試結果。發現之前有很多程式碼都是重複的工作，為了讓程式碼看的更優雅，定義了函式，去呼叫，頓時高大上了 #!/usr/bi

演算法優化：最大子段和，最大子矩陣和，一維，二維情況分析，動態規劃

最大子段和，前面b[j]理解的是：終點在j的最大連續子段和，及從k:j最大和是對b[j]進行動態規劃，從k:j最大和：取決於k:j-1的最大和，他大於0的話，就為k:j-1的最大和+arr[j],他小於0的話，就只是arr[j] 終點在j一共有n種情況，原問題只是求b[

谷歌瀏覽器的原始碼分析(6)

類AutocompleteEdit繼承了類CWindowImpl、類CRichEditCommands、類Menu::Delegate。其中類CWindowImpl實現了Windows視窗，它是WTL裡的視窗模板類，主要用來建立視窗介面類，並且使用類CRichEditCtrl作為基類，類CRichEditCt

ML - 貸款使用者逾期情況分析6 - Final

文章目錄

思路

1. 匯入資料

2. 效能評估函式

3. 模型優化

3.1 LR模型

3.2 SVM模型

3.3 決策樹模型

3.4 XGBoost模型

3.5 LightGBM模型

3.6 模型融合

4. 結果對比和分析

5. 遇到的問題

相關推薦