ML - 貸款使用者逾期情況分析6 - Final
文章目錄
Task9 - 統一資料,資料三七分,隨機種子2018,用AUC作為模型評價指標,對比單模型和融合模型的比分。
具體程式碼見Github
思路
匯入原始資料,特徵歸一化後,調參,然後模型融合。
1. 匯入資料
匯入資料和特徵歸一化
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
# 匯入資料
data = pd.read_csv('data_all.csv')
y = data['status']
data.drop('status', axis = 1, inplace = True)
X = data
# 劃分訓練集測試集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=2018)
# 特徵歸一化
std = StandardScaler()
X_train = std.fit_transform(X_train)
X_test = std.transform(X_test)
2. 效能評估函式
from sklearn.metrics import accuracy_score, roc_auc_score
def model_metrics (clf, X_train, X_test, y_train, y_test):
# 預測
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
y_train_proba = clf.predict_proba(X_train)[:,1]
y_test_proba = clf.predict_proba(X_test)[:,1]
# 準確率
print('[準確率]', end = ' ')
print('訓練集:', '%.4f'%accuracy_score(y_train, y_train_pred), end = ' ')
print('測試集:', '%.4f'%accuracy_score(y_test, y_test_pred))
# auc取值:用roc_auc_score或auc
print('[auc值]', end = ' ')
print('訓練集:', '%.4f'%roc_auc_score(y_train, y_train_proba), end = ' ')
print('測試集:', '%.4f'%roc_auc_score(y_test, y_test_proba))
3. 模型優化
調參:首先大範圍粗略地調,然後細分割槽間調。
對於包含較多引數的模型,例如xgb和lgb。首先固定其他引數為常用值,然後掃描某幾個引數,迴圈掃描,直到分數不再增加。
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier
from mlxtend.classifier import StackingClassifier
3.1 LR模型
調參:正則化因子C和正則化方式penalty。
lr = LogisticRegression(random_state = 2018)
# param = {'C': [1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']}
param = {'C': [i/100 for i in range(1,21)], 'penalty':['l1', 'l2']}
gsearch = GridSearchCV(lr, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)
print('最佳引數:',gsearch.best_params_)
print('訓練集的最佳分數:', gsearch.best_score_)
print('測試集的最佳分數:', gsearch.score(X_test, y_test))
lr = LogisticRegression(C = 0.04, penalty = 'l1',random_state = 2018)
lr.fit(X_train, y_train)
model_metrics(lr, X_train, X_test, y_train, y_test)
輸出
[準確率] 訓練集: 0.8016 測試集: 0.7884
[auc值] 訓練集: 0.8080 測試集: 0.7831
3.2 SVM模型
# 線性SVM
svm_linear = svm.SVC(kernel = 'linear', probability=True, random_state = 2018)
param = {'C':[0.01, 0.05, 0.1, 0.5, 1]}
gsearch = GridSearchCV(svm_linear, param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)
print('最佳引數:',gsearch.best_params_)
print('訓練集的最佳分數:', gsearch.best_score_)
print('測試集的最佳分數:', gsearch.score(X_test, y_test))
svm_linear = svm.SVC(C = 0.01, kernel = 'linear', probability=True,random_state = 2018)
svm_linear.fit(X_train, y_train)
model_metrics(svm_linear, X_train, X_test, y_train, y_test)
輸出
[準確率] 訓練集: 0.7992 測試集: 0.7765
[auc值] 訓練集: 0.8152 測試集: 0.7790
其他三個svm_poly、svm_rbf和svm_sigmoid 此處不再展示,具體參見Github。
3.3 決策樹模型
1)首先觀察一下預設引數的結果。(最後調參完肯定要比預設引數好才對)
dt = DecisionTreeClassifier(random_state = 2018)
dt.fit(X_train, y_train)
model_metrics(dt, X_train, X_test, y_train, y_test)
輸出
[準確率] 訓練集: 1.0000 測試集: 0.6854
[auc值] 訓練集: 1.0000 測試集: 0.5956
2)具體調參過程如下:先大範圍調,再小範圍調。調完後,再回到起始位置迴圈調,直到引數不再需要變化。
param = {'max_depth':range(3,14,2), 'min_samples_split':range(100,801,200)}
#param = {'min_samples_split':range(50,1000,100), 'min_samples_leaf':range(60,101,10)}
#param = {'min_samples_split':range(100,401,10), 'min_samples_leaf':range(40,101,10)}
#param = {'max_features':range(7,20,2)}
#param = {'max_features':[18,19,20]}
gsearch = GridSearchCV(DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9, random_state = 2018),
param_grid = param,scoring ='roc_auc', cv=5)
gsearch.fit(X_train, y_train)
# gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
3)調參最終結果
dt = DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9, random_state = 2018)
dt.fit(X_train, y_train)
model_metrics(dt, X_train, X_test, y_train, y_test)
輸出
[準確率] 訓練集: 0.7812 測試集: 0.7561
[auc值] 訓練集: 0.7721 測試集: 0.6946
3.4 XGBoost模型
1)首先觀察一下預設引數的結果。
import warnings
warnings.filterwarnings("ignore")
xgb0 = XGBClassifier(random_state =2018)
xgb0.fit(X_train, y_train)
model_metrics(xgb0, X_train, X_test, y_train, y_test)
2)具體調參過程:下面的引數迴圈調。最後降低學習速率為0.01,調一下n_estimators,看一下效能有沒有改善。
param_test = {'n_estimators':range(20,200,20)}
# param_test = {'n_estimators':range(40,81,10)}
# param_test = {'max_depth':range(3,10,2), 'min_child_weight':range(1,12,2)}
# param_test = {'max_depth':[2,3,4], 'min_child_weight':[10,11,12]}
# param_test = {'gamma':[i/10 for i in range(6)]}
# param_test = {'subsample':[i/10 for i in range(5,10)], 'colsample_bytree':[i/10 for i in range(5,10)]}
# param_test = { 'subsample':[i/100 for i in range(60,81,5)], 'colsample_bytree':[i/100 for i in range(70,91,5)]}
#param_test = {'reg_alpha':[1e-5, 1e-2, 0.1, 0, 1, 100]}
# param_test = {'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1]}
gsearch = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3,
min_child_weight=11, gamma=0, subsample=0.7,
colsample_bytree=0.8, objective= 'binary:logistic',
nthread=4,scale_pos_weight=1, random_state =2018),
param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch.fit(X_train, y_train)
#gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
3)調參最終結果
xgb = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11,
gamma=0, subsample=0.7,colsample_bytree=0.8, objective= 'binary:logistic',
nthread=4,scale_pos_weight=1, random_state =2018)
xgb.fit(X_train, y_train)
model_metrics(xgb, X_train, X_test, y_train, y_test)
輸出
[準確率] 訓練集: 0.8302 測試集: 0.7891
[auc值] 訓練集: 0.8710 測試集: 0.7780
3.5 LightGBM模型
類似XGBoost
lgb0 = LGBMClassifier(random_state =2018)
lgb0.fit(X_train, y_train)
model_metrics(lgb0, X_train, X_test, y_train, y_test)
param_test = {'n_estimators':range(20,200,20)}
# param_test = {'n_estimators':range(30,51,10)}
# param_test = {'max_depth':range(3,10,2), 'min_child_weight':range(1,12,2)}
# param_test = {'max_depth':[2,3,4], 'min_child_weight':[6,7,8]}
# param_test = {'gamma':[i/10 for i in range(6)]}
# param_test = {'subsample':[i/10 for i in range(5,10)], 'colsample_bytree':[i/10 for i in range(5,10)]}
# param_test = { 'subsample':[i/100 for i in range(60,81,5)], 'colsample_bytree':[i/100 for i in range(70,91,5)]}
#param_test = {'reg_alpha':[1e-5, 1e-2, 0.1, 0, 1, 100]}
# param_test = {'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1]}
# 上述迴圈調整, 然後降低學習速率
gsearch = GridSearchCV(estimator = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3,
min_child_weight=7, gamma=0, subsample=0.5,
colsample_bytree=0.8, reg_alpha = 1e-5,
nthread=4,scale_pos_weight=1, random_state =2018),
param_grid = param_test, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch.fit(X_train, y_train)
# gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
lgb = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7,
gamma=0, subsample=0.5, colsample_bytree=0.8, reg_alpha=1e-5,
nthread=4,scale_pos_weight=1,random_state =2018)
lgb.fit(X_train, y_train)
model_metrics(lgb, X_train, X_test, y_train, y_test)
輸出
[準確率] 訓練集: 0.8269 測試集: 0.7877
[auc值] 訓練集: 0.8741 測試集: 0.7746
3.6 模型融合
1.在融合的時候需要對模型進行篩選
2.StackingClassifier的引數設定
如果average_probas=True,則對分類器的結果求平均,得到:p=[0.25,0.45,0.35]
如果average_probas=False,則分類器的所有結果都保留作為新的特徵:p=[0.2,0.5,0.3,0.3,0.4,0.4]
average_probas嘗試True後, 效果更好。其次, 決策樹和svm_poly單模型效果並不好, 嘗試去掉兩者後再Stacking
lr = LogisticRegression(C = 0.04, penalty = 'l1',random_state = 2018)
svm_linear =svm.SVC(C = 0.01, kernel = 'linear', probability=True,random_state = 2018)
svm_poly = svm.SVC(C = 0.01, kernel = 'poly', probability=True,random_state = 2018)
svm_rbf = svm.SVC(gamma = 0.01, C =0.01 , probability=True,random_state = 2018)
svm_sigmoid = svm.SVC(C = 0.01, kernel = 'sigmoid',probability=True,random_state = 2018)
dt = DecisionTreeClassifier(max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9,
random_state = 2018)
xgb = XGBClassifier(learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11,
gamma=0, subsample=0.7,colsample_bytree=0.8, objective= 'binary:logistic',
nthread=4,scale_pos_weight=1, random_state =2018)
lgb = LGBMClassifier(learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7,
gamma=0, subsample=0.5, colsample_bytree=0.8, reg_alpha=1e-5,
nthread=4,scale_pos_weight=1,random_state =2018)
將初級分類器產生的輸出類概率作為新特徵
sclf = StackingClassifier(classifiers=[lr, svm_linear, svm_rbf, xgb, lgb],
meta_classifier=lr, use_probas=True,average_probas=True)
sclf.fit(X_train, y_train.values)
model_metrics(sclf, X_train, X_test, y_train, y_test)
輸出
[準確率] 訓練集: 0.8161 測試集: 0.7821
[auc值] 訓練集: 0.8556 測試集: 0.7861
4. 結果對比和分析
模型 | 引數 | auc值 |
---|---|---|
LR | C = 0.04, penalty = ‘l1’ | 訓練集: 0.8080 測試集: 0.7831 |
svm_linear | C = 0.01 | 訓練集: 0.8152 測試集: 0.7790 |
svm_poly | C = 0.01 | 訓練集: 0.8626 測試集: 0.7347 |
svm_rbf | gamma = 0.01, C =0.01 | 訓練集: 0.8522 測試集: 0.7708 |
svm_sigmoid | C = 0.01 | 訓練集: 0.7660 測試集: 0.7590 |
決策樹 | max_depth=9,min_samples_split=100,min_samples_leaf=90, max_features=9 | 訓練集: 0.7721 測試集: 0.6946 |
XGBoost | learning_rate =0.1, n_estimators=60, max_depth=3, min_child_weight=11, gamma=0, subsample=0.7,colsample_bytree=0.8 | 訓練集: 0.8710 測試集: 0.7780 |
LightGBM | learning_rate =0.1, n_estimators=50, max_depth=3, min_child_weight=7, gamma=0, subsample=0.5, colsample_bytree=0.8 | 訓練集: 0.8741 測試集: 0.7746 |
Stacking | - | 訓練集: 0.8750 測試集: 0.7861 |
分析
測試集最好情況是LR模型0.7831。
並且可以看到LR取最好結果時, 選擇的是L1正則化。所以猜測需要進一步進行特徵選擇。
5. 遇到的問題
調參後best_score_的分數和求得的測試集auc值不相同。
原因是best_score_時使用交叉驗證,和最終的test資料切分不一樣嗎?感覺有幾個模型差了0.2,有點多。