【一週演算法實踐】__2.模型構建之整合模型
阿新 • • 發佈:2019-01-05
模型構建之整合模型
構建RF GBDT XDBoost LightGBM這四個模型,並對每一個模型使用準確率和AUC評分。在上次任務中使用了LR SVM DecisionTree這三個簡單的模型對樣本進行了預測和評價,請參照https://blog.csdn.net/wxq_1993/article/details/85703936。
#1.匯入要使用的模組
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score,roc_auc_score
import time
import warnings
warnings.filterwarnings('ignore')
# 2.劃分X和y並簡單分析資料
data_original= pd.read_csv("data_all.csv")
data_original.head(5)
data_original.describe()
#data_original.info()
low_volume_percent | middle_volume_percent | take_amount_in_later_12_month_highest | trans_amount_increase_rate_lately | trans_activity_month | trans_activity_day | transd_mcc | trans_days_interval_filter | trans_days_interval | regional_mobility | ... | consfin_product_count | consfin_max_limit | consfin_avg_limit | latest_query_day | loans_latest_day | reg_preference_for_trad | latest_query_time_month | latest_query_time_weekday | loans_latest_time_month | loans_latest_time_weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | ... | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.000000 | 4754.00000 | 4754.000000 | 4754.000000 |
mean | 0.021801 | 0.901332 | 1940.197728 | 14.152318 | 0.804493 | 0.365356 | 17.503155 | 29.004628 | 21.748422 | 2.678797 | ... | 5.088347 | 16418.973496 | 7507.426378 | 24.041649 | 51.984013 | 0.372949 | 4.273875 | 3.42196 | 4.542701 | 3.025873 |
std | 0.041519 | 0.144837 | 3923.971494 | 693.961441 | 0.196920 | 0.170194 | 4.474686 | 22.711659 | 16.472031 | 0.890198 | ... | 3.344794 | 13885.107357 | 5830.674623 | 36.500344 | 53.249364 | 0.687382 | 1.333778 | 1.93213 | 2.987731 | 1.895870 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.120000 | 0.033000 | 2.000000 | 0.000000 | 4.000000 | 1.000000 | ... | 0.000000 | 0.000000 | 0.000000 | -2.000000 | -2.000000 | 0.000000 | 1.000000 | 0.00000 | 1.000000 | 0.000000 |
25% | 0.010000 | 0.880000 | 0.000000 | 0.620000 | 0.670000 | 0.233000 | 15.000000 | 16.000000 | 12.000000 | 2.000000 | ... | 3.000000 | 7800.000000 | 4200.000000 | 6.000000 | 7.000000 | 0.000000 | 4.000000 | 2.00000 | 3.000000 | 2.000000 |
50% | 0.010000 | 0.960000 | 500.000000 | 0.970000 | 0.860000 | 0.350000 | 17.000000 | 23.000000 | 17.000000 | 3.000000 | ... | 4.000000 | 14400.000000 | 6750.000000 | 16.000000 | 29.000000 | 0.000000 | 4.000000 | 4.00000 | 4.000000 | 3.000000 |
75% | 0.020000 | 0.990000 | 2000.000000 | 1.600000 | 1.000000 | 0.479500 | 20.000000 | 32.000000 | 26.750000 | 3.000000 | ... | 7.000000 | 20400.000000 | 9696.250000 | 23.000000 | 86.000000 | 1.000000 | 5.000000 | 5.00000 | 5.000000 | 5.000000 |
max | 1.000000 | 1.000000 | 68000.000000 | 47596.740000 | 1.000000 | 0.941000 | 42.000000 | 285.000000 | 234.000000 | 5.000000 | ... | 20.000000 | 266400.000000 | 82800.000000 | 360.000000 | 323.000000 | 4.000000 | 12.000000 | 6.00000 | 12.000000 | 6.000000 |
8 rows × 85 columns
y=data_original['status'].copy()
X=data_original.drop(['status'],axis=1).copy()
print("the X shape is:", X.shape)
print("the X shape is:" ,y.shape)
print("the nums of label 1 in y are",len(y[y==1]))
print("the nums of label 0 in y are",len(y[y==0]))
df_ret=pd.DataFrame(columns=('Model','Accuracy','AUC','Time'))
row=0
the X shape is: (4754, 84)
the X shape is: (4754,)
the nums of label 1 in y are 1193
the nums of label 0 in y are 3561
一共有4754組資料,每組資料中有84個特徵;標籤值中為1的有1193個,為0的有3561個;正樣例與負樣例數量差別較大,在後續處理應當考慮。
#3.資料集的三七劃分
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2018)
print('the proportition of label 1 in y_test: %.2f%%'%(len(y_test[y_test==1])/len(y_test)*100))
the proportition of label 1 in y_test: 25.16%
# 4.定義一個評價函式
def evaluate(y_pre,y):
acc=accuracy_score(y,y_pre)
auc=roc_auc_score(y,y_pre)
return acc,auc
由於在第一次作業中頻繁呼叫accuracy_score()和f1_score(),在第二次作業中,將其定義成一個評價函式方便呼叫
問題來了,我從官方文件上直接複製RF GBDT XGBoost Lightgbm這四個分類器的預設引數,執行後竟然報錯,提示有中文字元或者空格,只好如下這麼簡單輸入了
# 5.構建模型進行預測
#分別採用 RF GBDT XGBoost Lightgbm,由於對模型不熟悉,故全部採用預設值
rf_model=RandomForestClassifier(n_estimators=100,max_depth=None,criterion='gini')
gbdt_model=GradientBoostingClassifier(n_estimators=100,max_depth=3,learning_rate=0.1)
xgb_model=XGBClassifier(n_estimators=100,learning_rate=0.1,max_depth=3)
lgbm_model=LGBMClassifier(n_estimators=100,learning_rate=0.1,max_depth=-1)
# 6.訓練模型
models=[('RF',rf_model),('gbdt',gbdt_model),('xgb',xgb_model),('lgbm',lgbm_model)]
for name,model in models:
print(name,'start training.....')
startTime=time.clock()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
endTime=time.clock()
print(name,'using time is %.4f'%(endTime-startTime))
acc,auc=evaluate(y_pred,y_test)
print(name,'accuracy_score:',round(acc,4),'auc_score: ',round(auc,4))
df_ret.loc[row]=[name,acc,auc,(endTime-startTime)]
row+=1
print('\n')
print(df_ret)
RF start training.....
RF using time is 1.3224
RF accuracy_score: 0.7849 auc_score: 0.6076
gbdt start training.....
gbdt using time is 1.3351
gbdt accuracy_score: 0.78 auc_score: 0.6376
xgb start training.....
xgb using time is 0.7749
xgb accuracy_score: 0.7856 auc_score: 0.6432
lgbm start training.....
lgbm using time is 0.7061
lgbm accuracy_score: 0.7701 auc_score: 0.631
Model Accuracy AUC Time
0 RF 0.784863 0.607558 1.322362
1 gbdt 0.779958 0.637566 1.335071
2 xgb 0.785564 0.643161 0.774934
3 lgbm 0.770147 0.631012 0.706147
根據結果可知,整合學習的這四種模型明顯好於第一次使用的三種模型,**其中XGBoost表現最好,LGBM速度最快;**由於複製預設引數報錯,導致訓練過程中只是用了三個引數,在後續的訓練中繼續改進。另外在面試過程中XGBoost和GBDT模型是經常被提問的,應當重點掌握。
參考資料:
1.整合模型
2.XGBoost:
3.RandomForest:
4.GradientBoostingClassifier:
5.xgboost的安裝:
6.https://zhuanlan.zhihu.com/p/54042675