【一週演算法實踐】__2.模型構建之整合模型

阿新 • • 發佈：2019-01-05

模型構建之整合模型

構建RF GBDT XDBoost LightGBM這四個模型，並對每一個模型使用準確率和AUC評分。在上次任務中使用了LR SVM DecisionTree這三個簡單的模型對樣本進行了預測和評價，請參照https://blog.csdn.net/wxq_1993/article/details/85703936。

#1.匯入要使用的模組
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.ensemble import 
 RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score,roc_auc_score
import time
import warnings
warnings.filterwarnings('ignore')

# 2.劃分X和y並簡單分析資料
data_original= 
pd.read_csv("data_all.csv")
data_original.head(5)
data_original.describe() 
#data_original.info()

	low_volume_percent	middle_volume_percent	take_amount_in_later_12_month_highest	trans_amount_increase_rate_lately	trans_activity_month	trans_activity_day	transd_mcc	trans_days_interval_filter	trans_days_interval	regional_mobility	...	consfin_product_count	consfin_max_limit	consfin_avg_limit	latest_query_day	loans_latest_day	reg_preference_for_trad	latest_query_time_month	latest_query_time_weekday	loans_latest_time_month	loans_latest_time_weekday
count	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	...	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	4754.00000	4754.000000	4754.000000
mean	0.021801	0.901332	1940.197728	14.152318	0.804493	0.365356	17.503155	29.004628	21.748422	2.678797	...	5.088347	16418.973496	7507.426378	24.041649	51.984013	0.372949	4.273875	3.42196	4.542701	3.025873
std	0.041519	0.144837	3923.971494	693.961441	0.196920	0.170194	4.474686	22.711659	16.472031	0.890198	...	3.344794	13885.107357	5830.674623	36.500344	53.249364	0.687382	1.333778	1.93213	2.987731	1.895870
min	0.000000	0.000000	0.000000	0.000000	0.120000	0.033000	2.000000	0.000000	4.000000	1.000000	...	0.000000	0.000000	0.000000	-2.000000	-2.000000	0.000000	1.000000	0.00000	1.000000	0.000000
25%	0.010000	0.880000	0.000000	0.620000	0.670000	0.233000	15.000000	16.000000	12.000000	2.000000	...	3.000000	7800.000000	4200.000000	6.000000	7.000000	0.000000	4.000000	2.00000	3.000000	2.000000
50%	0.010000	0.960000	500.000000	0.970000	0.860000	0.350000	17.000000	23.000000	17.000000	3.000000	...	4.000000	14400.000000	6750.000000	16.000000	29.000000	0.000000	4.000000	4.00000	4.000000	3.000000
75%	0.020000	0.990000	2000.000000	1.600000	1.000000	0.479500	20.000000	32.000000	26.750000	3.000000	...	7.000000	20400.000000	9696.250000	23.000000	86.000000	1.000000	5.000000	5.00000	5.000000	5.000000
max	1.000000	1.000000	68000.000000	47596.740000	1.000000	0.941000	42.000000	285.000000	234.000000	5.000000	...	20.000000	266400.000000	82800.000000	360.000000	323.000000	4.000000	12.000000	6.00000	12.000000	6.000000

8 rows × 85 columns

y=data_original['status'].copy()
X=data_original.drop(['status'],axis=1).copy()
print("the X shape is:", X.shape)
print("the X shape is:" ,y.shape)
print("the nums of label 1 in y are",len(y[y==1]))
print("the nums of label 0 in y are",len(y[y==0]))
df_ret=pd.DataFrame(columns=('Model','Accuracy','AUC','Time'))
row=0

the X shape is: (4754, 84)
the X shape is: (4754,)
the nums of label 1 in y are 1193
the nums of label 0 in y are 3561

一共有4754組資料，每組資料中有84個特徵；標籤值中為1的有1193個，為0的有3561個;正樣例與負樣例數量差別較大，在後續處理應當考慮。

#3.資料集的三七劃分
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2018)
print('the proportition of label 1 in y_test: %.2f%%'%(len(y_test[y_test==1])/len(y_test)*100))

the proportition of label 1 in y_test: 25.16%

# 4.定義一個評價函式
def evaluate(y_pre,y):
    acc=accuracy_score(y,y_pre)
    auc=roc_auc_score(y,y_pre)
    return acc,auc

由於在第一次作業中頻繁呼叫accuracy_score()和f1_score(),在第二次作業中，將其定義成一個評價函式方便呼叫

問題來了，我從官方文件上直接複製RF GBDT XGBoost Lightgbm這四個分類器的預設引數，執行後竟然報錯，提示有中文字元或者空格，只好如下這麼簡單輸入了

# 5.構建模型進行預測
#分別採用 RF GBDT XGBoost Lightgbm,由於對模型不熟悉，故全部採用預設值

rf_model=RandomForestClassifier(n_estimators=100,max_depth=None,criterion='gini')
gbdt_model=GradientBoostingClassifier(n_estimators=100,max_depth=3,learning_rate=0.1)
xgb_model=XGBClassifier(n_estimators=100,learning_rate=0.1,max_depth=3)
lgbm_model=LGBMClassifier(n_estimators=100,learning_rate=0.1,max_depth=-1)

# 6.訓練模型
models=[('RF',rf_model),('gbdt',gbdt_model),('xgb',xgb_model),('lgbm',lgbm_model)]
for name,model in models:
    print(name,'start training.....')
    startTime=time.clock()
    model.fit(X_train,y_train)
    y_pred=model.predict(X_test)
    endTime=time.clock()
    print(name,'using time is %.4f'%(endTime-startTime))
    acc,auc=evaluate(y_pred,y_test)
    print(name,'accuracy_score:',round(acc,4),'auc_score: ',round(auc,4))
    df_ret.loc[row]=[name,acc,auc,(endTime-startTime)]
    row+=1
    print('\n')
print(df_ret)

RF start training.....
RF using time is 1.3224
RF accuracy_score: 0.7849 auc_score:  0.6076


gbdt start training.....
gbdt using time is 1.3351
gbdt accuracy_score: 0.78 auc_score:  0.6376


xgb start training.....
xgb using time is 0.7749
xgb accuracy_score: 0.7856 auc_score:  0.6432


lgbm start training.....
lgbm using time is 0.7061
lgbm accuracy_score: 0.7701 auc_score:  0.631


  Model  Accuracy       AUC      Time
0    RF  0.784863  0.607558  1.322362
1  gbdt  0.779958  0.637566  1.335071
2   xgb  0.785564  0.643161  0.774934
3  lgbm  0.770147  0.631012  0.706147

根據結果可知，整合學習的這四種模型明顯好於第一次使用的三種模型，**其中XGBoost表現最好，LGBM速度最快；**由於複製預設引數報錯，導致訓練過程中只是用了三個引數，在後續的訓練中繼續改進。另外在面試過程中XGBoost和GBDT模型是經常被提問的，應當重點掌握。

參考資料：

1.整合模型
2.XGBoost:
3.RandomForest:
4.GradientBoostingClassifier:
5.xgboost的安裝:
6.https://zhuanlan.zhihu.com/p/54042675

【一週演算法實踐】__2.模型構建之整合模型

模型構建之整合模型

參考資料：

【一週演算法實踐】__2.模型構建之整合模型

泰坦尼克號資料探勘專案實戰——Task4 模型構建之整合模型

一週演算法實踐day1：模型構建

一週演算法實踐__1.模型構建

【一週頭條盤點】中國軟體網（2018.10.8~2018.10.12）

【一週程式設計學習】--2.單鏈表與環形連結串列的實現

【一週程式設計學習】--1.用雜湊思想實現LeetCode的第1題和第202題

【小白學PyTorch】18 TF2構建自定義模型

金融貸款逾期的模型構建2——整合模型

【單純的記錄一下最近一週的快樂】

【演算法比賽】主流機器學習/深度學習模型程式碼模板

【Netty4 簡單專案實踐】一、長連線服務通用框架原型

【一週課表 · 國慶特別版低至1.9元】第四期Java、Python、C#、前端、運維、區塊鏈【得技術圖書/定製T恤/優惠券/會員專享券】

【Netty4 簡單專案實踐】十一、用Netty分發mpegts到websocket介面

Spring【一，環境搭建】

【轉】中文分詞之HMM模型詳解

牛客G-指紋鎖【一題三解】

【Python爬蟲學習實踐】基於Beautiful Soup的網站解析及數據可視化

【一首小詩】每一個難捱的日子都是一首詩

【軟件工程實踐】結對項目-四則運算 “軟件”之升級版

【一週演算法實踐】__2.模型構建之整合模型

模型構建之整合模型

參考資料：

相關推薦