利用隨機森林、GBDT、xgboost、LightGBM計算準確率和auc
阿新 • • 發佈:2018-12-22
利用隨機森林、GBDT、xgboost、LightGBM計算準確率和auc
- 用到的模組
import pandas as pd import lightgbm as lgb from sklearn.model_selection import train_test_split from sklearn.preprocessing import label_binarize from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn import metrics from sklearn.metrics import accuracy_score,roc_auc_score from xgboost.sklearn import XGBClassifier
- 讀取資料集
data_all = pd.read_csv('/home/infisa/wjht/project/DataWhale/data_all.csv', encoding='gbk')
- 劃分資料集和測試集
features = [x for x in data_all.columns if x not in ['status']] X = data_all[features] y = data_all['status'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2018)
- 構建模型 計算準確率
forest=RandomForestClassifier(n_estimators=100,random_state=2018) # 隨機森林 forest.fit(X_train,y_train) forest_y_score=forest.predict_proba(X_test) # print(forest_y_score[:,1]) forest_score=forest.score(X_test,y_test) #準確率 # print('forest_score:',forest_score) 'ranfor_score:0.7820602662929222' Gbdt=GradientBoostingClassifier(random_state=2018) #CBDT Gbdt.fit(X_train,y_train) Gbdt_score=Gbdt.score(X_train,y_train) #準確率 # print('Gbdt_score:',Gbdt_score) 'Gbdt_score:0.8623384430417794' Xgbc=XGBClassifier(random_state=2018) #Xgbc Xgbc.fit(X_train,y_train) y_xgbc_pred=Xgbc.predict(X_test) Xgbc_score=accuracy_score(y_test,y_xgbc_pred) #準確率 # print('Xgbc_score:',Xgbc_score) 'Xgbc_score:0.7855641205325858' gbm=lgb.LGBMClassifier(random_state=2018) #lgb gbm.fit(X_train,y_train) y_gbm_pred=gbm.predict(X_test) gbm_score=accuracy_score(y_test,y_gbm_pred) #準確率 # print('gbm_score:',gbm_score) 'gbm_score:0.7701471618780659'
- 計算auc
y_test_hot = label_binarize(y_test,classes =(0, 1)) # 將測試集標籤資料用二值化編碼的方式轉換為矩陣
Gbdt_y_score = Gbdt.decision_function(X_test) # 得到Gbdt預測的損失值
forest_fpr,forest_tpr,forest_threasholds=metrics.roc_curve(y_test_hot.ravel(),forest_y_score[:,1].ravel()) # 計算ROC的值,forest_threasholds為閾值
Gbdt_fpr,Gbdt_tpr,Gbdt_threasholds=metrics.roc_curve(y_test_hot.ravel(),Gbdt_y_score.ravel()) # 計算ROC的值,Gbdt_threasholds為閾值
forest_auc=metrics.auc(forest_fpr,forest_tpr) #Gbdt_auc值
# print('forest_auc',forest_auc)
'forest_auc 0.7491366989035293'
Gbdt_auc=metrics.auc(Gbdt_fpr,Gbdt_tpr) #Gbdt_auc值
# print('Gbdt_auc:',Gbdt_auc)
'Gbdt_auc:0.7633094425839567'
Xgbc_auc=roc_auc_score(y_test,y_xgbc_pred) #Xgbc_auc值
# print('Xgbc_auc:',Xgbc_auc)
'Xgbc_auc:0.6431606209508309'
gbm_auc=roc_auc_score(y_test,y_gbm_pred) #gbm_auc值
# print('gbm_auc:',gbm_auc)
'gbm_auc:0.6310118097503468'
- 簡要分析
綜合Forest,GBDT,XGBoot,lightgbm幾種演算法得出的準確率和auc值,GBDT的score:0.8623384430417794,auc:0.7633094425839567的效果最好.
-
思考
對上面這四種模型理解還很膚淺,現在對隨機森林和GBDT瞭解較多,LightGBM和xgboot只是簡單瞭解了一些,裡面有很多引數還不清楚什麼意思. -
參考的文章
sklearn隨機森林分類類RandomForestClassifier
lightGBM原理、改進簡述
python機器學習案例系列教程——LightGBM演算法
auc指標含義的理解
機器學習sklearn19.0——整合學習——bagging、隨機森林演算法
整合學習之Adaboost演算法原理小結
Sklearn-GBDT(GradientBoostingDecisonTree)梯度提升樹