資料探勘演算法和實踐(二十二):LightGBM整合演算法案列(癌症資料集)
阿新 • • 發佈:2021-01-24
技術標籤:機器學習/資料探勘實戰Python與資料分析資料探勘機器學習python人工智慧演算法
本節使用datasets資料集中的癌症資料集使用LightGBM進行建模的簡單案列,關於整合學習的學習可以參考:資料探勘演算法和實踐(十八):整合學習演算法(Boosting、Bagging),LGBM是一個非常常用演算法;
一、引入常用包
import datetime import numpy as np import pandas as pd import lightgbm as lgb from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt %matplotlib inline
二、載入資料集
# 載入資料集
breast = load_breast_cancer()
# 獲取特徵值和目標指
X,y = breast.data,breast.target
# 獲取特徵名稱
feature_name = breast.feature_names
三、資料預處理
資料是比較標準的玩具資料,因此不需要複雜的資料預處理;
# 資料集劃分 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) # 資料格式轉換 lgb_train = lgb.Dataset(X_train, y_train) lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
四、建模和引數
# 引數設定 boost_round = 50 # 迭代次數 early_stop_rounds = 10 # 驗證資料若在early_stop_rounds輪中未提高,則提前停止 params = { 'boosting_type': 'gbdt', # 設定提升型別 'objective': 'regression', # 目標函式 'metric': {'l2', 'auc'}, # 評估函式 'num_leaves': 31, # 葉子節點數 'learning_rate': 0.05, # 學習速率 'feature_fraction': 0.9, # 建樹的特徵選擇比例 'bagging_fraction': 0.8, # 建樹的樣本取樣比例 'bagging_freq': 5, # k 意味著每 k 次迭代執行bagging 'verbose': 1 # <0 顯示致命的, =0 顯示錯誤 (警告), >0 顯示資訊 } # 模型訓練:加入提前停止的功能 results = {} gbm = lgb.train(params, lgb_train, num_boost_round= boost_round, valid_sets=(lgb_eval, lgb_train), valid_names=('validate','train'), early_stopping_rounds = early_stop_rounds, evals_result= results)
訓練結果:
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001093 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4548
[LightGBM] [Info] Number of data points in the train set: 455, number of used features: 30
[LightGBM] [Info] Start training from score 0.637363
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] train's auc: 0.984943 train's l2: 0.21292 validate's auc: 0.98825 validate's l2: 0.225636
Training until validation scores don't improve for 10 rounds
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[2] train's auc: 0.990805 train's l2: 0.196278 validate's auc: 0.992855 validate's l2: 0.208124
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[3] train's auc: 0.990324 train's l2: 0.181505 validate's auc: 0.992379 validate's l2: 0.192562
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[4] train's auc: 0.990439 train's l2: 0.168012 validate's auc: 0.993966 validate's l2: 0.178022
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[5] train's auc: 0.990376 train's l2: 0.15582 validate's auc: 0.993014 validate's l2: 0.164942
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[6] train's auc: 0.990752 train's l2: 0.144636 validate's auc: 0.993649 validate's l2: 0.152745
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[7] train's auc: 0.991641 train's l2: 0.134404 validate's auc: 0.993331 validate's l2: 0.142248
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[8] train's auc: 0.992571 train's l2: 0.124721 validate's auc: 0.992379 validate's l2: 0.132609
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[9] train's auc: 0.992884 train's l2: 0.116309 validate's auc: 0.991743 validate's l2: 0.123573
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[10] train's auc: 0.992989 train's l2: 0.108757 validate's auc: 0.992696 validate's l2: 0.115307
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[11] train's auc: 0.993156 train's l2: 0.101871 validate's auc: 0.991743 validate's l2: 0.108458
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[12] train's auc: 0.99348 train's l2: 0.0954168 validate's auc: 0.99222 validate's l2: 0.101479
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[13] train's auc: 0.993396 train's l2: 0.0897573 validate's auc: 0.99222 validate's l2: 0.0956762
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[14] train's auc: 0.993605 train's l2: 0.0846034 validate's auc: 0.992855 validate's l2: 0.0898012
Early stopping, best iteration is:
[4] train's auc: 0.990439 train's l2: 0.168012 validate's auc: 0.993966 validate's l2: 0.178022
五、模型應用和評估
# 模型預測
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
y_pred
# 模型評估
lgb.plot_metric(results)
plt.show()
# 繪製重要的特徵
lgb.plot_importance(gbm,importance_type = "split")
plt.show()