LightGBM調參筆記
1. 概述
在競賽題中,我們知道XGBoost演算法非常熱門,是很多的比賽的大殺器,但是在使用過程中,其訓練耗時很長,記憶體佔用比較大。在2017年年1月微軟在GitHub的上開源了LightGBM。該演算法在不降低準確率的前提下,速度提升了10倍左右,佔用記憶體下降了3倍左右。LightGBM是個快速的,分散式的,高效能的基於決策樹演算法的梯度提升演算法。可用於排序,分類,迴歸以及很多其他的機器學習任務中。其詳細的原理及操作內容詳見:LightGBM 中文文件。
本文主要講解LightGBM的兩種調參方法。
下面幾張表為重要引數的含義和如何應用
學習控制引數 | 含義 | 用法 |
---|---|---|
max_depth |
樹的最大深度 | 當模型過擬合時,可以考慮首先降低 max_depth |
min_data_in_leaf |
葉子可能具有的最小記錄數 | 預設20,過擬合時用 |
feature_fraction |
例如 為0.8時,意味著在每次迭代中隨機選擇80%的引數來建樹 | boosting 為 random forest 時用 |
bagging_fraction |
每次迭代時用的資料比例 | 用於加快訓練速度和減小過擬合 |
early_stopping_round |
如果一次驗證資料的一個度量在最近的early_stopping_round 回合中沒有提高,模型將停止訓練 |
加速分析,減少過多迭代 |
lambda | 指定正則化 | 0~1 |
min_gain_to_split |
描述分裂的最小 gain | 控制樹的有用的分裂 |
max_cat_group |
在 group 邊界上找到分割點 | 當類別數量很多時,找分割點很容易過擬合時 |
核心引數 | 含義 | 用法 |
---|---|---|
Task | 資料的用途 | 選擇 train 或者 predict |
application | 模型的用途 | 選擇 regression: 迴歸時,binary: 二分類時,multiclass: 多分類時 |
boosting | 要用的演算法 | gbdt, rf: random forest, dart: Dropouts meet Multiple Additive Regression Trees, goss: Gradient-based One-Side Sampling |
num_boost_round |
迭代次數 | 通常 100+ |
learning_rate |
如果一次驗證資料的一個度量在最近的 early_stopping_round 回合中沒有提高,模型將停止訓練 |
常用 0.1, 0.001, 0.003… |
num_leaves |
預設 31 | |
device | cpu 或者 gpu | |
metric | mae: mean absolute error , mse: mean squared error , binary_logloss: loss for binary classification , multi_logloss: loss for multi classification |
IO 引數 | 含義 |
---|---|
max_bin |
表示 feature 將存入的 bin 的最大數量 |
categorical_feature |
如果 categorical_features = 0,1,2 , 則列 0,1,2是 categorical 變數 |
ignore_column |
與 categorical_features 類似,只不過不是將特定的列視為categorical,而是完全忽略 |
save_binary |
這個引數為 true 時,則資料集被儲存為二進位制檔案,下次讀資料時速度會變快 |
調參
IO parameter | 含義 |
---|---|
num_leaves |
取值應 <= 2 ^(max_depth) , 超過此值會導致過擬合 |
min_data_in_leaf |
將它設定為較大的值可以避免生長太深的樹,但可能會導致 underfitting,在大型資料集時就設定為數百或數千 |
max_depth |
這個也是可以限制樹的深度 |
下表對應了 Faster Speed ,better accuracy ,over-fitting 三種目的時,可以調的引數
Faster Speed | better accuracy | over-fitting |
---|---|---|
將 max_bin 設定小一些 |
用較大的 max_bin |
max_bin 小一些 |
num_leaves 大一些 |
num_leaves 小一些 |
|
用 feature_fraction 來做 sub-sampling |
用 feature_fraction |
|
用 bagging_fraction 和 bagging_freq |
設定 bagging_fraction 和 bagging_freq |
|
training data 多一些 | training data 多一些 | |
用 save_binary 來加速資料載入 |
直接用 categorical feature | 用 gmin_data_in_leaf 和 min_sum_hessian_in_leaf |
用 parallel learning | 用 dart | 用 lambda_l1, lambda_l2 ,min_gain_to_split 做正則化 |
num_iterations 大一些,learning_rate 小一些 |
用 max_depth 控制樹的深度 |
2.GridSearchCV調參
LightGBM的調參過程和RF、GBDT等類似,其基本流程如下:
-
首先選擇較高的學習率,大概0.1附近,這樣是為了加快收斂的速度。這對於調參是很有必要的。
-
對決策樹基本引數調參
-
正則化引數調參
-
最後降低學習率,這裡是為了最後提高準確率
第一步:學習率和迭代次數
我們先把學習率先定一個較高的值,這裡取 learning_rate = 0.1
,其次確定估計器boosting/boost/boosting_type
的型別,不過預設都會選gbdt
。
迭代的次數,也可以說是殘差樹的數目,引數名為n_estimators/num_iterations/num_round/num_boost_round
。我們可以先將該引數設成一個較大的數,然後在cv結果中檢視最優的迭代次數,具體如程式碼。
在這之前,我們必須給其他重要的引數一個初始值。初始值的意義不大,只是為了方便確定其他引數。下面先給定一下初始值:
以下引數根據具體專案要求定:
'boosting_type'/'boosting': 'gbdt'
'objective': 'binary'
'metric': 'auc'
以下是我選擇的初始值:
'max_depth': 5 # 由於資料集不是很大,所以選擇了一個適中的值,其實4-10都無所謂。
'num_leaves': 30 # 由於lightGBM是leaves_wise生長,官方說法是要小於2^max_depth
'subsample'/'bagging_fraction':0.8 # 資料取樣
'colsample_bytree'/'feature_fraction': 0.8 # 特徵取樣
下面用LightGBM的cv函式進行確定:
import pandas as pd
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
from sklearn.cross_validation import train_test_split
canceData=load_breast_cancer()
X=canceData.data
y=canceData.target
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0,test_size=0.2)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'nthread':4,
'learning_rate':0.1,
'num_leaves':30,
'max_depth': 5,
'subsample': 0.8,
'colsample_bytree': 0.8,
}
data_train = lgb.Dataset(X_train, y_train)
cv_results = lgb.cv(params, data_train, num_boost_round=1000, nfold=5, stratified=False, shuffle=True, metrics='auc',early_stopping_rounds=50,seed=0)
print('best n_estimators:', len(cv_results['auc-mean']))
print('best cv score:', pd.Series(cv_results['auc-mean']).max())
輸出結果如下:
('best n_estimators:', 188)
('best cv score:', 0.99134716298085424)
我們根據以上結果,取n_estimators=188。
第二步:確定max_depth和num_leaves
這是提高精確度的最重要的引數。這裡我們引入sklearn
裡的GridSearchCV()
函式進行搜尋。
from sklearn.grid_search import GridSearchCV
params_test1={'max_depth': range(3,8,1), 'num_leaves':range(5, 100, 5)}
gsearch1 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1, n_estimators=188, max_depth=6, bagging_fraction = 0.8,feature_fraction = 0.8),
param_grid = params_test1, scoring='roc_auc',cv=5,n_jobs=-1)
gsearch1.fit(X_train,y_train)
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_
結果如下:(結果較長,只顯示部分內容)
([mean: 0.99248, std: 0.01033, params: {'num_leaves': 5, 'max_depth': 3},
mean: 0.99227, std: 0.01013, params: {'num_leaves': 10, 'max_depth': 3},
mean: 0.99227, std: 0.01013, params: {'num_leaves': 15, 'max_depth': 3},
······
mean: 0.99331, std: 0.00775, params: {'num_leaves': 85, 'max_depth': 7},
mean: 0.99331, std: 0.00775, params: {'num_leaves': 90, 'max_depth': 7},
mean: 0.99331, std: 0.00775, params: {'num_leaves': 95, 'max_depth': 7}],
{'max_depth': 4, 'num_leaves': 10},
0.9943573667711598)
根據結果,我們取max_depth=4,num_leaves=10。
第三步:確定min_data_in_leaf和max_bin in
params_test2={'max_bin': range(5,256,10), 'min_data_in_leaf':range(1,102,10)}
gsearch2 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1, n_estimators=188, max_depth=4, num_leaves=10,bagging_fraction = 0.8,feature_fraction = 0.8),
param_grid = params_test2, scoring='roc_auc',cv=5,n_jobs=-1)
gsearch2.fit(X_train,y_train)
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_
結果如下:(結果較長,只顯示部分內容)
([mean: 0.98715, std: 0.01044, params: {'min_data_in_leaf': 1, 'max_bin': 5},
mean: 0.98809, std: 0.01095, params: {'min_data_in_leaf': 11, 'max_bin': 5},
mean: 0.98809, std: 0.00952, params: {'min_data_in_leaf': 21, 'max_bin': 5},
······
mean: 0.99363, std: 0.00812, params: {'min_data_in_leaf': 81, 'max_bin': 255},
mean: 0.99133, std: 0.00788, params: {'min_data_in_leaf': 91, 'max_bin': 255},
mean: 0.98882, std: 0.01223, params: {'min_data_in_leaf': 101, 'max_bin': 255}],
{'max_bin': 15, 'min_data_in_leaf': 51},
0.9952978056426331)
根據結果,我們取min_data_in_leaf=51,max_bin in=15。
第四步:確定feature_fraction、bagging_fraction、bagging_freq
params_test3={'feature_fraction': [0.6,0.7,0.8,0.9,1.0],
'bagging_fraction': [0.6,0.7,0.8,0.9,1.0],
'bagging_freq': range(0,81,10)
}
gsearch3 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1, n_estimators=188, max_depth=4, num_leaves=10,max_bin=15,min_data_in_leaf=51),
param_grid = params_test3, scoring='roc_auc',cv=5,n_jobs=-1)
gsearch3.fit(X_train,y_train)
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_
結果如下:(結果較長,只顯示部分內容)
([mean: 0.99467, std: 0.00710, params: {'bagging_freq': 0, 'bagging_fraction': 0.6, 'feature_fraction': 0.6},
mean: 0.99415, std: 0.00804, params: {'bagging_freq': 0, 'bagging_fraction': 0.6, 'feature_fraction': 0.7},
mean: 0.99530, std: 0.00722, params: {'bagging_freq': 0, 'bagging_fraction': 0.6, 'feature_fraction': 0.8},
······
mean: 0.99530, std: 0.00722, params: {'bagging_freq': 80, 'bagging_fraction': 1.0, 'feature_fraction': 0.8},
mean: 0.99383, std: 0.00731, params: {'bagging_freq': 80, 'bagging_fraction': 1.0, 'feature_fraction': 0.9},
mean: 0.99383, std: 0.00766, params: {'bagging_freq': 80, 'bagging_fraction': 1.0, 'feature_fraction': 1.0}],
{'bagging_fraction': 0.6, 'bagging_freq': 0, 'feature_fraction': 0.8},
0.9952978056426331)
第五步:確定lambda_l1和lambda_l2
params_test4={'lambda_l1': [1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0],
'lambda_l2': [1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0]
}
gsearch4 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1, n_estimators=188, max_depth=4, num_leaves=10,max_bin=15,min_data_in_leaf=51,bagging_fraction=0.6,bagging_freq= 0, feature_fraction= 0.8),
param_grid = params_test4, scoring='roc_auc',cv=5,n_jobs=-1)
gsearch4.fit(X_train,y_train)
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_
解果如下:(結果較長,只顯示部分內容)
([mean: 0.99530, std: 0.00722, params: {'lambda_l1': 1e-05, 'lambda_l2': 1e-05},
mean: 0.99415, std: 0.00804, params: {'lambda_l1': 1e-05, 'lambda_l2': 0.001},
mean: 0.99331, std: 0.00826, params: {'lambda_l1': 1e-05, 'lambda_l2': 0.1},
·····
mean: 0.99049, std: 0.01047, params: {'lambda_l1': 1.0, 'lambda_l2': 0.7},
mean: 0.99049, std: 0.01013, params: {'lambda_l1': 1.0, 'lambda_l2': 0.9},
mean: 0.99070, std: 0.01071, params: {'lambda_l1': 1.0, 'lambda_l2': 1.0}],
{'lambda_l1': 1e-05, 'lambda_l2': 1e-05},
0.9952978056426331)
第六步:確定 min_split_gain
params_test5={'min_split_gain':[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]}
gsearch5 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1, n_estimators=188, max_depth=4, num_leaves=10,max_bin=15,min_data_in_leaf=51,bagging_fraction=0.6,bagging_freq= 0, feature_fraction= 0.8,
lambda_l1=1e-05,lambda_l2=1e-05),
param_grid = params_test5, scoring='roc_auc',cv=5,n_jobs=-1)
gsearch5.fit(X_train,y_train)
gsearch5.grid_scores_, gsearch5.best_params_, gsearch5.best_score_
結果如下:
([mean: 0.99530, std: 0.00722, params: {'min_split_gain': 0.0},
mean: 0.99415, std: 0.00810, params: {'min_split_gain': 0.1},
mean: 0.99394, std: 0.00898, params: {'min_split_gain': 0.2},
mean: 0.99373, std: 0.00918, params: {'min_split_gain': 0.3},
mean: 0.99404, std: 0.00845, params: {'min_split_gain': 0.4},
mean: 0.99300, std: 0.00958, params: {'min_split_gain': 0.5},
mean: 0.99258, std: 0.00960, params: {'min_split_gain': 0.6},
mean: 0.99227, std: 0.01071, params: {'min_split_gain': 0.7},
mean: 0.99342, std: 0.00872, params: {'min_split_gain': 0.8},
mean: 0.99206, std: 0.01062, params: {'min_split_gain': 0.9},
mean: 0.99206, std: 0.01064, params: {'min_split_gain': 1.0}],
{'min_split_gain': 0.0},
0.9952978056426331)
第七步:降低學習率,增加迭代次數,驗證模型
model=lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.01, n_estimators=1000, max_depth=4, num_leaves=10,max_bin=15,min_data_in_leaf=51,bagging_fraction=0.6,bagging_freq= 0, feature_fraction= 0.8,
lambda_l1=1e-05,lambda_l2=1e-05,min_split_gain=0)
model.fit(X_train,y_train)
y_pre=model.predict(X_test)
print("acc:",metrics.accuracy_score(y_test,y_pre))
print("auc:",metrics.roc_auc_score(y_test,y_pre))
結果如下:
('acc:', 0.97368421052631582)
('auc:', 0.9744363289933311)
而使用預設引數時,模型表現如下:
model=lgb.LGBMClassifier()
model.fit(X_train,y_train)
y_pre=model.predict(X_test)
print("acc:",metrics.accuracy_score(y_test,y_pre))
print("auc:",metrics.roc_auc_score(y_test,y_pre))
('acc:', 0.96491228070175439)
('auc:', 0.96379803112099083)
我們可以看出在準確率和AUC得分都有所提高。
3.LightGBM的cv函式調參
這種方式比較省事兒,寫好程式碼自動尋優,但需要有調參經驗,如何設定較好的引數範圍有一定的技術含量,這裡直接給出程式碼。
import pandas as pd
import lightgbm as lgb
from sklearn import metrics
from sklearn.datasets import load_breast_cancer
from sklearn.cross_validation import train_test_split
canceData=load_breast_cancer()
X=canceData.data
y=canceData.target
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0,test_size=0.2)
### 資料轉換
print('資料轉換')
lgb_train = lgb.Dataset(X_train, y_train, free_raw_data=False)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,free_raw_data=False)
### 設定初始引數--不含交叉驗證引數
print('設定引數')
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'nthread':4,
'learning_rate':0.1
}
### 交叉驗證(調參)
print('交叉驗證')
max_auc = float('0')
best_params = {}
# 準確率
print("調參1:提高準確率")
for num_leaves in range(5,100,5):
for max_depth in range(3,8,1):
params['num_leaves'] = num_leaves
params['max_depth'] = max_depth
cv_results = lgb.cv(
params,
lgb_train,
seed=1,
nfold=5,
metrics=['auc'],
early_stopping_rounds=10,
verbose_eval=True
)
mean_auc = pd.Series(cv_results['auc-mean']).max()
boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
if mean_auc >= max_auc:
max_auc = mean_auc
best_params['num_leaves'] = num_leaves
best_params['max_depth'] = max_depth
if 'num_leaves' and 'max_depth' in best_params.keys():
params['num_leaves'] = best_params['num_leaves']
params['max_depth'] = best_params['max_depth']
# 過擬合
print("調參2:降低過擬合")
for max_bin in range(5,256,10):
for min_data_in_leaf in range(1,102,10):
params['max_bin'] = max_bin
params['min_data_in_leaf'] = min_data_in_leaf
cv_results = lgb.cv(
params,
lgb_train,
seed=1,
nfold=5,
metrics=['auc'],
early_stopping_rounds=10,
verbose_eval=True
)
mean_auc = pd.Series(cv_results['auc-mean']).max()
boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
if mean_auc >= max_auc:
max_auc = mean_auc
best_params['max_bin']= max_bin
best_params['min_data_in_leaf'] = min_data_in_leaf
if 'max_bin' and 'min_data_in_leaf' in best_params.keys():
params['min_data_in_leaf'] = best_params['min_data_in_leaf']
params['max_bin'] = best_params['max_bin']
print("調參3:降低過擬合")
for feature_fraction in [0.6,0.7,0.8,0.9,1.0]:
for bagging_fraction in [0.6,0.7,0.8,0.9,1.0]:
for bagging_freq in range(0,50,5):
params['feature_fraction'] = feature_fraction
params['bagging_fraction'] = bagging_fraction
params['bagging_freq'] = bagging_freq
cv_results = lgb.cv(
params,
lgb_train,
seed=1,
nfold=5,
metrics=['auc'],
early_stopping_rounds=10,
verbose_eval=True
)
mean_auc = pd.Series(cv_results['auc-mean']).max()
boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
if mean_auc >= max_auc:
max_auc=mean_auc
best_params['feature_fraction'] = feature_fraction
best_params['bagging_fraction'] = bagging_fraction
best_params['bagging_freq'] = bagging_freq
if 'feature_fraction' and 'bagging_fraction' and 'bagging_freq' in best_params.keys():
params['feature_fraction'] = best_params['feature_fraction']
params['bagging_fraction'] = best_params['bagging_fraction']
params['bagging_freq'] = best_params['bagging_freq']
print("調參4:降低過擬合")
for lambda_l1 in [1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0]:
for lambda_l2 in [1e-5,1e-3,1e-1,0.0,0.1,0.4,0.6,0.7,0.9,1.0]:
params['lambda_l1'] = lambda_l1
params['lambda_l2'] = lambda_l2
cv_results = lgb.cv(
params,
lgb_train,
seed=1,
nfold=5,
metrics=['auc'],
early_stopping_rounds=10,
verbose_eval=True
)
mean_auc = pd.Series(cv_results['auc-mean']).max()
boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
if mean_auc >= max_auc:
max_auc=mean_auc
best_params['lambda_l1'] = lambda_l1
best_params['lambda_l2'] = lambda_l2
if 'lambda_l1' and 'lambda_l2' in best_params.keys():
params['lambda_l1'] = best_params['lambda_l1']
params['lambda_l2'] = best_params['lambda_l2']
print("調參5:降低過擬合2")
for min_split_gain in [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]:
params['min_split_gain'] = min_split_gain
cv_results = lgb.cv(
params,
lgb_train,
seed=1,
nfold=5,
metrics=['auc'],
early_stopping_rounds=10,
verbose_eval=True
)
mean_auc = pd.Series(cv_results['auc-mean']).max()
boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
if mean_auc >= max_auc:
max_auc=mean_auc
best_params['min_split_gain'] = min_split_gain
if 'min_split_gain' in best_params.keys():
params['min_split_gain'] = best_params['min_split_gain']
print(best_params)
結果如下:
{'bagging_fraction': 0.7,
'bagging_freq': 30,
'feature_fraction': 0.8,
'lambda_l1': 0.1,
'lambda_l2': 0.0,
'max_bin': 255,
'max_depth': 4,
'min_data_in_leaf': 81,
'min_split_gain': 0.1,
'num_leaves': 10}
我們將訓練得到的引數代入模型
model=lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.01, n_estimators=1000, max_depth=4, num_leaves=10,max_bin=255,min_data_in_leaf=81,bagging_fraction=0.7,bagging_freq= 30, feature_fraction= 0.8,
lambda_l1=0.1,lambda_l2=0,min_split_gain=0.1)
model.fit(X_train,y_train)
y_pre=model.predict(X_test)
print("acc:",metrics.accuracy_score(y_test,y_pre))
print("auc:",metrics.roc_auc_score(y_test,y_pre))
結果如下:
('acc:', 0.98245614035087714)
('auc:', 0.98189901556049541)