Xgboost原理、程式碼、調參和上線實錄
對於一個演算法工程師而言,xgboost應該算的上是起手式,網上也有各式各樣的教程,這篇部落格主要從原理、程式碼、調參和上線進行覆蓋,進而構建一個直觀的演算法體系;
生成的二叉樹是滿二叉樹還是完全二叉樹?
調參方法
param = {
# step size
'eta': 0.1,
# model param, the weight value of the number of leaves, larger -> under fitting
# 'gamma': 0.1,
'max_depth': depth,
# pruning param, min instance weight in hessian needed in a child; larger -> under fitting
# 'min_child_weight': 1,
# pruning param, update constraint, larger -> under fitting
# 'max_delta_step': 0,
'subsample': 0.8,
# column sample ratio each tree
# 'colsample_bytree': 0.8,
# column sample ratio each layer
'colsample_bylevel': 0.3,
# L2 regularization term
# 'lambda': 1,
# L1 regularization term
'alpha': 0.1,
# small data set -> exact, large data set -> approx, just choose auto
# 'tree_method': 'auto',
# 'sketch_eps': 0.03,
# for unbalanced data set, pos: value, neg: 1
# 'scale_pos_weight': 1 / weight,
# "reg:linear" –linear regression
# "reg:logistic" –logistic regression
# "binary:logistic" –logistic regression for binary classification, output probability
# "binary:logitraw" –logistic regression for binary classification, output score before logistic transformation
# "count:poisson" –poisson regression for count data, output mean of poisson distribution, max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization)
# "multi:softmax" –set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)
# "multi:softprob" –same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probability of each data point belonging to each class.
# "rank:pairwise" –set XGBoost to do ranking task by minimizing the pairwise loss
# "reg:gamma" –gamma regression for severity data, output mean of gamma distribution
'objective': 'binary:logistic',
# threshold
'base_score': weight / (weight + 1),
# "rmse": root mean square error
# "mae": mean absolute error
# "logloss": negative log-likelihood
# "error": Binary classification error rate. It is calculated as # (wrong cases)/# (all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
# "merror": Multiclass classification error rate. It is calculated as # (wrong cases)/# (all cases).
# "mlogloss": Multiclass logloss
# "auc": Area under the curve for ranking evaluation.
# "ndcg":Normalized Discounted Cumulative Gain
# "map":Mean average precision
# "[email protected]","[email protected]": n can be assigned as an integer to cut off the top positions in the lists for evaluation.
# "ndcg-","map-","[email protected]","[email protected]": In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding "-" in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. training repeatedly
# "gamma-deviance": [residual deviance for gamma regression]
'eval_metric': 'auc',
'seed': 31,
}
上線實錄:
在實際生產過程中,難免遇到將模型上線的問題,這裡將流程進行拆解;
1.模型訓練完成—儲存模型檔案—解析模型檔案—重寫程式碼讀取解析檔案預測樹;
Xgb的損失函式的作用;
Java解析原始碼,具體的程式碼和註釋已經更新在github上,