1. 程式人生 > >Xgboost原理、程式碼、調參和上線實錄

Xgboost原理、程式碼、調參和上線實錄

對於一個演算法工程師而言,xgboost應該算的上是起手式,網上也有各式各樣的教程,這篇部落格主要從原理、程式碼、調參和上線進行覆蓋,進而構建一個直觀的演算法體系;

 

 

 

生成的二叉樹是滿二叉樹還是完全二叉樹?

 

調參方法

param = {

        # step size

        'eta': 0.1,

        # model param, the weight value of the number of leaves, larger -> under fitting

        # 'gamma': 0.1,

        'max_depth': depth,

        # pruning param, min instance weight in hessian needed in a child; larger -> under fitting

        # 'min_child_weight': 1,

        # pruning param, update constraint, larger -> under fitting

        # 'max_delta_step': 0,

        'subsample': 0.8,

        # column sample ratio each tree

        # 'colsample_bytree': 0.8,

        # column sample ratio each layer

        'colsample_bylevel': 0.3,

        # L2 regularization term

        # 'lambda': 1,

        # L1 regularization term

        'alpha': 0.1,

        # small data set -> exact, large data set -> approx, just choose auto

        # 'tree_method': 'auto',

        # 'sketch_eps': 0.03,

        # for unbalanced data set, pos: value, neg: 1

        # 'scale_pos_weight': 1 / weight,

        # "reg:linear" –linear regression

        # "reg:logistic" –logistic regression

        # "binary:logistic" –logistic regression for binary classification, output probability

        # "binary:logitraw" –logistic regression for binary classification, output score before logistic transformation

        # "count:poisson" –poisson regression for count data, output mean of poisson distribution, max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization)

        # "multi:softmax" –set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)

        # "multi:softprob" –same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probability of each data point belonging to each class.

        # "rank:pairwise" –set XGBoost to do ranking task by minimizing the pairwise loss

        # "reg:gamma" –gamma regression for severity data, output mean of gamma distribution

        'objective': 'binary:logistic',

        # threshold

        'base_score': weight / (weight + 1),

        # "rmse": root mean square error

        # "mae": mean absolute error

        # "logloss": negative log-likelihood

        # "error": Binary classification error rate. It is calculated as # (wrong cases)/# (all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.

        # "merror": Multiclass classification error rate. It is calculated as # (wrong cases)/# (all cases).

        # "mlogloss": Multiclass logloss

        # "auc": Area under the curve for ranking evaluation.

        # "ndcg":Normalized Discounted Cumulative Gain

        # "map":Mean average precision

        # "[email protected]","[email protected]": n can be assigned as an integer to cut off the top positions in the lists for evaluation.

        # "ndcg-","map-","[email protected]","[email protected]": In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding "-" in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. training repeatedly

        # "gamma-deviance": [residual deviance for gamma regression]

        'eval_metric': 'auc',

        'seed': 31,

    }

 

上線實錄:

在實際生產過程中,難免遇到將模型上線的問題,這裡將流程進行拆解;

1.模型訓練完成—儲存模型檔案—解析模型檔案—重寫程式碼讀取解析檔案預測樹;

 

Xgb的損失函式的作用;

 

Java解析原始碼,具體的程式碼和註釋已經更新在github上,