xgboost 特征重要性計算
在XGBoost中提供了三種特征重要性的計算方法:
‘weight’ - the number of times a feature is used to split the data across all trees.
‘gain’ - the average gain of the feature when it is used in trees
‘cover’ - the average coverage of the feature when it is used in trees
簡單來說
weight就是在所有樹中特征用來分割的節點個數總和;
gain就是特征用於分割的平均增益
cover 的解釋有點晦澀,在[R-package/man/xgb.plot.tree.Rd]有比較詳盡的解釋:(https://github.com/dmlc/xgboost/blob/f5659e17d5200bd7471a2e735177a81cb8d3012b/R-package/man/xgb.plot.tree.Rd):the sum of second order gradient of training data classified to the leaf, if it is square loss, this simply corresponds to the number of instances in that branch. Deeper in the tree a node is, lower this metric will be。實際上coverage可以理解為被分到該節點的樣本的二階導數之和,而特征度量的標準就是平均的coverage值。
還是舉李航書上那個例子,我們用不同顏色來表示不同的特征,繪制下圖
xgboost 特征重要性計算