lightgbm論文翻譯
摘要:梯度提升決策樹(GBDT)是一種流行的機器學習算法,並且有很多有效的實現,例如XGBoost和pGBRT。盡管在這些實現中已經采用了許多工程優化,但是當面對維度高,數據量大的問題時,其特征的效率和可擴展性仍然不盡人意。其中一個主要原因是對於每個特征,他們需要遍歷所有的數據實例來估計所有可能的分割點的信息增益,這非常耗時。為了解決這個問題,我們提出了兩種新穎的技術:基於梯度的單面采樣(GOSS)和互補特征壓縮(EFB)。使用GOSS排除了很大比例的小梯度數據實例,只用剩下的來估計信息收益。我們證明,由於具有較大梯度的數據實例發揮作用在信息獲取計算中起更重要的作用,GOSS可以獲得用更小的數據量對信息增益進行相當準確的估計。借助EFB,我們將相互排斥的特征壓縮在一起(即它們很少取非零值),以減少特征的數量。我們證明這一發現互補特征的最佳匹配是NP難度,但是卻可以與貪婪算法一樣可以達到相當好的近似比率(因而可以有效地減少許多不影響分割點確定的準確性)。我們稱GOSS和EFB LightGBM為我們的新GBDT實施。在多個公共數據集上進行的實驗表明,LightGBM加快了傳統GBDT訓練過程速度達到20余次準確度幾乎相同。
梯度提升決策樹(GBDT)[1] 因其效率、準確性和可解釋性成為一種廣泛使用的機器學習算法。 GBDT在機器學習許多任務中實現了最先進的性能,例如多類分類[2],點擊預測[3]和學習排名[4]。 近年來,隨著大數據(特征數量和樣本數量)的出現,GBDT面臨著新的挑戰,特別是在精度和效率之間的權衡。 GBDT的傳統實現需要針對每個特征掃描所有數據實例以估計所有可能分割點的信息增益。 因此,它們的計算復雜度將與特征數量和樣本數量成正比。 這使得這些實現在處理大數據時非常耗時。
為了應對這一挑戰,一個簡單的想法是減少數據實例的數量和特征的數量。 然而,事實證明這是非常重要的。 例如,目前還不清楚如何對GBDT執行數據采樣。 雖然有些研究成果提出根據對應的權重對數據進行采樣以加速提高訓練過程[5,6,7],但由於GBDT中根本沒有樣本權重,因此不能直接應用於GBDT。 在本文中,我們提出了實現這一目標的兩種新技術,詳見下文。
Gradient-based One-Side Sampling (GOSS). While there is no native weight for data instance in GBDT, we notice that data instances with different gradients play different roles in the computation of information gain. In particular, according to the de?nition of information gain, those instances with larger gradients1 (i.e., under-trained instances) will contribute more to the information gain. Therefore, when down sampling the data instances, in order to retain the accuracy of information gain estimation, we should better keep those instances with large gradients (e.g., larger than a pre-de?ned threshold, or among the top percentiles), and only randomly drop those instances with small gradients. We prove that such a treatment can lead to a more accurate gain estimation than uniformly random sampling, with the same target sampling rate, especially when the value of information gain has a large range.
基於梯度的單面采樣(GOSS)。 盡管GBDT中的數據實例沒有自具權重,但我們註意到具有不同梯度的數據實例在計算信息增益時扮演不同的角色。 具體而言,根據信息增益的定義,具有較大梯度的那些實例(即訓練不足的實例)將對信息增益做出更多貢獻。 因此,在對數據實例進行下采樣時,為了保持信息增益估計的準確性,我們應該更好地保留那些具有較大梯度(例如,大於預先定義的閾值或者最高百分位數)的實例,並且只能隨機地 用小漸變刪除這些實例。 我們證明,這種處理可以導致比均勻隨機采樣更準確的增益估計,具有相同的目標采樣率,特別是當信息增益值的範圍較大時。
Exclusive Feature Bundling (EFB). Usually in real applications, although there are a large number of features, the feature space is quite sparse, which provides us a possibility of designing a nearly lossless approach to reduce the number of effective features. Speci?cally, in a sparse feature space, many features are (almost) exclusive, i.e., they rarely take nonzero values simultaneously. Examples include the one-hot features (e.g., one-hot word representation in text mining). We can safely bundle such exclusive features. To this end, we design an ef?cient algorithm by reducing the optimal bundling problem to a graph coloring problem (by taking features as vertices and adding edges for every two features if they are not mutually exclusive), and solving it by a greedy algorithm with a constant approximation ratio.
互補特征壓縮(EFB)。 通常在實際應用中,雖然有大量的特征,但特征空間相當稀疏,這為我們設計幾乎無損的方法來減少有效特征的數量提供了可能性。 特別地,在稀疏特征空間中,許多特征是(幾乎)排他性的,即它們很少同時取非零值。 例子包括一個熱門特征(例如,文本挖掘中的熱門詞匯表示)。 我們可以安全地壓縮這些獨特特征。 為此,我們設計了一種有效的算法,通過將最優壓縮問題簡化為圖著色問題(通過將特征作為頂點,並且如果兩個特征不相互排斥,則為每兩個特征添加邊),然後用貪心算法 恒定近似比。
We call the new GBDT algorithm with GOSS and EFB LightGBM2. Our experiments on multiple public datasets show that LightGBM can accelerate the training process by up to over 20 times while achieving almost the same accuracy.
我們將GOSS和EFB LightGBM2稱為新的GBDT算法。在多個公共數據集上進行的實驗表明,達到幾乎相同精度的情況下 LightGBM可以將訓練過程加速20倍以上。
The remaining of this paper is organized as follows. At ?rst, we review GBDT algorithms and related work in Sec.2. Then, we introduce the details of GOSS in Sec.3 and EFB in Sec.4. Our experiments for LightGBM on public datasets are presented in Sec. 5. Finally, we conclude the paper in Sec. 6.
本文的其余部分安排如下。 首先,我們回顧GBDT算法和第二部分的相關工作。 然後,在第3節中詳細介紹GOSS,第4節中介紹EFB。在公共數據集上的LightGBM實驗將在第5節中介紹。 最後,在第6節總結全文。
2.預備知識
2.1GBDT和它的復雜性分析
GBDT is an ensemble model of decision trees, which are trained in sequence [1]. In each iteration, GBDT learns the decision trees by ?tting the negative gradients (also known as residual errors).
GBDT是決策樹的集合模型,它們按順序進行訓練[1]。 在每次叠代中,GBDT通過擬合負梯度(也稱為殘差)來學習決策樹。
The main cost in GBDT lies in learning the decision trees, and the most time-consuming part in learning a decision tree is to ?nd the best split points. One of the most popular algorithms to ?nd split points is the pre-sorted algorithm [8, 9], which enumerates all possible split points on the pre-sorted feature values. This algorithm is simple and can ?nd the optimal split points, however, it is inef?cient in both training speed and memory consumption. Another popular algorithm is the histogram-based algorithm [10, 11, 12], as shown in Alg. 13. Instead of ?nding the split points on the sorted feature values, histogram-based algorithm buckets continuous feature values into discrete bins and uses these bins to construct feature histograms during training. Since the histogram-based algorithm is more ef?cient in both memory consumption and training speed, we will develop our work on its basis.
GBDT的主要成本在於學習決策樹,學習決策樹中最耗時的部分是找出最佳的分割點。 找到分割點的最流行的算法之一是預先排序的算法[8,9],該算法枚舉了預先排序的特征值上所有可能的分割點。 這個算法很簡單,可以找到最優的分割點,但是它在訓練速度和內存消耗方面都是不夠的。 另一種流行的算法是基於直方圖的算法[10,11,12],如Alg所示。 13.不是在已排序的特征值上找到分割點,而是基於直方圖的算法將連續的特征值抽象成離散的區域,並使用這些區域在訓練過程中構建特征直方圖。 由於基於直方圖的算法在內存消耗和訓練速度方面都更加高效,因此我們將在其基礎上開展工作。
As shown in Alg. 1, the histogram-based algorithm ?nds the best split points based on the feature histograms. It costs O(#data×#feature) for histogram building and O(#bin×#feature) for split point ?nding. Since #bin is usually much smaller than #data, histogram building will dominate the computational complexity. If we can reduce #data or #feature, we will be able to substantially speed up the training of GBDT.
如Alg1所示,基於直方圖的算法根據特征直方圖找出最佳分割點。 它用於構建直方圖的O(#data×#特征)和用於分割點的O(#bin×#特征)。 由於#bin通常比#data小得多,因此直方圖構建將主導計算復雜性。 如果我們能夠減少#data或#feature,我們將能夠大幅加速GBDT的訓練。
2.2相關性工作
There have been quite a few implementations of GBDT in the literature, including XGBoost [13], pGBRT [14], scikit-learn [15], and gbm in R [16] 4. Scikit-learn and gbm in R implements the presorted algorithm, and pGBRT implements the histogram-based algorithm. XGBoost supports both the pre-sorted algorithm and histogram-based algorithm. As shown in [13], XGBoost outperforms the other tools. So, we use XGBoost as our baseline in the experiment section.
在文獻中有很多GBDT的實現,包括XGBoost [13],pGBRT [14],scikit-learn [15]和R [16]中的gbm 4. R中的Scikit-learn和gbm實現預分類 算法,並且pGBRT實現基於直方圖的算法。 XGBoost支持預排序算法和基於直方圖的算法。 如[13]所示,XGBoost優於其他工具。 所以,我們在實驗部分使用XGBoost作為我們的基準。
To reduce the size of the training data, a common approach is to down sample the data instances. For example, in [5], data instances are ?ltered if their weights are smaller than a ?xed threshold. SGB [20] uses a random subset to train the weak learners in every iteration. In [6], the sampling ratio are dynamically adjusted in the training progress. However, all these works except SGB [20] are based on AdaBoost [21], and cannot be directly applied to GBDT since there are no native weights for data instances in GBDT. Though SGB can be applied to GBDT, it usually hurts accuracy and thus it is not a desirable choice.
為了減小訓練數據的大小,常用的方法是對數據實例進行下采樣。 例如,在[5]中,如果數據實例的權重小於固定閾值,則會過濾數據實例。 SGB [20]使用隨機子集在每次叠代中訓練弱的學習者。 在[6]中,采樣率在訓練過程中動態調整。 然而,除了SGB [20]之外的所有這些工作都基於AdaBoost [21],並且不能直接應用於GBDT,因為GBDT中沒有數據實例的本機權重。 雖然SGB可以應用於GBDT,但它通常會損害準確性,因此它不是一個理想的選擇。
Similarly, to reduce the number of features, it is natural to ?lter weak features [22, 23, 7, 24]. This is usually done by principle component analysis or projection pursuit. However, these approaches highly rely on the assumption that features contain signi?cant redundancy, which might not always be true in practice (features are usually designed with their unique contributions and removing any of them may affect the training accuracy to some degree).
同樣,為了減少特征的數量,自然可以濾除弱點[22,23,7,24]。 這通常通過主成分分析或投影追蹤來完成。 然而,這些方法在很大程度上依賴於特征包含重要冗余這一假設,在實踐中可能並非總是如此(特征通常是以其獨特貢獻設計的,並且刪除它們中的任何一個都可能在一定程度上影響訓練的準確性)。
The large-scale datasets used in real applications are usually quite sparse. GBDT with the pre-sorted algorithm can reduce the training cost by ignoring the features with zero values [13]. However, GBDT with the histogram-based algorithm does not have ef?cient sparse optimization solutions. The reason is that the histogram-based algorithm needs to retrieve feature bin values (refer to Alg. 1) for each data instance no matter the feature value is zero or not. It is highly preferred that GBDT with the histogram-based algorithm can effectively leverage such sparse property.
實際應用中使用的大規模數據集通常很稀少。 GBDT與預排序算法可以通過忽略具有零值的特征來降低訓練成本[13]。 但是,基於直方圖的GBDT算法沒有有效的稀疏優化解決方案。 原因在於基於直方圖的算法無論特征值是否為零,都需要為每個數據實例檢索特征倉值(參見第1章)。 具有基於直方圖的算法的GBDT可以有效地利用這種稀疏屬性。
To address the limitations of previous works, we propose two new novel techniques called Gradientbased One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). More details will be introduced in the next sections.
為了解決先前工作的局限性,我們提出了兩種稱為基於梯度的單面采樣(GOSS)和獨特特征捆綁(EFB)的新穎技術。 更多細節將在下一節中介紹。
3.基於梯度的單面采樣(Gradient-based One-Side Sampling)
In this section, we propose a novel sampling method for GBDT that can achieve a good balance between reducing the number of data instances and keeping the accuracy for learned decision trees.
在本節中,我們提出了一種新的GBDT抽樣方法,可以在減少數據實例數量和保持學習決策樹的準確性之間取得良好的平衡。
3.1算法描述
In AdaBoost, the sample weight serves as a good indicator for the importance of data instances. However, in GBDT, there are no native sample weights, and thus the sampling methods proposed for AdaBoost cannot be directly applied. Fortunately, we notice that the gradient for each data instance in GBDT provides us with useful information for data sampling. That is, if an instance is associated with a small gradient, the training error for this instance is small and it is already well-trained. A straightforward idea is to discard those data instances with small gradients. However, the data distribution will be changed by doing so, which will hurt the accuracy of the learned model. To avoid this problem, we propose a new method called Gradient-based One-Side Sampling (GOSS).
在AdaBoost中,樣本權重是數據實例重要性的良好指標。 但是,在GBDT中,不存在本地樣本權重,因此針對AdaBoost提出的抽樣方法不能直接應用。 幸運的是,我們註意到GBDT中每個數據實例的漸變為我們提供了有用的數據采樣信息。 也就是說,如果一個實例與一個小梯度相關聯,則此實例的訓練錯誤很小,並且它已經受過良好訓練。 一個簡單的想法是放棄那些具有小梯度的數據實例。 但是,這樣做會改變數據分布,這會損害學習模型的準確性。 為了避免這個問題,我們提出了一種叫做基於梯度的單面采樣(GOSS)的新方法。
GOSS keeps all the instances with large gradients and performs random sampling on the instances with small gradients. In order to compensate the in?uence to the data distribution, when computing the information gain, GOSS introduces a constant multiplier for the data instances with small gradients (seeAlg.2). Speci?cally, GOSS ?rstly sorts the data instances according to the absolute value of their gradients and selects the top a×100% instances. Then it randomly samples b×100% instances from the rest of the data. After that, GOSS ampli?es the sampled data with small gradients by a constant 1?a b when calculating the information gain. By doing so, we put more focus on the under-trained instances without changing the original data distribution by much.
GOSS保留所有具有大梯度的實例,並對具有小梯度的實例執行隨機采樣。 為了補償對數據分布的影響,在計算信息增益時,GOSS為具有小梯度的數據實例引入了一個常數乘數(參見第2節)。 具體而言,GOSS首先根據梯度的絕對值對數據實例進行排序,並選擇最高的a×100%實例。 然後從其余數據中隨機抽樣b×100%的實例。 之後,當計算信息增益時,GOSS以小的梯度以恒定的1-a b放大采樣數據。 通過這樣做,我們將更多的重點放在訓練有素的實例上,而不用多次改變原始數據分布。
3.2理論分析
GBDT uses decision trees to learn a function from the input space Xs to the gradient spaceG [1]. Suppose that we have a training set with n i.i.d. instances{x1,··· ,xn}, where each xi is a vector with dimension s in space Xs. In each iteration of gradient boosting, the negative gradients of the loss function with respect to the output of the model are denoted as{g1,··· ,gn}. The decision tree model splits each node at the most informative feature(with the largest information gain). For GBDT, the information gain is usually measured by the variance after splitting, which is de?ned as below.
GBDT使用決策樹從輸入空間Xs學習一個函數到梯度空間G [1]。 假設我們有一個n i.i.d.的訓練集。 實例{x1,...,xn},其中每個xi是空間Xs中尺寸為s的向量。 在每次梯度增強叠代中,損失函數相對於模型輸出的負梯度表示為{g1,...,gn}。 決策樹模型將信息最豐富的特征(具有最大的信息增益)劃分為每個節點。 對於GBDT,信息增益通常用分裂後的方差來衡量,其定義如下。
4.互補特征壓縮(Exclusive Feature Bundling)
在本節中,我們提出了一種有效減少特征數量的新方法。
High-dimensional data are usually very sparse. The sparsity of the feature space provides us a possibility of designing a nearly lossless approach to reduce the number of features. Speci?cally, in a sparse feature space, many features are mutually exclusive, i.e., they never take nonzero values simultaneously. We can safely bundle exclusive features into a single feature (which we call an exclusive feature bundle). By a carefully designed feature scanning algorithm, we can build the same feature histograms from the feature bundles as those from individual features. In this way, the complexity of histogram building changes from O(#data×#feature) to O(#data×#bundle), while #bundle << #feature. Then we can signi?cantly speed up the training of GBDT without hurting the accuracy. In the following, we will show how to achieve this in detail. There are two issues to be addressed. The ?rst one is to determine which features should be bundled together. The second is how to construct the bundle.
高維數據通常非常稀少。特征空間的稀疏性為我們提供了設計幾乎無損的方法來減少特征數量的可能性。特別地,在稀疏特征空間中,許多特征是相互排斥的,即它們從不同時取非零值。我們可以安全地將專有特征捆綁到一個特征(我們稱之為專用特征包)。通過精心設計的特征掃描算法,我們可以從特征捆綁中構建與單個特征相同的特征直方圖。這樣,直方圖構建的復雜度從O(#data×#feature)改為O(#data×#bundle),而#bundle << #feature。那麽我們可以在不損害準確性的情況下顯著加快GBDT的培訓速度。下面我們將詳細介紹如何實現這一點。有兩個問題需要解決。第一個是確定哪些特征應該捆綁在一起。其次是如何構建捆綁。
Theorem4.1 The problem of partitioning features into a smallest number of exclusive bundles is NP-hard.
定理4.1將特征劃分為最小數量的互補壓縮問題是NP難題。
Proof: We will reduce the graph coloring problem [25] to our problem. Since graph coloring problem is NP-hard, we can then deduce our conclusion.
證明:我們將把圖著色問題[25]減少到我們的問題。 由於圖著色問題是NP難題,我們可以推斷出我們的結論。
Given any instance G = (V,E) of the graph coloring problem. We construct an instance of our problem as follows. Take each row of the incidence matrix of G as a feature, and get an instance of our problem with |V| features. It is easy to see that an exclusive bundle of features in our problem corresponds to a set of vertices with the same color, and vice versa.
給定圖的著色問題的任意實例G =(V,E)作為構建我們問題的一個實例。 以G的關聯矩陣的每一行為特征,並用| V |得到問題的一個實例特征。 很容易看出,問題中的一組特征與一組具有相同顏色的頂點相對應,反之亦然。
For the ?rst issue, we prove in Theorem 4.1 that it is NP-Hard to ?nd the optimal bundling strategy, which indicates that it is impossible to ?nd an exact solution within polynomial time. In order to ?nd a good approximation algorithm, we ?rst reduce the optimal bundling problem to the graph coloring problem by taking features as vertices and adding edges for every two features if they are not mutually exclusive, then we use a greedy algorithm which can produce reasonably good results (with a constant approximation ratio) for graph coloring to produce the bundles. Furthermore, we notice that there are usually quite a few features, although not 100% mutually exclusive, also rarely take nonzero values simultaneously. If our algorithm can allow a small fraction of con?icts, we can get an even smaller number of feature bundles and further improve the computational ef?ciency. By simple calculation, random polluting a small fraction of feature values will affect the training accuracy by at mostO([(1?γ)n]?2/3)(See Proposition 2.1 in the supplementary materials), where γ is the maximal con?ict rate in each bundle. So, if we choose a relatively small γ, we will be able to achieve a good balance between accuracy and ef?ciency.
對於第一個問題,我們在定理4.1中證明了找到最優捆綁策略是NP難的,這表明在多項式時間內找到一個精確解是不可能的。為了找到一個好的近似算法,我們首先將最優捆綁問題歸結為圖著色問題,即如果兩個特征之間不是互斥的,則將特征作為頂點,並為每個特征添加邊,那麽我們使用一種可合理生成的貪心算法好的結果(具有恒定的近似比)用於圖著色以產生束。此外,我們註意到通常有很多特征,雖然不是100%互斥,但也很少同時使用非零值。如果我們的算法可以允許一小部分沖突,我們可以得到更少數量的特征捆綁並進一步提高計算效率。通過簡單的計算,隨機汙染一小部分特征值將至多影響訓練的準確性O([(1-γ)n] -2/3)(見補充材料中的命題2.1),其中γ是最大沖突率在每個捆綁。所以,如果我們選擇一個相對較小的γ,我們將能夠在精度和效率之間取得很好的平衡。
Based on the above discussions, we design an algorithm for exclusive feature bundling as shown in Alg. 3. First, we construct a graph with weighted edges, whose weights correspond to the total con?icts between features. Second, we sort the features by their degrees in the graph in the descending order. Finally, we check each feature in the ordered list, and either assign it to an existing bundle with a small con?ict (controlled by γ), or create a new bundle. The time complexity of Alg. 3 is O(#feature2) and it is processed only once before training. This complexity is acceptable when the number of features is not very large, but may still suffer if there are millions of features. To further improve the ef?ciency, we propose a more ef?cient ordering strategy without building the graph: ordering by the count of nonzero values, which is similar to ordering by degrees since more nonzero values usually leads to higher probability of con?icts. Since we only alter the ordering strategies in Alg. 3, the details of the new algorithm are omitted to avoid duplication.
基於以上討論,我們設計了一個獨家特征捆綁算法,如Alg所示。首先,我們構造一個加權邊的圖,其權重對應於特征之間的總沖突。其次,我們按照降序排列的圖表中的度數對特征進行排序。最後,我們檢查有序列表中的每個要素,並將其分配給一個具有小沖突(由γ控制)的現有捆綁包,或者創建一個新的捆綁包。 Alg3的時間復雜性是O(#特征2),訓練前僅處理一次。當特征數量不是很大時,這種復雜性是可以接受的,但是如果有數百萬個特征可能仍會受到影響。為了進一步提高效率,我們提出了一個更有效的排序策略,不需要構建圖表:按非零值計數排序,這類似於按度排序,因為更多的非零值通常會導致更高的沖突概率。因為我們只改變Alg3中的排序策略,新算法的細節被省略以避免重復。
For the second issues, we need a good way of merging the features in the same bundle in order to reduce the corresponding training complexity. The key is to ensure that the values of the original features can be identi?ed from the feature bundles. Since the histogram-based algorithm stores discrete bins instead of continuous values of the features, we can construct a feature bundle by letting exclusive features reside in different bins. This can be done by adding offsets to the original values of the features. For example, suppose we have two features in a feature bundle. Originally, feature A takesvaluefrom [0,10) andfeatureBtakesvalue [0,20). Wethenaddanoffsetof10tothevaluesof feature B so that the re?ned feature takes values from [10,30). After that, it is safe to merge features A and B, and use a feature bundle with range [0,30] to replace the original features A and B. The detailed algorithm is shown in Alg. 4.
對於第二個問題,我們需要一種合並同一捆綁中的特征的好方法,以減少相應的訓練復雜度。 關鍵是要確保可以從特征包中識別出原始特征的值。 由於基於直方圖的算法存儲離散倉而不是特征的連續值,我們可以通過讓獨占特征位於不同倉中來構建特征捆綁。 這可以通過向特征的原始值添加偏移來完成。 例如,假設我們在一個特征包中有兩個特征。 最初,特征A從[0,10)和featureBtakesvalue[0,20)中取值。 對特征B的值進行加法和偏移,以使得精化特征取[10,30]中的值。 之後,合並特征A和B並使用範圍為[0,30]的特征束來替換原始特征A和B是安全的。詳細的算法在Alg4中顯示。
EFB algorithm can bundle many exclusive features to the much fewer dense features, which can effectively avoid unnecessary computation for zero feature values. Actually, we can also optimize the basic histogram-based algorithm towards ignoring the zero feature values by using a table for each feature to record the data with nonzero values. By scanning the data in this table, the cost of histogram building for a feature will change from O(#data) to O(#non_zero_data). However, thismethodneedsadditionalmemoryandcomputationcosttomaintaintheseper-featuretablesinthe whole tree growth process. We implement this optimization in LightGBM as a basic function. Note, this optimization does not con?ict with EFB since we can still use it when the bundles are sparse.
EFB算法可以將許多獨有的特征捆綁到少得多的密集特征上,這可以有效地避免對零特征值進行不必要的計算。 實際上,我們也可以通過使用每個特征的表來優化基本的基於直方圖的算法來忽略零特征值,以記錄具有非零值的數據。 通過掃描此表中的數據,特征的直方圖構建成本將從O(#data)更改為O(#non_zero_data)。 但是,這種方法需要額外的記憶和計算費用來維護整個樹形成過程中的特征表。 我們在LightGBM中實現這種優化作為基本特征。 請註意,此優化不會與EFB沖突,因為我們仍然可以在捆綁稀疏時使用它。
5.實驗
6.總結
7.參考文獻
[1] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189?232, 2001.
[2] Ping Li. Robust logitboost and adaptive base class (abc) logitboost. arXiv preprint arXiv:1203.3491, 2012.
[3] Matthew Richardson, Ewa Dominowska, and Robert Ragno. Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pages 521?30. ACM, 2007.
[4] Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11(23-581):81, 2010.
[5] Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. Additive logistic regression: a statistical view of boosting (with discussion and
a rejoinder by the authors). The annals of statistics, 28(2):337?07, 2000.
[6] Charles Dubout and Fran鏾is Fleuret. Boosting with maximum adaptive sampling. In Advances in Neural Information Processing
Systems, pages 1332?340, 2011.
[7] Ron Appel, Thomas J Fuchs, Piotr Doll醨, and Pietro Perona. Quickly boosting decision trees-pruning underachieving features early. In
ICML (3), pages 594?02, 2013.
[8] Manish Mehta, Rakesh Agrawal, and Jorma Rissanen. Sliq: A fast scalable classifier for data mining. In International Conference on
Extending Database Technology, pages 18?2. Springer, 1996.
[9] John Shafer, Rakesh Agrawal, and Manish Mehta. Sprint: A scalable parallel classi er for data mining. In Proc. 1996 Int. Conf. Very Large Data Bases, pages 544?55. Citeseer, 1996.
[10] Sanjay Ranka and V Singh. Clouds: A decision tree classifier for large datasets. In Proceedings of the 4th Knowledge Discovery and
Data Mining Conference, pages 2?, 1998.
[11] Ruoming Jin and Gagan Agrawal. Communication and memory efficient parallel decision tree construction. In Proceedings of the 2003
SIAM International Conference on Data Mining, pages 119?29. SIAM, 2003.
[12] Ping Li, Christopher JC Burges, Qiang Wu, JC Platt, D Koller, Y Singer, and S Roweis. Mcrank: Learning to rank using multiple
classification and gradient boosting. In NIPS, volume 7, pages 845?52, 2007.
[13] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 785?94. ACM, 2016.
[14] Stephen Tyree, Kilian Q Weinberger, Kunal Agrawal, and Jennifer Paykin. Parallel boosted regression trees for web search ranking. In
Proceedings of the 20th international conference on World wide web, pages 387?96. ACM, 2011.
[15] Fabian Pedregosa, Ga?l Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter
Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of Machine Learning Research,
12(Oct):2825–2830, 2011.
[16] Greg Ridgeway. Generalized boosted models: A guide to the gbm package. Update, 1(1):2007, 2007.
[17] Huan Zhang, Si Si, and Cho-Jui Hsieh. Gpu-acceleration for large-scale tree boosting. arXiv preprint arXiv:1706.08359, 2017.
[18] Rory Mitchell and Eibe Frank. Accelerating the xgboost algorithm using gpu computing. PeerJ Preprints, 5:e2911v1, 2017.
[19] Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, and Tieyan Liu. A communication-efficient parallel algorithm
for decision tree. In Advances in Neural Information Processing Systems, pages 1271–1279, 2016.
[20] Jerome H Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378, 2002.
[21] Michael Collins, Robert E Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning,
48(1-3):253–285, 2002.
[22] Ian Jolliffe. Principal component analysis. Wiley Online Library, 2002.
[23] Luis O Jimenez and David A Landgrebe. Hyperspectral data analysis and supervised feature reduction via projection pursuit. IEEE
Transactions on Geoscience and Remote Sensing, 37(6):2653–2667, 1999.
[24] Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. CRC press, 2012.
[25] Tommy R Jensen and Bjarne Toft. Graph coloring problems, volume 39. John Wiley & Sons, 2011.
[26] Tao Qin and Tie-Yan Liu. Introducing LETOR 4.0 datasets. CoRR, abs/1306.2597, 2013.
[27] Allstate claim data, https://www.kaggle.com/c/ClaimPredictionChallenge.
[28] Flight delay data, https://github.com/szilard/benchm-ml#data.
[29] Hsiang-Fu Yu, Hung-Yi Lo, Hsun-Ping Hsieh, Jing-Kai Lou, Todd G McKenzie, Jung-Wei Chou, Po-Han Chung, Chia-Hua Ho, Chun-Fu
Chang, Yin-Hsuan Wei, et al. Feature engineering and classifier ensemble for kdd cup 2010. In KDD Cup, 2010.
[30] Kuan-Wei Wu, Chun-Sung Ferng, Chia-Hua Ho, An-Chun Liang, Chun-Heng Huang, Wei-Yuan Shen, Jyun-Yu Jiang, Ming-Hao Yang,
Ting-Wei Lin, Ching-Pei Lee, et al. A two-stage ensemble of diverse models for advertisement ranking in kdd cup 2012. In KDDCup,2012.
[31] Libsvm binary classification data, https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html.
[32] Haijian Shi. Best-first decision tree learning. PhD thesis, The University of Waikato, 2007.
lightgbm論文翻譯