Random Forest 與 GBDT 的異同
曾經在看用RF和GBDT的時候,以為是非常相似的兩個算法,都是屬於集成算法,可是細致研究之後,發現他們根本全然不同。
以下總結基本的一些不同點
Random Forest:
bagging (你懂得。原本叫Bootstrap aggregating)
Recall that the key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations. One can
show that on average, each bagged treemakes use of around two-thirds of the observations.
bagging 的關鍵是反復的對經過bootstrapped採樣來的觀測集子集進行擬合。然後求平均。。。一個bagged tree充分利用近2/3的樣本集。。。
所以就有了OOB預估(outof bag estimation)
training: bootstrap the samples,
But when building these decision trees,each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.
當構建決策樹時,每次分裂時。都從全特征候選p集中選取m個進行分裂,一般m=sqrt(p)
比方:we choose m (4 out of the 13 for the Heart data)
Using a small value of m in building a random forest will typically be helpful when we have a large number of correlated predictors.
當特征集中相關聯特征較多時,選擇一個較小的m會有幫助。
random forests willnot overfit if we increase B, so in practice we use a value of B sufficiently large for the error rate to have settled down.
隨機森林不會過擬合,所以樹的個數(B)足夠大時會使得錯誤率減少
------------------------------------------------------------------------------------------------------------
GBDT
Boosting(a set of weak learners create a single strong learner)
Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original dataset.
Boosting不進行bootstrap sampling(這個但是RF的看家本領啊)。而是在原始數據集變化的版本號上進行擬合,(這個變化的版本號就是逐輪訓練後。上一次的殘差)
In general, statistical learning approaches that learn slowly tend to perform well.
普通情況下,學習慢的訓練器表現效果較好(好像暗示了什麽。。。。)
except that the trees are grown sequentially: each tree is grown using information from previously grown trees.
GBDT的每棵樹是依照順序生成的(這個和RF全然不一樣,RF並行生成就Ok),每棵樹的生成都利用上之前生成的數留下的信息
The number of trees B. Unlike bagging and random forests, boostingcan overfit if B is too large,
在GBDT中,樹再多會過擬合的。
。(和RF不一樣)
The number d of splits in each tree, which controls the complexity of the boosted ensemble.Often d = 1 works well,
在樹生成過程中,每一次分裂的時候。樹深度為1時。效果最好(這個就是決策樁)
看完他們兩個的差別之後,是不是認為他們全然不一樣呢?
再來一個圖:
在同樣數據集上,Boosting主要比較樹深度,而RF的參數主要是m....這樣是不是更看出了他們的不同。
主要文字及圖片參考
<An Introduction to Statistical Learning with Applications in R>
Random Forest 與 GBDT 的異同