bootstrap && bagging && 決策樹 && 隨機森林
看了一篇介紹這幾個概念的文章,整理一點點筆記在這裏,原文鏈接:
https://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/
1.Bootstrap Method
The bootstrap is a powerful statistical method for estimating a quantity from a data sample. This is easiest to understand if the quantity is a descriptive statistic such as a mean or a standard deviation.
就是說,bootstrap是一個統計學習的方法,用來更好的估計一個數據集的某些性質,比如方差和均值,當數據集的數據有一些錯誤的時候,這樣可以提高估計的準確率;
具體的操作就是,創造一個數據集的多個子數據集,然後再各個子數據集上分別計算比如方差,最後將多個計算結果做平均;
2.Bootstrap Aggregation (Bagging)
是一種集成方法,集成方法就是合並來自多種機器學習預測方法計算的結果的技術,得到的結果比單一的預測結果要好;
Bootstrap Aggregation is a general procedure that can be used to reduce the variance for those algorithm that have high variance. An algorithm that has high variance are decision trees, like classification and regression trees (CART).
Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees.
可以看出,bagging其實是 bootsrap方法在高方差的算法上的應用,用來降低方差 variance;所以可以得到 5-bagged decision trees 這種;
具體的方法也很簡單,和bootstrap差不多,將數據集劃分,然後再各個子數據集上分別訓練決策樹,最後合並決策樹的預測結果,例如:
Let’s assume we have a sample dataset of 1000instances (x) and we are using the CART algorithm. Bagging of the CART algorithm would work as follows. 1.Create many (e.g. 100) random sub-samples of our dataset with replacement. 2.Train a CART model on each sample. 3.Given a new dataset, calculate the average prediction from each model.
3.Random Forest 隨機森林、
引入隨機森林的原因是,在多個子數據集上分別訓練決策樹,但是決策樹都是貪心的,都想尋找最優的劃分,導致最後多個決策樹之間的相關性很大,這樣對最後的結果不好;
所以引入隨機森林,每次限制決策樹在 split point 可以挑選的特征數量,導致更好的隨機;一般來說數量 m:
- For classification a good default is: m = sqrt(p)
- For regression a good default is: m = p/3
P是分類問題輸入的變量數目,也是特征的數目;
bootstrap && bagging && 決策樹 && 隨機森林