1. 程式人生 > >隨機森林算法demo python spark

隨機森林算法demo python spark

and led != stc gin 隨機 相對 overfit resin

關鍵參數

最重要的,常常需要調試以提高算法效果的有兩個參數:numTrees,maxDepth。

  • numTrees(決策樹的個數):增加決策樹的個數會降低預測結果的方差,這樣在測試時會有更高的accuracy。訓練時間大致與numTrees呈線性增長關系。
  • maxDepth:是指森林中每一棵決策樹最大可能depth,在決策樹中提到了這個參數。更深的一棵樹意味模型預測更有力,但同時訓練時間更長,也更傾向於過擬合。但是值得註意的是,隨機森林算法和單一決策樹算法對這個參數的要求是不一樣的。隨機森林由於是多個的決策樹預測結果的投票或平均而降低而預測結果的方差,因此相對於單一決策樹而言,不容易出現過擬合的情況。所以隨機森林可以選擇比決策樹模型中更大的maxDepth。

    甚至有的文獻說,隨機森林的每棵決策樹都最大可能地進行生長而不進行剪枝。但是不管怎樣,還是建議對maxDepth參數進行一定的實驗,看看是否可以提高預測的效果。
    另外還有兩個參數,subsamplingRate,featureSubsetStrategy一般不需要調試,但是這兩個參數也可以重新設置以加快訓練,但是值得註意的是可能會影響模型的預測效果(如果需要調試的仔細讀下面英文吧)。

We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.
The first two parameters we mention are the most important, and tuning them can often improve performance:
(1)numTrees: Number of trees in the forest.
Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy.
Training time increases roughly linearly in the number of trees.
(2)maxDepth: Maximum depth of each tree in the forest.
Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and are also more prone to overfitting.
In general, it is acceptable to train deeper trees when using random forests than when using a single decision tree. One tree is more likely to overfit than a random forest (because of the variance reduction from averaging multiple trees in the forest).
The next two parameters generally do not require tuning. However, they can be tuned to speed up training.
(3)subsamplingRate: This parameter specifies the size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training.
(4)featureSubsetStrategy: Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low.
We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.

"""
Random Forest Classification Example.
"""
from __future__ import print_function

from pyspark import SparkContext
# $example on$
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils
# $example off$

if __name__ == "__main__":
    sc = SparkContext(appName="
PythonRandomForestClassificationExample") # $example on$ # Load and parse the data file into an RDD of LabeledPoint. data = MLUtils.loadLibSVMFile(sc, data/mllib/sample_libsvm_data.txt) # Split the data into training and test sets (30% held out for testing) (trainingData, testData) = data.randomSplit([0.7, 0.3]) # Train a RandomForest model. # Empty categoricalFeaturesInfo indicates all features are continuous. # Note: Use larger numTrees in practice. # Setting featureSubsetStrategy="auto" lets the algorithm choose. model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, numTrees=3, featureSubsetStrategy="auto", impurity=gini, maxDepth=4, maxBins=32) # Evaluate model on test instances and compute test error predictions = model.predict(testData.map(lambda x: x.features)) labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions) testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count()) print(Test Error = + str(testErr)) print(Learned classification forest model:) print(model.toDebugString()) # Save and load model model.save(sc, "target/tmp/myRandomForestClassificationModel") sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel") # $example off$

模型樣子:

TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 511 <= 0.0)
     If (feature 434 <= 0.0)
      Predict: 0.0
     Else (feature 434 > 0.0)
      Predict: 1.0
    Else (feature 511 > 0.0)
     Predict: 0.0
  Tree 1:
    If (feature 490 <= 31.0)
     Predict: 0.0
    Else (feature 490 > 31.0)
     Predict: 1.0
  Tree 2:
    If (feature 302 <= 0.0)
     If (feature 461 <= 0.0)
      If (feature 208 <= 107.0)
       Predict: 1.0
      Else (feature 208 > 107.0)
       Predict: 0.0
     Else (feature 461 > 0.0)
      Predict: 1.0
    Else (feature 302 > 0.0)
     Predict: 0.0

隨機森林算法demo python spark