隨機森林算法demo python spark

阿新 • • 發佈：2017-07-19

and led != stc gin 隨機相對 overfit resin

關鍵參數

最重要的，常常需要調試以提高算法效果的有兩個參數：numTrees，maxDepth。

numTrees（決策樹的個數）：增加決策樹的個數會降低預測結果的方差，這樣在測試時會有更高的accuracy。訓練時間大致與numTrees呈線性增長關系。
maxDepth：是指森林中每一棵決策樹最大可能depth，在決策樹中提到了這個參數。更深的一棵樹意味模型預測更有力，但同時訓練時間更長，也更傾向於過擬合。但是值得註意的是，隨機森林算法和單一決策樹算法對這個參數的要求是不一樣的。隨機森林由於是多個的決策樹預測結果的投票或平均而降低而預測結果的方差，因此相對於單一決策樹而言，不容易出現過擬合的情況。所以隨機森林可以選擇比決策樹模型中更大的maxDepth。

甚至有的文獻說，隨機森林的每棵決策樹都最大可能地進行生長而不進行剪枝。但是不管怎樣，還是建議對maxDepth參數進行一定的實驗，看看是否可以提高預測的效果。
另外還有兩個參數，subsamplingRate，featureSubsetStrategy一般不需要調試，但是這兩個參數也可以重新設置以加快訓練，但是值得註意的是可能會影響模型的預測效果（如果需要調試的仔細讀下面英文吧）。

We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.
The first two parameters we mention are the most important, and tuning them can often improve performance:
（1）numTrees: Number of trees in the forest.
Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy.
Training time increases roughly linearly in the number of trees.
（2）maxDepth: Maximum depth of each tree in the forest.
Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and are also more prone to overfitting.
In general, it is acceptable to train deeper trees when using random forests than when using a single decision tree. One tree is more likely to overfit than a random forest (because of the variance reduction from averaging multiple trees in the forest).
The next two parameters generally do not require tuning. However, they can be tuned to speed up training.
（3）subsamplingRate: This parameter specifies the size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training.
（4）featureSubsetStrategy: Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low.
We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.

"""
Random Forest Classification Example.
"""
from __future__ import print_function

from pyspark import SparkContext
# $example on$
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils
# $example off$

if __name__ == "__main__":
    sc = SparkContext(appName=" 
PythonRandomForestClassificationExample")
    # $example on$
    # Load and parse the data file into an RDD of LabeledPoint.
    data = MLUtils.loadLibSVMFile(sc, ‘data/mllib/sample_libsvm_data.txt‘)
    # Split the data into training and test sets (30% held out for testing)
    (trainingData, testData) = data.randomSplit([0.7, 0.3])

    # Train a RandomForest model.
    #  Empty categoricalFeaturesInfo indicates all features are continuous.
    #  Note: Use larger numTrees in practice.
    #  Setting featureSubsetStrategy="auto" lets the algorithm choose.
    model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                         numTrees=3, featureSubsetStrategy="auto",
                                         impurity=‘gini‘, maxDepth=4, maxBins=32)

    # Evaluate model on test instances and compute test error
    predictions = model.predict(testData.map(lambda x: x.features))
    labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
    testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
    print(‘Test Error = ‘ + str(testErr))
    print(‘Learned classification forest model:‘)
    print(model.toDebugString())

    # Save and load model
    model.save(sc, "target/tmp/myRandomForestClassificationModel")
    sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel")
    # $example off$

模型樣子：

TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 511 <= 0.0)
     If (feature 434 <= 0.0)
      Predict: 0.0
     Else (feature 434 > 0.0)
      Predict: 1.0
    Else (feature 511 > 0.0)
     Predict: 0.0
  Tree 1:
    If (feature 490 <= 31.0)
     Predict: 0.0
    Else (feature 490 > 31.0)
     Predict: 1.0
  Tree 2:
    If (feature 302 <= 0.0)
     If (feature 461 <= 0.0)
      If (feature 208 <= 107.0)
       Predict: 1.0
      Else (feature 208 > 107.0)
       Predict: 0.0
     Else (feature 461 > 0.0)
      Predict: 1.0
    Else (feature 302 > 0.0)
     Predict: 0.0

隨機森林算法demo python spark

and led != stc gin 隨機相對 overfit resin 關鍵參數最重要的，常常需要調試以提高算法效果的有兩個參數：numTrees，maxDepth。 numTrees（決策樹的個數）：增加決策樹的個數會降低預測結果的方差，這樣在測試時會有更高

spark 隨機森林算法案例實戰

方法 ring table shel evel 算法下使用 org trap 隨機森林算法由多個決策樹構成的森林，算法分類結果由這些決策樹投票得到，決策樹在生成的過程當中分別在行方向和列方向上添加隨機過程，行方向上構建決策樹時采用放回抽樣（bootstraping）得到

隨機森林算法

CA reg ica level runif mtr 應用 try var 0 引言　　隨機森林算法以其眾多優點而廣泛應用於數據挖掘及分類回歸中，具體優點請自學哈哈。我是從做分類入手，是個菜鳥。 1 算法原理 2 建模 3 仿真結果 4 結果分析及總結

旋轉隨機森林算法

imp report dex zeros 通過一半 while 割點解決　　當輸入數據中存在非線性關系的時候，基於線性回歸的模型就會失效，而基於樹的算法則不受數據中非線性關系的影響，基於樹的方法最大的一個困擾時為了避免過擬合而對樹進行剪枝的難度，對於潛在數據中的噪

隨機森林算法基礎梳理

node 著名權重 .net -m 集合方法 criterion normal 平衡 1.集成學習概念　　在機器學習的有監督學習算法中，我們的目標是學習出一個穩定的且在各個方面表現都較好的模型，但實際情況往往不這麽理想，有時我們只能得到多個有偏好的模型（弱監督模型，在某

隨機森林算法原理小結

相同聯系就會原理圖 font 一個 gbdt blog 模型來自：https://www.cnblogs.com/pinard/p/6156009.html 集成學習有兩個流派，一個是boosting，特點是各個弱學習器之間有依賴關系；一個是bagging，特點是各

隨機森林算法梳理

total 模型 http auto criterion 會員 times dba bootstrap ---恢復內容開始--- 隨機森林算法梳理 1. 集成學習概念通過構建並結合多個學習器來完成學習任務，有時也被稱為多分類器系統、基於委員會的學習等

Fuzzy C Means 算法及其 Python 實現——寫得很清楚，見原文

少包均值平均值劃分 gin 及其 end 5% 指數 Fuzzy C Means 算法及其 Python 實現轉自：http://note4code.com/2015/04/14/fuzzy-c-means-%E7%AE%97%E6%B3%95%E5%8F%8A%E

八大排序算法的python實現（三）冒泡排序

解釋實現兩個 blog python odin int 通過順序代碼： #coding:utf-8 #author:徐蔔靈 #交換排序.冒泡排序 L = [1, 3, 2, 32, 5, 4] def Bubble_sort(L): for i in ra

常見排序算法之python實現

uic 位置 cti gte https 最大值 ice 插入排序快速　　本文介紹了幾種常用的排序算法，包含冒泡排序，選擇排序，插入排序，歸並排序，快速排序，堆排序，本文涉及的代碼可以在https://github.com/lianyingteng/Programmin

算法（Python）

blog 算法 span while 一個 == cnblogs val one 算法就是為了解決某一個問題而采取的具體有效的操作步驟算法的復雜度，表示代碼的運行效率，用一個大寫的O加括號來表示，比如O(1)，O(n) 認為算法的復雜度是漸進的，即對於一個大小為n的輸入，

樸素貝葉斯算法的python實現 -- 機器學習實戰

cut ocl add set 分類器觀察 problem enc 兩個 1 import numpy as np 2 import re 3 4 #詞表到向量的轉換函數 5 def loadDataSet(): 6 postingLi

Kmeans聚類算法及其 Python實現

lap pytho pan 鏈接 nbsp ade 不知道 ans details python Kmeans聚類之後如何給數據貼上聚類的標簽？用了二分Kmeans 來聚類質心和聚類的簇都得到了，不知道如何給每一條數據貼上具體的標簽？這個鏈接下的代碼，可以作為參

金額隨機分配算法（修改版）

算法 sca final 是否打印 util sin sys next import java.util.ArrayList; import java.util.Collections; import java.util.List; import java.util.R

BZOJ3680 JSOI2004 平衡點 - 隨機/近似算法

ace pre 近似算法 post tmp min span nbsp bzoj3 叠代亂搞了下就過了…… #include <bits/stdc++.h> using namespace std; double x[10005],y[10005],w[1

常用排序算法的python實現和性能分析

pos 算法復雜度信息環比數組長度暫時並且直接排序 win 作者：waterxi 原文鏈接一年一度的換工作高峰又到了，HR大概每天都塞幾份簡歷過來，基本上一天安排兩個面試的話，當天就只能加班幹活了。趁著面試別人的機會，自己也把一些基礎算法和一些面試題整了一

計量經濟與時間序列_自協方差(AutoCovariance)算法解析(Python)

VG pos auto log png spa src 5.7 8.4 1　　樣本的自協方差函數的通式如下： 2　　其實，後面要計算的自相關函數也可以用自協方差來表示： 1 TimeSeries = [11.67602657, 5.637492979, 1.3755

隨機選擇算法

bubuko == mage pre 數組 stdlib.h 時間大小 AI 　　隨機選擇算法和快速排序原理相似，所以有時候也稱作“快速選擇算法”，一般選擇問題可以證明都能在O(n)時間內完成。隨機選擇算法的期望運行時間為線性時間，即Θ

經典排序算法及python實現

設計 python get 排序。技術排好序 sort RR 第一部分今天我們來談談幾種經典排序算法，然後用python來實現，最後通過數據來比較幾個算法時間選擇排序選擇排序（Selection sort）是一種簡單直觀的排序算法。它的工作原理是每一次從待排序的數

隨機化算法之隨機數

整數 for stat cpp 隨機化算法 ons 系統時間硬幣 head 首先是介紹：代碼如下： //隨機數類 //Random.hpp //=====================================================

隨機森林算法demo python spark

關鍵參數

相關推薦