Use Random Forest: Testing 179 Classifiers on 121 Datasets
If you don’t know what algorithm to use on your problem, try a few.
Alternatively, you could just try Random Forest and maybe a Gaussian SVM.
In a recent study these two algorithms were demonstrated to be the most effective when raced against nearly 200 other algorithms averaged over more than 100 data sets.
In this post we will review this study and consider some implications for testing algorithms on our own applied machine learning problems.
Do We Need Hundreds of Classifiers
In the paper, the authors evaluate 179 classifiers arising from 17 families across 121 standard datasets from the
As a taste, here is a list of the families of algorithms investigated and the number of algorithms in each family.
- Discriminant analysis (DA): 20 classifiers
- Bayesian (BY) approaches: 6 classifiers
- Neural networks (NNET): 21 classifiers
- Support vector machines (SVM): 10 classifiers
- Decision trees (DT): 14 classifiers.
- Rule-based methods (RL): 12 classifiers.
- Boosting (BST): 20 classifiers
- Bagging (BAG): 24 classifiers
- Stacking (STC): 2 classifiers.
- Random Forests (RF): 8 classifiers.
- Other ensembles (OEN): 11 classifiers.
- Generalized Linear Models (GLM): 5 classifiers.
- Nearest neighbor methods (NN): 5 classifiers.
- Partial least squares and principal component regression (PLSR): 6
- Logistic and multinomial regression (LMR): 3 classifiers.
- Multivariate adaptive regression splines (MARS): 2 classifiers
- Other Methods (OM): 10 classifiers.
This is a huge study.
Some algorithms were tuned before contributing their final score and algorithms were evaluated using a 4-fold cross validation.
Cutting to the chase they found that Random Forest (specifically parallel random forest in R) and Gaussian Support Vector Machines (specifically from libSVM) performed the best overall.
From the abstract of the paper:
The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets.
In a discussion on HackerNews about the paper, Ben Hamner from Kaggle makes a corroborating comment on the profound performance of bagged decision trees:
This is consistent with our experience running hundreds of Kaggle competitions: for most classification problems, some variation on ensembles decision trees (random forests, gradient boosted machines, etc.) performs the best.
Get your FREE Algorithms Mind Map
I've created a handy mind map of 60+ algorithms organized by type.
Download it, print it and use it.
Also get exclusive access to the machine learning algorithms email mini-course.
Be Very Careful Preparing Data
Some algorithms only work with categorical data and others require numerical data. A few can handle whatever you throw at them. The datasets in the UCI machine are generally standardized, but not enough to be used in their raw state for a study like this.
In this commentary, the author points out that categorical data in relevant datasets that were tested was systematically transformed into numerical values, but in a way that likely hindered some algorithms being tested.
The Gaussian SVM likely performed well because of the transformation of categorical attributes to numerical and the standardization of the datasets that was performed.
Nevertheless, I commend the courage the authors had in taking on this challenge and the problems may be addressed in those willing to take on the follow-up studies.
The authors also note the OpenML project that looks like a citizen science effort to take on the same challenge.
Why Do Studies Like This?
It is easy to snipe at this study with arguments along the lines of No Free Lunch Theorem (NFLT). That the performance of all algorithms is equivalent when averaged over all problems.
I dislike this argument. The NFLT requires that you have no prior knowledge. That you don’t know what problem you are working on or what algorithms you are trying. These conditions are not practical.
In the paper, the authors list four goals for the project:
- To select the globally best classifier for the selected data set collection
- To rank each classifier and family according to its accuracy
- To determine, for each classifier, its probability of achieving the best accuracy, and the difference between its accuracy and the best one
- To evaluate the classifier behavior varying the data set properties (complexity, number of patterns, number of classes and number of inputs)
The authors of the study acknowledge that practical problems we want to solve are a subset of all possible problems and that the number of effective algorithms is not infinite but manageable.
The paper is a statement that indeed we may have something to say about the capability of the suite of most used (or implemented) algorithms on a suite of known (but small) problems. (much like the STATLOG project from the mid 1990s)
In Practice: Choose a Middle Ground
You cannot know which algorithm (or algorithm configuration) will perform well or even best on your problem before you get started.
You must try multiple algorithms and double down your efforts on those few that demonstrate their ability to pick out the structure in the problem.
In the context of this study, spot checking is a middle ground between going with your favorite algorithm on one hand and testing all known algorithms on the other hand.
- Pick your favorite algorithm. Fast but limited to whatever your favorite algorithm or library happens to be.
- Spot check a dozen algorithms. A balanced approach that allows better performing algorithms to rise to the top for you to focus on.
- Test all known/implemented algorithms. Time consuming exhaustive approach that can sometimes deliver surprising results.
Where you land on this spectrum is dependent on the time and resources you have at your disposal. And remember, that trialling algorithms on a problem is but one step in the process of working through a problem.
Testing all algorithms requires a robust test harness. This cannot be overstated.
When I have attempted this in the past I find that most algorithms pick out most of the structure in the problem. It is a bunched distribution of results with a fat head a long tail and the difference in the fat head is often very minor.
It is this minor difference that you would like to be meaningful. Hence the need for you to invest a lot of upfront time in designing a robust test harness (cross validation, a good number of folds, perhaps a separate validation dataset) without data leakage (data scaling/transforms within cross validation folds, etc.)
I take this for granted on applied problems now. I don’t even care which algorithms rise up. I focus my efforts on data preparation and on ensembling the results of a diverse set of good enough models.
Next Step
Where do you fall on the spectrum when working a machine learning problem?
Do you stick with a favorite or favorite set of algorithms? Do you spot check or do you try to be exhaustive and test everything that your favorite libraries have to offer?
Frustrated With Machine Learning Math?
See How Algorithms Work in Minutes
…with just arithmetic and simple examples
It covers explanations and examples of 10 top algorithms, like:
Linear Regression, k-Nearest Neighbors, Support Vector Machines and much more…
Finally, Pull Back the Curtain on
Machine Learning Algorithms
Skip the Academics. Just Results.
相關推薦
Use Random Forest: Testing 179 Classifiers on 121 Datasets
Tweet Share Share Google Plus If you don’t know what algorithm to use on your problem, try a few
Random Forest 與 GBDT 的異同
blog cal b2c -s 隨機森林 number error one 總結 曾經在看用RF和GBDT的時候,以為是非常相似的兩個算法,都是屬於集成算法,可是細致研究之後,發現他們根本全然不同。以下總結基本的一些不同點 Random Forest: baggin
【機器學習】隨機森林 Random Forest 得到模型後,評估參數重要性
img eas 一個 increase 裏的 sum 示例 增加 機器 在得出random forest 模型後,評估參數重要性 importance() 示例如下 特征重要性評價標準 %IncMSE 是 increase in MSE。就是對每一個變量 比如 X1
讀書筆記 The Random Forest based Detection of Shadowsock's Traffic
base 讀書 tree 框架 mage 參數 mce ado rand 文章框架: 除去introduction、related work,文章首先介紹的是背景知識。1. 為什麽很難檢測Shadowsocks? 2. 隨機森林算法介紹(就是一種分類器,可以將流量分為兩類,
隨機森林(Random Forest)--- 轉載
市場營銷 ssi -o afr actual 所有 很好 struct 驗證 1 什麽是隨機森林? 作為新興起的、高度靈活的一種機器學習算法,隨機森林(Random Forest,簡稱RF)擁有廣泛的應用前景,從市場營銷到醫療保健保險,既可以用來做市場營銷模擬的建模,統
集成學習(Random Forest)——實踐
ron 加載 n-2 個數 res span 特征 oob gre 對於集成學習,由於是多個基學習期共同作用結果,因此在做參數調節時候就有基學習器的參數和集成學習的參數兩類 在scikit-learn中,RF的分類類是RandomForestClassifier,回歸類是R
Write and read opencv3.0 ml files(random forest)
using namespace cv; using namespace std; int main() { { auto rtrees = cv::ml::RTrees::create(); rtrees->setMaxDepth(10);
3. 集成學習(Ensemble Learning)隨機森林(Random Forest)
總結 子節點 clas 支持向量機 2個 最終 分類算法 容易 oot 1. 前言 相信看了之前關於集成學習的介紹,大家對集成學習有了一定的了解。本文在給大家介紹下遠近聞名的隨機森林(RF)算法。 隨機森林是集成學習中可以和梯度提升樹GBDT分庭抗禮的算法,尤其是它可以很方
3. 整合學習(Ensemble Learning)隨機森林(Random Forest)
1. 前言 相信看了之前關於整合學習的介紹,大家對整合學習有了一定的瞭解。本文在給大家介紹下遠近聞名的隨機森林(RF)演算法。 隨機森林是整合學習中可以和梯度提升樹GBDT分庭抗禮的演算法,尤其是它可以很方便的並行訓練,在如今大資料大樣本的的時代很有誘惑力。 2. 隨機森林原理 隨機森林是Baggin
TensorFlow構建Random Forest分類器
""" Random Forest. Implement Random Forest algorithm with TensorFlow, and apply it to classify handwritten digit images. This example is using the M
Julia機器學習實戰——使用Random Forest隨機森林進行字元影象識別
文章目錄 0 Preface 1 載入資料 2 訓練隨機森林(train RF) 3 完整程式碼 0 Preface 相關引數說明 - Julia: 1.0 - OS: MacOS
1.3.1 Julia機器學習實戰——使用Random Forest隨機森林進行字元影象識別
0 Preface 相關引數說明 - Julia: 1.0 - OS: MacOS 訓練測試資料百度雲連結:點選下載 密碼: u71o 檔案說明: - rf_julia_charReg - resizeData.py #批量
10-----Forecast of China Railway Freight Volume by Random Forest Regression Model
本人亮點是結構圖參考、還有就是隨機森林的特徵重要性逐個減少的方法測試準確性的方法、誤差分析等 提出模型的結構 演算法的特徵引數 它是隨機森林的一個重要應用。它是布賴曼提出的一種統計學習理論。 該演算法具有預測精度高的優點。良好的泛化能力、較快的收斂
9-------Comparison of Random Forest and SVM for Electrical Short-term Load Forecast with Different D
隨機森林與支援向量機的比較電力短期負荷預測資料來源 本文探討的SVM和RF的引數選擇方法是亮點 兩種不同資料集 thus use the EUNITE competition data and the NYSIO(New York Independent System Operator)
6-----A Random Forest Method for Real-Time Price Forecasting in New York Electricity Market
實時價格的隨機森林法紐約電力市場預測(清華的) 隨機森林,作為一種新引入的方法,將提供價格概率分佈 此外,該模型可以調整最新的預報條件,即最新的氣候,季節和市場條件,通過更新隨機森林 引數與新的觀測。這種適應性避免了不同氣候或經濟條件下的模型失效訓練集。  
4----- A Two-Stage Random Forest Method for Short-term Load Forecasting
短期森林兩階段隨機森林法負荷預測 亮點:兩階段預測方法,首先是改進的灰色關聯方法選擇相似日,然後RF預測 典型的機器人工神經網路(ANN)學習方法 支援向量迴歸(SVR)存在著難以克服的缺陷。 克服,例如容易陷入區域性優化(用於 難定核引數和懲罰引數(對於SVR)  
1---A Combined Model of Random Forest and Multilayer Perceptron to Forecast Expressway Traffic Flow
北郵大水比寫的,明顯就是造假 隨機森林與多層感知器相結合的高速公路交通流預測模型 隨機森林與多層組合模型感知器 A.隨機森林演算法 and it is an extension of Bagging algorithm 在迴歸預測問題中, 隨機森林演
Random Forest(sklearn引數詳解)
本篇不是介紹RF的,關於RF網上有很多通俗易懂的解釋 西瓜書與統計學習方法等很多教材中的解釋也都足夠 本篇僅針對如何使用sklearn中的RandomForestClassifier作記錄 程式碼案例: class sklearn.ensemble.RandomForestClass
[Machine Learning & Algorithm] 隨機森林(Random Forest)
閱讀目錄 回到頂部 1 什麼是隨機森林? 作為新興起的、高度靈活的一種機器學習演算法,隨機森林(Random Forest,簡稱RF)擁有廣泛的應用前景,從市場營銷到醫療保健保險,既可以用來做市場營銷模擬的建模,統計客戶來源,保留和流失,也可用來預測疾病的風險和病患
我的代碼-random forest
nbsp size rand 1.0 over .sh spl log pan # coding: utf-8 # In[1]: import pandas as pdimport numpy as npfrom sklearn import treefrom sklear