Use Random Forest: Testing 179 Classifiers on 121 Datasets

阿新 • • 發佈：2019-01-12

If you don’t know what algorithm to use on your problem, try a few.

Alternatively, you could just try Random Forest and maybe a Gaussian SVM.

In a recent study these two algorithms were demonstrated to be the most effective when raced against nearly 200 other algorithms averaged over more than 100 data sets.

In this post we will review this study and consider some implications for testing algorithms on our own applied machine learning problems.

Do We Need Hundreds of Classifiers
Photo by Thomas Leth-Olsen, some rights reserved

Do We Need Hundreds of Classifiers

In the paper, the authors evaluate 179 classifiers arising from 17 families across 121 standard datasets from the

UCI machine learning repository.

As a taste, here is a list of the families of algorithms investigated and the number of algorithms in each family.

Discriminant analysis (DA): 20 classifiers
Bayesian (BY) approaches: 6 classifiers
Neural networks (NNET): 21 classifiers
Support vector machines (SVM): 10 classifiers

Decision trees (DT): 14 classifiers.
Rule-based methods (RL): 12 classifiers.
Boosting (BST): 20 classifiers
Bagging (BAG): 24 classifiers
Stacking (STC): 2 classifiers.
Random Forests (RF): 8 classifiers.
Other ensembles (OEN): 11 classifiers.
Generalized Linear Models (GLM): 5 classifiers.
Nearest neighbor methods (NN): 5 classifiers.
Partial least squares and principal component regression (PLSR): 6
Logistic and multinomial regression (LMR): 3 classifiers.
Multivariate adaptive regression splines (MARS): 2 classifiers
Other Methods (OM): 10 classifiers.

This is a huge study.

Some algorithms were tuned before contributing their final score and algorithms were evaluated using a 4-fold cross validation.

Cutting to the chase they found that Random Forest (specifically parallel random forest in R) and Gaussian Support Vector Machines (specifically from libSVM) performed the best overall.

From the abstract of the paper:

The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets.

In a discussion on HackerNews about the paper, Ben Hamner from Kaggle makes a corroborating comment on the profound performance of bagged decision trees:

This is consistent with our experience running hundreds of Kaggle competitions: for most classification problems, some variation on ensembles decision trees (random forests, gradient boosted machines, etc.) performs the best.

Get your FREE Algorithms Mind Map

Sample of the handy machine learning algorithms mind map.

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Download For Free

Also get exclusive access to the machine learning algorithms email mini-course.

Be Very Careful Preparing Data

Some algorithms only work with categorical data and others require numerical data. A few can handle whatever you throw at them. The datasets in the UCI machine are generally standardized, but not enough to be used in their raw state for a study like this.

In this commentary, the author points out that categorical data in relevant datasets that were tested was systematically transformed into numerical values, but in a way that likely hindered some algorithms being tested.

The Gaussian SVM likely performed well because of the transformation of categorical attributes to numerical and the standardization of the datasets that was performed.

Nevertheless, I commend the courage the authors had in taking on this challenge and the problems may be addressed in those willing to take on the follow-up studies.

The authors also note the OpenML project that looks like a citizen science effort to take on the same challenge.

Why Do Studies Like This?

It is easy to snipe at this study with arguments along the lines of No Free Lunch Theorem (NFLT). That the performance of all algorithms is equivalent when averaged over all problems.

I dislike this argument. The NFLT requires that you have no prior knowledge. That you don’t know what problem you are working on or what algorithms you are trying. These conditions are not practical.

In the paper, the authors list four goals for the project:

To select the globally best classifier for the selected data set collection
To rank each classifier and family according to its accuracy
To determine, for each classifier, its probability of achieving the best accuracy, and the difference between its accuracy and the best one
To evaluate the classifier behavior varying the data set properties (complexity, number of patterns, number of classes and number of inputs)

The authors of the study acknowledge that practical problems we want to solve are a subset of all possible problems and that the number of effective algorithms is not infinite but manageable.

The paper is a statement that indeed we may have something to say about the capability of the suite of most used (or implemented) algorithms on a suite of known (but small) problems. (much like the STATLOG project from the mid 1990s)

In Practice: Choose a Middle Ground

You cannot know which algorithm (or algorithm configuration) will perform well or even best on your problem before you get started.

You must try multiple algorithms and double down your efforts on those few that demonstrate their ability to pick out the structure in the problem.

In the context of this study, spot checking is a middle ground between going with your favorite algorithm on one hand and testing all known algorithms on the other hand.

Pick your favorite algorithm. Fast but limited to whatever your favorite algorithm or library happens to be.
Spot check a dozen algorithms. A balanced approach that allows better performing algorithms to rise to the top for you to focus on.
Test all known/implemented algorithms. Time consuming exhaustive approach that can sometimes deliver surprising results.

Where you land on this spectrum is dependent on the time and resources you have at your disposal. And remember, that trialling algorithms on a problem is but one step in the process of working through a problem.

Testing all algorithms requires a robust test harness. This cannot be overstated.

When I have attempted this in the past I find that most algorithms pick out most of the structure in the problem. It is a bunched distribution of results with a fat head a long tail and the difference in the fat head is often very minor.

It is this minor difference that you would like to be meaningful. Hence the need for you to invest a lot of upfront time in designing a robust test harness (cross validation, a good number of folds, perhaps a separate validation dataset) without data leakage (data scaling/transforms within cross validation folds, etc.)

I take this for granted on applied problems now. I don’t even care which algorithms rise up. I focus my efforts on data preparation and on ensembling the results of a diverse set of good enough models.

Next Step

Where do you fall on the spectrum when working a machine learning problem?

Do you stick with a favorite or favorite set of algorithms? Do you spot check or do you try to be exhaustive and test everything that your favorite libraries have to offer?

Frustrated With Machine Learning Math?

See How Algorithms Work in Minutes

…with just arithmetic and simple examples

It covers explanations and examples of 10 top algorithms, like:
Linear Regression, k-Nearest Neighbors, Support Vector Machines and much more…

Finally, Pull Back the Curtain on
Machine Learning Algorithms

Skip the Academics. Just Results.

Use Random Forest: Testing 179 Classifiers on 121 Datasets

Do We Need Hundreds of Classifiers

Get your FREE Algorithms Mind Map

Be Very Careful Preparing Data

Why Do Studies Like This?

In Practice: Choose a Middle Ground

Next Step

Frustrated With Machine Learning Math?

See How Algorithms Work in Minutes

Finally, Pull Back the Curtain on
Machine Learning Algorithms

Use Random Forest: Testing 179 Classifiers on 121 Datasets

Random Forest 與 GBDT 的異同

【機器學習】隨機森林 Random Forest 得到模型後，評估參數重要性

讀書筆記 The Random Forest based Detection of Shadowsock's Traffic

隨機森林（Random Forest）--- 轉載

集成學習（Random Forest）——實踐

Write and read opencv3.0 ml files(random forest)

3. 集成學習（Ensemble Learning）隨機森林（Random Forest）

3. 整合學習（Ensemble Learning）隨機森林（Random Forest）

TensorFlow構建Random Forest分類器

Julia機器學習實戰——使用Random Forest隨機森林進行字元影象識別

1.3.1 Julia機器學習實戰——使用Random Forest隨機森林進行字元影象識別

10-----Forecast of China Railway Freight Volume by Random Forest Regression Model

9-------Comparison of Random Forest and SVM for Electrical Short-term Load Forecast with Different D

6-----A Random Forest Method for Real-Time Price Forecasting in New York Electricity Market

4----- A Two-Stage Random Forest Method for Short-term Load Forecasting

1---A Combined Model of Random Forest and Multilayer Perceptron to Forecast Expressway Traffic Flow

Random Forest（sklearn引數詳解)

[Machine Learning & Algorithm] 隨機森林（Random Forest）

我的代碼-random forest

Use Random Forest: Testing 179 Classifiers on 121 Datasets

Do We Need Hundreds of Classifiers

Get your FREE Algorithms Mind Map

Be Very Careful Preparing Data

Why Do Studies Like This?

In Practice: Choose a Middle Ground

Next Step

Frustrated With Machine Learning Math?

See How Algorithms Work in Minutes

Finally, Pull Back the Curtain on Machine Learning Algorithms

相關推薦

Finally, Pull Back the Curtain on
Machine Learning Algorithms