1. 程式人生 > >Tuning Machine Learning Models Using the Caret R Package

Tuning Machine Learning Models Using the Caret R Package

Machine learning algorithms are parameterized so that they can be best adapted for a given problem. A difficulty is that configuring an algorithm for a given problem can be a project in and of itself.

Like selecting ‘the best’ algorithm for a problem you cannot know before hand which algorithm parameters will be best for a problem. The best thing to do is to investigate empirically with controlled experiments.

The caret R package was designed to make finding optimal parameters for an algorithm very easy. It provides a grid search method for searching parameters, combined with various methods for estimating the performance of a given model.

In this post you will discover 5 recipes that you can use to tune machine learning algorithms to find optimal parameters for your problems using the caret R package.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Model Tuning

The caret R package provides a grid search where it or you can specify the parameters to try on your problem. It will trial all combinations and locate the one combination that gives the best results.

The examples in this post will demonstrate how you can use the caret R package to tune a machine learning algorithm.

The Learning Vector Quantization (LVQ) will be used in all examples because of its simplicity. It is like k-nearest neighbors, except the database of samples is smaller and adapted based on training data. It has two parameters to tune, the number of instances (codebooks) in the model called the size, and the number of instances to check when making predictions called k.

Each example will also use the iris flowers dataset, that comes with R. This classification dataset provides 150 observations for three species of iris flower and their petal and sepal measurements in centimeters.

Each example also assumes that we are interested in the classification accuracy as the metric we are optimizing, although this can be changed. Also, each example estimates the performance of a given model (size and k parameter combination) using repeated n-fold cross validation, with 10 folds and 3 repeats. This too can be changed if you like.

Grid Search: Automatic Grid

There are two ways to tune an algorithm in the Caret R package, the first is by allowing the system to do it automatically. This can be done by setting the tuneLength to indicate the number of different values to try for each algorithm parameter.

This only supports integer and categorical algorithm parameters, and it makes a crude guess as to what values to try, but it can get you up and running very quickly.

The following recipe demonstrates the automatic grid search of the size and k attributes of LVQ with 5 (tuneLength=5) values of each (25 total models).

Automatic parameter tuning in R R
123456789101112 # ensure results are repeatableset.seed(7)# load the librarylibrary(caret)# load the datasetdata(iris)# prepare training schemecontrol<-trainControl(method="repeatedcv",number=10,repeats=3)# train the modelmodel<-train(Species~.,data=iris,method="lvq",trControl=control,tuneLength=5)# summarize the modelprint(model)

The final values used for the model were size = 10 and k = 1.

Grid Search: Manual Grid

The second way to search algorithm parameters is to specify a tune grid manually. In the grid, each algorithm parameter can be specified as a vector of possible values. These vectors combine to define all the possible combinations to try.

The recipe below demonstrates the search of a manual tune grid with 4 values for the size parameter and 5 values for the k parameter (20 combinations).

Grid search with the caret r package R
1234567891011121314 # ensure results are repeatableset.seed(7)# load the librarylibrary(caret)# load the datasetdata(iris)# prepare training schemecontrol<-trainControl(method="repeatedcv",number=10,repeats=3)# design the parameter tuning gridgrid<-expand.grid(size=c(5,10,20,50),k=c(1,2,3,4,5))# train the modelmodel<-train(Species~.,data=iris,method="lvq",trControl=control,tuneGrid=grid)# summarize the modelprint(model)

The final values used for the model were size = 50 and k = 5.

Data Pre-Processing

The dataset can be preprocessed as part of the parameter tuning. It is important to do this within the sample used to evaluate each model, to ensure that the results account for all the variability in the test. If the data set say normalized or standardized before the tuning process, it would have access to additional knowledge (bias) and not give an as accurate estimate of performance on unseen data.

The attributes in the iris dataset are all in the same units and generally the same scale, so normalization and standardization are not really necessary. Nevertheless, the example below demonstrates tuning the size and k parameters of LVQ while normalizing the dataset with preProcess=”scale”.

Grid search with preprocessing with the caret r package R
123456789101112 # ensure results are repeatableset.seed(7)# load the librarylibrary(caret)# load the datasetdata(iris)# prepare training schemecontrol<-trainControl(method="repeatedcv",number=10,repeats=3)# train the modelmodel<-train(Species~.,data=iris,method="lvq",preProcess="scale",trControl=control,tuneLength=5)# summarize the modelprint(model)

The final values used for the model were size = 8 and k = 6.

Parallel Processing

The caret package supports parallel processing in order to decrease the compute time for a given experiment. It is supported automatically as long as it is configured. In this example we load the doMC package and set the number of cores to 4, making available 4 worker threads to caret when tuning the model. This is used for the loops for the repeats of cross validation for each parameter combination.

Grid search with parallel processing with the caret r package R
123456789101112131415 # ensure results are repeatableset.seed(7)# configure multicorelibrary(doMC)registerDoMC(cores=4)# load the librarylibrary(caret)# load the datasetdata(iris)# prepare training schemecontrol<-trainControl(method="repeatedcv",number=10,repeats=3)# train the modelmodel<-train(Species~.,data=iris,method="lvq",trControl=control,tuneLength=5)# summarize the modelprint(model)

The results are the same as the first example, only completed faster.

Visualization of Performance

It can be useful to graph the performance of different algorithm parameter combinations to look for trends and the sensitivity of the model. Caret supports graphing the model directly which will compare the accuracy of different algorithm combinations.

In the recipe below, a larger manual grid of algorithm parameters are defined and the results are graphed. The graph shows the size on the x axis and model accuracy on the y axis. Two lines are drawn, one for each k value. The graph shows the general trends in the increase in performance with size and that the larger value of k is probably preferred.

Grid search with visualization with the caret r package R
12345678910111213141516 # ensure results are repeatableset.seed(7)# load the librarylibrary(caret)# load the datasetdata(iris)# prepare training schemecontrol<-trainControl(method="repeatedcv",number=10,repeats=3)# design the parameter tuning gridgrid<-expand.grid(size=c(5,10,15,20,25,30,35,40,45,50),k=c(3,5))# train the modelmodel<-train(Species~.,data=iris,method="lvq",trControl=control,tuneGrid=grid)# summarize the modelprint(model)# plot the effect of parameters on accuracyplot(model)

The final values used for the model were size = 35 and k = 5.

Grid Search using the Caret R Package

Grid Search using the Caret R Package
Shows size and k versus model accuracy for LVQ

Summary

In this post you have discovered the support in the caret R package for tuning algorithm parameters by using a grid search.

You have seen 5 recipes using the caret R package for tuning the size and k parameters for the LVQ algorithm.

Each recipe in this post is standalone and ready for you to copy-and-paste into your own project and adapt to your problem.


Frustrated With Your Progress In R Machine Learning?

Master Machine Learning With R

Develop Your Own Models in Minutes

…with just a few lines of R code

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.


相關推薦

Tuning Machine Learning Models Using the Caret R Package

Tweet Share Share Google Plus Machine learning algorithms are parameterized so that they can be

Compare Models And Select The Best Using The Caret R Package

Tweet Share Share Google Plus The Caret R package allows you to easily construct many different

Data Visualization with the Caret R package

Tweet Share Share Google Plus The caret package in R is designed to streamline the process of ap

Feature Selection with the Caret R Package

Tweet Share Share Google Plus Selecting the right features in your data can mean the difference

How To Estimate Model Accuracy in R Using The Caret Package

Tweet Share Share Google Plus When you are building a predictive model, you need a way to evalua

How AI & Machine Learning Are Redefining The War For Talent

These and many other fascinating insights are from Gartner's recent research note, Cool Vendors in Human Capital Management for Talent Acquisition (PDF, 13

Learn How to Code and Deploy Machine Learning Models on Spark Structured Streaming

This post is a token of appreciation for the amazing open source community of Data Science, to which I owe a lot of what I have learned. For last few month

How businesses can build fairness into their machine learning models

Enterprise-wide deployments of AI are constrained by the requirements of scaling any new system or technology: transparency, security, and the application'

How Facebook Uses Bayesian Optimization to Conduct Better Experiments in Machine Learning Models

How Facebook Uses Bayesian Optimization to Conduct Better Experiments in Machine Learning ModelsHyperparameter optimization is a key aspect of the lifecycl

Regression Versus Classification Machine Learning: What's the Difference?

The difference between regression machine learning algorithms and classification machine learning algorithms sometimes confuse most data scientists, which

Utilising machine learning, AI and the cloud to deliver a more personalised customer experience

Paul Armstrong, Principal Solutions Architect, Amazon Web Services: "To deliver a truly personalised travel experience, the first thing airports and airlin

Why Data Normalization is necessary for Machine Learning models

Why Data Normalization is necessary for Machine Learning modelsNormalization is a technique often applied as part of data preparation for machine learning.

Create Custom Machine Learning Models With Google Cloud ML Wimoxez

Google Cloud Machine Learning (ML) motors can be really just controlled services that make it possible for programmers and info boffins to successfully con

Understanding Data and Machine Learning Models with Visualizations (Part 1)

Examining Feature-PC CorrelationOn the non-interactive side, the tool also generates heatmaps with additional information about the principal components. T

Help improve lives through Machine Learning by joining the AWS DeepLens Challenge!

Today, we’re unveiling a fresh approach to the AWS DeepLens Challenge. We are bringing you four challenges to choose from–sustainability, games, h

Why it's hard to design fair machine learning models

In this episode of the Data Show, I spoke with Sharad Goel, assistant professor at Stanford, and his student Sam Corbett-Davies. They recently wrote a surv

Training Machine Learning Models in Pharma and Biotech Manufacturing with Bigfinite Amazon Web Services

Creating and training machine learning models has become less time consuming and more cost efficient thanks to technology advancements like open source sof

Practical Machine Learning Books for the Holidays: A Quick Look at the New Offerings from O'Reilly

Tweet Share Share Google Plus O’Reilly books have a reputation for being practical, hands on and

Begin Machine Learning By Finding The Landmarks

Tweet Share Share Google Plus Where do you begin in machine learning? Is actually breaking groun