Save And Finalize Your Machine Learning Model in R

Finding an accurate machine learning is not the end of the project.

In this post you will discover how to finalize your machine learning model in R including: making predictions on unseen data, re-building the model from scratch and saving your model for later use.

Let’s get started.

Finalize Your Machine Learning Model in R.
Photo by Christian Schnettelker, some rights reserved.

Finalize Your Machine Learning Model

Once you have an accurate model on your test harness you are nearly, done. But not yet.

There are still a number of tasks to do to finalize your model. The whole idea of creating an accurate model for your dataset was to make predictions on unseen data.

There are three tasks you may be concerned with:

Making new predictions on unseen data.
Creating a standalone model using all training data.
Saving your model to file for later loading and making predictions on new data.

Once you have finalized your model you are ready to make use of it. You could use the R model directly. You could also discover the key internal representation found by the learning algorithm (like the coefficients in a linear model) and use them in a new implementation of the prediction algorithm on another platform.

In the next section, you will look at how you can finalize your machine learning model in R.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Finalize Predictive Model in R

Caret is an excellent tool that you can use to find good or even best machine learning algorithms and parameters for machine learning algorithms.

But what do you do after you have discovered a model that is accurate enough to use?

Once you have found a good model in R, you have three main concerns:

Making new predictions using your tuned caret model.
Creating a standalone model using the entire training dataset.
Saving/Loading a standalone model to file.

This section will step you through how to achieve each of these tasks in R.

1. Make Predictions On New Data

You can make new predictions using a model you have tuned using caret using the predict.train() function.

In the recipe below, the dataset is split into a validation dataset and a training dataset. The validation dataset could just as easily be a new dataset stored in a separate file and loaded as a data frame.

A good model of the data is found using LDA. We can see that caret provides access to the best model from a training run in the finalModel variable.

We can use that model to make predictions by calling predict using the fit from train which will automatically use the final model. We must specify the data one which to make predictions via the newdata argument.

# load libraries
library(caret)
library(mlbench)
# load dataset
data(PimaIndiansDiabetes)
# create 80%/20% for training and validation datasets
set.seed(9)
validation_index <- createDataPartition(PimaIndiansDiabetes$diabetes, p=0.80, list=FALSE)
validation <- PimaIndiansDiabetes[-validation_index,]
training <- PimaIndiansDiabetes[validation_index,]
# train a model and summarize model
set.seed(9)
control <- trainControl(method="cv", number=10)
fit.lda <- train(diabetes~., data=training, method="lda", metric="Accuracy", trControl=control)
print(fit.lda)
print(fit.lda$finalModel)
# estimate skill on validation dataset
set.seed(9)
predictions <- predict(fit.lda, newdata=validation)
confusionMatrix(predictions, validation$diabetes)

1234567891011121314151617181920

# load librarieslibrary(caret)library(mlbench)# load datasetdata(PimaIndiansDiabetes)# create 80%/20% for training and validation datasetsset.seed(9)validation_index<-createDataPartition(PimaIndiansDiabetes$diabetes,p=0.80,list=FALSE)validation<-PimaIndiansDiabetes[-validation_index,]training<-PimaIndiansDiabetes[validation_index,]# train a model and summarize modelset.seed(9)control<-trainControl(method="cv",number=10)fit.lda<-train(diabetes~.,data=training,method="lda",metric="Accuracy",trControl=control)print(fit.lda)print(fit.lda$finalModel)# estimate skill on validation datasetset.seed(9)predictions<-predict(fit.lda,newdata=validation)confusionMatrix(predictions,validation$diabetes)

Running the example, we can see that the estimated accuracy on the training dataset was 76.91%. Using the finalModel in the fit, we can see that the accuracy on the hold out validation dataset was 77.78%, very similar to our estimate.

Resampling results

  Accuracy   Kappa    Accuracy SD  Kappa SD 
  0.7691169  0.45993  0.06210884   0.1537133

...

Confusion Matrix and Statistics

          Reference
Prediction neg pos
       neg  85  19
       pos  15  34
                                          
               Accuracy : 0.7778          
                 95% CI : (0.7036, 0.8409)
    No Information Rate : 0.6536          
    P-Value [Acc > NIR] : 0.000586        
                                          
                  Kappa : 0.5004          
 Mcnemar's Test P-Value : 0.606905        
                                          
            Sensitivity : 0.8500          
            Specificity : 0.6415          
         Pos Pred Value : 0.8173          
         Neg Pred Value : 0.6939          
             Prevalence : 0.6536          
         Detection Rate : 0.5556          
   Detection Prevalence : 0.6797          
      Balanced Accuracy : 0.7458          
                                          
       'Positive' Class : neg

1234567891011121314151617181920212223242526272829303132

Resampling results Accuracy Kappa Accuracy SD Kappa SD 0.7691169 0.45993 0.06210884 0.1537133...Confusion Matrix and Statistics ReferencePrediction neg pos neg 85 19 pos 15 34 Accuracy : 0.7778 95% CI : (0.7036, 0.8409) No Information Rate : 0.6536 P-Value [Acc > NIR] : 0.000586 Kappa : 0.5004 Mcnemar's Test P-Value : 0.606905 Sensitivity : 0.8500 Specificity : 0.6415 Pos Pred Value : 0.8173 Neg Pred Value : 0.6939 Prevalence : 0.6536 Detection Rate : 0.5556 Detection Prevalence : 0.6797 Balanced Accuracy : 0.7458 'Positive' Class : neg

2. Create A Standalone Model

In this example, we have tuned a random forest with 3 different values for mtry and ntree set to 2000. By printing the fit and the finalModel, we can see that the most accurate value for mtry was 2.

Now that we know a good algorithm (random forest) and the good configuration (mtry=2, ntree=2000) we can create the final model directly using all of the training data. We can lookup the “rf” random forest implementation used by caret in the Caret List of Models and note that it is using the randomForest package and in turn the randomForest() function.

The example creates a new model directly and uses it to make predictions on the new data, this case simulated as the verification dataset.

# load libraries
library(caret)
library(mlbench)
library(randomForest)
# load dataset
data(Sonar)
set.seed(7)
# create 80%/20% for training and validation datasets
validation_index <- createDataPartition(Sonar$Class, p=0.80, list=FALSE)
validation <- Sonar[-validation_index,]
training <- Sonar[validation_index,]
# train a model and summarize model
set.seed(7)
control <- trainControl(method="repeatedcv", number=10, repeats=3)
fit.rf <- train(Class~., data=training, method="rf", metric="Accuracy", trControl=control, ntree=2000)
print(fit.rf)
print(fit.rf$finalModel)
# create standalone model using all training data
set.seed(7)
finalModel <- randomForest(Class~., training, mtry=2, ntree=2000)
# make a predictions on "new data" using the final model
final_predictions <- predict(finalModel, validation[,1:60])
confusionMatrix(final_predictions, validation$Class)

1234567891011121314151617181920212223

# load librarieslibrary(caret)library(mlbench)library(randomForest)# load datasetdata(Sonar)set.seed(7)# create 80%/20% for training and validation datasetsvalidation_index<-createDataPartition(Sonar$Class,p=0.80,list=FALSE)validation<-Sonar[-validation_index,]training<-Sonar[validation_index,]# train a model and summarize modelset.seed(7)control<-trainControl(method="repeatedcv",number=10,repeats=3)fit.rf<-train(Class~.,data=training,method="rf",metric="Accuracy",trControl=control,ntree=2000)print(fit.rf)print(fit.rf$finalModel)# create standalone model using all training dataset.seed(7)finalModel<-randomForest(Class~.,training,mtry=2,ntree=2000)# make a predictions on "new data" using the final modelfinal_predictions<-predict(finalModel,validation[,1:60])confusionMatrix(final_predictions,validation$Class)

We can see that the estimated accuracy of the optimal configuration was 85.07%. We can see that the accuracy of the final standalone model trained on all of the training dataset and predicting for the validation dataset was 82.93%.

Random Forest 

167 samples
 60 predictor
  2 classes: 'M', 'R' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 151, 150, 150, 150, 151, 150, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa      Accuracy SD  Kappa SD 
   2    0.8507353  0.6968343  0.07745360   0.1579125
  31    0.8064951  0.6085348  0.09373438   0.1904946
  60    0.7927696  0.5813335  0.08768147   0.1780100

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 2. 

...

Call:
 randomForest(x = x, y = y, ntree = 2000, mtry = param$mtry) 
               Type of random forest: classification
                     Number of trees: 2000
No. of variables tried at each split: 2

        OOB estimate of  error rate: 14.37%
Confusion matrix:
   M  R class.error
M 83  6  0.06741573
R 18 60  0.23076923

...

Confusion Matrix and Statistics

          Reference
Prediction  M  R
         M 20  5
         R  2 14
                                          
               Accuracy : 0.8293          
                 95% CI : (0.6794, 0.9285)
    No Information Rate : 0.5366          
    P-Value [Acc > NIR] : 8.511e-05       
                                          
                  Kappa : 0.653           
 Mcnemar's Test P-Value : 0.4497          
                                          
            Sensitivity : 0.9091          
            Specificity : 0.7368          
         Pos Pred Value : 0.8000          
         Neg Pred Value : 0.8750          
             Prevalence : 0.5366          
         Detection Rate : 0.4878          
   Detection Prevalence : 0.6098          
      Balanced Accuracy : 0.8230          
                                          
       'Positive' Class : M

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

Random Forest 167 samples 60 predictor 2 classes: 'M', 'R' No pre-processingResampling: Cross-Validated (10 fold, repeated 3 times) Summary of sample sizes: 151, 150, 150, 150, 151, 150, ... Resampling results across tuning parameters: mtry Accuracy Kappa Accuracy SD Kappa SD 2 0.8507353 0.6968343 0.07745360 0.1579125 31 0.8064951 0.6085348 0.09373438 0.1904946 60 0.7927696 0.5813335 0.08768147 0.1780100Accuracy was used to select the optimal model using the largest value.The final value used for the model was mtry = 2. ...Call: randomForest(x = x, y = y, ntree = 2000, mtry = param$mtry) Type of random forest: classification Number of trees: 2000No. of variables tried at each split: 2 OOB estimate of error rate: 14.37%Confusion matrix: M R class.errorM 83 6 0.06741573R 18 60 0.23076923...Confusion Matrix and Statistics ReferencePrediction M R M 20 5 R 2 14 Accuracy : 0.8293 95% CI : (0.6794, 0.9285) No Information Rate : 0.5366 P-Value [Acc > NIR] : 8.511e-05 Kappa : 0.653 Mcnemar's Test P-Value : 0.4497 Sensitivity : 0.9091 Specificity : 0.7368 Pos Pred Value : 0.8000 Neg Pred Value : 0.8750 Prevalence : 0.5366 Detection Rate : 0.4878 Detection Prevalence : 0.6098 Balanced Accuracy : 0.8230 'Positive' Class : M

Some simpler models, like linear models can output their coefficients. This is useful, because from these, you can implement the simple prediction procedure in your language of choice and use the coefficients to get the same accuracy. This gets more difficult as the complexity of the representation increases.

3. Save and Load Your Model

You can save your best models to a file so that you can load them up later and make predictions.

In this example we split the Sonar dataset into a training dataset and a validation dataset. We take our validation dataset as new data to test our final model. We train the final model using the training dataset and our optimal parameters, then save it to a file called final_model.rds in the local working directory.

The model is serialized. It can be loaded at a later time by calling readRDS() and assigning the object that is loaded (in this case a random forest fit) to a variable name. The loaded random forest is then used to make predictions on new data, in this case the validation dataset.

# load libraries
library(caret)
library(mlbench)
library(randomForest)
library(doMC)
registerDoMC(cores=8)
# load dataset
data(Sonar)
set.seed(7)
# create 80%/20% for training and validation datasets
validation_index <- createDataPartition(Sonar$Class, p=0.80, list=FALSE)
validation <- Sonar[-validation_index,]
training <- Sonar[validation_index,]
# create final standalone model using all training data
set.seed(7)
final_model <- randomForest(Class~., training, mtry=2, ntree=2000)
# save the model to disk
saveRDS(final_model, "./final_model.rds")

# later...

# load the model
super_model <- readRDS("./final_model.rds")
print(super_model)
# make a predictions on "new data" using the final model
final_predictions <- predict(super_model, validation[,1:60])
confusionMatrix(final_predictions, validation$Class)

123456789101112131415161718192021222324252627

# load librarieslibrary(caret)library(mlbench)library(randomForest)library(doMC)registerDoMC(cores=8)# load datasetdata(Sonar)set.seed(7)# create 80%/20% for training and validation datasetsvalidation_index<-createDataPartition(Sonar$Class,p=0.80,list=FALSE)validation<-Sonar[-validation_index,]training<-Sonar[validation_index,]# create final standalone model using all training dataset.seed(7)final_model<-randomForest(Class~.,training,mtry=2,ntree=2000)# save the model to disksaveRDS(final_model,"./final_model.rds")# later...# load the modelsuper_model<-readRDS("./final_model.rds")print(super_model)# make a predictions on "new data" using the final modelfinal_predictions<-predict(super_model,validation[,1:60])confusionMatrix(final_predictions,validation$Class)

We can see that the accuracy on the validation dataset was 82.93%.

Confusion Matrix and Statistics

          Reference
Prediction  M  R
         M 20  5
         R  2 14
                                          
               Accuracy : 0.8293          
                 95% CI : (0.6794, 0.9285)
    No Information Rate : 0.5366          
    P-Value [Acc > NIR] : 8.511e-05       
                                          
                  Kappa : 0.653           
 Mcnemar's Test P-Value : 0.4497          
                                          
            Sensitivity : 0.9091          
            Specificity : 0.7368          
         Pos Pred Value : 0.8000          
         Neg Pred Value : 0.8750          
             Prevalence : 0.5366          
         Detection Rate : 0.4878          
   Detection Prevalence : 0.6098          
      Balanced Accuracy : 0.8230          
                                          
       'Positive' Class : M

123

Save And Finalize Your Machine Learning Model in R

Finalize Your Machine Learning Model

Need more Help with R for Machine Learning?

Finalize Predictive Model in R

1. Make Predictions On New Data

2. Create A Standalone Model

3. Save and Load Your Model

Save And Finalize Your Machine Learning Model in R

How to Normalize and Standardize Your Machine Learning Data in Weka

How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking)

Spot Check Machine Learning Algorithms in R (algorithms to try on your next project)

How to Layout and Manage Your Machine Learning Project

How to Better Understand Your Machine Learning Data in Weka

How to Transform Your Machine Learning Data in Weka

Deploying your machine learning model to unlock its potential

Use Watson Knowledge Studio to build a custom machine learning model in the medical domain

Tune Machine Learning Algorithms in R (random forest case study)

How To Load Your Machine Learning Data Into R

Machine Learning Datasets in R (10 datasets you can use right now)

How To Get Started With Machine Learning Algorithms in R

Save and Load Machine Learning Models in Python with scikit

Applitools Recognized as a Top Artificial Intelligence and Machine Learning Solution in DevOps

Building a Machine Learning Model through Trial and Error

Deploy any machine learning model serverless in AWS

Training Machine Learning Models in Pharma and Biotech Manufacturing with Bigfinite Amazon Web Services

Training Machine Learning Models in Pharma and Biotech Manufacturing with Bigfinite

Quick and Dirty Data Analysis for your Machine Learning Problem

Save And Finalize Your Machine Learning Model in R

Finalize Your Machine Learning Model

Need more Help with R for Machine Learning?

Finalize Predictive Model in R

1. Make Predictions On New Data

2. Create A Standalone Model

3. Save and Load Your Model

相關推薦