Feature Selection with the Caret R Package
Selecting the right features in your data can mean the difference between mediocre performance with long training times and great performance with short training times.
The caret R package provides tools to automatically report on the relevance and importance of attributes in your data and even select the most important features for you.
In this post you will discover the feature selection tools in the Caret R package with standalone recipes in R.
After reading this post you will know:
- How to remove redundant features from your dataset.
- How to rank features in your dataset by their importance.
- How to select features from your dataset using the Recursive Feature Elimination method.
Let’s get started.
Need more Help with R for Machine Learning?
Take my free 14-day email course and discover how to use R on your project (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Remove Redundant Features
Data can contain attributes that are highly correlated with each other. Many methods perform better if highly correlated attributes are removed.
The Caret R package provides the findCorrelation which will analyze a correlation matrix of your data’s attributes report on attributes that can be removed.
The following example loads the Pima Indians Diabetes dataset that contains a number of biological attributes from medical reports. A correlation matrix is created from these attributes and highly correlated attributes are identified, in this case the age attribute is removed as it correlates highly with the pregnant attribute.
Generally, you want to remove attributes with an absolute correlation of 0.75 or higher.
Identify highly correlated features in caret r package R123456789101112131415 | # ensure the results are repeatableset.seed(7)# load the librarylibrary(mlbench)library(caret)# load the datadata(PimaIndiansDiabetes)# calculate correlation matrixcorrelationMatrix<-cor(PimaIndiansDiabetes[,1:8])# summarize the correlation matrixprint(correlationMatrix)# find attributes that are highly corrected (ideally >0.75)highlyCorrelated<-findCorrelation(correlationMatrix,cutoff=0.5)# print indexes of highly correlated attributesprint(highlyCorrelated) |
Rank Features By Importance
The importance of features can be estimated from data by building a model. Some methods like decision trees have a built in mechanism to report on variable importance. For other algorithms, the importance can be estimated using a ROC curve analysis conducted for each attribute.
The example below loads the Pima Indians Diabetes dataset and constructs an Learning Vector Quantization (LVQ) model. The varImp is then used to estimate the variable importance, which is printed and plotted. It shows that the glucose, mass and age attributes are the top 3 most important attributes in the dataset and the insulin attribute is the least important.
Rank features by importance using the caret r package R1234567891011121314151617 | # ensure results are repeatableset.seed(7)# load the librarylibrary(mlbench)library(caret)# load the datasetdata(PimaIndiansDiabetes)# prepare training schemecontrol<-trainControl(method="repeatedcv",number=10,repeats=3)# train the modelmodel<-train(diabetes~.,data=PimaIndiansDiabetes,method="lvq",preProcess="scale",trControl=control)# estimate variable importanceimportance<-varImp(model,scale=FALSE)# summarize importanceprint(importance)# plot importanceplot(importance) |
Feature Selection
Automatic feature selection methods can be used to build many models with different subsets of a dataset and identify those attributes that are and are not required to build an accurate model.
A popular automatic method for feature selection provided by the caret R package is called Recursive Feature Elimination or RFE.
The example below provides an example of the RFE method on the Pima Indians Diabetes dataset. A Random Forest algorithm is used on each iteration to evaluate the model. The algorithm is configured to explore all possible subsets of the attributes. All 8 attributes are selected in this example, although in the plot showing the accuracy of the different attribute subset sizes, we can see that just 4 attributes gives almost comparable results.
Automatically select features using Caret R Package R1234567891011121314151617 | # ensure the results are repeatableset.seed(7)# load the librarylibrary(mlbench)library(caret)# load the datadata(PimaIndiansDiabetes)# define the control using a random forest selection functioncontrol<-rfeControl(functions=rfFuncs,method="cv",number=10)# run the RFE algorithmresults<-rfe(PimaIndiansDiabetes[,1:8],PimaIndiansDiabetes[,9],sizes=c(1:8),rfeControl=control)# summarize the resultsprint(results)# list the chosen featurespredictors(results)# plot the resultsplot(results,type=c("g","o")) |
Summary
In this post you discovered 3 feature selection methods provided by the caret R package. Specifically, searching for and removing redundant features, ranking features by importance and automatically selecting a subset of the most predictive features.
Three standalone recipes in R were provided that you can copy-and-paste into your own project and adapt for your specific problems.
Frustrated With Your Progress In R Machine Learning?
Develop Your Own Models in Minutes
…with just a few lines of R code
Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…
Finally Bring Machine Learning To
Your Own Projects
Skip the Academics. Just Results.
相關推薦
Feature Selection with the Caret R Package
Tweet Share Share Google Plus Selecting the right features in your data can mean the difference
Data Visualization with the Caret R package
Tweet Share Share Google Plus The caret package in R is designed to streamline the process of ap
Tuning Machine Learning Models Using the Caret R Package
Tweet Share Share Google Plus Machine learning algorithms are parameterized so that they can be
Compare Models And Select The Best Using The Caret R Package
Tweet Share Share Google Plus The Caret R package allows you to easily construct many different
Become a Better R Programmer with the Awesome ‘lobstr’ Package
“Tools amplify your talent. The better your tools, and the better you know how to use them, the more productive you can be.” — Andrew Hunt, The Pragmatic P
Caret R Package for Applied Predictive Modeling
Tweet Share Share Google Plus The R platform for statistical computing is perhaps the most popul
How To Estimate Model Accuracy in R Using The Caret Package
Tweet Share Share Google Plus When you are building a predictive model, you need a way to evalua
Feature Selection: A/B Test With Tableau
Feature Selection: A/B Test With TableauDuring a data science project it is important to prepare the data before analyzing them or create a model that gene
Compare outlier detection methods with the OutliersO3 package
by Antony Unwin, University of Augsburg, GermanyThere are many different methods for identifying outliers and a lot of them are available
R programming for feature selection and regression
data introduction Select packages Split dataset feature selection tune parameters prediciton 1. data introduction 我的資料包含
Feature Selection for Time Series Forecasting with Python
Tweet Share Share Google Plus The use of machine learning methods on time series data requires f
Feature Selection in Python with Scikit
Tweet Share Share Google Plus Not all data attributes are created equal. More is not always bett
[SCSS] Write similar classes with the SCSS @for Control Directive
att oop enc rem coo tro from mil for Writing similar classes with minor variations, like utility classes, can be a pain to write and upda
MySQL故障處理一例_Another MySQL daemon already running with the same unix socket
read mon 解決 roo blog local 啟動mysql style 處理 MySQL故障處理一例:“Another MySQL daemon already running with the same unix socket”。 [root@test-121
poj-2996 Help Me with the Game
ora except small source ade else sub sca arch Help Me with the Game Time Limit: 1000MS Memory Limit: 65536K Total Submissions:
Can not find a java.io.InputStream with the name [downloadFile] in the invocation stack.
dex parameter work put 嚴重 efi open post onerror 1、錯誤描寫敘述八月 14, 2015 4:22:45 下午 com.opensymphony.xwork2.util.logging.jdk.JdkLogger error
oralce11g RAC 啟動後 CRS-0184: Cannot communicate with the CRS daemon.
asm art bili 解決 completed target let 服務器 style 很奇怪的一個問題! ORACLE數據庫服務器,系統啟動之後,查看集群狀態,發現CRS實例不可用,然後網上查找資料; 隔了幾分鐘之後,再次查詢相關集群服務狀態,發現正常了!!!
Your build settings specify a provisioning profile with the UUID, no provisioning profile was
settings 解決 目的 del 查找 set post 出錯 pretty iOS 真機調試問題 在Archive項目時,出現了“Your build settings specify a provisioning profile with the UUID
【RMAN】RMAN-05001: auxiliary filename conflicts with the target database
cat 主庫 check unique lin 創建 庫文件 lgwr err oracle 11.2.0.4 運行以下腳本,使用活動數據庫復制技術創建dataguard備庫報錯rman-005001: run{ duplicate target database
poj3311Hie with the Pie
bsp wid pan sel scan namespace names like val Hie with the Pie Time Limit: 2000MS Memory Limit: 65536K Total Submissions: 7599