Data Visualization with the Caret R package
The caret package in R is designed to streamline the process of applied machine learning.
A key part of solving data problems in understanding the data that you have available. You can do this very quickly by summarizing the attributes with data visualizations.
There are a lot of packages and functions for summarizing data in R and it can feel overwhelming. For the purposes of applied machine learning, the caret package provides a few key tools that can give you a quick summary of your data.
In this post you will discover the data visualization tools available in the caret R package.
Need more Help with R for Machine Learning?
Take my free 14-day email course and discover how to use R on your project (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Caret Package
The caret package is primarily used for streamlining model training, estimating model performance and tuning. It also has a number of convenient data visualization tools that can quickly give you an idea of the data you are working with.
In this post we are going to look at the following 4 data visualizations:
- Scatterplot Matrix: For comparing the distribution of real-valued attributes in pair-wise plots.
- Density Plots: For comparing the probability density function of attributes.
- Box and Whisker Plots: For summarizing and sparing the spread of attributes
Each example is standalone so that you can copy and paste it into your own project and adapt it to your needs. All examples will make use of the iris flowers dataset, that comes with R. This classification dataset provides 150 observations for three species of iris flower and their petal and sepal measurements in centimeters.
Scatterplot Matrix
A scatterplot matrix shows a grid of scatterplots where each attribute is plotted against all other attributes. It can be read by column or row, and each plot appears twice, allowing you to consider the spatial relationships from two perspectives.
An improvement of just plotting the scatterplots, is to further include class information. This is commonly done by coloring dots in each scatterplot by their class value.
The example below shows a scatterplot matrix for the iris dataset, with pair-wise scatter plots for all four attributes, and dots in the scatterplots colored by the class attribute.
Scatterplot matrix in caret r package R123456 | # load the librarylibrary(caret)# load the datadata(iris)# pair-wise plots of all 4 attributes, dots colored by classfeaturePlot(x=iris[,1:4],y=iris[,5],plot="pairs",auto.key=list(columns=3)) |
Density Plots
Density estimation plots (density plots for short) summarize the distribution of the data. Like a histogram, the relationship between the attribute values and number of observations is summarized, but rather than a frequency, the relationship is summarized as a continuous probability density function (PDF). This is the probability that a given observation has a given value.
The density plots can further be improved by separating each attribute by their class value for the observation. This can be useful to understand the single-attribute relationship with the class values and highlight useful structures like linear separability of attribute values into classes.
The example below shows density plots for the iris dataset, showing PDFs for how each attribute relates to each class value.
Density plots with caret r package R123456 | # load the librarylibrary(caret)# load the datadata(iris)# density plots for each attribute by class valuefeaturePlot(x=iris[,1:4],y=iris[,5],plot="density",scales=list(x=list(relation="free"),y=list(relation="free")),auto.key=list(columns=3)) |
Box and Whisker Plots
Box and Whisker plots (or box plots for short) summarize the distribution of a given attribute by showing a box for the 25th and 75th percentile, a line in the box for the 50th percentile (median) and a dot for the mean. The whiskers show 1.5*the height of the box (called the Inter Quartile Range) which indicate the expected range of the data and any data beyond those whiskers is assumed to be an outlier and marked with a dot.
Again, each attribute can be summarized in terms of their observed class value, giving you an idea of how attribute values and class values relate, much like the density plots.
The example below shows box and whisker plots for the iris data set, showing a separate box for each class value for a given attribute.
Box plots in caret r R123456 | # load the librarylibrary(caret)# load the datadata(iris)# box and whisker plots for each attribute by class valuefeaturePlot(x=iris[,1:4],y=iris[,5],plot="box",scales=list(x=list(relation="free"),y=list(relation="free")),auto.key=list(columns=3)) |
Summary
In this post you discovered three quick data visualizations using the caret R package that can help you to understand your classification dataset.
Each example is standalone, ready for you to copy-and-paste into your own project and adapt for your problem.
Frustrated With Your Progress In R Machine Learning?
Develop Your Own Models in Minutes
…with just a few lines of R code
Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…
Finally Bring Machine Learning To
Your Own Projects
Skip the Academics. Just Results.
相關推薦
Data Visualization with the Caret R package
Tweet Share Share Google Plus The caret package in R is designed to streamline the process of ap
Feature Selection with the Caret R Package
Tweet Share Share Google Plus Selecting the right features in your data can mean the difference
Tuning Machine Learning Models Using the Caret R Package
Tweet Share Share Google Plus Machine learning algorithms are parameterized so that they can be
Compare Models And Select The Best Using The Caret R Package
Tweet Share Share Google Plus The Caret R package allows you to easily construct many different
Become a Better R Programmer with the Awesome ‘lobstr’ Package
“Tools amplify your talent. The better your tools, and the better you know how to use them, the more productive you can be.” — Andrew Hunt, The Pragmatic P
Topic Modeling and Data Visualization with Python/Flask
TemplatesBase.htmlFirst, we’ll want to make our base template. I like to include all of these templates in a templates folder, as you can see from our tree
Caret R Package for Applied Predictive Modeling
Tweet Share Share Google Plus The R platform for statistical computing is perhaps the most popul
Time Series Data Visualization with Python
Tweet Share Share Google Plus 6 Ways to Plot Your Time Series Data with Python Time series lends
How To Estimate Model Accuracy in R Using The Caret Package
Tweet Share Share Google Plus When you are building a predictive model, you need a way to evalua
Change the default MySQL data directory with SELinux enabled
轉載:https://rmohan.com/?p=4605 Change the default MySQL data directory with SELinux enabled This is a short article that explains how you
R語言統計入門課程推薦——生物科學中的資料分析Data Analysis for the Life Sciences
Data Analysis for the Life Sciences是哈佛大學PH525x系列課程——生物醫學中的資料分析(PH525x series - Biomedical Data Science ),課程全部採用R語言進行統計分析理論教學與實戰。教材採用Rmarkdo
Chapter 6: Dimensionality Reduction: Squashing the Data Pancake with PCA
Suggestion it is best not to apply PCA to raw countss (word counts, music play counts, movie viewing counts, etc.)。 The reason for this is that such counts
Ask HN: Whats the best desktop cfg for ML and Data science side project as R&D?
Should I go for a) All in one powerful desktop b) multiple PCs with RAM in the 4-8 GB range? How to decide?
Crowdsourcing ML training data with the AutoML API and Firebase
Crowdsourcing ML training data with the AutoML API and FirebaseWant to build an ML model but don’t have enough training data? In this post I’ll show you ho
Simpson’s Paradox: How to Prove Opposite Arguments with the Same Data
Simpson’s Paradox occurs when trends that appear when a dataset is separated into groups reverse when the data are aggregated. In the restaurant recommenda
Wikipedia Data Science: Working with the World’s Largest Encyclopedia
Finding and Downloading Data ProgrammaticallyThe first step in any data science project is accessing your data! While we could make individual requests to
Enough with the Data Tables
Enough with the Data TablesData is important. But just providing data to your users isn’t enough to help them understand their world and take actions.There
Compare outlier detection methods with the OutliersO3 package
by Antony Unwin, University of Augsburg, GermanyThere are many different methods for identifying outliers and a lot of them are available
Interactive Data Visualization in Python With Bokeh
Bokeh prides itself on being a library for interactive data visualization. Unlike popular counterparts in the Python visualization space, like Matplotl
R語言讀取資料(Practical Data Science with R 第二章)
1、用R語言讀取檔案中的資料 1.1、用R語言讀取結構化資料 以University of California Irvine Machine Learning Repository (http://archive.ics.uci.edu/ml/)的car資料為例: u