1. 程式人生 > >Data Visualization with the Caret R package

Data Visualization with the Caret R package

The caret package in R is designed to streamline the process of applied machine learning.

A key part of solving data problems in understanding the data that you have available. You can do this very quickly by summarizing the attributes with data visualizations.

There are a lot of packages and functions for summarizing data in R and it can feel overwhelming. For the purposes of applied machine learning, the caret package provides a few key tools that can give you a quick summary of your data.

In this post you will discover the data visualization tools available in the caret R package.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Caret Package

The caret package is primarily used for streamlining model training, estimating model performance and tuning. It also has a number of convenient data visualization tools that can quickly give you an idea of the data you are working with.

In this post we are going to look at the following 4 data visualizations:

  • Scatterplot Matrix: For comparing the distribution of real-valued attributes in pair-wise plots.
  • Density Plots: For comparing the probability density function of attributes.
  • Box and Whisker Plots: For summarizing and sparing the spread of attributes

Each example is standalone so that you can copy and paste it into your own project and adapt it to your needs. All examples will make use of the iris flowers dataset, that comes with R. This classification dataset provides 150 observations for three species of iris flower and their petal and sepal measurements in centimeters.

Scatterplot Matrix

A scatterplot matrix shows a grid of scatterplots where each attribute is plotted against all other attributes. It can be read by column or row, and each plot appears twice, allowing you to consider the spatial relationships from two perspectives.

An improvement of just plotting the scatterplots, is to further include class information. This is commonly done by coloring dots in each scatterplot by their class value.

The example below shows a scatterplot matrix for the iris dataset, with pair-wise scatter plots for all four attributes, and dots in the scatterplots colored by the class attribute.

Scatterplot matrix in caret r package R
123456 # load the librarylibrary(caret)# load the datadata(iris)# pair-wise plots of all 4 attributes, dots colored by classfeaturePlot(x=iris[,1:4],y=iris[,5],plot="pairs",auto.key=list(columns=3))
Scatterplot Matrix of the Iris dataset using the Caret R package

Scatterplot Matrix of the Iris dataset using the Caret R package

Density Plots

Density estimation plots (density plots for short) summarize the distribution of the data. Like a histogram, the relationship between the attribute values and number of observations is summarized, but rather than a frequency, the relationship is summarized as a continuous probability density function (PDF). This is the probability that a given observation has a given value.

The density plots can further be improved by separating each attribute by their class value for the observation. This can be useful to understand the single-attribute relationship with the class values and highlight useful structures like linear separability of attribute values into classes.

The example below shows density plots for the iris dataset, showing PDFs for how each attribute relates to each class value.

Density plots with caret r package R
123456 # load the librarylibrary(caret)# load the datadata(iris)# density plots for each attribute by class valuefeaturePlot(x=iris[,1:4],y=iris[,5],plot="density",scales=list(x=list(relation="free"),y=list(relation="free")),auto.key=list(columns=3))
Density Plot of the iris dataset using the Caret R package

Density Plot of the iris dataset using the Caret R package

Box and Whisker Plots

Box and Whisker plots (or box plots for short) summarize the distribution of a given attribute by showing a box for the 25th and 75th percentile, a line in the box for the 50th percentile (median) and a dot for the mean. The whiskers show 1.5*the height of the box (called the Inter Quartile Range) which indicate the expected range of the data and any data beyond those whiskers is assumed to be an outlier and marked with a dot.

Again, each attribute can be summarized in terms of their observed class value, giving you an idea of how attribute values and class values relate, much like the density plots.

The example below shows box and whisker plots for the iris data set, showing a separate box for each class value for a given attribute.

Box plots in caret r R
123456 # load the librarylibrary(caret)# load the datadata(iris)# box and whisker plots for each attribute by class valuefeaturePlot(x=iris[,1:4],y=iris[,5],plot="box",scales=list(x=list(relation="free"),y=list(relation="free")),auto.key=list(columns=3))
Box plots of the iris dataset using the Caret R package

Box plots of the iris dataset using the Caret R package

Summary

In this post you discovered three quick data visualizations using the caret R package that can help you to understand your classification dataset.

Each example is standalone, ready for you to copy-and-paste into your own project and adapt for your problem.


Frustrated With Your Progress In R Machine Learning?

Master Machine Learning With R

Develop Your Own Models in Minutes

…with just a few lines of R code

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.


相關推薦

Data Visualization with the Caret R package

Tweet Share Share Google Plus The caret package in R is designed to streamline the process of ap

Feature Selection with the Caret R Package

Tweet Share Share Google Plus Selecting the right features in your data can mean the difference

Tuning Machine Learning Models Using the Caret R Package

Tweet Share Share Google Plus Machine learning algorithms are parameterized so that they can be

Compare Models And Select The Best Using The Caret R Package

Tweet Share Share Google Plus The Caret R package allows you to easily construct many different

Become a Better R Programmer with the Awesome ‘lobstr’ Package

“Tools amplify your talent. The better your tools, and the better you know how to use them, the more productive you can be.” — Andrew Hunt, The Pragmatic P

Topic Modeling and Data Visualization with Python/Flask

TemplatesBase.htmlFirst, we’ll want to make our base template. I like to include all of these templates in a templates folder, as you can see from our tree

Caret R Package for Applied Predictive Modeling

Tweet Share Share Google Plus The R platform for statistical computing is perhaps the most popul

Time Series Data Visualization with Python

Tweet Share Share Google Plus 6 Ways to Plot Your Time Series Data with Python Time series lends

How To Estimate Model Accuracy in R Using The Caret Package

Tweet Share Share Google Plus When you are building a predictive model, you need a way to evalua

Change the default MySQL data directory with SELinux enabled

轉載:https://rmohan.com/?p=4605   Change the default MySQL data directory with SELinux enabled This is a short article that explains how you

R語言統計入門課程推薦——生物科學中的資料分析Data Analysis for the Life Sciences

Data Analysis for the Life Sciences是哈佛大學PH525x系列課程——生物醫學中的資料分析(PH525x series - Biomedical Data Science ),課程全部採用R語言進行統計分析理論教學與實戰。教材採用Rmarkdo

Chapter 6: Dimensionality Reduction: Squashing the Data Pancake with PCA

Suggestion it is best not to apply PCA to raw countss (word counts, music play counts, movie viewing counts, etc.)。 The reason for this is that such counts

Ask HN: Whats the best desktop cfg for ML and Data science side project as R&D?

Should I go for a) All in one powerful desktop b) multiple PCs with RAM in the 4-8 GB range? How to decide?

Crowdsourcing ML training data with the AutoML API and Firebase

Crowdsourcing ML training data with the AutoML API and FirebaseWant to build an ML model but don’t have enough training data? In this post I’ll show you ho

Simpson’s Paradox: How to Prove Opposite Arguments with the Same Data

Simpson’s Paradox occurs when trends that appear when a dataset is separated into groups reverse when the data are aggregated. In the restaurant recommenda

Wikipedia Data Science: Working with the World’s Largest Encyclopedia

Finding and Downloading Data ProgrammaticallyThe first step in any data science project is accessing your data! While we could make individual requests to

Enough with the Data Tables

Enough with the Data TablesData is important. But just providing data to your users isn’t enough to help them understand their world and take actions.There

Compare outlier detection methods with the OutliersO3 package

by Antony Unwin, University of Augsburg, GermanyThere are many different methods for identifying outliers and a lot of them are available

Interactive Data Visualization in Python With Bokeh

Bokeh prides itself on being a library for interactive data visualization. Unlike popular counterparts in the Python visualization space, like Matplotl

R語言讀取資料(Practical Data Science with R 第二章)

1、用R語言讀取檔案中的資料 1.1、用R語言讀取結構化資料 以University of California Irvine Machine Learning Repository (http://archive.ics.uci.edu/ml/)的car資料為例: u