Rescaling Data for Machine Learning in Python with Scikit

阿新 • • 發佈：2019-01-12

Your data must be prepared before you can build models. The data preparation process can involve three steps: data selection, data preprocessing and data transformation.

In this post you will discover two simple data transformation methods you can apply to your data in Python using

scikit-learn.

Data Rescaling
Photo by Quinn Dombrowski, some rights reserved.

Data Rescaling

Your preprocessed data may contain attributes with a mixtures of scales for various quantities such as dollars, kilograms and sales volume.

Many machine learning methods expect or are more effective if the data attributes have the same scale. Two popular

data scaling methods are normalization and standardization.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Data Normalization

Normalization refers to rescaling real valued numeric attributes into the range 0 and 1.

It is useful to scale the input attributes for a model that relies on the magnitude of values, such as distance measures used in k-nearest neighbors and in the preparation of coefficients in regression.

The example below demonstrate data normalization of the Iris flowers dataset.

Normalize the data attributes for the Iris dataset Python

# Normalize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
# load the iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data from the target attributes
X = iris.data
y = iris.target
# normalize the data attributes
normalized_X = preprocessing.normalize(X)

1234567891011

# Normalize the data attributes for the Iris dataset.fromsklearn.datasets importload_irisfromsklearn importpreprocessing# load the iris datasetiris=load_iris()print(iris.data.shape)# separate the data from the target attributesX=iris.datay=iris.target# normalize the data attributesnormalized_X=preprocessing.normalize(X)

For more information see the normalize function in the API documentation.

Data Standardization

Standardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance).

It is useful to standardize attributes for a model that relies on the distribution of attributes such as Gaussian processes.

The example below demonstrate data standardization of the Iris flowers dataset.

Standardize the data attributes for the Iris dataset Python

# Standardize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
# load the Iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data and target attributes
X = iris.data
y = iris.target
# standardize the data attributes
standardized_X = preprocessing.scale(X)

1234567891011

# Standardize the data attributes for the Iris dataset.fromsklearn.datasets importload_irisfromsklearn importpreprocessing# load the Iris datasetiris=load_iris()print(iris.data.shape)# separate the data and target attributesX=iris.datay=iris.target# standardize the data attributesstandardized_X=preprocessing.scale(X)

For more information see the scale function in the API documentation.

Tip: Which Method To Use

It is hard to know whether rescaling your data will improve the performance of your algorithms before you apply them. If often can, but not always.

A good tip is to create rescaled copies of your dataset and race them against each other using your test harness and a handful of algorithms you want to spot check. This can quickly highlight the benefits (or lack there of) of rescaling your data with given models, and which rescaling method may be worthy of further investigation.

Summary

Data rescaling is an important part of data preparation before applying machine learning algorithms.

In this post you discovered where data rescaling fits into the process of applied machine learning and two methods: Normalization and Standardization that you can use to rescale your data in Python using the scikit-learn library.

Frustrated With Python Machine Learning?

Develop Your Own Models in Minutes

…with just a few lines of scikit-learn code

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Rescaling Data for Machine Learning in Python with Scikit

Tweet Share Share Google Plus Your data must be prepared before you can build models. The data p

Prepare Data for Machine Learning in Python with Pandas

Tweet Share Share Google Plus If you are using the Python stack for studying and applying machin

Get Your Data Ready For Machine Learning in R with Pre

Tweet Share Share Google Plus Preparing data is required to get the best results from machine le

Essential libraries for Machine Learning in Python

Python is often the language of choice for developers who need to apply statistical techniques or data analysis in their work. It is also used by data scie

Introduction to Random Number Generators for Machine Learning in Python

Tweet Share Share Google Plus Randomness is a big part of machine learning. Randomness is used a

How to Get Started with Machine Learning in Python

Tweet Share Share Google Plus The Python conference PyCon2014 has held recently and the videos f

Save and Load Machine Learning Models in Python with scikit

Hello Jason, I am new to machine learning. I am your big fan and read a lot of your blog and books. Thank you very much for teaching us machine le

斯坦福大學公開課機器學習：machine learning system design | data for machine learning（數據量很大時，學習算法表現比較好的原理）

ali 很多好的 info 可能斯坦福大學公開課數據 div http 下圖為四種不同算法應用在不同大小數據量時的表現，可以看出，隨著數據量的增大，算法的表現趨於接近。即不管多麽糟糕的算法，數據量非常大的時候，算法表現也可以很好。數據量很大時，學習算法表現比

NXP Owns the Stage for Machine Learning in Edge Devices

SAN JOSE, Calif. and BARCELONA, Spain, Oct. 16, 2018 (GLOBE NEWSWIRE) -- (ARMTECHCON and IoT World Congress Barcelona) - Mathematical advances that are dri

NXP's New Development Platform for Machine Learning in the IoT

NXP Semiconductors has launched a new machine learning toolkit. Called "eIQ", it's a software development platform that supports popular neural network fra

Abdul Latif Jameel Clinic for Machine Learning in Health at MIT aims to revolutionize disease prevention, detection, and treatme

Today, MIT and Community Jameel, the social enterprise organization founded and chaired by Mohammed Abdul Latif Jameel ’78, launched the Abdul Latif Jameel

Rescaling Data for Machine Learning in Python with Scikit

Data Rescaling

Need help with Machine Learning in Python?

Data Normalization

Data Standardization

Tip: Which Method To Use

Summary

Frustrated With Python Machine Learning?

Develop Your Own Models in Minutes

Finally Bring Machine Learning To
Your Own Projects

Rescaling Data for Machine Learning in Python with Scikit

Prepare Data for Machine Learning in Python with Pandas

Get Your Data Ready For Machine Learning in R with Pre

Essential libraries for Machine Learning in Python

Introduction to Random Number Generators for Machine Learning in Python

How to Get Started with Machine Learning in Python

Save and Load Machine Learning Models in Python with scikit

斯坦福大學公開課機器學習：machine learning system design | data for machine learning（數據量很大時，學習算法表現比較好的原理）

NXP Owns the Stage for Machine Learning in Edge Devices

NXP's New Development Platform for Machine Learning in the IoT

Abdul Latif Jameel Clinic for Machine Learning in Health at MIT aims to revolutionize disease prevention, detection, and treatme

Best Books For Machine Learning in R

How to Prepare Data For Machine Learning

How to Load Data in Python with Scikit

Feature Selection in Python with Scikit

[Javascript] Classify JSON text data with machine learning in Natural

Assessing Annotator Disagreements in Python to Build a Robust Dataset for Machine Learning

How to Create a Linux Virtual Machine For Machine Learning Development With Python 3

How to Clean Text for Machine Learning with Python

[Javascript] Classify text into categories with machine learning in Natural

Rescaling Data for Machine Learning in Python with Scikit

Data Rescaling

Need help with Machine Learning in Python?

Data Normalization

Data Standardization

Tip: Which Method To Use

Summary

Frustrated With Python Machine Learning?

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

相關推薦

Finally Bring Machine Learning To
Your Own Projects