Introduction to Random Number Generators for Machine Learning in Python
Randomness is a big part of machine learning.
Randomness is used as a tool or a feature in preparing data and in learning algorithms that map input data to output data in order to make predictions.
In order to understand the need for statistical methods in machine learning, you must understand the source of randomness in machine learning. The source of randomness in machine learning is a mathematical trick called a pseudorandom number generator.
In this tutorial, you will discover pseudorandom number generators and when to control and control-for randomness in machine learning.
After completing this tutorial, you will know:
- The sources of randomness in applied machine learning with a focus on algorithms.
- What a pseudorandom number generator is and how to use them in Python.
- When to control the sequence of random numbers and when to control-for randomness.
Let’s get started.
Tutorial Overview
This tutorial is divided into 5 parts; they are:
- Randomness in Machine Learning
- Pseudo Random Number Generators
- When to Seed the Random Number Generator
- How to Control for Randomness
- Common Questions
Need help with Statistics for Machine Learning?
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Randomness in Machine Learning
There are many sources of randomness in applied machine learning.
Randomness is used as a tool to help the learning algorithms be more robust and ultimately result in better predictions and more accurate models.
Let’s look at a few sources of randomness.
Randomness in Data
There is a random element to the sample of data that we have collected from the domain that we will use to train and evaluate the model.
The data may have mistakes or errors.
More deeply, the data contains noise that can obscure the crystal-clear relationship between the inputs and the outputs.
Randomness in Evaluation
We do not have access to all the observations from the domain.
We work with only a small sample of the data. Therefore, we harness randomness when evaluating a model, such as using k-fold cross-validation to fit and evaluate the model on different subsets of the available dataset.
We do this to see how the model works on average rather than on a specific set of data.
Randomness in Algorithms
Machine learning algorithms use randomness when learning from a sample of data.
This is a feature, where the randomness allows the algorithm to achieve a better performing mapping of the data than if randomness was not used. Randomness is a feature, which allows an algorithm to attempt to avoid overfitting the small training set and generalize to the broader problem.
Algorithms that use randomness are often called stochastic algorithms rather than random algorithms. This is because although randomness is used, the resulting model is limited to a more narrow range, e.g. like limited randomness.
Some clear examples of randomness used in machine learning algorithms include:
- The shuffling of training data prior to each training epoch in stochastic gradient descent.
- The random subset of input features chosen for spit points in a random forest algorithm.
- The random initial weights in an artificial neural network.
We can see that there are both sources of randomness that we must control-for, such as noise in the data, and sources of randomness that we have some control over, such as algorithm evaluation and the algorithms themselves.
Next, let’s look at the source of randomness that we use in our algorithms and programs.
Pseudorandom Number Generators
The source of randomness that we inject into our programs and algorithms is a mathematical trick called a pseudorandom number generator.
A random number generator is a system that generates random numbers from a true source of randomness. Often something physical, such as a Geiger counter, where the results are turned into random numbers. There are even books of random numbers generated from a physical source that you can purchase, for example:
We do not need true randomness in machine learning. Instead we can use pseudorandomness. Pseudorandomness is a sample of numbers that look close to random, but were generated using a deterministic process.
Shuffling data and initializing coefficients with random values use pseudorandom number generators. These little programs are often a function that you can call that will return a random number. Called again, they will return a new random number. Wrapper functions are often also available and allow you to get your randomness as an integer, floating point, within a specific distribution, within a specific range, and so on.
The numbers are generated in a sequence. The sequence is deterministic and is seeded with an initial number. If you do not explicitly seed the pseudorandom number generator, then it may use the current system time in seconds or milliseconds as the seed.
The value of the seed does not matter. Choose anything you wish. What does matter is that the same seeding of the process will result in the same sequence of random numbers.
Let’s make this concrete with some examples.
Pseudorandom Number Generator in Python
The Python standard library provides a module called random that offers a suite of functions for generating random numbers.
Python uses a popular and robust pseudorandom number generator called the Mersenne Twister.
The pseudorandom number generator can be seeded by calling the random.seed() function. Random floating point values between 0 and 1 can be generated by calling the random.random() function.
The example below seeds the pseudorandom number generator, generates some random numbers, then re-seeds to demonstrate that the same sequence of numbers is generated.
123456789101112 | # demonstrates the python pseudorandom number generatorfrom random import seedfrom random import random# seed the generatorseed(7)for_inrange(5):print(random())# seed the generator to get the same sequenceprint('Reseeded')seed(7)for_inrange(5):print(random()) |
Running the example prints five random floating point values, then the same five floating point values after the pseudorandom number generator was reseeded.
1234567891011 | 0.323832764833162370.150849173924501920.65093447303985370.072436286667542760.5358820043066892Reseeded0.323832764833162370.150849173924501920.65093447303985370.072436286667542760.5358820043066892 |
Pseudorandom Number Generator in NumPy
In machine learning, you are likely using libraries such as scikit-learn and Keras.
These libraries make use of NumPy under the covers, a library that makes working with vectors and matrices of numbers very efficient.
NumPy also has its own implementation of a pseudorandom number generator and convenience wrapper functions.
NumPy also implements the Mersenne Twister pseudorandom number generator. Importantly, seeding the Python pseudorandom number generator does not impact the NumPy pseudorandom number generator. It must be seeded and used separately.
The example below seeds the pseudorandom number generator, generates an array of five random floating point values, seeds the generator again, and demonstrates that the same sequence of random numbers are generated.
12345678910 | # demonstrates the numpy pseudorandom number generatorfrom numpy.random import seedfrom numpy.random import rand# seed the generatorseed(7)print(rand(5))# seed the generator to get the same sequenceprint('Reseeded')seed(7)print(rand(5)) |
Running the example prints the first batch of numbers and the identical second batch of numbers after the generator was reseeded.
123 | [0.07630829 0.77991879 0.43840923 0.72346518 0.97798951]Reseeded[0.07630829 0.77991879 0.43840923 0.72346518 0.97798951] |
Now that we know how controlled randomness is generated, let’s look at where we can use it effectively.
When to Seed the Random Number Generator
There are times during a predictive modeling project when you should consider seeding the random number generator.
Let’s look at two cases:
- Data Preparation. Data preparation may use randomness, such as a shuffle of the data or selection of values. Data preparation must be consistent so that the data is always prepared in the same way during fitting, evaluation, and when making predictions with the final model.
Data Splits. The splits of the data such as for a train/test split or k-fold cross-validation must be made consistently. This is to ensure that each algorithm is trained and evaluated in the same way on the same subsamples of data.
You may wish to seed the pseudorandom number generator once before each task or once before performing the batch of tasks. It generally does not matter which.
Sometimes you may want an algorithm to behave consistently, perhaps because it is trained on exactly the same data each time. This may happen if the algorithm is used in a production environment. It may also happen if you are demonstrating an algorithm in a tutorial environment.
In that case, it may make sense to initialize the seed prior to fitting the algorithm.
How to Control for Randomness
A stochastic machine learning algorithm will learn slightly differently each time it is run on the same data.
This will result in a model with slightly different performance each time it is trained.
As mentioned, we can fit the model using the same sequence of random numbers each time. When evaluating a model, this is a bad practice as it hides the inherent uncertainty within the model.
A better approach is to evaluate the algorithm in such a way that the reported performance includes the measured uncertainty in the performance of the algorithm.
We can do that by repeating the evaluation of the algorithm multiple times with different sequences of random numbers. The pseudorandom number generator could be seeded once at the beginning of the evaluation or it could be seeded with a different seed at the beginning of each evaluation.
There are two aspects of uncertainty to consider here:
- Data Uncertainty: Evaluating an algorithm on multiple splits of the data will give insight into how the algorithms performance varies with changes to the train and test data.
- Algorithm Uncertainty: Evaluating an algorithm multiple times on the same splits of data will give insight into how the algorithm performance varies alone.
In general, I would recommend reporting on both of these sources of uncertainty combined. This is where the algorithm is fit on different splits of the data each evaluation run and has a new sequence of randomness. The evaluation procedure can seed the random number generator once at the beginning and the process can be repeated perhaps 30 or more times to give a population of performance scores that can be summarized.
This will give a fair description of model performance taking into account variance both in the training data and in the learning algorithm itself.
Common Questions
Can I predict random numbers?
You cannot predict the sequence of random numbers, even with a deep neural network.
Will real random numbers lead to better results?
As far as I have read, using real randomness does not help in general, unless you are working with simulations of physical processes.
What about the final model?
The final model is the chosen algorithm and configuration trained on all available training data that you can use to make predictions. The performance of this model will fall within the variance of the evaluated model.
Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
- Confirm that seeding the Python pseudorandom number generator does not impact the NumPy pseudorandom number generator.
- Develop examples of generating integers between a range and Gaussian random numbers.
- Locate the equation for and implement a very simple pseudorandom number generator.
If you explore any of these extensions, I’d love to know.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Posts
API
Articles
Summary
In this tutorial, you discovered the role of randomness in applied machine learning and how to control and harness it.
Specifically, you learned:
- Machine learning has sources of randomness such as in the sample of data and in the algorithms themselves.
- Randomness is injected into programs and algorithms using pseudorandom number generators.
- There are times when the randomness requires careful control, and times when the randomness needs to be controlled-for.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Get a Handle on Statistics for Machine Learning!
Develop a working understanding of statistics
…by writing lines of code in python
It provides self-study tutorials on topics like:
Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more…
Discover how to Transform Data into Knowledge
Skip the Academics. Just Results.
相關推薦
Introduction to Random Number Generators for Machine Learning in Python
Tweet Share Share Google Plus Randomness is a big part of machine learning. Randomness is used a
Essential libraries for Machine Learning in Python
Python is often the language of choice for developers who need to apply statistical techniques or data analysis in their work. It is also used by data scie
Rescaling Data for Machine Learning in Python with Scikit
Tweet Share Share Google Plus Your data must be prepared before you can build models. The data p
Prepare Data for Machine Learning in Python with Pandas
Tweet Share Share Google Plus If you are using the Python stack for studying and applying machin
Abdul Latif Jameel Clinic for Machine Learning in Health at MIT aims to revolutionize disease prevention, detection, and treatme
Today, MIT and Community Jameel, the social enterprise organization founded and chaired by Mohammed Abdul Latif Jameel ’78, launched the Abdul Latif Jameel
How to Build an Intuition for Machine Learning Algorithms
Tweet Share Share Google Plus Machine learning algorithms are complex. To get good at applying a
How to Clean Text for Machine Learning with Python
Tweet Share Share Google Plus You cannot go straight from raw text to fitting a machine learning
NXP Owns the Stage for Machine Learning in Edge Devices
SAN JOSE, Calif. and BARCELONA, Spain, Oct. 16, 2018 (GLOBE NEWSWIRE) -- (ARMTECHCON and IoT World Congress Barcelona) - Mathematical advances that are dri
NXP's New Development Platform for Machine Learning in the IoT
NXP Semiconductors has launched a new machine learning toolkit. Called "eIQ", it's a software development platform that supports popular neural network fra
Get Your Data Ready For Machine Learning in R with Pre
Tweet Share Share Google Plus Preparing data is required to get the best results from machine le
Best Books For Machine Learning in R
Tweet Share Share Google Plus R is a powerful platform for data analysis and machine learning. I
How to Get Started with Machine Learning in Python
Tweet Share Share Google Plus The Python conference PyCon2014 has held recently and the videos f
A Gentle Introduction to Matrix Factorization for Machine Learning
Tweet Share Share Google Plus Many complex matrix operations cannot be solved efficiently or wit
Introduction to Matrices and Matrix Arithmetic for Machine Learning
Tweet Share Share Google Plus Matrices are a foundational element of linear algebra. Matrices ar
Introduction to Machine Learning with Python/Python機器學習基礎教程_程式碼修改與更新
2.3.1樣本資料集 --程式碼bug及修改意見 import matplotlib.pyplot as plt import mglearn X,y=mglearn.datasets.make_forge() mglearn.discrete_scatter(X[:,0
Facebook's PyTorch plans to light the way to speedy workflows for Machine Learning • DEVCLASS
Facebook's development department has finished a first release candidate for v1 of its PyTorch project – just in time for the first conference dedicated to
Assessing Annotator Disagreements in Python to Build a Robust Dataset for Machine Learning
Assessing Annotator Disagreements in Python to Build a Robust Dataset for Machine LearningTea vs. Coffee: the perfect example of decisions and disagreement
How to Create a Linux Virtual Machine For Machine Learning Development With Python 3
Tweet Share Share Google Plus Linux is an excellent environment for machine learning development
How to Prepare Data For Machine Learning
Tweet Share Share Google Plus Machine learning algorithms learn from data. It is critical that y
Introduction.to.Machine.Learning.with.Python 筆記
Python 3.0+ Chapter One from preamble import * %matplotlib inline import numpy as np x = np.array([[1, 2, 3], [4, 5, 6]]) print("x:\