A Gentle Introduction to Exploding Gradients in Neural Networks
Exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training.
This has the effect of your model being unstable and unable to learn from your training data.
In this post, you will discover the problem of exploding gradients with deep artificial neural networks.
After completing this post, you will know:
- What exploding gradients are and the problems they cause during training.
- How to know whether you may have exploding gradients with your network model.
- How you can fix the exploding gradient problem with your network.
Let’s get started.
- Update Oct/2018: Removed mention of ReLU as a solution.
What Are Exploding Gradients?
An error gradient is the direction and magnitude calculated during the training of a neural network that is used to update the network weights in the right direction and by the right amount.
In deep networks or recurrent neural networks, error gradients can accumulate during an update and result in very large gradients. These in turn result in large updates to the network weights, and in turn, an unstable network. At an extreme, the values of weights can become so large as to overflow and result in NaN values.
The explosion occurs through exponential growth by repeatedly multiplying gradients through the network layers that have values larger than 1.0.
What Is the Problem with Exploding Gradients?
In deep multilayer Perceptron networks, exploding gradients can result in an unstable network that at best cannot learn from the training data and at worst results in NaN weight values that can no longer be updated.
… exploding gradients can make learning unstable.
In recurrent neural networks, exploding gradients can result in an unstable network that is unable to learn from training data and at best a network that cannot learn over long input sequences of data.
… the exploding gradients problem refers to the large increase in the norm of the gradient during training. Such events are due to the explosion of the long term components
How do You Know if You Have Exploding Gradients?
There are some subtle signs that you may be suffering from exploding gradients during the training of your network, such as:
- The model is unable to get traction on your training data (e.g. poor loss).
- The model is unstable, resulting in large changes in loss from update to update.
- The model loss goes to NaN during training.
If you have these types of problems, you can dig deeper to see if you have a problem with exploding gradients.
There are some less subtle signs that you can use to confirm that you have exploding gradients.
- The model weights quickly become very large during training.
- The model weights go to NaN values during training.
- The error gradient values are consistently above 1.0 for each node and layer during training.
How to Fix Exploding Gradients?
There are many approaches to addressing exploding gradients; this section lists some best practice approaches that you can use.
1. Re-Design the Network Model
In deep neural networks, exploding gradients may be addressed by redesigning the network to have fewer layers.
There may also be some benefit in using a smaller batch size while training the network.
In recurrent neural networks, updating across fewer prior time steps during training, called truncated Backpropagation through time, may reduce the exploding gradient problem.
2. Use Long Short-Term Memory Networks
In recurrent neural networks, gradient exploding can occur given the inherent instability in the training of this type of network, e.g. via Backpropagation through time that essentially transforms the recurrent network into a deep multilayer Perceptron neural network.
Exploding gradients can be reduced by using the Long Short-Term Memory (LSTM) memory units and perhaps related gated-type neuron structures.
Adopting LSTM memory units is a new best practice for recurrent neural networks for sequence prediction.
3. Use Gradient Clipping
Exploding gradients can still occur in very deep Multilayer Perceptron networks with a large batch size and LSTMs with very long input sequence lengths.
If exploding gradients are still occurring, you can check for and limit the size of gradients during the training of your network.
This is called gradient clipping.
Dealing with the exploding gradients has a simple but very effective solution: clipping gradients if their norm exceeds a given threshold.
Specifically, the values of the error gradient are checked against a threshold value and clipped or set to that threshold value if the error gradient exceeds the threshold.
To some extent, the exploding gradient problem can be mitigated by gradient clipping (thresholding the values of the gradients before performing a gradient descent step).
In the Keras deep learning library, you can use gradient clipping by setting the clipnorm or clipvalue arguments on your optimizer before training.
Good default values are clipnorm=1.0 and clipvalue=0.5.
4. Use Weight Regularization
Another approach, if exploding gradients are still occurring, is to check the size of network weights and apply a penalty to the networks loss function for large weight values.
This is called weight regularization and often an L1 (absolute weights) or an L2 (squared weights) penalty can be used.
Using an L1 or L2 penalty on the recurrent weights can help with exploding gradients
In the Keras deep learning library, you can use weight regularization by setting the kernel_regularizer argument on your layer and using an L1 or L2 regularizer.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Books
Papers
Articles
Keras API
Summary
In this post, you discovered the problem of exploding gradients when training deep neural network models.
Specifically, you learned:
- What exploding gradients are and the problems they cause during training.
- How to know whether you may have exploding gradients with your network model.
- How you can fix the exploding gradient problem with your network.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Develop LSTMs for Sequence Prediction Today!
Develop Your Own LSTM models in Minutes
…with just a few lines of python code
It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more…
Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects
Skip the Academics. Just Results.
相關推薦
A Gentle Introduction to Exploding Gradients in Neural Networks
Tweet Share Share Google Plus Exploding gradients are a problem where large error gradients accu
A Gentle Introduction to Autocorrelation and Partial Autocorrelation (譯文)
A Gentle Introduction to Autocorrelation and Partial Autocorrelation 原文作者:Jason Brownlee 原文地址:https://machinelearningmastery.com/gentle-introdu
A Gentle Introduction to Applied Machine Learning as a Search Problem (譯文)
A Gentle Introduction to Applied Machine Learning as a Search Problem 原文作者:Jason Brownlee 原文地址:https://machinelearningmastery.com/applied-m
A gentle introduction to decision trees using R
Most techniques of predictive analytics have their origins in probability or statistical theory (see my post on Naïve Bayes, for example). In this post I'l
A Quick Introduction to Text Summarization in Machine Learning
A Quick Introduction to Text Summarization in Machine LearningText summarization refers to the technique of shortening long pieces of text. The intention i
A Gentle Introduction to Transfer Learning for Deep Learning
Tweet Share Share Google Plus Transfer learning is a machine learning method where a model devel
A Gentle Introduction to RNN Unrolling
Tweet Share Share Google Plus Recurrent neural networks are a type of neural network where the o
A Gentle Introduction to Matrix Factorization for Machine Learning
Tweet Share Share Google Plus Many complex matrix operations cannot be solved efficiently or wit
A Gentle Introduction to Autocorrelation and Partial Autocorrelation
Tweet Share Share Google Plus Autocorrelation and partial autocorrelation plots are heavily used
A Gentle Introduction to Broadcasting with NumPy Arrays
Tweet Share Share Google Plus Arrays with different sizes cannot be added, subtracted, or genera
A Gentle Introduction to Deep Learning Caption Generation Models
Tweet Share Share Google Plus Caption generation is the challenging artificial intelligence prob
翻譯 COMMON LISP: A Gentle Introduction to Symbolic Computation
因為學習COMMON LISP,起步較為艱難,田春翻譯的那本書起點太高,而大多數書籍起點都很高。其實國內還有一本書,是Common LISP程式設計/韓俊剛,殷勇編著,由西安電子科技大學出版社出版,不過鑑於該書已經絕版,我決定還是找個英文版的練練手。 很多高手,比如田春,都
An Introduction to Deep Learning and Neural Networks
aitopics.org uses cookies to deliver the best possible experience. By continuing to use this site, you consent to the use of cookies. Learn more » I und
A Quick Introduction to Neural Arithmetic Logic Units
A Quick Introduction to Neural Arithmetic Logic Units(Credit: aitoff)Classical neural networks are extremely flexible, but there are certain tasks they are
Gentle Introduction to Models for Sequence Prediction with Recurrent Neural Networks
Tweet Share Share Google Plus Sequence prediction is a problem that involves using historical se
Gentle Introduction to Transduction in Machine Learning
Tweet Share Share Google Plus Transduction or transductive learning are terms you may come acros
An introduction to parsing text in Haskell with Parsec
util eof try xib reporting where its ner short Parsec makes parsing text very easy in Haskell. I write this as much for myself as for any
A Short Introduction to Boosting
short why clu rom ons ner boosting algorithm plain http://www.site.uottawa.ca/~stan/csi5387/boost-tut-ppr.pdf Boosting is a general m
PBRT_V2 總結記錄 A Quick Introduction to Monte Carlo Methods
參考 : https://www.scratchapixel.com/lessons/mathematics-physics-for-computer-graphics/monte-carlo-methods-mathematical-foundations Mont
Gentle Introduction to the Adam Optimization Algorithm for Deep Learning
The choice of optimization algorithm for your deep learning model can mean the difference between good results in minutes, hours, and days. The Adam optim