DL-1: Tips for Training Deep Neural Network
Different approaches for different problems.
e.g. dropout for good results on testing data.
Choosing proper loss
- Square Error
- Cross Entropy
Mini-batch
We do not really minimize total loss!
batch_size: 每次批處理訓練樣本個數;
nb_epoch: 整個訓練資料重複處理次數。
總的訓練樣本數量不變。
Mini-batch is Faster. Not always true with parallel computing.
Mini-batch has better performance!
Shuffle the training examples for each epoch. This is the default of Keras.
New activation function
Q: Vanishing Gradient Problem
- Smaller gradients
- Learn very slow
Almost random
Larger gradients
- Learn very fast
- Already converge
2006 RBM –> 2015 ReLU
ReLU: Rectified Linear Unit
1. Fast to compute
2. Biological reason
3. Infinite sigmoid with different biases
4. Vanishing gradient problem
A Thinner linear network. Do not have smaller gradients.
Adaptive Learning Rate
Set the learning rate
η carefully.
- If learning rate is too large, total loss may not decrease after each update.
- If learning rate is too small, training would be too slow.
Solution:
- Popular & Simple Idea: Reduce the learning rate by some factor every few epochs.
- At the beginning, use larger learning rate
- After several epochs, reduce the learning rate. E.g. 1/t decay:
ηt=η/t+1−−−−√
- Learning rate cannot be one-size-fits-all.
- Giving different parameters different learning rates
Adagrad:
Summation of the square of the previous derivatives.
Observation:
1. Learning rate is smaller and smaller for all parameters.
2. Smaller derivatives, larger learning rate, and vice versa.
- Adagrad [John Duchi, JMLR’11]
- RMSprop
https://www.youtube.com/watch?v=O3sxAc4hxZU
- Adadelta [Matthew D. Zeiler, arXiv’12]
- “No more pesky learning rates” [Tom Schaul, arXiv’12]
- AdaSecant [Caglar Gulcehre, arXiv’14]
- Adam [DiederikP. Kingma, ICLR’15]
- Nadam
http://cs229.stanford.edu/proj2015/054_report.pdf
Momentum
Overfitting
Learning target is defined by the training data.
Training data and testing data can be different.
The parameters achieving the learning target do not necessary have good results on the testing data.
Panacea for Overfitting
- Have more training data
- Create more training data
Early Stopping
Regularization
Weight decay is one kind of regularization.
Dropout
Training
- Each time before updating the parameters
- Each neuron has p% to dropout
The structure of the network is changed.
- Using the new network for training
For each mini-batch, we resample the dropout neurons.
- Each neuron has p% to dropout
Testing
**No dropout**- If the dropout rate at training is p%, all the weights times (1-p)%
- Assume that the dropout rate is 50%.
If a weight w = 1 by training, set w = 0.5 for testing.
Dropout -Intuitive Reason
- When teams up, if everyone expect the partner will do the work, nothing will be done finally.
- However, if you know your partner will dropout, you will do better.
- When testing, no one dropout actually, so obtaining good results eventually.
Dropout is a kind of ensemble
Network Structure
CNN is a very good example!