A review of gradient descent optimization methods
Suppose we are going to optimize a parameterized function \(J(\theta)\), where \(\theta \in \mathbb{R}^d\), for example, \(\theta\) could be a neural net.
More specifically, we want to \(\mbox{ minimize } J(\theta; \mathcal{D})\) on dataset \(\mathcal{D}\), where each point in \(\mathcal{D}\) is a pair \((x_i, y_i)\)
There are different ways to apply gradient descent.
Let \(\eta\) be the learning rate
.
- Vanilla batch update
\(\theta \gets \theta - \eta \nabla J(\theta; \mathcal{D})\)
Note that \(\nabla J(\theta; \mathcal{D})\) computes the gradient on of the whole dataset \(\mathcal{D}\).
```python
for i in range(n_epochs):
theta = theta - eta * gradient
eta = eta * 0.95
```
It is obvious that when \(\mathcal{D}\) is too large, this approach is unfeasible.
Stochastic Gradient Descent
Stochastic Gradient, on the other hand, update the parameters example by example.
\(\theta \gets \theta - \eta *J(\theta, x_i, y_i)\)for n in range(n_epochs): for x_i, y_i in D: gradient=compute_gradient(J, theta, x_i, y_i) theta = theta - eta * gradient eta = eta * 0.95
Mini-batch Stochastic Gradient Descent
Update \(\theta\) example by example could lead to high variance, the alternative approach is to update \(\theta\) by mini-batches \(M\) where \(|M| \ll |\mathcal{D}|\).for n in range(n_epochs): for M in D: gradient = compute_gradient(J, M) theta = theta - eta * gradient eta = eta * 0.95
Question? Why decaying the learning rate
leads to convergence?
A review of gradient descent optimization methods