Optimization

Notation
First-order
Second-order
Resources

Notation

$\alpha$ - learning rate
$J(\theta) = J \big( f^*(\mathbf{X}; \theta), \mathbf{y} \big)$ - cost of using $f^*$ to approximate the real function $f(x) = y$ for observed input-output pairs $\mathbf{X}$ and $\mathbf{y}$
$\theta^t$ - approx. value for $\theta$ after $t$ iterations

First-order

Overview

First-order optimization techniques only take into account the the first-order derivative (the Jacobian).

These kind of methods are of course usually slower than the second-order techniques, but posess the advantage of being much more feasible for high-dimensional data.

Consider some $m \times n$ weight matrix $W$ . Taking the derivative of some function $f(\mathbf{x}) = W \mathbf{x}$ gives us a Jacobian matrix of $m \times n$ dimensions, i.e. $m * n$ derivatives. If we want the Hessian, we then have to take the partial derivatives of all the combinations of the Jacobian, givig us a total of $(m * n)^2$ partial derivatives to compute. See the issue?

Gradient Descent

Motivation

Finding minimum of some arbitrary function without having to compute the second-order derivative
Preferable iterative method, allowing incremental improvements

Definition

$\begin{equation*} \theta^{t+1} = \theta^t- \alpha \nabla_\theta F(\theta^t) \end{equation*}$ 1

where

$F(\theta)$ is some arbitrary function
$\alpha$ is the learning-rate

Typically in machine learning we want to minimize the cost of some function $f$ over some parameters $\theta$ , which we might denote $J(\theta)$ (in my opinion, more precisely $J(f(\mathbf{X}; \theta), \mathbf{y})$ ). Usually we don't have access to the entire function $f$ but samples of input-output pairs. Therefore we cannot compute the full gradient of $J(\theta)$ and need to somehow approximate the gradient. And so Batch Gradient Descent arises!

Batch Gradient Descent

Motivation

Perform gradient descent without having to know the entire function we're approximating

Definition

$\begin{equation*} \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla_\theta J(\theta; \mathbf{X}, \mathbf{y}) \end{equation*}$ 2

where the gradient now is the average over all observations $(\mathbf{x}^{(i)}, y^{(i)})$

Issues

Expensive to compute average gradient across entire set of observations
Convergence tends to be slow as some data where our prediction is way off might be "hidden" away in the average

Stochastic Gradient Descent

Motivation

Computing the full approximation to the gradient in Batch Gradient Descent is expensive

Also want to address the issue of potentially "hidden" away mistakes we're making

Definition

$\begin{equation*} \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla_\theta J(\theta; \mathbf{x}^{(i)}, y^{(i)}) \end{equation*}$ 3

where now the gradient is compute from a single observation.

Issues

Usually quite unstable due to high variance
- Can be mitigated by decreasing the learning rate as we go, but this has the disadvantage of vanishing gradient

Mini-batch Gradient Descent

These days, when people refer to Stochastic Gradient Descent (SGD) they most of the time mean Mini-batch Gradient Descent. It's just generally much better.

Motivation

Reduce instability (variance) of Stochastic Gradient Descent
Improve the slow & expensive Batch Gradient Descent
Make use of highly optimized matrix-operations

Definition

$\begin{equation*} \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla_\theta J(\theta; \mathbf{x}^{(i:i+k)}, y^{(i:i+k)}) \end{equation*}$ 4

where $k$ is the size of the mini-batch.

Issues

Choosing proper learning-rate $\alpha$ is hard
SGD only guarantees convergence local minimas for non-convex functions, and can easily be "trapped" in these local minimas
- In fact, saddle-points are argued to be a larger problem, as the gradient is zero in all dimensions, making it hard for SGD to "escape" / "get over the edge and keep rolling"

Momentum SGD

Motivation

If we keep seeing that is a "good thing" to update in some direction, then it would make sense to perform larger updates in this direction

Definition

$\begin{equation*} \begin{split} v_t &= \gamma v_{t-1} + \alpha \nabla_\theta J(\theta) \\ \theta^{(t+1)} &= \theta^{(t)} - v_t \\ \end{split} \end{equation*}$ 5

I really suggest having a look at Why Momentum Really Works by Gabriel Goh. It provides some theoretical background to why this simple addition works so well, in combinations with a bunch of visualizations to play with.

Issues

We have tendency to "overshoot" since we now have this "velocity" component, leading to oscillations around the point of convergence instead of reaching it and staying there. This is especially an issue for non-convex functions as we might actually overshoot the global minima.

Nestorov Momentum SGD

Motivation

Better control the overshooting of Momentum SGD

Definition

$\begin{equation*} \begin{split} v_t &= \gamma v_{t-1} + \alpha \nabla_\theta J(\theta - \gamma v_{t-1}) \\ \theta^{(t+1)} &= \theta^{(t)} - v_t \\ \end{split} \end{equation*}$ 6

Here we approximate the next value of $\theta$ ( $\theta^{(t+1)}$ ) by evaluating the gradient at the current point, $\theta^{(t)}$ , minus the momentum-term we'll be adding, $v_t$ , to obtain $\theta^{(t+1)}$ . In this way we get a certain control of the step we're taking by "looking ahead" and re-adjusting our update.

If the update $\theta - \gamma v_{t-1}$ is indeed a "bad" update, then $\nabla_\theta J(\theta - \gamma v_{t-1})$ will point towards $\theta^{(t)}$ again, correcting some of the "bad" update.

Sometimes people swaps the + sign in the update, like in the paper I'm about to suggest that you have a look at:

Have a look at Ilya Sutskever's thesis in section 7.2 where you get a more mathematical description of why this is a good idea.

Adagrad

Motivation

We would like to adapt our updates to each individual parameter to perform larger or smaller updates depending on the importance of the parameter
- Say you barely see this one parameter in your updates, but then suddenly you look at an observation where this parameter has a significant impact. In this case it would make sense to perform a greater-than-normal update for this parameter.
We still don't like the fact that all previous methods makes us have to manually tune the learning rate $\alpha$

Definition

$\begin{equation*} \begin{split} g_i^{(t)} &= \nabla_\theta J(\theta_i^{(t)}) \\ \theta_i^{(t+1)} &= \theta_i^{(t)} - \alpha \cdot g_i^{(t)} \end{split} \end{equation*}$ 7

where $\theta_i$ denotes the ith component of $\theta$ .

Or better:

$\begin{equation*} \begin{split} g_i^{(t)} &= \nabla_\theta J(\theta_i^{(t)}) \\ \theta_i^{(t+1)} &= \theta_i^{(t)} - \frac{\alpha}{\sqrt{G_{ii}^{(t)} + \epsilon}} \cdot g_i^{(t)} \end{split} \end{equation*}$ 8

where

$G^{(t)}$ is a diagonal matrix where each diagonal element $(i, i)$ is the sum of the squares of the gradients w.r.t. $\theta_i$ up to $t$
$\epsilon$ is a smoothing term that avoids zero-division (usually on the order of $1e - 8$ )

Finally to make use of efficient matrix-operations:

$\begin{equation*} \theta^{(t+1)} = \theta^{(t)} - \frac{\alpha}{\sqrt{G^{(t)} + \epsilon}} \odot g^{(t)} \end{equation*}$ 9

Issues

Accumulation of the squared gradient in the denominator means that eventually the denominator will so large that the learning-rate goes to zero

RMSProp

Motivation

Fix the issue of the too aggressive and monotonically decreasing learning rate $\alpha$ of Adagrad

Definition

Issues

Adam (usually the best)

Second-order

Resources

An overview of gradient descent optimization algorithms A great blog-post providing an overview as the title says.
Why Momentum Really Works A fantastic post about momentum in SGD. Provides some nice theoretical footing for its efficiency and really cool visualizations.

Optimization

Table of Contents

Notation

First-order

Overview

Gradient Descent

Motivation

Definition

Batch Gradient Descent

Motivation

Definition

Issues

Stochastic Gradient Descent

Motivation

Definition

Issues

Mini-batch Gradient Descent

Motivation

Definition

Issues

Momentum SGD

Motivation

Definition

Issues

Nestorov Momentum SGD

Motivation

Definition

Adagrad

Motivation

Definition

Issues

RMSProp

Motivation

Definition

Issues

Adam (usually the best)

Second-order

Resources