Optimization

Table of Contents

Notation

  • $\alpha$ - learning rate
  • $J(\theta) = J \big( f^*(\mathbf{X}; \theta), \mathbf{y} \big)$ - cost of using $f^*$ to approximate the real function $f(x) = y$ for observed input-output pairs $\mathbf{X}$ and $\mathbf{y}$
  • $\theta^t$ - approx. value for $\theta$ after $t$ iterations

First-order

Overview

First-order optimization techniques only take into account the the first-order derivative (the Jacobian).

These kind of methods are of course usually slower than the second-order techniques, but posess the advantage of being much more feasible for high-dimensional data.

Consider some $m \times n$ weight matrix $W$. Taking the derivative of some function $f(\mathbf{x}) = W \mathbf{x}$ gives us a Jacobian matrix of $m \times n$ dimensions, i.e. $m * n$ derivatives. If we want the Hessian, we then have to take the partial derivatives of all the combinations of the Jacobian, givig us a total of $(m * n)^2$ partial derivatives to compute. See the issue?

Gradient Descent

Motivation

  • Finding minimum of some arbitrary function without having to compute the second-order derivative
  • Preferable iterative method, allowing incremental improvements

Definition

\begin{equation*}
  \theta^{t+1} = \theta^t- \alpha \nabla_\theta F(\theta^t)
\end{equation*}
1

where

  • $F(\theta)$ is some arbitrary function
  • $\alpha$ is the learning-rate

Typically in machine learning we want to minimize the cost of some function $f$ over some parameters $\theta$, which we might denote $J(\theta)$ (in my opinion, more precisely $J(f(\mathbf{X}; \theta), \mathbf{y})$). Usually we don't have access to the entire function $f$ but samples of input-output pairs. Therefore we cannot compute the full gradient of $J(\theta)$ and need to somehow approximate the gradient. And so Batch Gradient Descent arises!

Batch Gradient Descent

Motivation

  • Perform gradient descent without having to know the entire function we're approximating

Definition

\begin{equation*}
  \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla_\theta J(\theta; \mathbf{X}, \mathbf{y})
\end{equation*}
2

where the gradient now is the average over all observations $(\mathbf{x}^{(i)}, y^{(i)})$

Issues

  • Expensive to compute average gradient across entire set of observations
  • Convergence tends to be slow as some data where our prediction is way off might be "hidden" away in the average

Stochastic Gradient Descent

Motivation

  • Computing the full approximation to the gradient in Batch Gradient Descent is expensive
  • Also want to address the issue of potentially "hidden" away mistakes we're making

Definition

\begin{equation*}
  \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla_\theta J(\theta; \mathbf{x}^{(i)}, y^{(i)})
\end{equation*}
3

where now the gradient is compute from a single observation.

Issues

  • Usually quite unstable due to high variance
    • Can be mitigated by decreasing the learning rate as we go, but this has the disadvantage of vanishing gradient

Mini-batch Gradient Descent

These days, when people refer to Stochastic Gradient Descent (SGD) they most of the time mean Mini-batch Gradient Descent. It's just generally much better.

Motivation

  • Reduce instability (variance) of Stochastic Gradient Descent
  • Improve the slow & expensive Batch Gradient Descent
  • Make use of highly optimized matrix-operations

Definition

\begin{equation*}
  \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla_\theta J(\theta; \mathbf{x}^{(i:i+k)}, y^{(i:i+k)})
\end{equation*}
4

where $k$ is the size of the mini-batch.

Issues

  • Choosing proper learning-rate $\alpha$ is hard
  • SGD only guarantees convergence local minimas for non-convex functions, and can easily be "trapped" in these local minimas
    • In fact, saddle-points are argued to be a larger problem, as the gradient is zero in all dimensions, making it hard for SGD to "escape" / "get over the edge and keep rolling"

Momentum SGD

Motivation

  • If we keep seeing that is a "good thing" to update in some direction, then it would make sense to perform larger updates in this direction

Definition

\begin{equation*}
\begin{split}
v_t &= \gamma v_{t-1} + \alpha \nabla_\theta J(\theta) \\
\theta^{(t+1)} &= \theta^{(t)} - v_t \\
\end{split}
\end{equation*}
5

I really suggest having a look at Why Momentum Really Works by Gabriel Goh. It provides some theoretical background to why this simple addition works so well, in combinations with a bunch of visualizations to play with.

Issues

  • We have tendency to "overshoot" since we now have this "velocity" component, leading to oscillations around the point of convergence instead of reaching it and staying there. This is especially an issue for non-convex functions as we might actually overshoot the global minima.

Nestorov Momentum SGD

Motivation

  • Better control the overshooting of Momentum SGD

Definition

\begin{equation*}
\begin{split}
v_t &= \gamma v_{t-1} + \alpha \nabla_\theta J(\theta - \gamma v_{t-1}) \\
\theta^{(t+1)} &= \theta^{(t)} - v_t \\
\end{split}
\end{equation*}
6

Here we approximate the next value of $\theta$ ($\theta^{(t+1)}$) by evaluating the gradient at the current point, $\theta^{(t)}$, minus the momentum-term we'll be adding, $v_t$, to obtain $\theta^{(t+1)}$. In this way we get a certain control of the step we're taking by "looking ahead" and re-adjusting our update.

If the update $\theta - \gamma v_{t-1}$ is indeed a "bad" update, then $\nabla_\theta J(\theta - \gamma v_{t-1})$ will point towards $\theta^{(t)}$ again, correcting some of the "bad" update.

Sometimes people swaps the + sign in the update, like in the paper I'm about to suggest that you have a look at:

Have a look at Ilya Sutskever's thesis in section 7.2 where you get a more mathematical description of why this is a good idea.

Adagrad

Motivation

  • We would like to adapt our updates to each individual parameter to perform larger or smaller updates depending on the importance of the parameter
    • Say you barely see this one parameter in your updates, but then suddenly you look at an observation where this parameter has a significant impact. In this case it would make sense to perform a greater-than-normal update for this parameter.
  • We still don't like the fact that all previous methods makes us have to manually tune the learning rate $\alpha$

Definition

\begin{equation*}
\begin{split}
g_i^{(t)} &= \nabla_\theta J(\theta_i^{(t)}) \\
\theta_i^{(t+1)} &= \theta_i^{(t)} - \alpha \cdot g_i^{(t)}
\end{split}
\end{equation*}
7

where $\theta_i$ denotes the ith component of $\theta$.

Or better:

\begin{equation*}
\begin{split}
g_i^{(t)} &= \nabla_\theta J(\theta_i^{(t)}) \\
\theta_i^{(t+1)} &= \theta_i^{(t)} - \frac{\alpha}{\sqrt{G_{ii}^{(t)} + \epsilon}} \cdot g_i^{(t)}
\end{split}
\end{equation*}    
8

where

  • $G^{(t)}$ is a diagonal matrix where each diagonal element $(i, i)$ is the sum of the squares of the gradients w.r.t. $\theta_i$ up to $t$
  • $\epsilon$ is a smoothing term that avoids zero-division (usually on the order of $1e - 8$)

Finally to make use of efficient matrix-operations:

\begin{equation*}
 \theta^{(t+1)} = \theta^{(t)} - \frac{\alpha}{\sqrt{G^{(t)} + \epsilon}} \odot g^{(t)}
\end{equation*}
9

Issues

  • Accumulation of the squared gradient in the denominator means that eventually the denominator will so large that the learning-rate goes to zero

RMSProp

Motivation

  • Fix the issue of the too aggressive and monotonically decreasing learning rate $\alpha$ of Adagrad

Definition

Issues

Adam (usually the best)

Second-order

Resources