Table of Contents


  • optimization_6e9d20113f7ef024a71fc3edcc10e77417d06393.png - learning rate
  • optimization_70ff16dc4cd1acb193754893c5cbae1c2058afec.png - cost of using optimization_e52a598b1b89688e78c8f8933a8bfef6600de168.png to approximate the real function optimization_0da3fde0d26624695d373b9b10150c4c59e6ca98.png for observed input-output pairs optimization_553e3281e8d268e42b95a8a4dd22e70563a7024d.png and optimization_29c26837fff0673a98d6523a80b727d38797f426.png
  • optimization_79c3b7b16eb76353ca554826e728263f5f4423d2.png - approx. value for optimization_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png after optimization_eb5d809ed7c492fae7d4927a6fc9a5e22f9b3831.png iterations



First-order optimization techniques only take into account the the first-order derivative (the Jacobian).

These kind of methods are of course usually slower than the second-order techniques, but posess the advantage of being much more feasible for high-dimensional data.

Consider some optimization_1395dc909262a8364fc2adb6e58fbfe880a195c7.png weight matrix optimization_fe7108f38f2db55db9cb7eee3f63077437a2e510.png. Taking the derivative of some function optimization_9598054a9a53972466cb50fba00263934cae32e9.png gives us a Jacobian matrix of optimization_1395dc909262a8364fc2adb6e58fbfe880a195c7.png dimensions, i.e. optimization_d99bc5aadc4a6b818701f268f69979dfdec45716.png derivatives. If we want the Hessian, we then have to take the partial derivatives of all the combinations of the Jacobian, givig us a total of optimization_f73a9b85741a9341709bf7064bca901d41c6b4af.png partial derivatives to compute. See the issue?

Gradient Descent


  • Finding minimum of some arbitrary function without having to compute the second-order derivative
  • Preferable iterative method, allowing incremental improvements




  • optimization_fe0acba9bc14a206ed774b785f77e4c44d344f84.png is some arbitrary function
  • optimization_6e9d20113f7ef024a71fc3edcc10e77417d06393.png is the learning-rate

Typically in machine learning we want to minimize the cost of some function optimization_cdd1cc131da6040eca078917132a377727053c44.png over some parameters optimization_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png, which we might denote optimization_ad3c237ee412f09f1e5081d7970e9439e052e813.png (in my opinion, more precisely optimization_4650e1dcb6901631621feb0fd56c2325e5b75b9f.png). Usually we don't have access to the entire function optimization_cdd1cc131da6040eca078917132a377727053c44.png but samples of input-output pairs. Therefore we cannot compute the full gradient of optimization_ad3c237ee412f09f1e5081d7970e9439e052e813.png and need to somehow approximate the gradient. And so Batch Gradient Descent arises!

Batch Gradient Descent


  • Perform gradient descent without having to know the entire function we're approximating



where the gradient now is the average over all observations optimization_84dd76992025e3512b8c3494aacce7a70366cdfc.png


  • Expensive to compute average gradient across entire set of observations
  • Convergence tends to be slow as some data where our prediction is way off might be "hidden" away in the average

Stochastic Gradient Descent


  • Computing the full approximation to the gradient in Batch Gradient Descent is expensive
  • Also want to address the issue of potentially "hidden" away mistakes we're making



where now the gradient is compute from a single observation.


  • Usually quite unstable due to high variance
    • Can be mitigated by decreasing the learning rate as we go, but this has the disadvantage of vanishing gradient

Mini-batch Gradient Descent

These days, when people refer to Stochastic Gradient Descent (SGD) they most of the time mean Mini-batch Gradient Descent. It's just generally much better.


  • Reduce instability (variance) of Stochastic Gradient Descent
  • Improve the slow & expensive Batch Gradient Descent
  • Make use of highly optimized matrix-operations



where optimization_094b02afce734f4ce51933d0093ef3d2da9f8123.png is the size of the mini-batch.


  • Choosing proper learning-rate optimization_6e9d20113f7ef024a71fc3edcc10e77417d06393.png is hard
  • SGD only guarantees convergence local minimas for non-convex functions, and can easily be "trapped" in these local minimas
    • In fact, saddle-points are argued to be a larger problem, as the gradient is zero in all dimensions, making it hard for SGD to "escape" / "get over the edge and keep rolling"

Momentum SGD


  • If we keep seeing that is a "good thing" to update in some direction, then it would make sense to perform larger updates in this direction



I really suggest having a look at Why Momentum Really Works by Gabriel Goh. It provides some theoretical background to why this simple addition works so well, in combinations with a bunch of visualizations to play with.


  • We have tendency to "overshoot" since we now have this "velocity" component, leading to oscillations around the point of convergence instead of reaching it and staying there. This is especially an issue for non-convex functions as we might actually overshoot the global minima.

Nestorov Momentum SGD


  • Better control the overshooting of Momentum SGD



Here we approximate the next value of optimization_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png (optimization_759003d5e20ef09c13873a732f4c7f507b55c83c.png) by evaluating the gradient at the current point, optimization_5ab921d90a545ff2e81e0972c9e1a4df1b5f7154.png, minus the momentum-term we'll be adding, optimization_dd4dce6c0f1c4732e0709306cbe5f745d904b98a.png, to obtain optimization_759003d5e20ef09c13873a732f4c7f507b55c83c.png. In this way we get a certain control of the step we're taking by "looking ahead" and re-adjusting our update.

If the update optimization_69d246f1be1d9b2fdbb1e9d0b31c72349d04e3d5.png is indeed a "bad" update, then optimization_c840feaaba28a3d52f90985a1d6fc9fc7c6f559f.png will point towards optimization_5ab921d90a545ff2e81e0972c9e1a4df1b5f7154.png again, correcting some of the "bad" update.

Sometimes people swaps the + sign in the update, like in the paper I'm about to suggest that you have a look at:

Have a look at Ilya Sutskever's thesis in section 7.2 where you get a more mathematical description of why this is a good idea.



  • We would like to adapt our updates to each individual parameter to perform larger or smaller updates depending on the importance of the parameter
    • Say you barely see this one parameter in your updates, but then suddenly you look at an observation where this parameter has a significant impact. In this case it would make sense to perform a greater-than-normal update for this parameter.
  • We still don't like the fact that all previous methods makes us have to manually tune the learning rate optimization_6e9d20113f7ef024a71fc3edcc10e77417d06393.png



where optimization_3fef34decfc720f701fe0648862afaa133b8d9b8.png denotes the ith component of optimization_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png.

Or better:



  • optimization_5585d9eb483afed6a073857c402873e7535725ad.png is a diagonal matrix where each diagonal element optimization_ffb6bbadc92f4ae45e1c4a34caf9e35f7b3307e3.png is the sum of the squares of the gradients w.r.t. optimization_3fef34decfc720f701fe0648862afaa133b8d9b8.png up to optimization_eb5d809ed7c492fae7d4927a6fc9a5e22f9b3831.png
  • optimization_f774fb2b9717d65742fb7c3f1bb4e9ba2df26910.png is a smoothing term that avoids zero-division (usually on the order of optimization_8a2cfef594f510be6ae54380aef3e0bb58e2d67d.png)

Finally to make use of efficient matrix-operations:



  • Accumulation of the squared gradient in the denominator means that eventually the denominator will so large that the learning-rate goes to zero



  • Fix the issue of the too aggressive and monotonically decreasing learning rate optimization_6e9d20113f7ef024a71fc3edcc10e77417d06393.png of Adagrad



Adam (usually the best)