"Notes on: Jung, A. (2017): A fixed-point of view on gradient methods for big data"

Table of Contents

Overview

  • Detailed analysis of methods for minimizing convex objective functions
  • Gradient methods using inexact or noisy gradients, e.g. SGD, can be studied conveniently using inexact fixed point iterations
  • Paper demonstrate that fixed-point approach allows derivation of accelerations for basic gradient methods

Notation

  • jung17_fixed_point_view_gradien_method_big_data_6ddff946221dfeff351434db6869d23b1f8745b9.png denotes the l-th entry in the vector jung17_fixed_point_view_gradien_method_big_data_0ca88bb74822db3608899becfee652ccb1332445.png
  • jung17_fixed_point_view_gradien_method_big_data_a9b093889e7135ee43ce92461b42ba5ae19bfe0e.png denotes the hermitian transpose of the square matrix jung17_fixed_point_view_gradien_method_big_data_c5aa3aa798775e13072867a204fd0ff48d2c4544.png
  • jung17_fixed_point_view_gradien_method_big_data_b1611ef0f26dcf5a81b8b90c005699d0183f3a67.png for a positive semi-definite matrix jung17_fixed_point_view_gradien_method_big_data_940ecb2dd481e5a3da02040b86605508aac948b4.png and jung17_fixed_point_view_gradien_method_big_data_a52f5f3b30c68be164bbacb951ce6b7ac93c4a31.png whose columns are the orthonormal eigenvectors jung17_fixed_point_view_gradien_method_big_data_818504f8ceabc107474ce1dec6474830f330ea39.png, where the diagonal matrix jung17_fixed_point_view_gradien_method_big_data_b030eaefae195afbe07bd517cd09d9039010e529.png contains the eigenvalues ordered as jung17_fixed_point_view_gradien_method_big_data_d84bbdd55da1b1da3ae3991abf24c36bf194e7cd.png
  • jung17_fixed_point_view_gradien_method_big_data_b7b200c2b7cd9432708c6d9071de93167e85b7de.png denotes the spectral norm of the matrix jung17_fixed_point_view_gradien_method_big_data_0bb540b959fb19c3eb91c69cf13c9d2f97497c9f.png
  • jung17_fixed_point_view_gradien_method_big_data_dcc173e8c921045d8f3ff8456867284c14a8fd1c.png denotes the
  • jung17_fixed_point_view_gradien_method_big_data_a554c90c75ecf3ffdbf3268ac5d8677d11417611.png denotes the set of convex functions satisfying jung17_fixed_point_view_gradien_method_big_data_ce5d7cfb8782b6b69e5a930f58d6265843c0e9fb.png for every jung17_fixed_point_view_gradien_method_big_data_11dcece2586929aa88011feed5d79219d1a2a647.png

Convex functions

If all second order partial derivatives of the function jung17_fixed_point_view_gradien_method_big_data_a8a87b931e8db55bcb596578ebfc43cf3a267473.png exist and are continuous, then jung17_fixed_point_view_gradien_method_big_data_a8a87b931e8db55bcb596578ebfc43cf3a267473.png is convex if and only if

jung17_fixed_point_view_gradien_method_big_data_496861aa1ea0e12b23d901ed9c68b83beb9d1a3c.png

Difficulty of finding the minimum of some function jung17_fixed_point_view_gradien_method_big_data_a554c90c75ecf3ffdbf3268ac5d8677d11417611.png using gradient methods is essentially governed by the condition number

jung17_fixed_point_view_gradien_method_big_data_74a949fc00787b401097ef2cf6afb86c267e7509.png

of the function class jung17_fixed_point_view_gradien_method_big_data_f144a57bda1bd4afaa14b4da0801d31be22a9204.png. Thus, the aboslute values of the bounds jung17_fixed_point_view_gradien_method_big_data_9dfb630910b0b3d2a61cb5f1673b9adaa08fe150.png and jung17_fixed_point_view_gradien_method_big_data_5760e806c8eb938d6a2563d076feec6ab9f9cafc.png are not crucial, only their ratio jung17_fixed_point_view_gradien_method_big_data_1a2921fe0b9c6130a1c24e19b1f21adbcf194935.png is.

Gradient methods for minimizing quadratic functions of the form

jung17_fixed_point_view_gradien_method_big_data_7ee7908073493ec157b6845404d05d1cb562b540.png

wit some vector jung17_fixed_point_view_gradien_method_big_data_9f4c55de2e01506a3410916046bac435a94719b7.png.

The gradient and the Hessian of this form can be obtained by

jung17_fixed_point_view_gradien_method_big_data_60d9a93404e1bc9d38723a80aa4079808453ce9f.png

Most results on gradient methods for minimizing quadratic functions of this form apply also when expanding their scope from quadratic functions to the larger set jung17_fixed_point_view_gradien_method_big_data_f144a57bda1bd4afaa14b4da0801d31be22a9204.png.

This is not a surprise since any function jung17_fixed_point_view_gradien_method_big_data_a554c90c75ecf3ffdbf3268ac5d8677d11417611.png can be approximated locally around some point jung17_fixed_point_view_gradien_method_big_data_d5c4be42c45694c6a9f1d51a9d8a29ca41defb67.png by a quadratic function which is obtained by the Taylor series.

In particular we have,

jung17_fixed_point_view_gradien_method_big_data_f47053e8bf5a02b01a29b40739c5f8225a466f7c.png

where jung17_fixed_point_view_gradien_method_big_data_d6b007cb481aa3c85b548b14484fcf470dc2c3c9.png with some jung17_fixed_point_view_gradien_method_big_data_4647cdd0eb05965b4a1a781a77a0313ae4980490.png, i.e. just some linear expansion about the point jung17_fixed_point_view_gradien_method_big_data_d5c4be42c45694c6a9f1d51a9d8a29ca41defb67.png.

The approximation error can then be bound as

jung17_fixed_point_view_gradien_method_big_data_f71dc30655be3a7e5beb8030dffc576725094aac.png

Thus we can ensure arbitrarily small approx. error jung17_fixed_point_view_gradien_method_big_data_03d96c52df5ea6a510607e2260e0745d11e4bd85.png by considering jung17_fixed_point_view_gradien_method_big_data_a8a87b931e8db55bcb596578ebfc43cf3a267473.png only over a neighbourhood jung17_fixed_point_view_gradien_method_big_data_023391d6460a1801b7fbc18ebf9ee6dcd479ecbd.png with sufficiently small radius jung17_fixed_point_view_gradien_method_big_data_7fe0ff08424a3007741593dadcfea581fc7aff7c.png.

Gradient descent

Here they show that gradient descent can be obatined naturally as fixed-point iterations involving the gradient operator jung17_fixed_point_view_gradien_method_big_data_551487356e5c20af7284f1f018a232ef710e3e0e.png.

They prove that if we consider a convex function jung17_fixed_point_view_gradien_method_big_data_a554c90c75ecf3ffdbf3268ac5d8677d11417611.png with the unique minimizer jung17_fixed_point_view_gradien_method_big_data_d5c4be42c45694c6a9f1d51a9d8a29ca41defb67.png, i.e. jung17_fixed_point_view_gradien_method_big_data_f824f6cdef84c5b67fa38be477d7528dbe780bcb.png. We can construct the operator jung17_fixed_point_view_gradien_method_big_data_dad2cbcf5c149b52a25d367755d6c67ebe8987ca.png with a step size jung17_fixed_point_view_gradien_method_big_data_6e9d20113f7ef024a71fc3edcc10e77417d06393.png such that

jung17_fixed_point_view_gradien_method_big_data_f17ebc9249347ffbd2a5575c8d6f82a90837be68.png

Then starting from an arbitrary initial guess jung17_fixed_point_view_gradien_method_big_data_01956c8a1a3e31bf8f0179e265f610786d918fb4.png, the iterates jung17_fixed_point_view_gradien_method_big_data_e121905b223b923e4fabe47fa02084a3fa05da63.png satisfy

jung17_fixed_point_view_gradien_method_big_data_72973e4c6bf4be0df201c5dd8fff19d1c1c6e058.png

we can prove convergence to the optimum jung17_fixed_point_view_gradien_method_big_data_d5c4be42c45694c6a9f1d51a9d8a29ca41defb67.png.

They find the smallest possible contraction factor

jung17_fixed_point_view_gradien_method_big_data_6a5ac75124c174aaf16cd90ab38767f06d4a89ac.png

First order methods

Here they note that the usual way of approaching the computational problem of all this is to consider a computational model with a first order oracle , which works as follows:

  • Given a query point jung17_fixed_point_view_gradien_method_big_data_7a9f00f52fc25b76ee5704ac6f6e9bc4da1d48a9.png, a first order oracle returns the gradient jung17_fixed_point_view_gradien_method_big_data_84f125a22b43ab7093f604455d7f110ea496019c.png of the objective function at this particular point
  • The FOM (first order method) then constructs the new iterate jung17_fixed_point_view_gradien_method_big_data_78b4c6488773c7a5edaf06a3f4c62e4df45e20ba.png s.t. eventually jung17_fixed_point_view_gradien_method_big_data_da20bdb5ad1651e8e214c6dd8069e5fb73fe73c4.png, or equivalently

jung17_fixed_point_view_gradien_method_big_data_3ce4948ebec5ab7c8eb865201b2c63cacdd48a40.png

Lower bounds on number of iterations

This is an interesting part of the paper. Here they note that in the appendix they prove the following theorem

Consider a particular FOM, which for a given convex function jung17_fixed_point_view_gradien_method_big_data_a554c90c75ecf3ffdbf3268ac5d8677d11417611.png generates iterates jung17_fixed_point_view_gradien_method_big_data_e121905b223b923e4fabe47fa02084a3fa05da63.png satisfying eqn:eqn-29. For fixed jung17_fixed_point_view_gradien_method_big_data_910df91f0b1417528778cd252e22ee0d5fb5430a.png there is a sequence of functions jung17_fixed_point_view_gradien_method_big_data_435a87cfb5712453869b0311184b6b9003f44255.png (indexed by dimension jung17_fixed_point_view_gradien_method_big_data_f07b932b12c91bca424fd428d383495413b5b6a9.png ) such that

jung17_fixed_point_view_gradien_method_big_data_30900a7fadca97a4483e0c21ff6bed4ea0f1b884.png

with a sequence jung17_fixed_point_view_gradien_method_big_data_3d500cd7172d9384a6734e302f8f2cc38af2845c.png such that jung17_fixed_point_view_gradien_method_big_data_a075bca5327012fbee6d488b0aaa6d0efb8a56fb.png.

Then they note that there is a considerable gap between the upper bound on the error achieved by GD after jung17_fixed_point_view_gradien_method_big_data_094b02afce734f4ce51933d0093ef3d2da9f8123.png iterations and the lower bound which applies ot any FOM which is run for the same number of iterations.

Accelerating Gradient Descent

Here they go on to describe an operator with the same fixed-point convergence, but using information obtained from not only the previous iterate, jung17_fixed_point_view_gradien_method_big_data_59c40427f10f049e884983dce244f551f2e67022.png, as GD, but also jung17_fixed_point_view_gradien_method_big_data_e7ec2e58229b4c9120403bfd268610d2084b4bb0.png. This operator they prove can obtained a lower-bound very close to the optimal lower bound for any FOM, which is substantially better than GD.

If I remember correctly, gradient descent methods such as Adam do make use of the iterate jung17_fixed_point_view_gradien_method_big_data_9de9ad6c131867235c22e9c2067277c0c08f51bc.png to provide an estimate for the second order derivative of the objective function, which seems somewhat related to this!