# "Notes on: Jung, A. (2017): A fixed-point of view on gradient methods for big data"

## Table of Contents

## Overview

- Detailed analysis of methods for minimizing convex objective functions
- Gradient methods using inexact or noisy gradients, e.g. SGD, can be studied conveniently using
**inexact fixed point iterations** - Paper demonstrate that fixed-point approach allows derivation of accelerations for basic gradient methods

## Notation

- denotes the l-th entry in the vector
- denotes the hermitian transpose of the square matrix
- for a
*positive semi-definite*matrix and whose columns are the orthonormal eigenvectors , where the diagonal matrix contains the eigenvalues ordered as - denotes the spectral norm of the matrix
- denotes the
- denotes the set of convex functions satisfying for every

## Convex functions

If all second order partial derivatives of the function exist and are continuous, then is **convex** if and only if

Difficulty of finding the minimum of some function using gradient methods is essentially governed by the **condition number**

of the function class . Thus, the aboslute values of the bounds and are not crucial, only their ratio is.

Gradient methods for minimizing quadratic functions of the form

wit some vector .

The gradient and the Hessian of this form can be obtained by

Most results on gradient methods for minimizing quadratic functions of this form apply also when expanding their scope from quadratic functions to the larger set .

This is not a surprise since any function can be approximated locally around some point by a quadratic function which is obtained by the Taylor series.

In particular we have,

where with some , i.e. just some linear expansion about the point .

The approximation error can then be bound as

Thus we can ensure arbitrarily small approx. error by considering only over a neighbourhood with sufficiently small radius .

## Gradient descent

Here they show that gradient descent can be obatined naturally as fixed-point iterations involving the gradient operator .

They prove that if we consider a **convex** function with the unique minimizer , i.e. . We can construct the operator with a step size such that

Then starting from an arbitrary initial guess , the iterates satisfy

we can prove convergence to the optimum .

They find the smallest possible contraction factor

## First order methods

Here they note that the usual way of approaching the computational problem of all this is to consider a computational model with a **first order oracle** , which works as follows:

- Given a query point , a
**first order oracle**returns the gradient of the objective function at this particular point - The FOM (first order method) then constructs the new iterate s.t. eventually , or equivalently

## Lower bounds on number of iterations

This is an interesting part of the paper. Here they note that in the appendix they prove the following theorem

Consider a particular FOM, which for a given convex function generates iterates satisfying eqn:eqn-29. For fixed there is a sequence of functions (indexed by dimension ) such that

with a sequence such that .

Then they note that **there is a considerable gap between the upper bound on the error achieved by GD after iterations and the lower bound which applies ot any FOM which is run for the same number of iterations**.

## Accelerating Gradient Descent

Here they go on to describe an operator with the same fixed-point convergence, but using information obtained from not only the previous iterate, , as GD, but also . This operator they prove can obtained a lower-bound very close to the optimal lower bound for any FOM, which is substantially better than GD.

If I remember correctly, gradient descent methods such as Adam do make use of the iterate to provide an estimate for the second order derivative of the objective function, which seems somewhat related to this!