# Optimization

## Notation

• Support of a vector is denoted by

A vector is referred to as s-sparse if

• denotes the number of zero-entries in
• denote A's singular values
• Frobenius norm of

• Nuclear norm of

• Spectral norm of

• Random variables are denoted with upper-case letters

## Definitions

### Algebraic Riccati equation

The (algebraic) Riccati equation is problem of finding an unknown matrix in

• Continuous case:

• Discrete case:

where , , , and are known real coefficient matrices.

## Lagrange multipliers

### TODO Modern formulation via differentiable manifolds

#### Notation

• smooth manifold
• is a differentiable function we wish to maximize
• is a local chart

#### Stuff

By extension of Fermat's Theorem the local maxima are only points where the exterior derivative .

In particular, the extrema are a subset of the critical points of the function .

This is just a mapping from to

## Convex optimization

### Visualization

#### Convexity

• Interpolation in codomain > interpolation in domain
using Plots, LaTeXStrings, Printf

f(x) = (x - 1) * (x - 2)
x₀, x₁ = 0., 6.

anim = @animate for t = 0:0.01:1
z = (1 - t) * x₀ + t * x₁

p = plot(-2:0.05:7, f, label="f(x)", legend=:topleft)
# title!("Convexity: interpolation in codomain > interpolation in domain")
vline!([z], label="", linestyle=:dash)
scatter!([x₀, x₁], [f(x₀), f(x₁)], label="")
plot!([x₀, x₁], f.([x₀, x₁]),
label = L"(1 - t) f(x_0) + tf(x_1)",
annotations = [(x₀ - 0.2, f(x₀), Plots.text(L"f(x_0)", :right, 10)), (x₁ + 0.2, f(x₁), Plots.text(L"f(x_1)", :left, 10))])
scatter!([z], [f(z)],
label="",
annotations = [(z + 0.1, f(z), Plots.text(latexstring(@sprintf("f(%.2f x_0 + %.2f x_1)\\quad", (1 - t), t)), :left, 10))])
scatter!([z], [(1 - t) * f(x₀) + t * f(x₁)],
label = "",
annotations = [(z - 0.1, (1 - t) * f(x₀) + t * f(x₁), Plots.text(latexstring(@sprintf("%.2f f(x_0) + %.2f f(x_1)\\quad", (1 - t), t)), :right, 10))])
end

gif(anim, "test.gif")


### Definitions

A convex combination of a set of vectors in an arbitrary real space is a vector

where:

A convex set is a subset of an affince space that is closed under convex combinations.

More specifically, in a Euclidean space, a convex region is a region where, for every pair of points within the region, every point on the straight line segment that joins the pair of points is also within the region, i.e. we say the set is a convex set if and only if

The boundary of a convex set is always a convex curve, i.e. curve in the Euclidean plane which lies completely on one side of it's tangents.

The intersection of all convex sets containing a given subset of Euclidean space is called the convex hull of , i.e. the smallest convex set containing the subset .

It's like a "convex closure" of .

A continuously differentiable function is considered convex if

In more generality, a continuous function is convex if for all ,

where the subscript of emphasizes that this vector can depend on . Alternatively, if for all ,

i.e. linear interpolation in input-space is "worse" than linear interpolation in output-space.

A continuously differentiable function is considered β-strongly convex

Similarily to a convex function, we don't require to be differentiable. In more generality, a contiuous function is β-strongly convex if for all and for all we have nesterov2009primal

Contrasting with the definition of a convex function, the only difference is the additional term .

#### Quasi-convex and -concave

A quasiconvex function is a function such that the preimage is a convex set.

To be precise, a function defined on a convex set is quasiconvex if

for all and .

In words, is quasiconvex then it is always true that a point directly between two other points does not give a higher value of the function than both of the other points do.

A function is quasiconcave if we take the above definition and replace convex with concave and with .

### Simplex

A simplex is a generalization of the notion of a triangle or a tetrahedron to arbitrary dimensions.

Specifically, a k-simplex is a k-dimensional polytope which is the convex hull of its vertices.

More formally, suppose the points are affinely independent, which means are linearly independent. Then, the simplex determined by them is the set of points

#### Stuff

We say is a subgradient of at if

where replaces the gradient in the definition of a convex function.

Observe that the line is a global lower bound of .

We say a function is subdifferentiable at if there exists at least one subgradient at .

The set of all subgradients at is called the subdifferential: .

Some remarks:

• is convex and differentiable .
• Any point , there can be 0, 1, or infinitely many subgradients.
• is not convex.

If , then is a global minimizer of .

This is easily seen from definition of subgradient, since if , then , i.e. global minimizer.

In general there might exist multiple global minimizers.

1. Suppose is convex, and we start optimizing at
2. Repeat
1. Step in a negative subgradient direction:

where is the step-size and .

The following theorem tells us that the subgradient gets us closer to the minimizer.

Suppose is convex.

• Let for
• Let be any point for which

Then for small enough ,

i.e. it's a contraction mapping and it has a unique fixed point at

That is, the negative subgradient step gets us closer to minimizer.

Suppose is convex.

• Let for
• Let be any point for which

Then

where we've used the definition a subgradient to get the inequality above.

Now consider .

• It's a convex quadration (facing upwards)
• Has zeros at

• Therefore, due intermedite value theorem and convexity (facing upwards), we know the term above is negative if

Hence,

We then observe that if is Lipschitz, we have

with , then the subgradient descent using a decreasing step-size is a contraction mapping with a fixed point at ! Hence,

If we're using a fixed step-size, we cannot guarantee the term considered in the proof of subgradient descent decreasing the objective function is negative (since at some point we get too close to , and so the fixed we're using is not in the required interval).

That is why we use a decreasing step-size in the theorem above.

Using a fixed step-size, we still have the following bound

### Cutting-plane method

• Used in linear integer programming

Consider an integer linear programming problem in standard form as

#### Gomory's cut

A Gomory's cut can be described geometrically as follows:

1. Drop requirement for and solve resulting linear program
• Solution will be a vertex of the convex polytope consisting of all feasible points
2. If this vertex is not an integer point
1. Find a hyperplane with the vertex on one side and all feasible integer points on the other
2. Add this as an additional linear constraint to exclude the vertex found
3. Solve resulting linear program
4. Goto Step 2.0
3. Fin

One of my first thoughts regarding this was "But how the heck can you separate this vertex from all feasible points with a , since lin. prog. can't handle strict inequalities?!".

Solution: it says separating the vertex from all feasible INTEGER points, duh. This is definitively doable if the vertex is non-integer! Great:) Meza happy.

## Non-convex optimization

### Convex Relaxation

• Relax problem such that we end up with a convex problem
• E.g. relaxation of sparse linear regression, you end up with Lasso regression
• It's really regression, but as you can read on that link, it can be viewed as a approximation to the non-convex "norm", (it's not really a norm)
• Can sometimes talk about the relaxation gap, which tells how different the optimal convex solution is from the non-convex
• Relaxation-approach can often be very hard to scale

## Under uncertainty

### Stochastic programming

#### Stochastic Linear Programming

A two-stage stochastic linear program is usually formulated as

The first stage of the problem is the minimization above, which is performed prior to the realization of the uncertain parameter .

The second stage is finding the variables .

• Class of dynamic optimization problems
• Related to Kalman filter:
• Recursive formulations of linear-quadratic control problems and Kalman filtering both involve matrix Riccati equations
• Classical formulations of linear control and linear filtering problems make use of similar matrix decompositions
• "Linear" part refeers refers to the "motion" of the state
• "Quadratic" refers to the preferences

TODO

## Topics in Convex optimization

### Lecture 1

#### Notation

• Black-box model: given input , a "oracle" returns
• Complexity of the algorithm is the number of queries made to the black-box oracle
• Consider Lipschitz functions wrt. infinity-norm, i.e.

• Goal: find s.t.

#### Stuff

There is an algorithm with complexity that can achieve the "goal" for all .

Grid search! Let be the grid points, where

is then the number of grid points.

Query at the grid points . Find grid point that achives the minimum of on .

Let be the closest grid point to

For any algo that acheives the "goal" for all , there exists an s.t. on , the algo will require at least .

• is such that

#### Resources

• [BROKEN LINK: No match for fuzzy expression: *GP%20&%20Kernel%20stuff]

#### Setup

Suppose we want to solve the system of linear equations

where

• is symmetric and positive-definitive

As it turns out, since is symmetric and positive-definite we can define an inner product wrt. :

We say two vectors are conjugate if they are orthogonal wrt. inner product wrt. , i.e. if

#### Method (direct)

Let

be a set of mutually orthogonal vectors (wrt. ), thus forms a basis of .

We can therefore express in this basis:

Let's consider left-multiplication by :

LHS can be written

where we've let denote the norm induced by the inner-product wrt. . Therefore:

Great! These solutions are exact, so why the heck did someone put the term "gradient" into the name?!

Because you might be satisfied with an approximation to the method if this would mean you don't have to find all of these "conjugate" / orthogonal (wrt. ) vectors! This motivatives writing it as a iterative method, which we can terminate when we get some near the true solution.

#### Iterative

Observe that is actually the unique minimizer of the quadratic

since

The fact that is a unique solution can then be seen by the fact that it's hessian

is symmetric positive-definite.

This implies uniqueness of solution only if we can show that

Since is positive-definite, we have

means that all . <= NOT TRUE.

Suppose s.t. , which also implies by symmetry of . Then if and for , we have

If we let , it is at least apparent that the diagonals of has to be non-negative. What about when ? Well, we then need

Since are arbitrary, we require this to hold for all s.t. .

Great; that's all we need! Since for a symmetric real matrix we can do the eigenvalue decomposition of , i.e.

where is a diagonal matrix with eigenvalues of as its entries, and is an orthonormal matrix. Thus

where we've let . We can equivalently consider the case where and for . This then proves that the eigenvalues of a symmetric positive-definite matrix are all positive.

### Random optimization

#### Idea

where

• is a random vector distributed, say, uniformly over the unit sphere

This converges under the assumption that .

## Mirrored descent

Suppose we want to minimize regret

where

• is a random variable indicating which index we chose as time
• indicating the loss of 0-1 loss of choosing the corresponding index

One approach is to solving this problem is to

1. Define a discrete distribution over the indices , i.e. .
2. Use gradient descent to find the optimal weights

TODO: watch first lecture to ensure that you understand the regret-bound for gradient descent

Doing the above naively, leads to the following bound

which unfortunately scales with dimensionality in .

Mirrored descent takes advantage of the following idea:

• If we suppose there's one true optimal arm, say , then the resulting solution should be "sparse" in the sense that it should only have one non-zero entry, i.e. and for .
• Equivalently, will be a vertex in the simplex
• Naive gradient descent does not make use of this information → want to incorporate this somehow

There's an alternative approach: Franke-Wolfe is a convex-optimization algorithm which aims to point the gradient steps towards verticies.

This works very well for static problems, but in this formulation the function we're evaluating is allowed to change wrt. , and apparently this is no good for Franke-Wolfe.