# Linear Algebra

## Definitions

### Ill-conditoned matrix

A matrix is ill-conditioned if small changes in its entries can produce large changes in the olutisns to . If small changes in the entries of produce small changes in the solutions to , then is called well-conditioned.

### Singular values

For any matrix , the is symmetric and thus orthogonally diagonalizable by the Spectral Theorem.

The eigenvalues of are real and non-negative. Therefore,

1

where is a unit eigenvector of . Hence, we can take the square root of all these eigenvalues, and define what we call singular values:

If is an matrx, the singular values of are the square roots of the eigenvalues of and are denoted .

By convention we order the singular values s.t. .

### Cholesky decomposition

The Cholesky decomposition or Cholesky factorization is a decomposition of a Hermitian positive-definite matrix into a product of lower triangular matrix and it's conjugate transpose:

where is lower-triangular.

Every Hermitian matrix has a unique Cholesky decomposition.

A real Cholesky factor is unique up to sign of the columns.

Suppose is Cholesky factor of some matrix. The matrix components are then given by

So by scaling the k-th column by -1 we get

So it ain't no matter:)

Sidenote: summing all the way to is redundant because all indices larger than vanishes.

## Theorems

### Spectral Theorem

Let be an real matrix. Then is symmetric iff it is orthogonally diagonalizable.

### Singular Value Decomposition

Let be an matrix with singular values and . Then there exists:

• orthogonal matrix
• orthogonal matrix
• "diagonal" matrix , where

such that:

See page 615 of "Linear Algebra: A Modern Introduction" :)

#### Constructing V

To construct the orthogonal matrix , we first find an orthonormal basis for consisting of eigenvectors of the symmetric matrix . Then

is an orthogonal matrix.

#### Constructing U

For the orthogonal matrix , we first note that is an orthogonal set of vectors in (we're NOT saying they form a basis in !). To see this, suppose that is a eigenvector of corresponding to the eigenvalue . Then, for , we have

since the eigenvectors are orthogonal.

Since we can normalize by

This guarantees that is a orthonormal set in , but if it's not a basis. In this case, we extend to an orthonormal basis for by some technique, e.g. the Gram-Schmidt Process, giving us

#### Outer Product Form of the SVD

Let be an matrix with singular values and , then

#### Geometric insight into the effect of matrix transformations

Let be an matrix with rank . Then the image of the unit sphere in under the matrix transformation that maps to is

1. the surface of a an ellipsoid in if
2. a solid ellipsoid in if

We have with . Then

Let (i.e. unit vector = surface of unit sphere) in . is orthogonal, thus is orthogonal and then is a unit vector too.

From the Outer Product Form we have:

where we are letting denote the scalar .

(1) If , then we must have and

Since is orthogonal, we have , thus

which shows that the vectors form the surface of an ellipsoid in !

(2) If , swapping equality with inequality (but performing the same steps):

where the inequality corresponds to a solid ellipsoid in !

Furthermore, we can view the each of the matrix transformations of the SVD separately:

1. is an orthogonal matrix, thus it maps the unit sphere onto itself (but in a different basis)
2. matrix first collapses to dimensions of the unit sphere, leaving a unit sphere of dimensions, and then "distorts" into an ellipsoid
3. is orthogonal, aligning the axes of this ellipsoid with the orthonormal basis vectors $1, …, r$ in

### Least squares solution for dependent column-vectors

The least squares problem has a unique least squares solution of minimal length that is given by:

See page 627 of "Linear Algebra: A Modern Introduction" :)

### Determinant

#### Trace is the sum of the eigenvalues

Let be a matrix, then

We then notice that

which can be seen by expanding the determinant along the diagonal:

Here we see that the only "sub-determinant" which contains a factor of is going to be the very first one which is multiplied by , since each of the other "sub-determinants" will not have a coefficient with , and it will not contain , hence at most contain a factor of .

Further, since

we observe that , which we also know to be the from above, i.e.

as wanted.

#### Determinant is the product of the eigenvalues

Let be a matrix, then

Let be a matrix. Then

defines the characteristic equation, which as the eigenvalues as its roots, therefore

Since is just a variable, we let and get

as wanted.

### Cramers Rule

Consider the system of linear equations represented by unknowns; in matrix form

where has nonzero determinant, then

where is the matrix formed by replacing the i-th column of by the column vector .

## The reason why we can think of OLS as projection

The proof of the following theorem is really important in my opinion, as it provides a "intuitive" way of viewing least-squares approximations.

The Best Approximation Theorem says the following:

If is a finite-dimensional subspace of an inner product space and if is a vector in , then is the best approximation to in .

Let be a vector in different from . Then is also in , so is orthogonal to .

Pythagoras' Theorem now implies that

But , so

i.e. and thus that is the best approximation.

## Distance and Approximation

### Applications

#### Function-approximation

Given a continunous function on an interval and a subspace of , find the function "closest" to in .

• denotes the space of all continuous functions defined on the domain
• is a subspace of

Problem is analogous to least squares fitting of data points, except we have infinitely many data points - namely, the points in the graph of the function .

What should "approximate" mean in this context? Best Approximation Theorem has the answer!

The given function lives in the vector space of continuous functions on the interval . This is an inner product space, with inner product:

If is finite-dimensional subspace of , then the best approximation to in is given by the projection of onto .

Furthermore, if is an orthogonal basis for , then

The error from this approximation is just

and is often called the root mean square error. For functions we can think of this as the area between the graphs.

#### Fourier approximation

A function of the form

is called a trigonometric polynomial.

Restricting our attention to the space of continuous functions on the interval , i.e. , with the inner product

and since the trigonometric polynomials are linear combinations of the set

the best approximation to a function in by trigonometric polynomials of order is therefore , where . It turns out that is an orthogonal set, hence a basis for .

The approximation to using trigonometric polynomials is called the n-th order Fourier approximation to on , with the coefficients are known as Fourier coefficients of .

## Rayleigh principle

A lot of the stuff here I got from these lecture notes.

### Notation

• denotes the Rayleigh coefficient of some Herimitian / real symmetric matrix
• is the diagonal matrix with entries being eigenvalues of
• Eigenvalues are ordered as
• is the orthogonal matrix with eigenvectors of as columns
• , thus

### Definition

For a given complex Hermitian matrix and non-zero vector , the Rayleigh quotient is defined as:

In the case of , and is symmetric, then

Further, in general,

is the generalized Rayleigh quotient, where is a positive-definite Herimitian matrix.

The generalized Rayleigh quotient can be reduced to through the transformation

where is the Cholesky decomposition of the Hermitian positive-definite matrix .

### Minimization

#### Personal explanation

At the maximum we must have

But observe that for whatever we have here, we have

i.e. it's invariant to scaling.

Further, since is symmetric, we have the diagonalization

which means that if we let

for some , we get

where we've applied the constraint , which we can since again, invariant under scaling.

Therefore, we're left with the following optimization problem:

Since we assume that , we clearly obtain the minima when

Hence,

where denotes the corresponding eigenvector.

Now suppose we'd like to find the minima, which is NOT in the space spanned by the first eigenvector, then we only consider the subspace

Thus, the optimization problem is

Finally, observe that , hence

And the subspace is just all , giving us

And if we just keep going, heyooo, we get all our eigenvalues and eigenvectors!

#### Stuff

We're going to assume is real.

Our objective is:

Observe the following:

• Objective is invariant to scaling, since we'd be scaling the denominator too
• is by assumption symmetric: all real eigenvalues

Thus, we have

Further,

Hence,

Thus, by letting , we get

Since this quotient is invariant under scaling, we frame it as follows:

Implying that (seen from and ) since we assume the eigenvalues to be ordered.

Hence, we have proved what's called the Rayleigh Principle:

where is the smallest eigenvalue of .

#### Planar Slices

Suppose we'd like to find the answer to the following:

i.e. minimizing the Rayleigh quotient over some hyperplane defined by .

This minimum is bounded blow as follows:

Further, we can obtain an upper bound for this by considering again:

By scaling we can assume , giving us the optimization problem

There is at least one vector of the form

If , then the above translates to

assuming (does not satisfy second constraint), then the first constraint can be represented as:

which is just a line through the origin. The second constraint is the unit-circle. Thus, there are only two points where both these constraints are satisfied (a line crosses the circle in two places). Let be one of these points of intersection, then

Which means we have bounded the minimum of on the hyperplane ! This becomes incredibly useful when have a look at Minimax Principle for the Second Smallest Eigenvalue.

#### Minimax Principle for the Second Smallest Eigenvalue

This is a slighly different way of viewing finding the second eigenvalue, and poses the "answer" in a frame where we might have asked the following question:

"Can we obtain without finding ?"

which is very useful computationally, as we can alternate between miniziming and maximizing to obtain .

What if we could attain the maximum in this equation such that we would know that "maximizing in the hyperplane , where minimizes , gives us the second smallest eigenvalue !

Or equivalently, the minimum in the restricted hyperplane is given by !

Thus, if we can force somehow force and , then we would get a upper bound for this. If we had then the only solutions to this would be:

To show that , we only consider solutions for where , i.e. let

Which is clearly attained when and , since

Hence,

Thus we've shown that min-maxing over some function is equivalent of finding the second smallest eigenvalue!

Since means that lies in the orthogonal complement of , which is an (n-1)-dimensional subspace. Thus, we may rewrite the above as

In fact, it turns out this works for general :

All this means that if we want to find some vector which minimizes the Rayleigh quotient, then we simply compute the eigenvalues of and look for the corresponding eigenvector, giving us the solution!