Linear Algebra

Definitions
Theorems
The reason why we can think of OLS as projection
Distance and Approximation
- Applications
Rayleigh principle

Definitions

Ill-conditoned matrix

A matrix $A$ is ill-conditioned if small changes in its entries can produce large changes in the olutisns to $A \mathbf{x} = \mathbf{b}$ . If small changes in the entries of $A$ produce small changes in the solutions to $A \mathbf{x} = \mathbf{b}$ , then $A$ is called well-conditioned.

Singular values

For any $m \times n$ matrix $A$ , the $n \times n$ $A^T A$ is symmetric and thus orthogonally diagonalizable by the Spectral Theorem.

The eigenvalues of $A^T A$ are real and non-negative. Therefore,

$\begin{equation*} \begin{split} 0 \le || A \mathbf{v} ||^2 &= \big(A \mathbf{v} \big) \cdot \big(A \mathbf{v} \big) = \big( A \mathbf{v} \big)^T \big( A \mathbf{v} ) = \mathbf{v}^T \big( A^T A \mathbf{v} \big) \\ &= \mathbf{v}^T \lambda \mathbf{v} = \lambda \mathbf{v}^T \mathbf{v} = \lambda ( \mathbf{v} \cdot \mathbf{v} ) \\ &= \lambda || \mathbf{v} ||^2 = \lambda \end{split} \end{equation*}$ 1

where $\mathbf{v}$ is a unit eigenvector of $A^T A$ . Hence, we can take the square root of all these eigenvalues, and define what we call singular values:

If $A$ is an $m \times n$ matrx, the singular values of $A$ are the square roots of the eigenvalues of $A^T A$ and are denoted $\sigma_1, \sigma_2, ..., \sigma_n$ .

$\begin{equation*} \text{if } \lambda_i : (A^T A) \mathbf{v} = \lambda_i \mathbf{v} \text{ for some } \mathbf{v} \implies \sigma_i = \sqrt{\lambda_i} = || A \mathbf{v} || \text{ for } A \end{equation*}$

By convention we order the singular values s.t. $\sigma_1 \ge \sigma_2 \ge ... \ge \sigma_n$ .

Cholesky decomposition

The Cholesky decomposition or Cholesky factorization is a decomposition of a Hermitian positive-definite matrix into a product of lower triangular matrix and it's conjugate transpose:

$\begin{equation*} A = L L^* \end{equation*}$

where $L$ is lower-triangular.

Every Hermitian matrix $A$ has a unique Cholesky decomposition.

A real Cholesky factor is unique up to sign of the columns.

Suppose $L$ is Cholesky factor of some matrix. The matrix components are then given by

$\begin{equation*} \tensor{(L L^T)}{^{i}_{j}} = \tensor{L}{^{i}_{a}} \tensor{(L^T)}{^{a}_{j}} \end{equation*}$

So by scaling the k-th column by -1 we get

$\begin{equation*} \tensor{L}{^{i}_{a}} \tensor{L}{_{a}^{j}} = (- \tensor{L}{^{i}_{k}}) (- \tensor{L}{_{k}^{j}}) + \sum_{a = 1, a \ne k}^{n} \tensor{L}{^{i}_{a}} \tensor{L}{_{a}^{j}} \end{equation*}$

So it ain't no matter:)

Sidenote: summing all the way to $n$ is redundant because all indices larger than $\min \left\{ i, j \right\}$ vanishes.

Theorems

Spectral Theorem

Let $A$ be an $n \times n$ real matrix. Then $A$ is symmetric iff it is orthogonally diagonalizable.

Singular Value Decomposition

Let $A$ be an $m \times n$ matrix with singular values $\sigma_1 \ge \sigma_2 \ge ... \ge \sigma_r > 0$ and $\sigma_{r + 1} = ... = \sigma_n = 0$ . Then there exists:

$m \times m$ orthogonal matrix $U$
$n \times n$ orthogonal matrix $V$
$m \times n$ "diagonal" matrix $\Sigma = \begin{bmatrix} & D & \vline & O & \\ \hline & O & \vline & O & \end{bmatrix}$ , where $D = \begin{bmatrix} \sigma_1 & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & \sigma_r \end{bmatrix}$

such that:

$\begin{equation*} A = U \Sigma V^T \end{equation*}$

See page 615 of "Linear Algebra: A Modern Introduction" :)

Constructing V

To construct the orthogonal matrix $V$ , we first find an orthonormal basis $\{ \mathbf{v}_1, ..., \mathbf{v}_n \}$ for $\mathbb{R}$ consisting of eigenvectors of the $n \times n$ symmetric matrix $A^T A$ . Then

$\begin{equation*} V = [ \mathbf{v}_1 \cdots \mathbf{v}_n ] \end{equation*}$

is an orthogonal matrix.

Constructing U

For the orthogonal matrix $U$ , we first note that $\{A \mathbf{v}_1, ..., A \mathbf{v}_n \}$ is an orthogonal set of vectors in $\mathbb{R}^m$ (we're NOT saying they form a basis in $\mathbb{R}^m$ !). To see this, suppose that $\mathbf{v}_i$ is a eigenvector of $A^T A$ corresponding to the eigenvalue $\lambda_i$ . Then, for $i \ne j$ , we have

$\begin{equation*} \begin{split} (A \mathbf{v}_i ) \cdot (A \mathbf{v}_j) &= ( A \mathbf{v}_i )^T ( A \mathbf{v}_j ) \\ &= \mathbf{v}_i^T (A^T A \mathbf{v}_j) \\ &= \mathbf{v}_i^T \lambda_j \mathbf{v}_j \\ &= \lambda_j ( \mathbf{v}_i \cdot \mathbf{v}_j ) = 0 \end{split} \end{equation*}$

since the eigenvectors are orthogonal.

Since $\sigma_i = || A \mathbf{v}_i ||$ we can normalize $A \mathbf{v}_1, ..., A \mathbf{v}_r$ by

$\begin{equation*} \mathbf{u}_i = \frac{1}{\sigma_i} A \mathbf{v}_i, \quad \forall i = 1, ..., r \end{equation*}$

This guarantees that $\{ \mathbf{u}_1, ..., \mathbf{u}_r \}$ is a orthonormal set in $\mathbb{R}^m$ , but if $r \le n$ it's not a basis. In this case, we extend $\{ \mathbf{u}_1, ..., \mathbf{u}_r \}$ to an orthonormal basis $\{ \mathbf{u}_1, ..., \mathbf{u}_m \}$ for $\mathbb{R}^m$ by some technique, e.g. the Gram-Schmidt Process, giving us

$\begin{equation*} U = [ \mathbf{u}_1 \cdots \mathbf{u}_m ] \end{equation*}$

Outer Product Form of the SVD

Let $A$ be an $m \times n$ matrix with singular values $\sigma_1 \ge \sigma_2 \ge ... \ge \sigma_r > 0$ and $\sigma_{r+1} = \sigma_{r+2} = ... = \sigma_n = 0$ , then

$\begin{equation*} A = \sigma_1 \mathbf{u}_1 \mathbf{v}_1^T + \cdots + \sigma_r \mathbf{u}_r \mathbf{v}_r^T \end{equation*}$

Geometric insight into the effect of matrix transformations

Let $A$ be an $m \times n$ matrix with rank $r$ . Then the image of the unit sphere in $\mathbb{R}^m$ under the matrix transformation that maps $\mathbf{x}$ to $A \mathbf{x}$ is

the surface of a an ellipsoid in $\mathbb{R}^m$ if $r = m$
a solid ellipsoid in $\mathbb{R}^m$ if $r < n$

We have $A = U \Sigma V^T$ with $\text{rank}(A) = r$ . Then

$\begin{equation*} \sigma_1 \ge \sigma_2 \ge ... \ge sigma_r > 0 \quad \text{and} \quad \sigma_{r+1} = ... = \sigma_n = 0 \end{equation*}$

Let $|| \mathbf{x} || = 1$ (i.e. unit vector = surface of unit sphere) in $\mathbb{R}^n$ . $V$ is orthogonal, thus $V^T$ is orthogonal and then $V^T \mathbf{x}$ is a unit vector too.

$\begin{equation*} V^T \mathbf{x} = \begin{bmatrix} \mathbf{v}_1 \mathbf{x} \\ \vdots \\ \mathbf{v}_n \mathbf{x} \end{bmatrix} \implies (\mathbf{v}_1^T \mathbf{x})^2 + ... + ( \mathbf{v}_n^T \mathbf{x} )^2 = 1 \end{equation*}$

From the Outer Product Form we have:

$\begin{equation*} \begin{split} A \mathbf{x} &= \sigma_1 \mathbf{u}_1 \mathbf{v}_1^T \mathbf{x} + ... + \sigma_r \mathbf{u}_r \mathbf{v}_r^T \mathbf{x} \\ &= ( \sigma_1 \mathbf{v}_1^T \mathbf{x} ) \mathbf{u}_1 + ... + ( \sigma_r \mathbf{v}_r^T \mathbf{x} ) \mathbf{u}_r \\ &= \gamma_1 \mathbf{u}_1 + ... + \gamma_r \mathbf{u}_r \end{split} \end{equation*}$

where we are letting $\gamma_i$ denote the scalar $\sigma_i \mathbf{v}_i^T \mathbf{x}$ .

(1) If $r = n$ , then we must have $n \le m$ and

$\begin{equation*} \begin{split} A \mathbf{x} &= \gamma_i \mathbf{u}_1 + ... + \gamma_ \mathbf{u}_n \\ &= U \mathbf{\gamma} \end{split} \end{equation*}$

Since $U$ is orthogonal, we have $|| A \mathbf{x} || = || U \mathbf{\gamma} || = || \mathbf{\gamma} ||$ , thus

$\begin{equation*} \Big( \frac{\gamma_1}{\sigma_1} \Big)^2 + \cdots + \Big( \frac{\gamma_n}{\sigma_n} \Big)^2 = (\mathbf{v}_1^T \mathbf{x})^2 + \cdots + (\mathbf{v}_n^T \mathbf{x})^2 = 1 \end{equation*}$

which shows that the vectors $A \mathbf{x}$ form the surface of an ellipsoid in $\mathbb{R}^m$ !

(2) If $r < n$ , swapping equality with inequality (but performing the same steps):

$\begin{equation*} \Big( \frac{\gamma_1}{\sigma_1} \Big)^2 + \cdots + \Big( \frac{\gamma_r}{\sigma_r} \Big)^2 \le 1 \end{equation*}$

where the inequality corresponds to a solid ellipsoid in $\mathbb{R}^m$ !

Furthermore, we can view the each of the matrix transformations of the SVD separately:

$V^T$ is an orthogonal matrix, thus it maps the unit sphere onto itself (but in a different basis)
$m \times n$ matrix $\Sigma$ first collapses to $n - r$ dimensions of the unit sphere, leaving a unit sphere of $r$ dimensions, and then "distorts" into an ellipsoid
$U$ is orthogonal, aligning the axes of this ellipsoid with the orthonormal basis vectors $ ₁, …, _r$ in $\mathbb{R}^m$

Least squares solution for dependent column-vectors

The least squares problem $A \mathbf{x} = \mathbf{b}$ has a unique least squares solution $\bar{\mathbf{x}}$ of minimal length that is given by:

$\begin{equation*} \bar{\mathbf{x}} = A^+ \mathbf{b} \quad \text{where } A^+ = V \Sigma^+ U^T \end{equation*}$

See page 627 of "Linear Algebra: A Modern Introduction" :)

Determinant

Trace is the sum of the eigenvalues

Let $\mathbf{A}$ be a matrix, then

$\begin{equation*} \begin{split} \text{tr} (\mathbf{A}) &= \lambda_1 + \lambda_2 + \dots + \lambda_n \\ &= \sum_{i=1}^{n} \lambda_i \end{split} \end{equation*}$

We start with the characteristic equation

$\begin{equation*} \det(\mathbf{A} - \lambda \mathbf{I}) = 0 \end{equation*}$

We then notice that

$\begin{equation*} \det(\mathbf{A} - \lambda \mathbf{I}) = (-1)^n (\lambda^n + \text{tr} (\mathbf{A}) \lambda^{n-1} + \dots + (-1)^n \det \mathbf{A}) \end{equation*}$

which can be seen by expanding the determinant along the diagonal:

$\begin{equation*} \begin{split} \begin{vmatrix} a_{11} - \lambda & \dots & a_{1n} \\ \vdots & \ddots & \vdots \\ a_{n1} & \dots & a_{nn} - \lambda \end{vmatrix} =& (a_{11} - \lambda) \begin{vmatrix} a_{22} - \lambda & a_{23} & \dots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n2} & a_{n3} & \dots & a_{nn} - \lambda \end{vmatrix} \\ & - a_{12} \begin{vmatrix} a_{21} & a_{23} & \dots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n3} & \dots & a_{nn} - \lambda \end{vmatrix} + \dots \\ =& (a_{11} - \lambda) (a_{22} - \lambda) \cdots (a_{nn} - \lambda) + \dots \end{split} \end{equation*}$

Here we see that the only "sub-determinant" which contains a factor of $\lambda^{n-1}$ is going to be the very first one which is multiplied by $(a_{11} - \lambda)$ , since each of the other "sub-determinants" will not have a coefficient with $\lambda$ , and it will not contain $a_{22} - \lambda$ , hence at most contain a factor of $\lambda^{n-2}$ .

Further, since

$\begin{equation*} \begin{split} \det (\mathbf{A} - \lambda \mathbf{I}) &= (-1)^n (\lambda - \lambda_1) \cdots (\lambda - \lambda_n) \\ &= (-1)^n (b_n \lambda^n + b_{n-1} \lambda^{n-1} + \dots + (-1)^n (\lambda_1 \lambda_2 \cdots \lambda_n)) \end{split} \end{equation*}$

we observe that $b_{n-1} = \lambda_1 + \dots \lambda_n$ , which we also know to be the $\text{tr}(\mathbf{A})$ from above, i.e.

$\begin{equation*} \text{tr}(\mathbf{A}) = \lambda_1 + \dots + \lambda_n \end{equation*}$

as wanted.

Determinant is the product of the eigenvalues

Let $\mathbf{A}$ be a matrix, then

$\begin{equation*} \begin{split} \det(\mathbf{A}) &= \lambda_1 \lambda_2 \cdots \lambda_n \\ &= \prod_{i=1}^n \lambda_i \end{split} \end{equation*}$

Let $\mathbf{A}$ be a matrix. Then

$\begin{equation*} \det (\mathbf{A} - \lambda \mathbf{I}) = 0 \end{equation*}$

defines the characteristic equation, which as the eigenvalues $\lambda_i$ as its roots, therefore

$\begin{equation*} \begin{split} \det (\mathbf{A} - \lambda \mathbf{I}) &= (-1)^n (\lambda - \lambda_1) \cdots (\lambda - \lambda_n) \\ &= (-1)(\lambda - \lambda_1) \cdots (-1) (\lambda - \lambda_n) \\ &= (\lambda_1 - \lambda) \cdots (\lambda_n - \lambda) \end{split} \end{equation*}$

Since $\lambda$ is just a variable, we let $\lambda = 0$ and get

$\begin{equation*} \det(\mathbf{A}) = \lambda_1 \lambda_2 \cdots \lambda_n \end{equation*}$

as wanted.

Cramers Rule

Consider the system of $n$ linear equations represented by $n$ unknowns; in matrix form

$\begin{equation*} \mathbf{A} \mathbf{x} = \mathbf{b} \end{equation*}$

where $\mathbf{A}_{n \times n}$ has nonzero determinant, then

$\begin{equation*} x_i = \frac{\det \mathbf{A}_i}{\det \mathbf{A}}, \quad i = 1, \dots, n \end{equation*}$

where $\mathbf{A}_i$ is the matrix formed by replacing the i-th column of $\mathbf{A}$ by the column vector $\mathbf{b}$ .

The reason why we can think of OLS as projection

The proof of the following theorem is really important in my opinion, as it provides a "intuitive" way of viewing least-squares approximations.

The Best Approximation Theorem says the following:

If $W$ is a finite-dimensional subspace of an inner product space $V$ and if $\mathbf{v}$ is a vector in $V$ , then $\text{proj}_W(\mathbf{v})$ is the best approximation to $\mathbf{v}$ in $W$ .

Let $\mathbf{w}$ be a vector in $W$ different from $\text{proj}_W(\mathbf{v})$ . Then $\text{proj}_W(\mathbf{v}) - \mathbf{w}$ is also in $W$ , so $\text{proj}_W(\mathbf{v}) - \mathbf{w} = \text{perp}_W(\mathbf{v})$ is orthogonal to $\text{proj}_W(\mathbf{v}) - \mathbf{w}$ .

Pythagoras' Theorem now implies that

$\begin{equation*} \begin{split} || \mathbf{v} - \text{proj}_W(\mathbf{v}) ||^2 + || \text{proj}_W(\mathbf{v}) - \mathbf{w} ||^2 &= || \big( \mathbf{v} - \text{proj}_W(\mathbf{v}) \big) + \big( \text{proj}_W(\mathbf{v}) - \mathbf{w} \big) ||^2 \\ &= || \mathbf{v} - \mathbf{w} ||^2 \end{split} \end{equation*}$

But $\mathbf{w} \ne \text{proj}_W(\mathbf{v}) \implies || \text{proj}_W(\mathbf{v}) - w || > 0$ , so

$\begin{equation*} || v - \text{proj}_W(\mathbf{v}) ||^2 < || \mathbf{v} - \text{proj}_W(\mathbf{v}) ||^2 + || \text{proj}_W(\mathbf{v}) - \mathbf{w} ||^2 = || \mathbf{v} - \mathbf{w} ||^2 \end{equation*}$

i.e. $|| v - \text{proj}_W(\mathbf{v}) || < || \mathbf{v} - \mathbf{w} ||$ and thus that $\text{proj}_W(\mathbf{v})$ is the best approximation.

Distance and Approximation

Applications

Function-approximation

Given a continunous function $f$ on an interval $[a, b]$ and a subspace $W$ of $\mathcal{C}[a, b]$ , find the function "closest" to $f$ in $W$ .

$\mathcal{C}[a, b]$ denotes the space of all continuous functions defined on the domain $[a, b]$
$W$ is a subspace of $\mathcal{C}[a, b]$

Problem is analogous to least squares fitting of data points, except we have infinitely many data points - namely, the points in the graph of the function $f$ .

What should "approximate" mean in this context? Best Approximation Theorem has the answer!

The given function $f$ lives in the vector space $\mathcal{C}[a, b]$ of continuous functions on the interval $[a, b]$ . This is an inner product space, with inner product:

$\begin{equation*} \langle f, g \rangle = \int_a^b f(x) g(x) dx \end{equation*}$

If $W$ is finite-dimensional subspace of $\mathcal{C}[a, b]$ , then the best approximation to $f$ in $W$ is given by the projection of $f$ onto $W$ .

Furthermore, if $\{ \mathbf{u}_1, ..., \mathbf{u}_k \}$ is an orthogonal basis for $W$ , then

$\begin{equation*} \text{proj}_W(f) = \frac{\langle \mathbf{u}_1, f \rangle}{\langle \mathbf{u}_1, \mathbf{u}_1 \rangle} \mathbf{u}_1 + ... + \frac{\langle \mathbf{u}_k, f \rangle}{\langle \mathbf{u}_k, \mathbf{u}_k \rangle} \mathbf{u}_k \end{equation*}$

The error from this approximation is just

$\begin{equation*} || f - g || = \sqrt{\int_b^a \big( f(x) - g(x) \big)^2 dx} \end{equation*}$

and is often called the root mean square error. For functions we can think of this as the area between the graphs.

Fourier approximation

A function of the form

$\begin{equation*} p(x) = a_0 + a_1 \cos x + ... + a_n \cos nx + b_1 \sin x + ... + b_n \sin nx \end{equation*}$

is called a trigonometric polynomial.

Restricting our attention to the space of continuous functions on the interval $[- \pi, \pi ]$ , i.e. $\mathcal{C}[-\pi, \pi]$ , with the inner product

$\begin{equation*} \langle f, g \rangle = \int_{-\pi}^\pi f(x) g(x) dx \end{equation*}$

and since the trigonometric polynomials are linear combinations of the set

$\begin{equation*} \mathcal{B} = \{ 1, \cos x, ..., \cos nx, \sin x, ..., \sin nx \} \end{equation*}$

the best approximation to a function $f$ in $\mathcal{C}[-\pi, \pi]$ by trigonometric polynomials of order $n$ is therefore $\text{proj}_W(f)$ , where $W = \text{span}(\mathcal{B})$ . It turns out that $\mathcal{B}$ is an orthogonal set, hence a basis for $W$ .

The approximation to $f$ using trigonometric polynomials is called the n-th order Fourier approximation to $f$ on $[-\pi, \pi]$ , with the coefficients are known as Fourier coefficients of $f$ .

Rayleigh principle

A lot of the stuff here I got from these lecture notes.

Notation

$R(x)$ denotes the Rayleigh coefficient of some Herimitian / real symmetric matrix $A$
$\Lambda$ is the diagonal matrix with entries being eigenvalues of $A$
Eigenvalues are ordered as $\lambda_1 \le \lambda_2 \le \dots \le \lambda_n$
$Q$ is the orthogonal matrix with eigenvectors of $A$ as columns
$y = Q^T x$ , thus $R(y) = R(Qx) = \lambda_1 x_1^2 + \dots + \lambda_n x_n^2$
$z' = (a_1, a_2, \dots, a_n)$

Definition

For a given complex Hermitian matrix $A$ and non-zero vector $x$ , the Rayleigh quotient is defined as:

$\begin{equation*} R(A, x) = \frac{ x^* A x}{x^* x} \end{equation*}$

In the case of $\mathbb{R}$ , $A$ and is symmetric, then

$\begin{equation*} R(A, x) = \frac{x^T A x}{x^T x} \end{equation*}$

Further, in general,

$\begin{equation*} R(A, B; x) = \frac{x^* A x}{x^* B x} \end{equation*}$

is the generalized Rayleigh quotient, where $B$ is a positive-definite Herimitian matrix.

The generalized Rayleigh quotient can be reduced to $R(D, C^* x)$ through the transformation

$\begin{equation*} D = C^{-1} A C^{*^{-1}} \end{equation*}$

where $CC^*$ is the Cholesky decomposition of the Hermitian positive-definite matrix $B$ .

Minimization

Personal explanation

$\begin{equation*} R(x) = \frac{x^T A x}{x^T x} \end{equation*}$

At the maximum we must have

$\begin{equation*} x^T x = x_1^2 + \dots + x_n^2 = c^2 \in \mathbb{R} \end{equation*}$

But observe that for whatever $c$ we have here, we have

$\begin{equation*} R(x) = \frac{(x / c)^T A (x / c)}{(x / c)^T (x / c)} = \frac{x^T A x}{x^T x} \end{equation*}$

i.e. it's invariant to scaling.

Further, since $A$ is symmetric, we have the diagonalization

$\begin{equation*} A = Q^T \Lambda Q \end{equation*}$

which means that if we let

$\begin{equation*} y = Q^T x \end{equation*}$

for some $x \in \mathbb{R}^n$ , we get

$\begin{equation*} R(y) = \frac{(Q^T x)^T A (Q^T x)}{(Q^T x)^T (Q^T x)} = \frac{(x^T Q) (Q \Lambda Q^T) (Q^T x) }{x^T (Q Q^T) x} = \frac{x^T \Lambda x}{x^T x} = \lambda_1 x_1^2 + \dots \lambda_n x_n^2 \end{equation*}$

where we've applied the constraint $x_1^2 + \dots x_n^2 = 1$ , which we can since again, invariant under scaling.

Therefore, we're left with the following optimization problem:

$\begin{equation*} \begin{split} \min_y R(y) = \min_x R(Q^T x) &= \min_x \lambda_1 x_1^2 + \dots + \lambda_n x_n^2 \\ \text{subject to} & \quad x_1^2 + \dots + x_n^2 = 1 \end{split} \end{equation*}$

Since we assume that $\lambda_1 \le \lambda_2 \le \dots \le \lambda_n$ , we clearly obtain the minima when

$\begin{equation*} x = (1, 0, 0, \dots, 0) \end{equation*}$

Hence,

$\begin{equation*} \min_x R(y) = \lambda_1 \quad \implies \quad y = Q^T (1, 0, 0, \dots, 0) = q_1 \end{equation*}$

where $q_1$ denotes the corresponding eigenvector.

Now suppose we'd like to find the minima, which is NOT in the space spanned by the first eigenvector, then we only consider the subspace

$\begin{equation*} y^T q_1 = 0 \end{equation*}$

Thus, the optimization problem is

$\begin{equation*} \begin{split} \min_{y: \ y^T q_1 = 0} R(y) = \min_{x: \ x^T Q q_1 = 0} R(Q^T x) &= \min_{x: \ x^T (Q q_1) = 0 } \lambda_1 x_1^2 + \dots + \lambda_n x_n^2 \\ \text{subject to} & \quad x_1^2 + \dots + x_n^2 = 1 \end{split} \end{equation*}$

Finally, observe that $Q q_1 = (1, 0, 0, \dots, 0)$ , hence

$\begin{equation*} \begin{split} \min_{x: \ x^T e_1 = 0} & \lambda_1 x_1^2 + \dots + \lambda_n x_n^2 \\ \text{subject to} & \quad x_1^2 + \dots + x_n^2 = 1 \end{split} \end{equation*}$

And the subspace $x^T e_1 = 0$ is just all $x = (0, x_2, x_3, \dots, x_n)$ , giving us

$\begin{equation*} \min_{x: \ x^T e_1 = 0} R(Q^T x) = \lambda_2 \quad \implies \quad y = Q^T e_2 = q_2 \end{equation*}$

And if we just keep going, heyooo, we get all our eigenvalues and eigenvectors!

Stuff

We're going to assume $A$ is real.

Our objective is:

$\begin{equation*} \min_x R(A, x) \end{equation*}$

Observe the following:

Objective is invariant to scaling, since we'd be scaling the denominator too
$A$ is by assumption symmetric: all real eigenvalues

Thus, we have

$\begin{equation*} A = Q^T \Lambda Q \end{equation*}$

Further,

$\begin{equation*} \big( Q x \big)^T \big( Q x \big) = \norm{Q x}^2 = \norm{x}^2 = x^T x \end{equation*}$

Hence,

$\begin{equation*} R(Q^T x) = \frac{\big( Q Q^T x \big)^T \Lambda \big( Q Q^T x \big)}{\big( Q Q^T x \big)^T \big( Q Q^T x \big)} = \frac{x^T \Lambda x}{x^T x} = \frac{\lambda_1 x_1^2 + \dots + \lambda_n x_n^2}{x_1^2 + \dots + x_n^2} \end{equation*}$

Thus, by letting $y = Q^T x$ , we get

$\begin{equation*} R(y) = \frac{\lambda_1 x_1^2 + \dots + \lambda_n x_n^2}{x_1^2 + \dots + x_n^2} \end{equation*}$

Since this quotient is invariant under scaling, we frame it as follows:

$\begin{equation*} R(y) = \lambda_1 x_1^2 + \dots + \lambda_n x_n^2 \quad \text{where} \quad x_1^2 + \dots x_n^2 = 1 \end{equation*}$

Implying that $R(y) \in [\lambda_1, \lambda_n]$ (seen from $x = (1, \dots, 0)$ and $x = (0, \dots, 1)$ ) since we assume the eigenvalues to be ordered.

Hence, we have proved what's called the Rayleigh Principle:

$\begin{equation*} \min_x R(x) = \lambda_1 \end{equation*}$

where $\lambda_1$ is the smallest eigenvalue of $A$ .

Planar Slices

Suppose we'd like to find the answer to the following:

$\begin{equation*} \min_{x^T z = 0} R(x) \end{equation*}$

i.e. minimizing the Rayleigh quotient over some hyperplane defined by $x^T z = 0$ .

This minimum is bounded blow as follows:

$\begin{equation*} \min_x R(x) = \lambda_1 \le \min_{x: \ x^T z = 0} R(x) \end{equation*}$

Further, we can obtain an upper bound for this by considering $R(y)$ again:

$\begin{equation*} \min_{y: \ y^T z = 0} R(y) = \min_{x: \ x^T (Qz) = 0} R(Qx) = \min_{x: \ x^T (Qz) = 0} \frac{\lambda_1 x_1^2 + \dots + \lambda_n x_n^2}{x_1^2 + \dots + x_n^2} \end{equation*}$

By scaling we can assume $x_1^2 + \dots + x_n^2 = 1$ , giving us the optimization problem

$\begin{equation*} \begin{split} \min_{x: \ x^T (z') = 0} &= \lambda_1 x_1^2 + \dots + \lambda_n x_n^2 \\ \text{subject to} \quad & x_1^2 + \dots + x_n^2 = 1 \end{split} \end{equation*}$

There is at least one vector of the form

$\begin{equation*} x = (x_1, x_2, 0, 0, \dots, 0) \quad \text{such that} \quad x^T (z') = 0, \quad x_1^2 + x_2^2 = 1 \end{equation*}$

If $z' = (a_1, \dots, a_n)$ , then the above translates to

$\begin{equation*} x_1 a_1 + x_2 a_2 = 0, \quad x_1^2 + x_2^2 = 1 \end{equation*}$

assuming $(a_1, a_2) \ne (0, 0)$ (does not satisfy second constraint), then the first constraint can be represented as:

$\begin{equation*} x_2 = - \frac{a_1}{a_2} x_1 \end{equation*}$

which is just a line through the origin. The second constraint is the unit-circle. Thus, there are only two points where both these constraints are satisfied (a line crosses the circle in two places). Let $(w_1, w_2)$ be one of these points of intersection, then

$\begin{equation*} \min_{x: \ x^T z = 0} R(x) \le \lambda_1 w_1^2 + \lambda_2 w_2^2 \le \lambda_2, \quad w_1, w_2 : \ w_1^2 + w_2^2 = 1 \end{equation*}$

Which means we have bounded the minimum of $R(x)$ on the hyperplane $x^T z = 0$ ! This becomes incredibly useful when have a look at Minimax Principle for the Second Smallest Eigenvalue.

Minimax Principle for the Second Smallest Eigenvalue

This is a slighly different way of viewing finding the second eigenvalue, and poses the "answer" in a frame where we might have asked the following question:

"Can we obtain $\lambda_2$ without finding $\lambda_1$ ?"

And ends with the answer:

$\begin{equation*} \max_z \min_{x^T z = 0} R(x) = \lambda_2 \end{equation*}$

which is very useful computationally, as we can alternate between miniziming and maximizing $R(x)$ to obtain $\lambda_2$ .

What if we could attain the maximum in this equation such that we would know that "maximizing in the hyperplane $x^T z = 0$ , where $x$ minimizes $R(x)$ , gives us the second smallest eigenvalue $\lambda_2$ !

Or equivalently, the minimum in the restricted hyperplane $x^T z = 0$ is given by $\lambda_2$ !

Thus, if we can force somehow force $w_1 = 0$ and $w_2 = 1$ , then we would get a upper bound for this. If we had $z' = (1, 0, 0, \dots, 0)$ then the only solutions to this would be:

$\begin{equation*} (x_1, x_2) = (0, \pm 1) \quad \implies \quad R(y) = \lambda_2 \end{equation*}$

To show that $\min_{x: \ x^T z = 0} R(y) = \lambda_2$ , we only consider solutions for $x$ where $x_1 = 0$ , i.e. let

$\begin{equation*} \mathcal{X} &:= \{ \forall x \mid x = (0, x_2, \dots, x_n), \ \lambda_1 x_1^2 + \lambda_2 x_2^2 + \dots + \lambda_n x_n^2 = 1 \} \end{equation*}$

$\begin{equation*} \begin{split} \min_\mathcal{X} R(x) &= \min_\mathcal{X} \lambda_1 x_1^2 + \lambda_2 x_2^2 + \dots + \lambda_n x_n^2 \\ &= \min_{x_2^2 + x_3^2 + \dots + x_n^2 = 1} \lambda_2 x_2^2 + \dots + \lambda_n x_n^2 \end{split} \end{equation*}$

Which is clearly attained when $x_2 = 1$ and $x_3 = \dots = x_n = 0$ , since $\lambda_2 \le \dots \le \lambda_n$

Hence,

$\begin{equation*} \max_z \min_{x^T z = 0} R(x) = \lambda_2 \end{equation*}$

Thus we've shown that min-maxing over some function $R(x)$ is equivalent of finding the second smallest eigenvalue!

Since $x^T z = 0$ means that $x$ lies in the orthogonal complement of $z$ , which is an (n-1)-dimensional subspace. Thus, we may rewrite the above as

$\begin{equation*} \max_{\dim S = n - 1} \min_{x \in S} R(x) = \lambda_2 \end{equation*}$

In fact, it turns out this works for general $j$ :

$\begin{equation*} \max_{\dim S = n - j} \min_{x \in S} R(x) = \lambda_{j + 1} \end{equation*}$

All this means that if we want to find some vector $x$ which minimizes the Rayleigh quotient, then we simply compute the eigenvalues of $A$ and look for the corresponding eigenvector, giving us the solution!

Linear Algebra

Table of Contents

Definitions

Ill-conditoned matrix

Singular values

Cholesky decomposition

Theorems

Spectral Theorem

Singular Value Decomposition

Constructing V

Constructing U

Outer Product Form of the SVD

Geometric insight into the effect of matrix transformations

Least squares solution for dependent column-vectors

Determinant

Trace is the sum of the eigenvalues

Determinant is the product of the eigenvalues

Cramers Rule

The reason why we can think of OLS as projection

Distance and Approximation

Applications

Function-approximation

Fourier approximation

Rayleigh principle

Notation

Definition

Minimization

Personal explanation

Stuff

Planar Slices

Minimax Principle for the Second Smallest Eigenvalue