Linear Algebra

Table of Contents

Definitions

Ill-conditoned matrix

A matrix $A$ is ill-conditioned if small changes in its entries can produce large changes in the olutisns to $A \mathbf{x} = \mathbf{b}$. If small changes in the entries of $A$ produce small changes in the solutions to $A \mathbf{x} = \mathbf{b}$, then $A$ is called well-conditioned.

Singular values

For any $m \times n$ matrix $A$, the $n \times n$ $A^T A$ is symmetric and thus orthogonally diagonalizable by the Spectral Theorem.

The eigenvalues of $A^T A$ are real and non-negative. Therefore,

\begin{equation*}
\begin{split}
0 \le || A \mathbf{v} ||^2 &= \big(A \mathbf{v} \big) \cdot \big(A \mathbf{v} \big)
 = \big( A \mathbf{v} \big)^T \big( A \mathbf{v} ) = \mathbf{v}^T \big( A^T A \mathbf{v} \big) \\
&= \mathbf{v}^T \lambda \mathbf{v} = \lambda \mathbf{v}^T \mathbf{v} = \lambda ( \mathbf{v} \cdot \mathbf{v} ) \\
 &= \lambda || \mathbf{v} ||^2 = \lambda
\end{split}
\end{equation*}
1

where $\mathbf{v}$ is a unit eigenvector of $A^T A$. Hence, we can take the square root of all these eigenvalues, and define what we call singular values:

If $A$ is an $m \times n$ matrx, the singular values of $A$ are the square roots of the eigenvalues of $A^T A$ and are denoted $\sigma_1, \sigma_2, ..., \sigma_n$.

\begin{equation*}
  \text{if } \lambda_i : (A^T A) \mathbf{v} = \lambda_i \mathbf{v} \text{ for some } \mathbf{v} \implies \sigma_i = \sqrt{\lambda_i} = || A \mathbf{v} || \text{ for } A
\end{equation*}

By convention we order the singular values s.t. $\sigma_1 \ge \sigma_2 \ge ... \ge \sigma_n$.

Cholesky decomposition

The Cholesky decomposition or Cholesky factorization is a decomposition of a Hermitian positive-definite matrix into a product of lower triangular matrix and it's conjugate transpose:

\begin{equation*}
A = L L^*
\end{equation*}

where $L$ is lower-triangular.

Every Hermitian matrix $A$ has a unique Cholesky decomposition.

A real Cholesky factor is unique up to sign of the columns.

Suppose $L$ is Cholesky factor of some matrix. The matrix components are then given by

\begin{equation*}
\tensor{(L L^T)}{^{i}_{j}} = \tensor{L}{^{i}_{a}} \tensor{(L^T)}{^{a}_{j}}
\end{equation*}

So by scaling the k-th column by -1 we get

\begin{equation*}
\tensor{L}{^{i}_{a}} \tensor{L}{_{a}^{j}} = (- \tensor{L}{^{i}_{k}}) (- \tensor{L}{_{k}^{j}}) +  \sum_{a = 1, a \ne k}^{n} \tensor{L}{^{i}_{a}} \tensor{L}{_{a}^{j}}
\end{equation*}

So it ain't no matter:)

Sidenote: summing all the way to $n$ is redundant because all indices larger than $\min \left\{ i, j \right\}$ vanishes.

Theorems

Spectral Theorem

Let $A$ be an $n \times n$ real matrix. Then $A$ is symmetric iff it is orthogonally diagonalizable.

Singular Value Decomposition

Let $A$ be an $m \times n$ matrix with singular values $\sigma_1 \ge \sigma_2 \ge ... \ge \sigma_r > 0$ and $\sigma_{r + 1} = ... = \sigma_n = 0$. Then there exists:

  • $m \times m$ orthogonal matrix $U$
  • $n \times n$ orthogonal matrix $V$
  • $m \times n$ "diagonal" matrix $\Sigma = \begin{bmatrix} & D & \vline & O & \\ \hline & O & \vline & O & \end{bmatrix}$, where $D = \begin{bmatrix} \sigma_1 & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & \sigma_r \end{bmatrix}$

such that:

\begin{equation*}
 A = U \Sigma V^T
\end{equation*}

See page 615 of "Linear Algebra: A Modern Introduction" :)

Constructing V

To construct the orthogonal matrix $V$, we first find an orthonormal basis $\{ \mathbf{v}_1, ..., \mathbf{v}_n \}$ for $\mathbb{R}$ consisting of eigenvectors of the $n \times n$ symmetric matrix $A^T A$. Then

\begin{equation*}
  V = [ \mathbf{v}_1 \cdots \mathbf{v}_n ]
\end{equation*}

is an orthogonal matrix.

Constructing U

For the orthogonal matrix $U$, we first note that $\{A \mathbf{v}_1, ..., A \mathbf{v}_n \}$ is an orthogonal set of vectors in $\mathbb{R}^m$ (we're NOT saying they form a basis in $\mathbb{R}^m$!). To see this, suppose that $\mathbf{v}_i$ is a eigenvector of $A^T A$ corresponding to the eigenvalue $\lambda_i$. Then, for $i \ne j$, we have

\begin{equation*}
\begin{split}
  (A \mathbf{v}_i ) \cdot (A \mathbf{v}_j) &= ( A \mathbf{v}_i )^T ( A \mathbf{v}_j ) \\
  &= \mathbf{v}_i^T (A^T A \mathbf{v}_j) \\
  &= \mathbf{v}_i^T \lambda_j \mathbf{v}_j \\
  &= \lambda_j ( \mathbf{v}_i \cdot \mathbf{v}_j ) = 0
\end{split}
\end{equation*}

since the eigenvectors are orthogonal.

Since $\sigma_i = || A \mathbf{v}_i ||$ we can normalize $A \mathbf{v}_1, ..., A \mathbf{v}_r$ by

\begin{equation*}
  \mathbf{u}_i = \frac{1}{\sigma_i} A \mathbf{v}_i, \quad \forall i = 1, ..., r
\end{equation*}

This guarantees that $\{ \mathbf{u}_1, ..., \mathbf{u}_r \}$ is a orthonormal set in $\mathbb{R}^m$, but if $r \le n$ it's not a basis. In this case, we extend $\{ \mathbf{u}_1, ..., \mathbf{u}_r \}$ to an orthonormal basis $\{ \mathbf{u}_1, ..., \mathbf{u}_m \}$ for $\mathbb{R}^m$ by some technique, e.g. the Gram-Schmidt Process, giving us

\begin{equation*}
  U = [ \mathbf{u}_1 \cdots \mathbf{u}_m ]
\end{equation*}

Outer Product Form of the SVD

Let $A$ be an $m \times n$ matrix with singular values $\sigma_1 \ge \sigma_2 \ge ... \ge \sigma_r > 0$ and $\sigma_{r+1} = \sigma_{r+2} = ... = \sigma_n = 0$, then

\begin{equation*}
  A = \sigma_1 \mathbf{u}_1 \mathbf{v}_1^T + \cdots + \sigma_r \mathbf{u}_r \mathbf{v}_r^T
\end{equation*}

Geometric insight into the effect of matrix transformations

Let $A$ be an $m \times n$ matrix with rank $r$. Then the image of the unit sphere in $\mathbb{R}^m$ under the matrix transformation that maps $\mathbf{x}$ to $A \mathbf{x}$ is

  1. the surface of a an ellipsoid in $\mathbb{R}^m$ if $r = m$
  2. a solid ellipsoid in $\mathbb{R}^m$ if $r < n$

We have $A = U \Sigma V^T$ with $\text{rank}(A) = r$. Then

\begin{equation*}
  \sigma_1 \ge \sigma_2 \ge ... \ge sigma_r > 0 \quad \text{and} \quad \sigma_{r+1} = ... = \sigma_n = 0
\end{equation*}

Let $|| \mathbf{x} || = 1$ (i.e. unit vector = surface of unit sphere) in $\mathbb{R}^n$. $V$ is orthogonal, thus $V^T$ is orthogonal and then $V^T \mathbf{x}$ is a unit vector too.

\begin{equation*}
  V^T \mathbf{x} = \begin{bmatrix} \mathbf{v}_1 \mathbf{x} \\ \vdots \\ \mathbf{v}_n \mathbf{x} \end{bmatrix} 
    \implies (\mathbf{v}_1^T \mathbf{x})^2 + ... + ( \mathbf{v}_n^T \mathbf{x} )^2 = 1
\end{equation*}

From the Outer Product Form we have:

\begin{equation*}
\begin{split}
  A \mathbf{x} &= \sigma_1 \mathbf{u}_1 \mathbf{v}_1^T \mathbf{x} + ... + \sigma_r \mathbf{u}_r \mathbf{v}_r^T \mathbf{x} \\
    &= ( \sigma_1 \mathbf{v}_1^T \mathbf{x} ) \mathbf{u}_1 + ... + ( \sigma_r \mathbf{v}_r^T \mathbf{x} ) \mathbf{u}_r \\
    &= \gamma_1 \mathbf{u}_1 + ... + \gamma_r \mathbf{u}_r
\end{split}
\end{equation*}

where we are letting $\gamma_i$ denote the scalar $\sigma_i \mathbf{v}_i^T \mathbf{x}$.

(1) If $r = n$, then we must have $n \le m$ and

\begin{equation*}
\begin{split}
  A \mathbf{x} &= \gamma_i \mathbf{u}_1 + ... + \gamma_ \mathbf{u}_n \\
    &= U \mathbf{\gamma}
\end{split}
\end{equation*}

Since $U$ is orthogonal, we have $|| A \mathbf{x} || = || U \mathbf{\gamma} || = || \mathbf{\gamma} ||$, thus

\begin{equation*}
  \Big( \frac{\gamma_1}{\sigma_1} \Big)^2 + \cdots + \Big( \frac{\gamma_n}{\sigma_n} \Big)^2 = (\mathbf{v}_1^T \mathbf{x})^2 + \cdots + (\mathbf{v}_n^T \mathbf{x})^2 = 1
\end{equation*}

which shows that the vectors $A \mathbf{x}$ form the surface of an ellipsoid in $\mathbb{R}^m$ !

(2) If $r < n$, swapping equality with inequality (but performing the same steps):

\begin{equation*}
  \Big( \frac{\gamma_1}{\sigma_1} \Big)^2 + \cdots + \Big( \frac{\gamma_r}{\sigma_r} \Big)^2 \le 1
\end{equation*}

where the inequality corresponds to a solid ellipsoid in $\mathbb{R}^m$ !

Furthermore, we can view the each of the matrix transformations of the SVD separately:

  1. $V^T$ is an orthogonal matrix, thus it maps the unit sphere onto itself (but in a different basis)
  2. $m \times n$ matrix $\Sigma$ first collapses to $n - r$ dimensions of the unit sphere, leaving a unit sphere of $r$ dimensions, and then "distorts" into an ellipsoid
  3. $U$ is orthogonal, aligning the axes of this ellipsoid with the orthonormal basis vectors $ 1, …, r$ in $\mathbb{R}^m$

Least squares solution for dependent column-vectors

The least squares problem $A \mathbf{x} = \mathbf{b}$ has a unique least squares solution $\bar{\mathbf{x}}$ of minimal length that is given by:

\begin{equation*}
  \bar{\mathbf{x}} = A^+ \mathbf{b} \quad \text{where } A^+ = V \Sigma^+ U^T
\end{equation*}

See page 627 of "Linear Algebra: A Modern Introduction" :)

Determinant

Trace is the sum of the eigenvalues

Let $\mathbf{A}$ be a matrix, then

\begin{equation*}
\begin{split}
  \text{tr} (\mathbf{A}) &= \lambda_1 + \lambda_2 + \dots + \lambda_n \\
  &= \sum_{i=1}^{n} \lambda_i
\end{split}
\end{equation*}

We start with the characteristic equation

\begin{equation*}
\det(\mathbf{A} - \lambda \mathbf{I}) = 0
\end{equation*}

We then notice that

\begin{equation*}
\det(\mathbf{A} - \lambda \mathbf{I}) = (-1)^n (\lambda^n + \text{tr} (\mathbf{A}) \lambda^{n-1} + \dots + (-1)^n \det \mathbf{A})
\end{equation*}

which can be seen by expanding the determinant along the diagonal:

\begin{equation*}
\begin{split}
  \begin{vmatrix}
  a_{11} - \lambda & \dots & a_{1n} \\
  \vdots & \ddots & \vdots \\
  a_{n1} & \dots & a_{nn} - \lambda
\end{vmatrix} =& (a_{11} - \lambda) \begin{vmatrix}
       a_{22} - \lambda & a_{23} & \dots & a_{2n} \\
       \vdots & \vdots & \ddots & \vdots \\
       a_{n2} & a_{n3} & \dots & a_{nn} - \lambda
     \end{vmatrix} \\
& - a_{12} \begin{vmatrix}
           a_{21} & a_{23} & \dots & a_{2n} \\
           \vdots & \vdots & \ddots & \vdots \\
           a_{n1} & a_{n3} & \dots & a_{nn} - \lambda
         \end{vmatrix}
+ \dots \\
  =& (a_{11} - \lambda) (a_{22} - \lambda) \cdots (a_{nn} - \lambda) + \dots
\end{split}
\end{equation*}

Here we see that the only "sub-determinant" which contains a factor of $\lambda^{n-1}$ is going to be the very first one which is multiplied by $(a_{11} - \lambda)$, since each of the other "sub-determinants" will not have a coefficient with $\lambda$, and it will not contain $a_{22} - \lambda$, hence at most contain a factor of $\lambda^{n-2}$.

Further, since

\begin{equation*}
\begin{split}
  \det (\mathbf{A} - \lambda \mathbf{I}) &= (-1)^n (\lambda - \lambda_1) \cdots (\lambda - \lambda_n) \\
  &= (-1)^n (b_n \lambda^n + b_{n-1} \lambda^{n-1} + \dots + (-1)^n (\lambda_1 \lambda_2 \cdots \lambda_n))
\end{split}
\end{equation*}

we observe that $b_{n-1} = \lambda_1 + \dots \lambda_n$, which we also know to be the $\text{tr}(\mathbf{A})$ from above, i.e.

\begin{equation*}
\text{tr}(\mathbf{A}) = \lambda_1 + \dots + \lambda_n
\end{equation*}

as wanted.

Determinant is the product of the eigenvalues

Let $\mathbf{A}$ be a matrix, then

\begin{equation*}
\begin{split}
  \det(\mathbf{A}) &= \lambda_1 \lambda_2 \cdots \lambda_n \\
  &= \prod_{i=1}^n \lambda_i
\end{split}
\end{equation*}

Let $\mathbf{A}$ be a matrix. Then

\begin{equation*}
\det (\mathbf{A} - \lambda \mathbf{I}) = 0
\end{equation*}

defines the characteristic equation, which as the eigenvalues $\lambda_i$ as its roots, therefore

\begin{equation*}
\begin{split}
  \det (\mathbf{A} - \lambda \mathbf{I}) &= (-1)^n (\lambda - \lambda_1) \cdots (\lambda - \lambda_n) \\
  &= (-1)(\lambda - \lambda_1) \cdots (-1) (\lambda - \lambda_n) \\
  &= (\lambda_1 - \lambda) \cdots (\lambda_n - \lambda)
\end{split}
\end{equation*}

Since $\lambda$ is just a variable, we let $\lambda = 0$ and get

\begin{equation*}
\det(\mathbf{A}) = \lambda_1 \lambda_2 \cdots \lambda_n
\end{equation*}

as wanted.

Cramers Rule

Consider the system of $n$ linear equations represented by $n$ unknowns; in matrix form

\begin{equation*}
\mathbf{A} \mathbf{x} = \mathbf{b}
\end{equation*}

where $\mathbf{A}_{n \times n}$ has nonzero determinant, then

\begin{equation*}
x_i = \frac{\det \mathbf{A}_i}{\det \mathbf{A}}, \quad i = 1, \dots, n
\end{equation*}

where $\mathbf{A}_i$ is the matrix formed by replacing the i-th column of $\mathbf{A}$ by the column vector $\mathbf{b}$.

The reason why we can think of OLS as projection

The proof of the following theorem is really important in my opinion, as it provides a "intuitive" way of viewing least-squares approximations.

The Best Approximation Theorem says the following:

If $W$ is a finite-dimensional subspace of an inner product space $V$ and if $\mathbf{v}$ is a vector in $V$, then $\text{proj}_W(\mathbf{v})$ is the best approximation to $\mathbf{v}$ in $W$.

Let $\mathbf{w}$ be a vector in $W$ different from $\text{proj}_W(\mathbf{v})$. Then $\text{proj}_W(\mathbf{v}) - \mathbf{w}$ is also in $W$, so $\text{proj}_W(\mathbf{v}) - \mathbf{w} = \text{perp}_W(\mathbf{v})$ is orthogonal to $\text{proj}_W(\mathbf{v}) - \mathbf{w}$.

Pythagoras' Theorem now implies that

\begin{equation*}
\begin{split}
|| \mathbf{v} - \text{proj}_W(\mathbf{v}) ||^2 + || \text{proj}_W(\mathbf{v}) - \mathbf{w} ||^2 &= || \big( \mathbf{v} - \text{proj}_W(\mathbf{v}) \big) + \big( \text{proj}_W(\mathbf{v}) - \mathbf{w} \big) ||^2 \\
  &= || \mathbf{v} - \mathbf{w} ||^2
\end{split}
\end{equation*}

But $\mathbf{w} \ne \text{proj}_W(\mathbf{v}) \implies || \text{proj}_W(\mathbf{v}) - w || > 0$, so

\begin{equation*}
|| v - \text{proj}_W(\mathbf{v}) ||^2 < || \mathbf{v} - \text{proj}_W(\mathbf{v}) ||^2 + || \text{proj}_W(\mathbf{v}) - \mathbf{w} ||^2 = || \mathbf{v} - \mathbf{w} ||^2
\end{equation*}

i.e. $|| v - \text{proj}_W(\mathbf{v}) || < || \mathbf{v} - \mathbf{w} ||$ and thus that $\text{proj}_W(\mathbf{v})$ is the best approximation.

Distance and Approximation

Applications

Function-approximation

Given a continunous function $f$ on an interval $[a, b]$ and a subspace $W$ of $\mathcal{C}[a, b]$, find the function "closest" to $f$ in $W$.

  • $\mathcal{C}[a, b]$ denotes the space of all continuous functions defined on the domain $[a, b]$
  • $W$ is a subspace of $\mathcal{C}[a, b]$

Problem is analogous to least squares fitting of data points, except we have infinitely many data points - namely, the points in the graph of the function $f$.

What should "approximate" mean in this context? Best Approximation Theorem has the answer!

The given function $f$ lives in the vector space $\mathcal{C}[a, b]$ of continuous functions on the interval $[a, b]$. This is an inner product space, with inner product:

\begin{equation*}
  \langle f, g \rangle = \int_a^b f(x) g(x) dx
\end{equation*}

If $W$ is finite-dimensional subspace of $\mathcal{C}[a, b]$, then the best approximation to $f$ in $W$ is given by the projection of $f$ onto $W$.

Furthermore, if $\{ \mathbf{u}_1, ..., \mathbf{u}_k \}$ is an orthogonal basis for $W$, then

\begin{equation*}
  \text{proj}_W(f) = \frac{\langle \mathbf{u}_1, f \rangle}{\langle \mathbf{u}_1, \mathbf{u}_1 \rangle} \mathbf{u}_1 + ... + \frac{\langle \mathbf{u}_k, f \rangle}{\langle \mathbf{u}_k, \mathbf{u}_k \rangle} \mathbf{u}_k
\end{equation*}

The error from this approximation is just

\begin{equation*}
  || f - g || = \sqrt{\int_b^a \big( f(x) - g(x) \big)^2 dx}
\end{equation*}

and is often called the root mean square error. For functions we can think of this as the area between the graphs.

Fourier approximation

A function of the form

\begin{equation*}
  p(x) = a_0 + a_1 \cos x + ... + a_n \cos nx + b_1 \sin x + ... + b_n \sin nx
\end{equation*}

is called a trigonometric polynomial.

Restricting our attention to the space of continuous functions on the interval $[- \pi, \pi ]$, i.e. $\mathcal{C}[-\pi, \pi]$, with the inner product

\begin{equation*}
  \langle f, g \rangle = \int_{-\pi}^\pi f(x) g(x) dx
\end{equation*}

and since the trigonometric polynomials are linear combinations of the set

\begin{equation*}
  \mathcal{B} = \{ 1, \cos x, ..., \cos nx, \sin x, ..., \sin nx \}
\end{equation*}

the best approximation to a function $f$ in $\mathcal{C}[-\pi, \pi]$ by trigonometric polynomials of order $n$ is therefore $\text{proj}_W(f)$, where $W = \text{span}(\mathcal{B})$. It turns out that $\mathcal{B}$ is an orthogonal set, hence a basis for $W$.

The approximation to $f$ using trigonometric polynomials is called the n-th order Fourier approximation to $f$ on $[-\pi, \pi]$, with the coefficients are known as Fourier coefficients of $f$.

Rayleigh principle

A lot of the stuff here I got from these lecture notes.

Notation

  • $R(x)$ denotes the Rayleigh coefficient of some Herimitian / real symmetric matrix $A$
  • $\Lambda$ is the diagonal matrix with entries being eigenvalues of $A$
  • Eigenvalues are ordered as $\lambda_1 \le \lambda_2 \le \dots \le \lambda_n$
  • $Q$ is the orthogonal matrix with eigenvectors of $A$ as columns
  • $y = Q^T x$, thus $R(y) = R(Qx) = \lambda_1 x_1^2 + \dots + \lambda_n x_n^2$
  • $z' = (a_1, a_2, \dots, a_n)$

Definition

For a given complex Hermitian matrix $A$ and non-zero vector $x$, the Rayleigh quotient is defined as:

\begin{equation*}
R(A, x) = \frac{ x^* A x}{x^* x}
\end{equation*}

In the case of $\mathbb{R}$, $A$ and is symmetric, then

\begin{equation*}
R(A, x) = \frac{x^T A x}{x^T x}
\end{equation*}

Further, in general,

\begin{equation*}
R(A, B; x) = \frac{x^* A x}{x^* B x}
\end{equation*}

is the generalized Rayleigh quotient, where $B$ is a positive-definite Herimitian matrix.

The generalized Rayleigh quotient can be reduced to $R(D, C^* x)$ through the transformation

\begin{equation*}
D = C^{-1} A C^{*^{-1}}
\end{equation*}

where $CC^*$ is the Cholesky decomposition of the Hermitian positive-definite matrix $B$.

Minimization

Personal explanation

\begin{equation*}
R(x) = \frac{x^T A x}{x^T x}
\end{equation*}

At the maximum we must have

\begin{equation*}
x^T x = x_1^2 + \dots + x_n^2 = c^2 \in \mathbb{R}
\end{equation*}

But observe that for whatever $c$ we have here, we have

\begin{equation*}
R(x) = \frac{(x / c)^T A (x / c)}{(x / c)^T (x / c)} = \frac{x^T A x}{x^T x}
\end{equation*}

i.e. it's invariant to scaling.

Further, since $A$ is symmetric, we have the diagonalization

\begin{equation*}
A = Q^T \Lambda Q
\end{equation*}

which means that if we let

\begin{equation*}
y = Q^T x
\end{equation*}

for some $x \in \mathbb{R}^n$, we get

\begin{equation*}
R(y) = \frac{(Q^T x)^T A (Q^T x)}{(Q^T x)^T  (Q^T x)} = \frac{(x^T Q) (Q \Lambda Q^T) (Q^T x) }{x^T (Q Q^T) x} = \frac{x^T \Lambda x}{x^T x} = \lambda_1 x_1^2 + \dots \lambda_n x_n^2
\end{equation*}

where we've applied the constraint $x_1^2 + \dots x_n^2 = 1$, which we can since again, invariant under scaling.

Therefore, we're left with the following optimization problem:

\begin{equation*}
\begin{split}
  \min_y R(y) = \min_x R(Q^T x) &= \min_x \lambda_1 x_1^2 + \dots + \lambda_n x_n^2 \\
  \text{subject to} & \quad x_1^2 + \dots + x_n^2 = 1
\end{split}
\end{equation*}

Since we assume that $\lambda_1 \le \lambda_2 \le \dots \le \lambda_n$, we clearly obtain the minima when

\begin{equation*}
x = (1, 0, 0, \dots, 0)
\end{equation*}

Hence,

\begin{equation*}
\min_x R(y) = \lambda_1 \quad \implies \quad y = Q^T (1, 0, 0, \dots, 0) = q_1
\end{equation*}

where $q_1$ denotes the corresponding eigenvector.

Now suppose we'd like to find the minima, which is NOT in the space spanned by the first eigenvector, then we only consider the subspace

\begin{equation*}
y^T q_1 = 0
\end{equation*}

Thus, the optimization problem is

\begin{equation*}
\begin{split}
  \min_{y: \ y^T q_1 = 0} R(y) = \min_{x: \ x^T Q q_1 = 0} R(Q^T x) &= \min_{x: \ x^T (Q q_1) = 0 } \lambda_1 x_1^2 + \dots + \lambda_n x_n^2 \\
  \text{subject to} & \quad x_1^2 + \dots + x_n^2 = 1
\end{split}
\end{equation*}

Finally, observe that $Q q_1 = (1, 0, 0, \dots, 0)$, hence

\begin{equation*}
\begin{split}
  \min_{x: \ x^T e_1 = 0} & \lambda_1 x_1^2 + \dots + \lambda_n x_n^2 \\
  \text{subject to} & \quad x_1^2 + \dots + x_n^2 = 1
\end{split}
\end{equation*}

And the subspace $x^T e_1 = 0$ is just all $x = (0, x_2, x_3, \dots, x_n)$, giving us

\begin{equation*}
\min_{x: \ x^T e_1 = 0} R(Q^T x) = \lambda_2 \quad \implies \quad y = Q^T e_2 = q_2
\end{equation*}

And if we just keep going, heyooo, we get all our eigenvalues and eigenvectors!

Stuff

We're going to assume $A$ is real.

Our objective is:

\begin{equation*}
\min_x R(A, x)
\end{equation*}

Observe the following:

  • Objective is invariant to scaling, since we'd be scaling the denominator too
  • $A$ is by assumption symmetric: all real eigenvalues

Thus, we have

\begin{equation*}
A = Q^T \Lambda Q
\end{equation*}

Further,

\begin{equation*}
\big( Q x \big)^T \big( Q x \big) = \norm{Q x}^2 = \norm{x}^2 = x^T x
\end{equation*}

Hence,

\begin{equation*}
R(Q^T x) = \frac{\big( Q Q^T x \big)^T \Lambda \big( Q Q^T x \big)}{\big( Q Q^T x \big)^T \big( Q Q^T x \big)} = \frac{x^T \Lambda x}{x^T x} = \frac{\lambda_1 x_1^2 + \dots + \lambda_n x_n^2}{x_1^2 + \dots + x_n^2}
\end{equation*}

Thus, by letting $y = Q^T x$, we get

\begin{equation*}
R(y) = \frac{\lambda_1 x_1^2 + \dots + \lambda_n x_n^2}{x_1^2 + \dots + x_n^2}
\end{equation*}

Since this quotient is invariant under scaling, we frame it as follows:

\begin{equation*}
R(y) = \lambda_1 x_1^2 + \dots + \lambda_n x_n^2 \quad \text{where} \quad x_1^2 + \dots x_n^2 = 1
\end{equation*}

Implying that $R(y) \in [\lambda_1, \lambda_n]$ (seen from $x = (1, \dots, 0)$ and $x = (0, \dots, 1)$) since we assume the eigenvalues to be ordered.

Hence, we have proved what's called the Rayleigh Principle:

\begin{equation*}
\min_x R(x) = \lambda_1
\end{equation*}

where $\lambda_1$ is the smallest eigenvalue of $A$.

Planar Slices

Suppose we'd like to find the answer to the following:

\begin{equation*}
\min_{x^T z = 0} R(x)
\end{equation*}

i.e. minimizing the Rayleigh quotient over some hyperplane defined by $x^T z = 0$.

This minimum is bounded blow as follows:

\begin{equation*}
\min_x R(x) = \lambda_1 \le \min_{x: \ x^T z = 0} R(x)
\end{equation*}

Further, we can obtain an upper bound for this by considering $R(y)$ again:

\begin{equation*}
\min_{y: \ y^T z = 0} R(y) = \min_{x: \ x^T (Qz) = 0} R(Qx) = \min_{x: \ x^T (Qz) = 0} \frac{\lambda_1 x_1^2 + \dots + \lambda_n x_n^2}{x_1^2 + \dots + x_n^2}
\end{equation*}

By scaling we can assume $x_1^2 + \dots + x_n^2 = 1$, giving us the optimization problem

\begin{equation*}
\begin{split}
  \min_{x: \ x^T (z') = 0} &= \lambda_1 x_1^2 + \dots + \lambda_n x_n^2 \\
  \text{subject to} \quad & x_1^2 + \dots + x_n^2 = 1
\end{split}
\end{equation*}

There is at least one vector of the form

\begin{equation*}
x = (x_1, x_2, 0, 0, \dots, 0) \quad \text{such that} \quad x^T (z') = 0, \quad x_1^2 + x_2^2 = 1
\end{equation*}

If $z' = (a_1, \dots, a_n)$, then the above translates to

\begin{equation*}
x_1 a_1 + x_2 a_2 = 0, \quad x_1^2 + x_2^2 = 1
\end{equation*}

assuming $(a_1, a_2) \ne (0, 0)$ (does not satisfy second constraint), then the first constraint can be represented as:

\begin{equation*}
x_2 = - \frac{a_1}{a_2} x_1
\end{equation*}

which is just a line through the origin. The second constraint is the unit-circle. Thus, there are only two points where both these constraints are satisfied (a line crosses the circle in two places). Let $(w_1, w_2)$ be one of these points of intersection, then

\begin{equation*}
\min_{x: \ x^T z = 0} R(x) \le \lambda_1 w_1^2 + \lambda_2 w_2^2 \le \lambda_2, \quad w_1, w_2 : \ w_1^2 + w_2^2 = 1
\end{equation*}

Which means we have bounded the minimum of $R(x)$ on the hyperplane $x^T z = 0$! This becomes incredibly useful when have a look at Minimax Principle for the Second Smallest Eigenvalue.

Minimax Principle for the Second Smallest Eigenvalue

This is a slighly different way of viewing finding the second eigenvalue, and poses the "answer" in a frame where we might have asked the following question:

"Can we obtain $\lambda_2$ without finding $\lambda_1$?"

And ends with the answer:

\begin{equation*}
\max_z \min_{x^T z = 0} R(x) = \lambda_2
\end{equation*}

which is very useful computationally, as we can alternate between miniziming and maximizing $R(x)$ to obtain $\lambda_2$.

What if we could attain the maximum in this equation such that we would know that "maximizing in the hyperplane $x^T z = 0$, where $x$ minimizes $R(x)$, gives us the second smallest eigenvalue $\lambda_2$!

Or equivalently, the minimum in the restricted hyperplane $x^T z = 0$ is given by $\lambda_2$!

Thus, if we can force somehow force $w_1 = 0$ and $w_2 = 1$, then we would get a upper bound for this. If we had $z' = (1, 0, 0, \dots, 0)$ then the only solutions to this would be:

\begin{equation*}
(x_1, x_2) = (0, \pm 1) \quad \implies \quad R(y) = \lambda_2
\end{equation*}

To show that $\min_{x: \ x^T z = 0} R(y) = \lambda_2$, we only consider solutions for $x$ where $x_1 = 0$, i.e. let

\begin{equation*}
\mathcal{X} &:= \{ \forall x \mid x = (0, x_2, \dots, x_n), \ \lambda_1 x_1^2 + \lambda_2 x_2^2 + \dots + \lambda_n x_n^2 = 1 \}
\end{equation*}
\begin{equation*}
\begin{split}
  \min_\mathcal{X} R(x) &= \min_\mathcal{X} \lambda_1 x_1^2 + \lambda_2 x_2^2 + \dots + \lambda_n x_n^2 \\
  &= \min_{x_2^2 + x_3^2 + \dots + x_n^2 = 1} \lambda_2 x_2^2 + \dots + \lambda_n x_n^2
\end{split}
\end{equation*}

Which is clearly attained when $x_2 = 1$ and $x_3 = \dots = x_n = 0$, since $\lambda_2 \le \dots \le \lambda_n$

Hence,

\begin{equation*}
\max_z \min_{x^T z = 0} R(x) = \lambda_2
\end{equation*}

Thus we've shown that min-maxing over some function $R(x)$ is equivalent of finding the second smallest eigenvalue!

Since $x^T z = 0$ means that $x$ lies in the orthogonal complement of $z$, which is an (n-1)-dimensional subspace. Thus, we may rewrite the above as

\begin{equation*}
\max_{\dim S = n - 1} \min_{x \in S} R(x) = \lambda_2
\end{equation*}

In fact, it turns out this works for general $j$ :

\begin{equation*}
\max_{\dim S = n - j} \min_{x \in S} R(x) = \lambda_{j + 1}
\end{equation*}

All this means that if we want to find some vector $x$ which minimizes the Rayleigh quotient, then we simply compute the eigenvalues of $A$ and look for the corresponding eigenvector, giving us the solution!