Factor-based models

Table of Contents

Principal Component Analysis (PCA)

Notation

  • $\mathbf{v}_i$ and $v_i$ denotes the i-th eigenvector
  • $\Sigma$ denotes the covariance-matrix
  • $\mathbf{V}$ is the eigenmatrix

Stuff

Principle Component Analysis (PCA) corresponds to the finding the eigenvectors and eigenvalues of the (centered, i.e. $\mu_i = 0$ for each column $i$) covariance matrix of the data. We can do this since the covariance matrix is the semi-positive definite matrix given by $C = \frac{1}{n - 1} X^T X$.

If:

  • $X$ is the $n \times p$ data matrix

Think about it in this way:

  • Eigenvectors form an orthogonal basis for the covariance matrix of the variables / features, i.e. we have a way of expressing the variances independently of each other.
  • For each eigenvector / basis $v_i$, the greater the corresponding eigenvalue $\lambda_i$, the greater the explained variance.
  • By projecting the data $X$ onto this basis, simply by taking the matrix product $XV$, where $V$ is the matrix with columns corresponding to the eigenvectors of the $p \times p$ covariance matrix $C = \frac{1}{n - 1}  X^T X$, we have a new "view" of the features where

    $x_i^T V = [x_i^T v_1, ..., x_i^T v_p], \quad \forall i \in \{1, ..., n\}$ i.e. we are now looking at each data point transformed to a space where the basis vectors are linearly independent, each explaining some of the variance of the features in the data.

    Now we make another observation:

    1. Take the Singular Value Decomposition (SVD) of $X$: \begin{equation} X = U Σ VT

    1. Then,

      \begin{equation}

      \begin{split*}
C &= \frac{1}{n - 1} X^T X = \frac{1}{n - 1} (U \Sigma V^T)^T (U \Sigma V^T) \\
&= \frac{1}{n - 1} V \Sigma U^T U \Sigma V^T \\
&= V \frac{\Sigma^2}{n - 1} V^T
\end{split*}
1

    Thus we in actually only need to compute the SVD of the data matrix $X$ to be able to compute the entire covariance matrix $C$!

    What does the eigenvalue $\lambda_i$ actually tell us?

    • $C \mathbf{v}_i = \lambda_i \mathbf{v}_i$ => $CV = [\sigma_1^2 \mathbf{v}_1, ..., \sigma_p^2 \mathbf{v}_p], \quad \text{since } \lambda_i = \sigma_i^2$ where $\sigma_i^2 = \Sigma_{ii}^2$

PCA maximum variance formulation using Lagrange multipliers

Want to consider a projection of the data onto a subspace which maximizes the variance, i.e. explains the most variance.

  • Let $\mathbf{v}_1$ be the direction of this subspace, and we want $\mathbf{v}_1$ to be a unit vector, i.e. $\mathbf{v}_1^T \mathbf{v}_1 = 1$
  • Each data point $\mathbf{x}_i$ is then projected onto the scalar $\mathbf{v}_1^T \mathbf{x}_i$
  • The mean of the projected data is $\mathbf{v}_1^T \bar{\mathbf{x}}$
  • Variance of projected data is:

    \begin{equation*}
  \text{Var}(\mathbf{v}_1^T X) \approx \frac{1}{n} \sum_{i=1}^{n} \Big( \mathbf{v}_1^T \mathbf{x}_i - \mathbf{v}_1^T \bar{\mathbf{x}} \Big)^2 = \frac{1}{n} \sum_{i=1}^{n} \Big( \mathbf{v}_1^T \mathbf{x}_i - \mathbf{v}_1^T \bar{\mathbf{x}} \Big) \Big( \mathbf{v}_1^T \mathbf{x}_i - \mathbf{v}_1^T \bar{\mathbf{x}} \Big)^T = \mathbf{v}_1^T \Sigma \mathbf{v}_1
\end{equation*}

Then to maximize the variance of the projected data $\mathbf{v}_1^T \Sigma \mathbf{v}_1$ wrt. $\mathbf{v}_1$ under the constraint $\mathbf{v}_1^T \mathbf{v}_1 = 1$.

We then introduce the Lagrange multiplier $\lambda_1$ and define the unconstrained maximization of

\begin{equation*}
\max_{\mathbf{v}_1, \lambda_1} \Big\{ \mathbf{v}_1^T \Sigma \mathbf{v}_1 + \lambda_1 \big( 1 - \mathbf{v}_1^T \mathbf{v}_1 \big) \Big\}
\end{equation*}

By taking the derivative wrt. $\mathbf{v}_1$ and setting to zero, we observe that the solution satisfies:

\begin{equation*}
\Sigma \mathbf{v}_1 = \lambda_1 \mathbf{v}_1
\end{equation*}

i.e. $\mathbf{v}_1$ must be an eigenvector of $\Sigma$ with eigenvalue $\lambda_1$.

We also observe that,

\begin{equation*}
\mathbf{v}_1^T \Sigma \mathbf{v}_1 = \lambda_1
\end{equation*}

due to $\mathbf{v}_1$ being an eigenvector of $\Sigma$. We call $\mathbf{v}_1$ the first principal component.

To get the rest of the principal components, we simply maximize the projected variance amongst all possible directions orthogonal to those already considered, i.e. constrain the next $\mathbf{v}_2$ such that $\mathbf{v}_2 \notin \text{span} \big( \{ \mathbf{v}_1 \} \big)$.

TODO Figure out what the QR representation has to do with all of this

Whitening

Notation

  • $\mathbf{L} = \text{diag}(\lambda_1, \dots, \lambda_D)$
  • $\mathbf{V} = [\mathbf{v}_1 \ \dots \ \mathbf{v}_D]$ is matrix with the normalized eigenvectors, i.e. an orthonormal matrix
  • $\mathbf{y}_i = \mathbf{L}^{- 1 / 2} \mathbf{V}^T \big( \mathbf{x}_i - \bar{\mathbf{x}} \big)$ is the whitened data

Stuff

  • Can normalize the data to give zero-mean and unit-covariance (i.e. variables are decorrelated)

Consider the key eigenvalue problem in PCA in a matrix form;

\begin{equation*}
\boldsymbol{\Sigma} \mathbf{V} = \mathbf{V} \mathbf{L}, \quad \mathbf{L} = \text{diag} \big( \lambda_1, \dots, \lambda_D \big), \quad \mathbf{V} = [\mathbf{v}_1 \ \dots \ \mathbf{v}_D] \ \text{(orthogonal)}
\end{equation*}

For each data point $\mathbf{x}_i$, we define a transformed value as:

\begin{equation*}
  \mathbf{y}_i = \mathbf{L}^{- 1 / 2} \mathbf{V}^T \Big( \mathbf{x}_i - \bar{\mathbf{x}}} \Big)
\end{equation*}

Then the set $\{ \mathbf{y}_i \}$ has zero mean and its covariance is the identity:

\begin{equation*}
\begin{split}
  \frac{1}{n} \sum_{i=1}^{n} \mathbf{y}_i \mathbf{y}_i^T &= \frac{1}{N} \sum_{i=1}^{n} \mathbf{L}^{- 1 / 2} \mathbf{V}^T \textcolor{green}{\Big( \mathbf{x}_i - \bar{\mathbf{x}} \Big) \Big( \mathbf{x}_i - \bar{\mathbf{x}} \Big)^T} \mathbf{V} \mathbf{L}^{- 1 / 2} \\
  &= \mathbf{L}^{- 1 / 2} \mathbf{V}^T \textcolor{green}{\boldsymbol{\Sigma}} \mathbf{V} \mathbf{L}^{- 1 / 2} \\
  &= \mathbf{L}^{- 1 / 2} \mathbf{V}^T \textcolor{green}{\mathbf{V} \mathbf{L} \mathbf{V}^T} \mathbf{V} \mathbf{L}^{- 1 / 2} \\
  &= \mathbf{I}
\end{split}
\end{equation*}

Bayesian PCA

Maximum likelihood PCA

Have a look at this paper! Sry mate..

Difference between PCA and OLS

set.seed(2)

x <- 1:100
epsilon <- rnorm(100, 0, 60)
y <- 20 + 3 * x + epsilon

plot(x, y)

yx.lm <- lm(y ~ x)  # linear model y ~ x
lines(x, predict(yx.lm), col="red")

xy.lm <- lm(x ~ y)
lines(predict(xy.lm), y, col="blue")

# normalize means and cbind together
xyNorm <- cbind(x = x - mean(x), y = y - mean(y))
plot(xyNorm)

# covariance
xyCov <- cov(xyNorm)
eigenValues <- eigen(xyCov)$values
eigenVectors <- eigen(xyCov)$vectors

plot(xyNorm, ylim=c(-200, 200), xlim=c(-200, 200))
lines(xyNorm[x], eigenVectors[2,1] / eigenVectors[1, 1] * xyNorm[x])
lines(xyNorm[x], eigenVectors[2,2] / eigenVectors[1, 2] * xyNorm[x])

# the largest eigenValue is the first one
# so that's our principal component.
# But the principal component is in normalized terms (mean = 0)
# and we want it back in real terms like our starting data
# so let's denormalize it

plot(x, y)
lines(x, (eigenVectors[2, 1] / eigenVectors[1, 1] * xyNorm[x]) + mean(y))

# what if we bring back our other two regressions?
lines(x, predict(yx.lm), col="red")
lines(predict(xy.lm), y, col="blue")

Factor Analysis (FA)

Overview

Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a lower number of unobserved variables called factors.

It searches for joint variations in response to unobserved latent variables.

Notation

  • set of $p$ observable random variables $x_1, x_2, \dots, x_p$ with means $\mu_1, \mu_2, \dots, \mu_p$
  • $F_j$ are common factors because they influence all observed random variables, or in vector notation $\mathbf{F}$
  • $F$ denotes the $k \times n$ matrix which represents our samples for the rvs. $\mathbf{F}$
  • $k$ denotes the number of common factors , $k &lt; p$
  • $L$ is called the loading matrix
  • $\varepsilon_i$ is unobserved stochastic error term with zero mean and finite variance

Definition

We suppose that the set of $p$ observable random variables $x_1, x_2, \dots, x_p$ with means $\mu_1, \mu_2, \dots, \mu_p$, we have the following model:

\begin{equation*}
x_i - \mu_i = l_{i1} F_i + \dots + l_{ik}F_k + \varepsilon_i
\end{equation*}

or in matrix terms,

\begin{equation*}
\mathbf{x} - \boldsymbol{\mu} = L \mathbf{F} + \boldsymbol{\varepsilon}
\end{equation*}

where if we have $n$ observations: $x_{p \times n}, L_{p \times k}, F_{k \times n}$. (Notice that here $F$ is a matrix, NOT a vector, but in our model $\mathbf{F}$ is vector. Matrix is for fitting, $\mathbf{F}$ is the random variable we're modelling. Confusing notation, yeah I know.)

We impose the following conditions on $\mathbf{F}$ :

  1. $\mathbf{F}$ and $\boldsymbol{\varepsilon}$ are independent
  2. $\mathbb{E}[\mathbf{F}] = 0$
  3. $\text{Cov}(\mathbf{F}) = I$ (to make sure that the factors are uncorrelated, as we hypothesize)

Suppose $\text{Cov}(\mathbf{F}) = \Sigma$. Then note that from the conditions just imposed on $\mathbf{F}$, we have

\begin{equation*}
\begin{split}
  \text{Cov}(\mathbf{x} - \boldsymbol{\mu}) &amp;= \text{Cov}(L \mathbf{F} + \boldsymbol{\varepsilon}) \\
  \Sigma &amp;= L \text{Cov}( \mathbf{F}) L^T + \text{Cov}(\varepsilon) \\
  \Sigma &amp;= L L^T + \Psi
\end{split}
\end{equation*}

In words

What we're saying here is that we believe the data can be described by some linear (lower-dimensional) subspace spanned by the common factors $\mathbf{F}$, and we attempt to find the best components for producing this fit. The matrix $L$ describes the coefficients for the $p$ observed rvs. when projecting these onto the common factors.

This is basically a form of linear regression, but instead of taking the observed rvs. $x_i$ and directly finding a subspace to the data onto to predict some target variable, we instead make the observed "features" $\mathbf{x}$ the target and hypothesize that there exists some $k$ common factors which produces the variance seen between the observed variables.

In a way, it's very similar to PCA, but see PCA vs. Factor analysis for more on that.

Exploratory factor analysis (ETA)

Same as FA, but does not make any prior assumptions about the relationships among the factors themselves, whereas FA assumes them to be independent.

PCA vs. Factor analysis

PCA does not account for inherit random error

In PCA, 1s are put in the diagonal meaning that all of the variance in the matrix is to be accouned for (including variance unique to each variable, variance common among variables, and error variance).

In EFA (Exploratory Factor Analysis), the communalities are put in the diagonal meaning that only the variance shared with other variables is accounted for (excluding variance unique to each variable and error variance). That would, therefore, by definition, inlcude only variance that is commong among the variables.

Summary

  • PCA is simply a variable reduction technique; FA makes the assumption that an underlying causal model exists
  • PCA results in principal components analysis that account for maximal amount of variance of observed variables; FA account for variance shared between observed variables in the data.
  • PCA inserts ones on the diagonals of the correlation matrix; FA adjusts the diagonals of the correlation matrix with the unique factors.
  • PCA minimizes the sum of squared perpendicular distance to the component axis; FA estimates factors which influence responses on observed variables.
  • The component scores in PCA represent a linear combination of observed variables weighted by the eigenvectors; the observed variables in FA are linear combinations of the underlying unique factors.

Indendent Factor Analysis (ICA)

Overview

  • Method for separating a multivariate signal into additive subcomponents
  • Assumes subcomponents are non-Gaussian signals

Assumptions

  1. Different factors are independent of each other (in a probabilistic sense)
  2. Values in each factor have non-Gaussian distributions

Defining independence

We want to maximize the statistical independence between the factors. We may choose one of many ways to define a proxy for independence, with the two broadest ones:

  1. Minimization of mutual information
  2. Maximization of non-Gaussianity

Definition

In a Linear Noiseless ICA we assume components $x_i$ of an observed random vector $\mathbf{x} = (x_1, \dots, x_m)^T$ are generated as a sum of $n$ (statistically) independent components $s_j$, i.e.

\begin{equation*}
x_i = a_{i, 1} s_1 + \dots + a_{i, n} s_n = \sum_{j=1}^{n} a_{i, j} s_j
\end{equation*}

for some $\{ a_{i, j} \}$.

In matrix notation,

\begin{equation*}
\mathbf{x} = \mathbf{A} \mathbf{s}
\end{equation*}

where the problem is to find the matrix $\mathbf{A}$.

In a Lineary noisy ICA we follow the same model as for the noiseless ICA but with the added assumption of zero-mean and uncorrelated Gaussian noise

\begin{equation*}
\mathbf{x} = \mathbf{A} \mathbf{s} + \boldsymbol{\varepsilon}
\end{equation*}

where $\varepsilon \sim \mathcal{N} \big( \mathbf{0}, \text{diag}(\Sigma) \big)$.

Comparison

  • PCA maximizes 2nd moments, i.e. variance
    • Finding basis vectors which "best" explain the variance of the data
  • FA attempts to
    • Generative
    • Allows different variances across the different basis vectors
  • ICA attempts to maximize 4th moments, i.e. [BROKEN LINK: No match for fuzzy expression: def:kurtosis]
    • Finding basis vectors such that resulting vector is one of the independent components of the original data
      • Can do this by maximizing kurtosis or minimizing mutual information
    • Motivated by the idea that when you add things up, you get something normal, due to CLT
    • Hopes that data is non-normal, such that non-normal components can be extracted from them
    • In attempt to exploit non-normality, ICA tries to maximize the 4th moment of a linear combination of the inputs
    • Compares to PCA which attempts to do the same, but for 2nd moments