Factor-based models

Table of Contents

Principal Component Analysis (PCA)

Notation

  • factor_based_models_808842c8529aa9004b1bfe14bea55ac7ec70dfca.png and factor_based_models_d227c159aec842757df3b431a59e158a624489c8.png denotes the i-th eigenvector
  • factor_based_models_017948b866be67b1a8e56a6b5f8848f823410c34.png denotes the covariance-matrix
  • factor_based_models_8de89f0b52aab42fb2ed3fc97c9194db2eb4f2df.png is the eigenmatrix

Stuff

Principle Component Analysis (PCA) corresponds to the finding the eigenvectors and eigenvalues of the (centered, i.e. factor_based_models_d120a70f2a83652cedb22eb60eedbce5e48a835b.png for each column factor_based_models_f78e53d727e82c8e43205bb55f2368f6dc0affa3.png) covariance matrix of the data. We can do this since the covariance matrix is the semi-positive definite matrix given by factor_based_models_c0efb4b8751e8670c771254b453c0bc914dc5831.png.

If:

  • factor_based_models_aa07b3a8458adb2855b54064282b1f340d44fbd4.png is the factor_based_models_df2a837dbe240756e1f72dc81279ca0cbd9befa6.png data matrix

Think about it in this way:

  • Eigenvectors form an orthogonal basis for the covariance matrix of the variables / features, i.e. we have a way of expressing the variances independently of each other.
  • For each eigenvector / basis factor_based_models_d227c159aec842757df3b431a59e158a624489c8.png, the greater the corresponding eigenvalue factor_based_models_2f7de0bb72a7f95147202acfbb505010062a8b5a.png, the greater the explained variance.
  • By projecting the data factor_based_models_aa07b3a8458adb2855b54064282b1f340d44fbd4.png onto this basis, simply by taking the matrix product factor_based_models_99bbf04ff03445434f02bfa38611d7606c71eb13.png, where factor_based_models_d09e558e8b16ae9aff8e7fd088d8e072a09a73f1.png is the matrix with columns corresponding to the eigenvectors of the factor_based_models_3e723e70ea3f17f2cfe1d855be2ac1e894d3224c.png covariance matrix factor_based_models_4fa2ae0dd86f321c57d022f831f9d986d50a35c0.png, we have a new "view" of the features where

    factor_based_models_471b59f36ed8dca42af4b50c3c0e3ace808459cc.png i.e. we are now looking at each data point transformed to a space where the basis vectors are linearly independent, each explaining some of the variance of the features in the data.

    Now we make another observation:

    1. Take the Singular Value Decomposition (SVD) of factor_based_models_aa07b3a8458adb2855b54064282b1f340d44fbd4.png:

      factor_based_models_a59da926d810ccc62e2cdeea087e40be64f8816a.png

    2. Then,

      factor_based_models_26fa7ee9940beb314fc1a6b08e4e971906a95de6.png

      Thus we in actually only need to compute the SVD of the data matrix factor_based_models_aa07b3a8458adb2855b54064282b1f340d44fbd4.png to be able to compute the entire covariance matrix factor_based_models_9ccec48caa132706c57c7e0e22c7558bf8f1a48b.png!

      What does the eigenvalue factor_based_models_2f7de0bb72a7f95147202acfbb505010062a8b5a.png actually tell us?

      • factor_based_models_49a81a97c9578258b0b73aa209f1ec81412cb703.png => factor_based_models_b6cb48148ac4ff9bb46cc82e76ec2d0df66de46f.png where factor_based_models_407c9c49b7b3cbbc3470f9670c23594bb88e771b.png

PCA maximum variance formulation using Lagrange multipliers

Want to consider a projection of the data onto a subspace which maximizes the variance, i.e. explains the most variance.

  • Let factor_based_models_e17835ebefa43e53d5a5d72ff9a64f0befa9a29a.png be the direction of this subspace, and we want factor_based_models_e17835ebefa43e53d5a5d72ff9a64f0befa9a29a.png to be a unit vector, i.e. factor_based_models_70cf04f5c2eb0140b121ea79ffca2470099c466b.png
  • Each data point factor_based_models_53c951ec14f3126cacfad5f7f1aaa180a08ed30e.png is then projected onto the scalar factor_based_models_814665bf3b295fc4599c3ba29d540b73fe7ea724.png
  • The mean of the projected data is factor_based_models_40d6a049535e0595747dbcc93ae4fe001a46233c.png
  • Variance of projected data is:

    factor_based_models_7a985691e9d64de31ce8eb4822d7877043935a82.png

Then to maximize the variance of the projected data factor_based_models_58be8c89eaf8f932d01b5dfbc0298927cea857df.png wrt. factor_based_models_e17835ebefa43e53d5a5d72ff9a64f0befa9a29a.png under the constraint factor_based_models_70cf04f5c2eb0140b121ea79ffca2470099c466b.png.

We then introduce the Lagrange multiplier factor_based_models_dfe853bfeb08535dd6f8ccf4278b3fe087b5e2f0.png and define the unconstrained maximization of

factor_based_models_a591f37cb6fe41ac819da7f7885025e1ddcc7235.png

By taking the derivative wrt. factor_based_models_e17835ebefa43e53d5a5d72ff9a64f0befa9a29a.png and setting to zero, we observe that the solution satisfies:

factor_based_models_58bf6575824dfeb9875b99169945467d0592d616.png

i.e. factor_based_models_e17835ebefa43e53d5a5d72ff9a64f0befa9a29a.png must be an eigenvector of factor_based_models_017948b866be67b1a8e56a6b5f8848f823410c34.png with eigenvalue factor_based_models_dfe853bfeb08535dd6f8ccf4278b3fe087b5e2f0.png.

We also observe that,

factor_based_models_90f78e23ccc14f3b5ee19070237495b0d631c76a.png

due to factor_based_models_e17835ebefa43e53d5a5d72ff9a64f0befa9a29a.png being an eigenvector of factor_based_models_017948b866be67b1a8e56a6b5f8848f823410c34.png. We call factor_based_models_e17835ebefa43e53d5a5d72ff9a64f0befa9a29a.png the first principal component.

To get the rest of the principal components, we simply maximize the projected variance amongst all possible directions orthogonal to those already considered, i.e. constrain the next factor_based_models_eeeaaa8aeee978420ee9ae29176577c050bffa78.png such that factor_based_models_d73f6a9be1ae31503921c083e3c0808da6e905fb.png.

TODO Figure out what the QR representation has to do with all of this

Whitening

Notation

  • factor_based_models_269ef4d86d1cdb8a5bc684f098e7437d116c388c.png
  • factor_based_models_f638640b09507b9f6bb02ef863170fcc64833d5f.png is matrix with the normalized eigenvectors, i.e. an orthonormal matrix
  • factor_based_models_4f1a85740bf86ec6e6c8960878a909c5794e4c07.png is the whitened data

Stuff

  • Can normalize the data to give zero-mean and unit-covariance (i.e. variables are decorrelated)

Consider the key eigenvalue problem in PCA in a matrix form;

factor_based_models_b481fb5cc237e70fc5e9d1a8deac3d26e3a6db9f.png

For each data point factor_based_models_53c951ec14f3126cacfad5f7f1aaa180a08ed30e.png, we define a transformed value as:

factor_based_models_fc1d91a5f337d4cc5ceaeecf9ed5da3f3d17eb58.png

Then the set factor_based_models_b1b35b1e71395d4ca319f5f6461c4ac2b049c0c7.png has zero mean and its covariance is the identity:

factor_based_models_8be22c9b3b9dadd8a226bbad5c0305785af4cd4b.png

Bayesian PCA

Maximum likelihood PCA

Have a look at this paper! Sry mate..

Difference between PCA and OLS

set.seed(2)

x <- 1:100
epsilon <- rnorm(100, 0, 60)
y <- 20 + 3 * x + epsilon

plot(x, y)

yx.lm <- lm(y ~ x)  # linear model y ~ x
lines(x, predict(yx.lm), col="red")

xy.lm <- lm(x ~ y)
lines(predict(xy.lm), y, col="blue")

# normalize means and cbind together
xyNorm <- cbind(x = x - mean(x), y = y - mean(y))
plot(xyNorm)

# covariance
xyCov <- cov(xyNorm)
eigenValues <- eigen(xyCov)$values
eigenVectors <- eigen(xyCov)$vectors

plot(xyNorm, ylim=c(-200, 200), xlim=c(-200, 200))
lines(xyNorm[x], eigenVectors[2,1] / eigenVectors[1, 1] * xyNorm[x])
lines(xyNorm[x], eigenVectors[2,2] / eigenVectors[1, 2] * xyNorm[x])

# the largest eigenValue is the first one
# so that's our principal component.
# But the principal component is in normalized terms (mean = 0)
# and we want it back in real terms like our starting data
# so let's denormalize it

plot(x, y)
lines(x, (eigenVectors[2, 1] / eigenVectors[1, 1] * xyNorm[x]) + mean(y))

# what if we bring back our other two regressions?
lines(x, predict(yx.lm), col="red")
lines(predict(xy.lm), y, col="blue")

Factor Analysis (FA)

Overview

Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a lower number of unobserved variables called factors.

It searches for joint variations in response to unobserved latent variables.

Notation

  • set of factor_based_models_7225b076f6e6326f1636b11d1aad8de58bcc4761.png observable random variables factor_based_models_0cd514399f97c5a0b30b348a2927dab7d2e6429e.png with means factor_based_models_68dba46e57bb2adf1178c3b662a8b4d8a6779364.png
  • factor_based_models_ed5ff04d5fac88dfa4f528d9d64e4b6de14e1194.png are common factors because they influence all observed random variables, or in vector notation factor_based_models_1f15588d4acc55769d76ba21967dd6a42b708989.png
  • factor_based_models_b2db59aeb94cbc1050079892ff07b21b493513b7.png denotes the factor_based_models_f58698e8935653b09c8f83faef788c1cce50e715.png matrix which represents our samples for the rvs. factor_based_models_1f15588d4acc55769d76ba21967dd6a42b708989.png
  • factor_based_models_ad63c81b09a45c2afd6c497ff38a542e26c1423a.png denotes the number of common factors , factor_based_models_f8ecf77bf7c224afa2963e8e5e3c8b1a5cb413c8.png
  • factor_based_models_a48375c7d11c6e5c0584515ab8c07eecb5e78ca8.png is called the loading matrix
  • factor_based_models_7bac7bedc2d84e896cfd3dc9da9475a082b89aaa.png is unobserved stochastic error term with zero mean and finite variance

Definition

We suppose that the set of factor_based_models_7225b076f6e6326f1636b11d1aad8de58bcc4761.png observable random variables factor_based_models_0cd514399f97c5a0b30b348a2927dab7d2e6429e.png with means factor_based_models_68dba46e57bb2adf1178c3b662a8b4d8a6779364.png, we have the following model:

factor_based_models_2c1bb5f9c6b63f3d650871ef81bbb9e693ded3a2.png

or in matrix terms,

factor_based_models_1c632a21682176cccb9ac857f7f87f6c4fb8ecc9.png

where if we have factor_based_models_1dbee3eb7b7bd0453a691130ee5d0d2bc8d46f23.png observations: factor_based_models_59e1a2760b5168539e8ccc7c1b8b96d82eaf3c21.png. (Notice that here factor_based_models_b2db59aeb94cbc1050079892ff07b21b493513b7.png is a matrix, NOT a vector, but in our model factor_based_models_1f15588d4acc55769d76ba21967dd6a42b708989.png is vector. Matrix is for fitting, factor_based_models_1f15588d4acc55769d76ba21967dd6a42b708989.png is the random variable we're modelling. Confusing notation, yeah I know.)

We impose the following conditions on factor_based_models_1f15588d4acc55769d76ba21967dd6a42b708989.png :

  1. factor_based_models_1f15588d4acc55769d76ba21967dd6a42b708989.png and factor_based_models_11ba0f83c8d6e7c651110b373470fab9a2b09249.png are independent
  2. factor_based_models_c29ef508206155a147f34a78f2e437382a0f9730.png
  3. factor_based_models_9ba90f383dbe54d455d002bc5b27510af58ea0cf.png (to make sure that the factors are uncorrelated, as we hypothesize)

Suppose factor_based_models_3002b07e30841630cade9382d8fea7a1f947426b.png. Then note that from the conditions just imposed on factor_based_models_1f15588d4acc55769d76ba21967dd6a42b708989.png, we have

factor_based_models_5de93f5f7c5c9537065493dc0a32b4496aa1ec64.png

In words

What we're saying here is that we believe the data can be described by some linear (lower-dimensional) subspace spanned by the common factors factor_based_models_1f15588d4acc55769d76ba21967dd6a42b708989.png, and we attempt to find the best components for producing this fit. The matrix factor_based_models_a48375c7d11c6e5c0584515ab8c07eecb5e78ca8.png describes the coefficients for the factor_based_models_7225b076f6e6326f1636b11d1aad8de58bcc4761.png observed rvs. when projecting these onto the common factors.

This is basically a form of linear regression, but instead of taking the observed rvs. factor_based_models_9199f0f34d1d00b33a01628ee9082fca91a338f0.png and directly finding a subspace to the data onto to predict some target variable, we instead make the observed "features" factor_based_models_40fb973cd997849a029e392c385d32d4b8c40196.png the target and hypothesize that there exists some factor_based_models_ad63c81b09a45c2afd6c497ff38a542e26c1423a.png common factors which produces the variance seen between the observed variables.

In a way, it's very similar to PCA, but see PCA vs. Factor analysis for more on that.

Exploratory factor analysis (ETA)

Same as FA, but does not make any prior assumptions about the relationships among the factors themselves, whereas FA assumes them to be independent.

PCA vs. Factor analysis

PCA does not account for inherit random error

In PCA, 1s are put in the diagonal meaning that all of the variance in the matrix is to be accouned for (including variance unique to each variable, variance common among variables, and error variance).

In EFA (Exploratory Factor Analysis), the communalities are put in the diagonal meaning that only the variance shared with other variables is accounted for (excluding variance unique to each variable and error variance). That would, therefore, by definition, inlcude only variance that is commong among the variables.

Summary

  • PCA is simply a variable reduction technique; FA makes the assumption that an underlying causal model exists
  • PCA results in principal components analysis that account for maximal amount of variance of observed variables; FA account for variance shared between observed variables in the data.
  • PCA inserts ones on the diagonals of the correlation matrix; FA adjusts the diagonals of the correlation matrix with the unique factors.
  • PCA minimizes the sum of squared perpendicular distance to the component axis; FA estimates factors which influence responses on observed variables.
  • The component scores in PCA represent a linear combination of observed variables weighted by the eigenvectors; the observed variables in FA are linear combinations of the underlying unique factors.

Indendent Factor Analysis (ICA)

Overview

  • Method for separating a multivariate signal into additive subcomponents
  • Assumes subcomponents are non-Gaussian signals

Assumptions

  1. Different factors are independent of each other (in a probabilistic sense)
  2. Values in each factor have non-Gaussian distributions

Defining independence

We want to maximize the statistical independence between the factors. We may choose one of many ways to define a proxy for independence, with the two broadest ones:

  1. Minimization of mutual information
  2. Maximization of non-Gaussianity

Definition

In a Linear Noiseless ICA we assume components factor_based_models_9199f0f34d1d00b33a01628ee9082fca91a338f0.png of an observed random vector factor_based_models_8add9a207204fe8f9f8526818b363e4c7d49ba74.png are generated as a sum of factor_based_models_1dbee3eb7b7bd0453a691130ee5d0d2bc8d46f23.png (statistically) independent components factor_based_models_0828f6df1ce2bfda3642516823ab03f92c0f6408.png, i.e.

factor_based_models_672b005f23938eaece27cd994e9edf867d82db20.png

for some factor_based_models_b5912b52e1b798cad8c74e0ca9b2b0c8218eda1b.png.

In matrix notation,

factor_based_models_e240f504f38722bbf813eecbbfc4b5d0ff54b35c.png

where the problem is to find the matrix factor_based_models_5621442eb84314df5b6d4f90193ccefd86434bb0.png.

In a Lineary noisy ICA we follow the same model as for the noiseless ICA but with the added assumption of zero-mean and uncorrelated Gaussian noise

factor_based_models_c02d3245a1e45efe14a33e921ec1ef93005c2f28.png

where factor_based_models_f8f8f28370f880e61df4ada6a44119ddeb9d72ad.png.

Comparison

  • PCA maximizes 2nd moments, i.e. variance
    • Finding basis vectors which "best" explain the variance of the data
  • FA attempts to
    • Generative
    • Allows different variances across the different basis vectors
  • ICA attempts to maximize 4th moments, i.e. kurtosis
    • Finding basis vectors such that resulting vector is one of the independent components of the original data
      • Can do this by maximizing kurtosis or minimizing mutual information
    • Motivated by the idea that when you add things up, you get something normal, due to CLT
    • Hopes that data is non-normal, such that non-normal components can be extracted from them
    • In attempt to exploit non-normality, ICA tries to maximize the 4th moment of a linear combination of the inputs
    • Compares to PCA which attempts to do the same, but for 2nd moments