Linear Regression

Table of Contents

Multivariable linear regression

Assumptions

  1. Linearity
  2. Homoscedasticity
    • A sequence of rvs. is homoscedastic if all rvs. in the sequence have the same finite variance
    • The complementary notion sis called heteroscedasticity
  3. Multivariate normal
  4. Independence of errors
  5. Lack of multicolinearity
    • For our estimator to be a Best Linear Unbiased Estimator (BLUE) we require that the variables are independent

Dummy variable trap

Let's say we have a categorical variable in our data set, which can take on two values, e.g. raining or not raining We can then either represent this as one binary variable, $x$ or two binary variables, $x_1$ and $x_2$.

If we do one variable, we have the model:

\begin{equation*}
y = \beta_0 + \beta_1 x, \quad x \in \{0, 1\}
\end{equation*}
1

If we have two variables, we have the model:

\begin{equation*}
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2, \quad x_1 = 1 - x_2 \quad x_1, x_2 \in \{0, 1\}
\end{equation*}
2

Now, do you see something wrong with the second option? To obtain our estimated parameters $\boldsymbol{\hat{\beta}}$ (when using Ordinary Least Squares (OLS)) we compute the following:

\begin{equation*}
\boldsymbol{\hat{\beta}} = \Big(X X^T \Big)^{-1} X^T \mathbf{y}
\end{equation*}
3

where $X$ is the $n \times p$ data matrix and $\mathbf{y}$ is the $n$ label-vector.

The issue lies in the $\Big( X X^T \Big)^{-1}$. To be able to compute a the inverse of a matrix we require it to be invertible / non-singular. This is equivalent of saying it needs to have full rank, i.e. all columns must be linearly independent (The same holds for rows). In the case where $x_1 = 1 - x_2$, this is not the case! The covariance between $x_1$ and $x_2$ will be 1!

Therefore, you always go with the first solution. This also is the case for categorical variables which can take on more than 2 values: you only include a parameter $\beta_i$ for $n - 1$ of these $n$ values the categorical variable can take on.

Intuitively, you can also sort of view it as the last value for the categorical variable being included in the $\beta_0$.

Variable selection

This section is regarding choosing variables to include in your model, as you usually don't want to include all of them.

The three first methods (all different versions of "step-wise" regression) are apparently not a good idea:

Backward elimination

  1. Decide on some significance level, e.g. $SL = 0.05$
  2. Fit full model using all variables
  3. Consider the variable with the largest P-value. If $P > SL$:
    • Go to step 4
    • Otherwise: go to FIN
  4. Remove this variable.
  5. Fit full model without the removed variables. Go back to step 3

Forward elimination

  1. Decide on some significance level, e.g. $SL = 0.05$
  2. Fit all simple regression models $y ~ x_n$, and select the one with the lowest P-value
  3. Keep this variable and fit all possible models with one extra predictor added to the one(s) you already have
  4. Consider the predictor with the lowest P-value. If $P < SL$:
    • Go to step 3, i.e. keep introductin predictors until we reach a insignificant one
    • Otherwise: go to FIN

Since the last added predictor is of a too high P-value, we actually use the model fitted just before adding this last predictor, not the model we have when we go to FIN.

Bi-directional elimination (step-wise regression)

  1. Select a significance level to enter and to stay in the model, e.g.: SLENTER = 0.05 and SLSTAY = 0.05
  2. Perform the next of Forward Selection (new variables must have: =P < SLENTER to enter)
  3. Perform ALL steps of Backward Elimination (old variables must have P < SLSTAY to stay)
  4. No new variables can enter and no old variables can exit => FIN

Recursive feature elimination

This one was not actually in the course-lectures, but is the only one provided as a function by scikit-learn so I figured I'd include it. sklearn.featureselection.RFE.

  1. Decide on some number of features you want.
  2. Train estimator on the initial set of features and weights are assigned to each one of them
  3. Features whose absolute weights are the smallest, are pruned from the current set of features
  4. Procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

Geometric interpretation

Before you engage on this journey of viewing Multivariate Linear Regression in the light of geometry, you need to get the following into your head:

  • Linear! Say it after me: lineeaar.

OKAY! So I'm smart now, we got dis.

SO! Let's consider the case of simple univariate (ONE SINGLE FREAKING VARIABLE), even excluding the bias / intercept, with the model:

\begin{equation*}
Y = \beta X + \epsilon
\end{equation*}
4
  • $X$ is the input random variable
  • $Y$ is the dependent random variable
  • $\epsilon \sim \mathcal{N}(0, 1)$

We then obtain our least squares estimate to be:

\begin{equation*}
\hat{\boldsymbol{\beta}} = \big( \mathbf{x}^T \mathbf{x} \big)^{-1} \mathbf{x}^T \mathbf{y}
\end{equation*}
5
  • $\mathbf{x}$ is a $N \times 1$ matrix (generally $N \times p$)
  • $\mathbf{y}$ is a $N \times 1$ matrix
  • $\boldsymbol{\beta}$ is a $1 \times 1$ matrix (generally $p \times p$)

Now, since we're only considering the univariate linear regression, we can instead use vector notation and end up with the following:

\begin{equation*}
\begin{split}
\boldsymbol{\beta} &amp;= \frac{\langle \mathbf{x}, \mathbf{y} \rangle}{\langle \mathbf{x}, \mathbf{x} \rangle} \\ 
&amp;= \frac{\sum^N_{n=0} x_n y_n}{\sigma_x^2} \quad \text{(included out of interest, not relevant)} \\
&amp;= \frac{\mathbf{x} \cdot \mathbf{y}}{||\mathbf{x}||^2} \\
\end{split}
\end{equation*}
6

DO YOU SEE IT?!

$\boldsymbol{\beta}$ just happens to be equal to the projection of our samples for the depedent variable onto the column space of our samples for the feature / input variable. Interesting, or what?!

All-in-all, our least squares estimate is then the projection of our samples, $\mathbf{y}$, for the dependent rv. $Y$, onto the column-space of our samples, $\mathbf{x}$, for the input rv. $X$.

When we are making a prediction on some new input $x_0$, we're not computing a projection!

This isn't hard to tell by looking at the equations, but I've found myself pondering about "how making a prediction is computing a projection" when we're not actually doing that.. #BrainFarts

Projections are for fitting and we're computing them for each feature / input variable with the corresponding observed label / dependent variable. These projections then give us the coefficient for the corresponding feature / input variable. Thus, when making a prediction we're computing a linear combination of the features / input variables where the coefficient for each of these are computed by what we described above.

TL;DR: Projections are for fitting, not predicting.

"But Tor, this is just univariate linear regression! We want to play with the big-boys doing multivariate linear regression!" You see, my sweet summerchild, by making sure that our input variables are linearly independent, we can simply provide each of the symbols above with a subscript $i$ denoting the $i^\text{th}$ feature / input variable, and we're good!

For a more complete explanation of the multivariate case, checkout section 3.2 in The Elements Of Statistical Learning.

I tried…

We'll be viewing the least squares estimate as a projection onto the column space of the data, $X$. Why does this make any sense?

  • Consider $y$$x$ instead of the other way around, i.e. given some $y$ we want can define it
  • Basically, the question we have here is "why does making the predicted value for $y$ the projection onto the space spanned by the indep. variable $x$ make the residuals smallest?"
    • By def. the projection of some vector $y$ onto a subspace is defined to such that the vector from its projected vector to $y$ is perpendicular to the plane, i.e. $y - \text{proj}_x(y)$ is perpendicular to subspace spanned by $x$, i.e. minimal! (since perpendicular distance is the smallest distance from the plane)
    • Why is this relevant? Well, it tells us that if we have a set of orthogonal vectors we can view our fit as a projection onto the column-space of all $x_i$, i.e. the projection defines how much we move along each of the $x_i$ to get our least estimate for some $y$.
    • EACH COMPONENT OF $y$ IS PROJECTED ONTO A SUBSPACE SPANNED BY THE INPUTS FOR THIS SPECIFIC COMPONENENT!!!
from functools import wraps

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


def ols_fit(X, y):
    # the code below is to standardize the inputs
    # X = X - X.mean()  # p x n
    # X = X / X.var()
    # need to account for intercept / bias
    beta_0 = np.ones(X.shape)
    X = np.vstack((beta_0, X))
    XX_inv = np.linalg.inv(np.dot(X, X.T))  # p x p
    beta = np.dot(np.dot(XX_inv, X), y.T)  # p x 1

    def predict(x):
        x = np.vstack((np.ones(x.shape), x))
        return np.dot(beta.T, x)  # p x n
    return predict

n = 10
x = np.arange(n).reshape(1, -1)
epsilon = np.random.normal(0, 1, n).reshape(1, -1)
y = np.arange(10, 10 + n).reshape(1, -1) + epsilon
fit = ols_fit(x, y)
y_fit = fit(x)
print(y_fit)

plt.scatter(x, y, label="$y$")
plt.scatter(x, y_fit, c="g", label="$\hat{y}$")
x_extra = (np.max(x) - np.min(x)) * 0.2
plt.xlim(np.min(x) - x_extra * 0.5, np.max(x) + x_extra)
y_extra = (np.max(y) - np.min(y)) * 0.2
plt.ylim(np.min(y) - y_extra, np.max(y) + y_extra)
plt.legend()
plt.title("Linear with white noise and fitted values")
plt.savefig(".linear_regression/1d_data.png")
return ".linear_regression/1d_data.png"
import numpy as np

n = 10
x = np.arange(n).reshape(1, -1)
epsilon = np.random.normal(0, 1, n).reshape(1, -1)
y = np.arange(10, 10 + n).reshape(1, -1) + epsilon

x = x - x.mean()
x /= x.var()
np.linalg.matrix_rank(x)

Estimating std. err. of parameter-estimates

So, we now have our parameter-estimates:

\begin{equation*}
\boldsymbol{\hat{\beta}} = \Big( X X^T \Big)^{-1} X^T \mathbf{y}
\end{equation*}

Now, the linear regression model is

\begin{equation*}
Y = \boldsymbol{\beta} X + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2 I)
\end{equation*}

Thus, with our estimate for the parameters,

\begin{equation*}
\begin{split}
  \boldsymbol{\hat{\beta}} &amp;= \Big( X X^T \Big)^{-1} X^T Y = \Big( X X^T \Big)^{-1} X^T (\boldsymbol{\beta} X + \varepsilon) \\
  &amp;= \Big( X X^T \Big)^{-1} \Big( X X^T \Big) \boldsymbol{\beta} + \Big( X X^T \Big) X^T \varepsilon \\
  &amp;= \boldsymbol{\beta} + \Big( X X^T \Big) X^T \varepsilon
\end{split}
\end{equation*}

Here we're using $\boldsymbol{\beta}$ to denote the true parameters.

which is a random variable following the distribution

\begin{equation*}
\boldsymbol{\hat{\beta}} \sim \mathcal{N} \Big( \boldsymbol{\beta}, \Big( X X^T \Big) X^T \sigma^2 I X \Big( X X^T \Big) \Bigg)
\end{equation*}

Thus we can compute the variance of each of the parameters, and therefore the standard error!

Least Squares Estimate

Bias-variance decomposition

Consider the mean squared error of an estimator $\hat{\theta}$ in estimating $\theta$

\begin{equation*}
\text{MSE}(\hat{\theta}) = E \Big[ (\hat{\theta} - \theta)^2 \Big]
\end{equation*}
7

The bias-variance decomposition would then be:

\begin{equation*}
\begin{split}
\text{MSE}(\hat{\theta}) &amp;= E \Big[ \big( \hat{\theta} - E(\hat{\theta}) + E(\hat{\theta}) - \theta \big)^2 \Big] \\
&amp;= E \Big[ \big(\hat{\theta} - E(\hat{\theta}) \big) ^2 + \big(E(\hat{\theta}) - \theta \big)^2 \Big] \\
&amp;= E \Big[ \big(\hat{\theta} - E(\hat{\theta}) \big) ^2 \Big] + E \Big[ \big(E(\hat{\theta}) - \theta \big)^2 \Big] \\
&amp;= \text{Var}(\hat{\theta}) + \big( E[\hat{\theta}] - \theta \big)^2 \\
&amp;= \text{Var}(\hat{\theta}) + E[\hat{\theta} - \theta]^2 \\
&amp;= \text{Var}(\hat{\theta}) + \text{Bias}^2 (\hat{\theta})
\end{split}
\end{equation*}
8

Where $E \Big[ \big(E(\hat{\theta}) - \theta \big)^2 \Big] = \big( E(\hat{\theta}) - \theta) \big)^2$ since both $E(\hat{\theta})$ and $\theta$ are constant

Regularization methods

Ridge regression

Ridge regression shrinks the regression coefficients by imposing a penalty on their size.

Ridge regression minimizes a penalized residual sum of squares,

\begin{equation*}
\hat{\beta}^{\text{ridge}} = \underset{\beta}{\text{argmin}}\ \left\{ \sum_{i=1}^{N} \bigg( y_i - \beta_0 - \sum_{j=1}^{p} x_{ij} \beta_j \bigg)^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \right\}
\end{equation*}

Here $\lambda \ge 0$ is a complexity parameter that controls the amount of shrinkage; the larger the value of $\lambda$, the greater the amount of shrinkage.

An equivalent way to write the ridge problem is

\begin{equation*}
\begin{split}
  \hat{\beta}^{\text{ridge}} &amp;= \underset{\beta}{\text{argmin}}\ \sum_{i=1}^{N} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p} x_{ij} \beta_j \Bigg)^2 \\
  &amp; \text{subject to } \sum_{j=1}^{p} \beta_j^2 \le t
\end{split}
\end{equation*}

for some $t &gt; 0$.

There is a one-to-one correspondance between these two formalations, since the Lagrangian of this constrained optimization problem would only have an extra term of $\lambda t$, which would simply be a positive constant.

Using centered inputs, we drop $\beta_0$ and can write the problem in matrix form:

\begin{equation*}
\rm{RSS}(\lambda) = \big( \mathbf{y} - \mathbf{X} \beta \big)^T \big( \mathbf{y} - \mathbf{X} \beta \big) + \lambda \beta^T \beta
\end{equation*}

which have solutions

\begin{equation*}
\hat{\beta}^{\text{ridge}} = \big( \mathbf{X}^T \mathbf{X} + \lambda \mathbf{I} \big)^{- 1} \mathbf{X}^T \mathbf{y}
\end{equation*}

Ridge solutions are not equivariant under scaling of the inputs, and so one normally standardizest he inputs before solving.

The intercept $\beta_0$ has been left out of the penalty term! Penalization of the intercept would make the procedure depend on the origin chosen for $Y$, that is, adding a constant to each of the targets would not simply result in a shift of the predictions by the same amount $c$.

From ridge solution in matrix form, we see that the solution adds a positive constant to the diagonal of $\mathbf{X}^T \mathbf{X}$ before inversion. This makes the problem non-singular, even if $\mathbf{X}^T \mathbf{X}$ is not!

This was the main motivation for ridge regression when it was first introduced.trevor09

Using Singular Value Decomposition of the centered input matrix $\mathbf{X}$:

\begin{equation*}
\mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{V}^T
\end{equation*}

For "standard" least squares we have

\begin{equation*}
\begin{split}
  \mathbf{X} \hat{\beta} &amp;= \mathbf{X} \big( \mathbf{X}^T \mathbf{X} \big)^{-1} \mathbf{X}^T \mathbf{y} \\
  &amp;= \mathbf{U} \mathbf{U}^T \mathbf{y}
\end{split}
\end{equation*}

Observe that $\mathbf{U}^T \mathbf{y} = \text{proj}_{\mathcal{U}}(\mathbf{y})$, where $\mathcal{U} = \text{span} \big( \left\{ \mathbf{u}_j \right\} \big)$.

In comparison, the ridge solution is gives us

\begin{equation*}
\begin{split}
  \mathbf{X} \hat{\beta}^{\text{ridge}} &amp;= \mathbf{X} \big( \mathbf{X}^T \mathbf{X} + \lambda \mathbf{I} \big)^{-1} \mathbf{X}^T \mathbf{y} \\
  &amp;= \mathbf{U} \mathbf{D} \big( \mathbf{D}^2 + \lamda \mathbf{I} \big)^{-1} \mathbf{D} \mathbf{U}^T \mathbf{y} \\
  &amp;= \sum_{j=1}^{p} \mathbf{u}_j \frac{d_j^2}{d_j^2 + \lambda} \mathbf{u}_j^T \mathbf{y}
\end{split}
\end{equation*}

Note that since $\lambda \ge 0$, we have

\begin{equation*}
\frac{d_j^2}{d_j^2 + \lambda} \le 1$
\end{equation*}

Observe that like linear regression, ridge regression computes the projection of $\mathbf{y}$ onto the subspace spanned by $\mathcal{U}$, but also shrinks these coordinates by the factors $\frac{d_j^2}{d_j^2 + \lambda}$! This is why you will sometimes see such methods under the name "shrinkage methods".

Variance

Since $\mathbf{X}^T \mathbf{X} / N$ is the sample covariance, we know that the eigenvectors of this, $\{ \mathbf{v}_j \}$ are the principal components of the variables in $\mathbf{X}$. Therefore, letting $\mathbf{z}_i = \mathbf{X} \mathbf{v}_i$, we have

\begin{equation*}
\text{Var}(\mathbf{z}_i) = \text{Var}(\mathbf{X} \mathbf{v}_i) = \frac{d_i^2}{N}
\end{equation*}

Hence, the small singular values $d_j$ correspond to the directions in the column space of $\mathbf{X}$ having small variance, and ridge regression shrinks these directions the most.

Lasso regression

Lasso regression is defined by the estimate

\begin{equation*}
\begin{split}
  \hat{\beta}^{\text{lasso}} &amp;= \underset{\beta}{\text{argmin}}\ \sum_{i=1}^{N} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p} x_{ij} \beta_j \Bigg)^2 \\
  &amp; \text{subject to } \sum_{j=1}^{p} |\beta_j| \le t
\end{split}
\end{equation*}

In the signal processing literature, the lasso is also known as basis pursuit.

We can write the lasso in the equivalent Lagrangian form:

\begin{equation*}
\hat{\beta}^{\text{lasso}} = \underset{\beta}{\text{argmin}}\ \left\{ \frac{1}{2} \sum_{i=1}^{p} \bigg( y_i - \beta_0 - \sum_{j=1}^{p} x_{ij} \beta_j \bigg)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right\}
\end{equation*}

The absolute value here makes the optimization problem non-linear, and so there is no closed form expression as in the case of ridge regression.

Kernel methods

Different regression-models

Poission regression

Instead of assuming that $Y \mid X_1, \dots, X_p$ to follow

\begin{equation*}
Y \mid X_1 = x_1, \dots, X_p = x_p \sim \mathcal{N}(\mathbf{w}^T \mathbf{x}, \sigma^2)
\end{equation*}

we assume

\begin{equation*}
Y \mid X_1 = x_1, \dots, X_p = x_p \sim \text{Poisson}(\lambda), \quad \lambda = \mathbf{w}^T \mathbf{x}
\end{equation*}

Bibliography

Bibliography

  • [trevor09] Trevor Hastie, The elements of statistical learning : data mining, inference, and prediction, Springer (2009).