Linear Regression

Table of Contents

Multivariable linear regression


  1. Linearity
  2. Homoscedasticity
    • A sequence of rvs. is homoscedastic if all rvs. in the sequence have the same finite variance
    • The complementary notion sis called heteroscedasticity
  3. Multivariate normal
  4. Independence of errors
  5. Lack of multicolinearity
    • For our estimator to be a Best Linear Unbiased Estimator (BLUE) we require that the variables are independent

Dummy variable trap

Let's say we have a categorical variable in our data set, which can take on two values, e.g. raining or not raining We can then either represent this as one binary variable, linear_regression_3c314f80373742988ad542a6d4ce66a111b17847.png or two binary variables, linear_regression_e6d4f3d47d675d6505dd6f594394aa6bb212dfeb.png and linear_regression_2c5024c544eadf13553ff6f41d60836455b37ed9.png.

If we do one variable, we have the model:


If we have two variables, we have the model:


Now, do you see something wrong with the second option? To obtain our estimated parameters linear_regression_97c33ef3b55e11a0480aae460b585daa8a8ffdc0.png (when using Ordinary Least Squares (OLS)) we compute the following:


where linear_regression_0207be880056b9a69e22e729dd37bced29cd174a.png is the linear_regression_fd319580cd0fdb0030aa7a530ff95445265b7b92.png data matrix and linear_regression_29c26837fff0673a98d6523a80b727d38797f426.png is the linear_regression_f07b932b12c91bca424fd428d383495413b5b6a9.png label-vector.

The issue lies in the linear_regression_19da9e5bbedfd7981252f38c7cf5aa3394633dba.png. To be able to compute a the inverse of a matrix we require it to be invertible / non-singular. This is equivalent of saying it needs to have full rank, i.e. all columns must be linearly independent (The same holds for rows). In the case where linear_regression_048fe315ee645c45e7af67cbebeadf39376fcd2e.png, this is not the case! The covariance between linear_regression_e6d4f3d47d675d6505dd6f594394aa6bb212dfeb.png and linear_regression_2c5024c544eadf13553ff6f41d60836455b37ed9.png will be 1!

Therefore, you always go with the first solution. This also is the case for categorical variables which can take on more than 2 values: you only include a parameter linear_regression_71311614e452e17501211d008bb0f24c29e86d08.png for linear_regression_c0bc17f451d8a243f5e125bf2d4efbf8877ed765.png of these linear_regression_f07b932b12c91bca424fd428d383495413b5b6a9.png values the categorical variable can take on.

Intuitively, you can also sort of view it as the last value for the categorical variable being included in the linear_regression_e022aef44492777b6daec69f88093d6f042204c0.png.

Variable selection

This section is regarding choosing variables to include in your model, as you usually don't want to include all of them.

The three first methods (all different versions of "step-wise" regression) are apparently not a good idea:

Backward elimination

  1. Decide on some significance level, e.g. linear_regression_93f90a9abf03415109bbb3ccd26624e1eff5ec80.png
  2. Fit full model using all variables
  3. Consider the variable with the largest P-value. If linear_regression_0d550eb2b78b4e711f4693c1371aa03c01f95cb5.png:
    • Go to step 4
    • Otherwise: go to FIN
  4. Remove this variable.
  5. Fit full model without the removed variables. Go back to step 3

Forward elimination

  1. Decide on some significance level, e.g. linear_regression_93f90a9abf03415109bbb3ccd26624e1eff5ec80.png
  2. Fit all simple regression models linear_regression_58153bc97b720d793d55e7adec2b825546a18e3f.png, and select the one with the lowest P-value
  3. Keep this variable and fit all possible models with one extra predictor added to the one(s) you already have
  4. Consider the predictor with the lowest P-value. If linear_regression_72f93a2807b68655074470d8547c66b7a663e0b6.png:
    • Go to step 3, i.e. keep introductin predictors until we reach a insignificant one
    • Otherwise: go to FIN

Since the last added predictor is of a too high P-value, we actually use the model fitted just before adding this last predictor, not the model we have when we go to FIN.

Bi-directional elimination (step-wise regression)

  1. Select a significance level to enter and to stay in the model, e.g.: SLENTER = 0.05 and SLSTAY = 0.05
  2. Perform the next of Forward Selection (new variables must have: =P < SLENTER to enter)
  3. Perform ALL steps of Backward Elimination (old variables must have P < SLSTAY to stay)
  4. No new variables can enter and no old variables can exit => FIN

Recursive feature elimination

This one was not actually in the course-lectures, but is the only one provided as a function by scikit-learn so I figured I'd include it. sklearn.featureselection.RFE.

  1. Decide on some number of features you want.
  2. Train estimator on the initial set of features and weights are assigned to each one of them
  3. Features whose absolute weights are the smallest, are pruned from the current set of features
  4. Procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

Geometric interpretation

Before you engage on this journey of viewing Multivariate Linear Regression in the light of geometry, you need to get the following into your head:

  • Linear! Say it after me: lineeaar.

OKAY! So I'm smart now, we got dis.

SO! Let's consider the case of simple univariate (ONE SINGLE FREAKING VARIABLE), even excluding the bias / intercept, with the model:


  • linear_regression_0207be880056b9a69e22e729dd37bced29cd174a.png is the input random variable
  • linear_regression_3ca302b80fb078e124ac6b194794815069394f1a.png is the dependent random variable
  • linear_regression_2ac3ae7b30bf05cd1007bf0828df5c932731e0e4.png

We then obtain our least squares estimate to be:


  • linear_regression_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png is a linear_regression_50ca1f6178ea4c01b7b8aa2d8f753fe1202bd6cd.png matrix (generally linear_regression_92372441d5fdd97cef43dce788e12e29ea8d08d8.png)
  • linear_regression_29c26837fff0673a98d6523a80b727d38797f426.png is a linear_regression_50ca1f6178ea4c01b7b8aa2d8f753fe1202bd6cd.png matrix
  • linear_regression_86b8358ce75468a54566791b10e4d79e3fcf616b.png is a linear_regression_bf7fedf9a227562321c0159af638666f1debfade.png matrix (generally linear_regression_0ac76bfcaa502912d8e49bc57b8f8af5975cf87c.png)

Now, since we're only considering the univariate linear regression, we can instead use vector notation and end up with the following:



linear_regression_86b8358ce75468a54566791b10e4d79e3fcf616b.png just happens to be equal to the projection of our samples for the depedent variable onto the column space of our samples for the feature / input variable. Interesting, or what?!

All-in-all, our least squares estimate is then the projection of our samples, linear_regression_29c26837fff0673a98d6523a80b727d38797f426.png, for the dependent rv. linear_regression_3ca302b80fb078e124ac6b194794815069394f1a.png, onto the column-space of our samples, linear_regression_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png, for the input rv. linear_regression_0207be880056b9a69e22e729dd37bced29cd174a.png.

When we are making a prediction on some new input linear_regression_01df804acf60553200444b51ed91c83a9d8283d4.png, we're not computing a projection!

This isn't hard to tell by looking at the equations, but I've found myself pondering about "how making a prediction is computing a projection" when we're not actually doing that.. #BrainFarts

Projections are for fitting and we're computing them for each feature / input variable with the corresponding observed label / dependent variable. These projections then give us the coefficient for the corresponding feature / input variable. Thus, when making a prediction we're computing a linear combination of the features / input variables where the coefficient for each of these are computed by what we described above.

TL;DR: Projections are for fitting, not predicting.

"But Tor, this is just univariate linear regression! We want to play with the big-boys doing multivariate linear regression!" You see, my sweet summerchild, by making sure that our input variables are linearly independent, we can simply provide each of the symbols above with a subscript linear_regression_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png denoting the linear_regression_066bafe42b878a7e40dda3cac0cd058aa2350267.png feature / input variable, and we're good!

For a more complete explanation of the multivariate case, checkout section 3.2 in The Elements Of Statistical Learning.

I tried…

We'll be viewing the least squares estimate as a projection onto the column space of the data, linear_regression_0207be880056b9a69e22e729dd37bced29cd174a.png. Why does this make any sense?

  • Consider linear_regression_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.pnglinear_regression_3c314f80373742988ad542a6d4ce66a111b17847.png instead of the other way around, i.e. given some linear_regression_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png we want can define it
  • Basically, the question we have here is "why does making the predicted value for linear_regression_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png the projection onto the space spanned by the indep. variable linear_regression_3c314f80373742988ad542a6d4ce66a111b17847.png make the residuals smallest?"
    • By def. the projection of some vector linear_regression_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png onto a subspace is defined to such that the vector from its projected vector to linear_regression_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png is perpendicular to the plane, i.e. linear_regression_f04c32cb63058c72a90585a6bbedf481347c121c.png is perpendicular to subspace spanned by linear_regression_3c314f80373742988ad542a6d4ce66a111b17847.png, i.e. minimal! (since perpendicular distance is the smallest distance from the plane)
    • Why is this relevant? Well, it tells us that if we have a set of orthogonal vectors we can view our fit as a projection onto the column-space of all linear_regression_e9f4d218474bc10aa94958ec30139aee865c0173.png, i.e. the projection defines how much we move along each of the linear_regression_e9f4d218474bc10aa94958ec30139aee865c0173.png to get our least estimate for some linear_regression_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png.
    • EACH COMPONENT OF linear_regression_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png IS PROJECTED ONTO A SUBSPACE SPANNED BY THE INPUTS FOR THIS SPECIFIC COMPONENENT!!!
from functools import wraps

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def ols_fit(X, y):
    # the code below is to standardize the inputs
    # X = X - X.mean()  # p x n
    # X = X / X.var()
    # need to account for intercept / bias
    beta_0 = np.ones(X.shape)
    X = np.vstack((beta_0, X))
    XX_inv = np.linalg.inv(, X.T))  # p x p
    beta =, X), y.T)  # p x 1

    def predict(x):
        x = np.vstack((np.ones(x.shape), x))
        return, x)  # p x n
    return predict

n = 10
x = np.arange(n).reshape(1, -1)
epsilon = np.random.normal(0, 1, n).reshape(1, -1)
y = np.arange(10, 10 + n).reshape(1, -1) + epsilon
fit = ols_fit(x, y)
y_fit = fit(x)

plt.scatter(x, y, label="$y$")
plt.scatter(x, y_fit, c="g", label="$\hat{y}$")
x_extra = (np.max(x) - np.min(x)) * 0.2
plt.xlim(np.min(x) - x_extra * 0.5, np.max(x) + x_extra)
y_extra = (np.max(y) - np.min(y)) * 0.2
plt.ylim(np.min(y) - y_extra, np.max(y) + y_extra)
plt.title("Linear with white noise and fitted values")
return ".linear_regression/1d_data.png"
import numpy as np

n = 10
x = np.arange(n).reshape(1, -1)
epsilon = np.random.normal(0, 1, n).reshape(1, -1)
y = np.arange(10, 10 + n).reshape(1, -1) + epsilon

x = x - x.mean()
x /= x.var()

Estimating std. err. of parameter-estimates

So, we now have our parameter-estimates:


Now, the linear regression model is


Thus, with our estimate for the parameters,


Here we're using linear_regression_86b8358ce75468a54566791b10e4d79e3fcf616b.png to denote the true parameters.

which is a random variable following the distribution


Thus we can compute the variance of each of the parameters, and therefore the standard error!

Least Squares Estimate

Bias-variance decomposition

Consider the mean squared error of an estimator linear_regression_3647d4bab6b6e9c8d0e5a5fd22ad6a80eac3d0de.png in estimating linear_regression_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png


The bias-variance decomposition would then be:


Where linear_regression_c002b5f9ecc48d75e76872122d695f3b8f398912.png since both linear_regression_a5e6509b52edfe4e1953cc9746b09455586e372a.png and linear_regression_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png are constant

Regularization methods

Ridge regression

Ridge regression shrinks the regression coefficients by imposing a penalty on their size.

Ridge regression minimizes a penalized residual sum of squares,


Here linear_regression_a7ad046e58736d7fefae327145a8823d3b6ebc31.png is a complexity parameter that controls the amount of shrinkage; the larger the value of linear_regression_ec150cd354803373f200e63346b8a8df0f63f554.png, the greater the amount of shrinkage.

An equivalent way to write the ridge problem is


for some linear_regression_8598c49b0d06464b104343a50457f795454ec79e.png.

There is a one-to-one correspondance between these two formalations, since the Lagrangian of this constrained optimization problem would only have an extra term of linear_regression_cd394b86f0bedb1ba53097573a5a87b8df3daadb.png, which would simply be a positive constant.

Using centered inputs, we drop linear_regression_e022aef44492777b6daec69f88093d6f042204c0.png and can write the problem in matrix form:


which have solutions


Ridge solutions are not equivariant under scaling of the inputs, and so one normally standardizest he inputs before solving.

The intercept linear_regression_e022aef44492777b6daec69f88093d6f042204c0.png has been left out of the penalty term! Penalization of the intercept would make the procedure depend on the origin chosen for linear_regression_3ca302b80fb078e124ac6b194794815069394f1a.png, that is, adding a constant to each of the targets would not simply result in a shift of the predictions by the same amount linear_regression_32e55c6e155c100dc50ef35a36a56df33f4c0687.png.

From ridge solution in matrix form, we see that the solution adds a positive constant to the diagonal of linear_regression_583037170a4e7a8a58bfed551541540ad930ad65.png before inversion. This makes the problem non-singular, even if linear_regression_583037170a4e7a8a58bfed551541540ad930ad65.png is not!

This was the main motivation for ridge regression when it was first introduced.trevor09

Using Singular Value Decomposition of the centered input matrix linear_regression_553e3281e8d268e42b95a8a4dd22e70563a7024d.png:


For "standard" least squares we have


Observe that linear_regression_f905ba0ef7e220e1c38df6aa1cb9ed4726fd255b.png, where linear_regression_04de495753095ac48526fd3f97632e5a2c48720f.png.

In comparison, the ridge solution is gives us


Note that since linear_regression_a7ad046e58736d7fefae327145a8823d3b6ebc31.png, we have


Observe that like linear regression, ridge regression computes the projection of linear_regression_29c26837fff0673a98d6523a80b727d38797f426.png onto the subspace spanned by linear_regression_ff20c0551c12baae49dd7bb5fbc9a3451b69123d.png, but also shrinks these coordinates by the factors linear_regression_2dd9b94efa0414eb6f930f37215f15d45d202a1a.png! This is why you will sometimes see such methods under the name "shrinkage methods".


Since linear_regression_ab81f5e2edf10aac2d0ddb72d9c04be570e86dee.png is the sample covariance, we know that the eigenvectors of this, linear_regression_511bcf2da8626f41fada28eaf96ee4078dae4274.png are the principal components of the variables in linear_regression_553e3281e8d268e42b95a8a4dd22e70563a7024d.png. Therefore, letting linear_regression_a744aa56eda4e55677f0e61858643edca186ce49.png, we have


Hence, the small singular values linear_regression_97db602b4f00a36c290e3a54fd7bafa006fd0b2c.png correspond to the directions in the column space of linear_regression_553e3281e8d268e42b95a8a4dd22e70563a7024d.png having small variance, and ridge regression shrinks these directions the most.

Lasso regression

Lasso regression is defined by the estimate


In the signal processing literature, the lasso is also known as basis pursuit.

We can write the lasso in the equivalent Lagrangian form:


The absolute value here makes the optimization problem non-linear, and so there is no closed form expression as in the case of ridge regression.

Kernel methods

Different regression-models

Poission regression

Instead of assuming that linear_regression_c6bdf0551175803079efd3f8a991efb35638c693.png to follow


we assume



  • [trevor09] Trevor Hastie, The elements of statistical learning : data mining, inference, and prediction, Springer (2009).