# Linear Regression

## Table of Contents

## Multivariable linear regression

### Assumptions

- Linearity
- Homoscedasticity
- A sequence of rvs. is
**homoscedastic**if all rvs. in the sequence have the*same*finite variance - The complementary notion sis called
**heteroscedasticity**

- A sequence of rvs. is
- Multivariate normal
- Independence of errors
- Lack of multicolinearity
- For our estimator to be a Best Linear Unbiased Estimator (BLUE) we require that the variables are independent

#### Dummy variable trap

Let's say we have a categorical variable in our data set, which can
take on two values, e.g. `raining`

or `not raining`

We can then either represent this as *one* binary variable,
or two binary variables, and .

If we do one variable, we have the model:

If we have two variables, we have the model:

Now, do you see something wrong with the second option? To obtain our estimated parameters (when using Ordinary Least Squares (OLS)) we compute the following:

where is the data matrix and is the label-vector.

The issue lies in the . To be able to compute a the inverse of
a matrix we require it to be invertible / non-singular. This is equivalent
of saying it needs to have *full rank*, i.e. all columns must be linearly *independent*
(The same holds for rows). In the case where , this is not the case!
The covariance between and will be 1!

Therefore, you *always* go with the first solution. This also is the case for
categorical variables which can take on more than 2 values: you only include
a parameter for of these values the categorical variable
can take on.

Intuitively, you can also sort of view it as the last value for the categorical variable being included in the .

### Variable selection

This section is regarding choosing variables to include in your model,
as you usually don't want to include *all* of them.

The three first methods (all different versions of "step-wise" regression) are apparently not a good idea:

#### Backward elimination

- Decide on some significance level, e.g.
- Fit full model using all variables
- Consider the variable with the largest P-value.
If :
- Go to step 4
- Otherwise: go to FIN

- Remove this variable.
- Fit full model without the removed variables. Go back to step 3

#### Forward elimination

- Decide on some significance level, e.g.
- Fit all simple regression models , and select the one with the lowest P-value
- Keep this variable and fit all possible models with one extra predictor added to the one(s) you already have
- Consider the predictor with the
*lowest*P-value. If :- Go to step 3, i.e. keep introductin predictors until we reach a insignificant one
- Otherwise: go to FIN

Since the last added predictor is of a too high P-value, we actually use the model fitted just before adding this last predictor, not the model we have when we go to FIN.

#### Bi-directional elimination (step-wise regression)

- Select a significance level to enter and to stay in the model,
e.g.:
`SLENTER = 0.05`

and`SLSTAY = 0.05`

- Perform the next of Forward Selection (new variables must have: =P < SLENTER to enter)
- Perform ALL steps of Backward Elimination (old variables must have
`P < SLSTAY`

to stay) - No new variables can enter and no old variables can exit => FIN

#### Recursive feature elimination

This one was not actually in the course-lectures, but is the only one provided as a
function by `scikit-learn`

so I figured I'd include it. sklearn.feature_{selection.RFE}.

- Decide on some number of features you want.
- Train estimator on the initial set of features and weights are assigned to each one of them
- Features whose absolute weights are the smallest, are pruned from the current set of features
- Procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

### Geometric interpretation

Before you engage on this journey of viewing Multivariate Linear Regression in the light of geometry, you need to get the following into your head:

*Linear*! Say it after me:*lineeaar*.

#### OKAY! So I'm smart now, we got dis.

SO! Let's consider the case of simple univariate (ONE SINGLE FREAKING VARIABLE), even excluding the bias / intercept, with the model:

- is the
*input*random variable - is the
*dependent*random variable

We then obtain our least squares estimate to be:

- is a matrix (generally )
- is a matrix
- is a matrix (generally )

Now, since we're only considering the univariate linear regression, we can instead use vector notation and end up with the following:

DO YOU SEE IT?!

just happens to be equal to the projection of our samples for the depedent variable onto the column space of our samples for the feature / input variable. Interesting, or what?!

All-in-all, our least squares estimate is then the projection of our samples, , for the dependent rv. , onto the column-space of our samples, , for the input rv. .

When we are making a prediction on some new input , we're *not*
computing a projection!

This isn't hard to tell by looking at the equations, but I've found myself pondering about "how making a prediction is computing a projection" when we're not actually doing that.. #BrainFarts

Projections are for *fitting* and we're computing them for each feature / input variable
with the corresponding observed label / dependent variable. These projections then
give us the *coefficient* for the corresponding feature / input variable. Thus,
when making a prediction we're computing a linear combination of the features / input variables
where the coefficient for each of these are computed by what we described above.

**TL;DR:** Projections are for fitting, *not* predicting.

"But Tor, this is just univariate linear regression! We want to play with the big-boys doing *multivariate* linear regression!"
You see, my sweet summerchild, by making sure that our input variables are linearly independent, we can
simply provide each of the symbols above with a subscript denoting the feature / input variable, and we're good!

For a more complete explanation of the multivariate case, checkout section 3.2 in The Elements Of Statistical Learning.

### I tried…

We'll be viewing the least squares estimate as a *projection* onto the
column space of the data, . Why does this make any sense?

- Consider → instead of the other way around, i.e. given some we want can define it
- Basically, the question we have here is "why does making the predicted
value for the projection onto the space spanned by the indep. variable
make the residuals smallest?"
- By def. the projection of some vector onto a subspace is defined to such that the vector from its projected vector to is perpendicular to the plane, i.e. is perpendicular to subspace spanned by , i.e. minimal! (since perpendicular distance is the smallest distance from the plane)
- Why is this relevant? Well, it tells us that if we have a set of orthogonal vectors we can view our fit as a projection onto the column-space of all , i.e. the projection defines how much we move along each of the to get our least estimate for some .
- EACH COMPONENT OF IS PROJECTED ONTO A SUBSPACE SPANNED BY THE INPUTS FOR THIS SPECIFIC COMPONENENT!!!

from functools import wraps import numpy as np import matplotlib.pyplot as plt import seaborn as sns def ols_fit(X, y): # the code below is to standardize the inputs # X = X - X.mean() # p x n # X = X / X.var() # need to account for intercept / bias beta_0 = np.ones(X.shape) X = np.vstack((beta_0, X)) XX_inv = np.linalg.inv(np.dot(X, X.T)) # p x p beta = np.dot(np.dot(XX_inv, X), y.T) # p x 1 def predict(x): x = np.vstack((np.ones(x.shape), x)) return np.dot(beta.T, x) # p x n return predict n = 10 x = np.arange(n).reshape(1, -1) epsilon = np.random.normal(0, 1, n).reshape(1, -1) y = np.arange(10, 10 + n).reshape(1, -1) + epsilon fit = ols_fit(x, y) y_fit = fit(x) print(y_fit) plt.scatter(x, y, label="$y$") plt.scatter(x, y_fit, c="g", label="$\hat{y}$") x_extra = (np.max(x) - np.min(x)) * 0.2 plt.xlim(np.min(x) - x_extra * 0.5, np.max(x) + x_extra) y_extra = (np.max(y) - np.min(y)) * 0.2 plt.ylim(np.min(y) - y_extra, np.max(y) + y_extra) plt.legend() plt.title("Linear with white noise and fitted values") plt.savefig(".linear_regression/1d_data.png") return ".linear_regression/1d_data.png"

import numpy as np n = 10 x = np.arange(n).reshape(1, -1) epsilon = np.random.normal(0, 1, n).reshape(1, -1) y = np.arange(10, 10 + n).reshape(1, -1) + epsilon x = x - x.mean() x /= x.var() np.linalg.matrix_rank(x)

### Estimating std. err. of parameter-estimates

So, we now have our parameter-estimates:

Now, the linear regression model is

Thus, with our estimate for the parameters,

Here we're using to denote the true parameters.

which is a random variable following the distribution

Thus we can compute the variance of each of the parameters, and therefore the standard error!

## Least Squares Estimate

### Bias-variance decomposition

Consider the mean squared error of an estimator in estimating

The *bias-variance decomposition* would then be:

Where since both and are constant

## Regularization methods

### Ridge regression

Ridge regression shrinks the regression coefficients by imposing a penalty on their size.

**Ridge regression** minimizes a penalized residual sum of squares,

Here is a complexity parameter that controls the amount of *shrinkage*; the larger the value of , the greater the amount of shrinkage.

An equivalent way to write the ridge problem is

for some .

*There is a one-to-one correspondance between these two formalations, since the Lagrangian of this constrained optimization problem would only have an extra term of , which would simply be a positive constant.*

Using *centered* inputs, we drop and can write the problem in matrix form:

which have solutions

Ridge solutions are not equivariant under scaling of the inputs, and so one normally standardizest he inputs before solving.

The intercept has been left out of the penalty term! Penalization of the intercept would make the procedure depend on the origin chosen for , that is, adding a constant to each of the targets would not simply result in a shift of the predictions by the same amount .

From ridge solution in matrix form, we see that the solution adds a positive constant to the diagonal of before inversion. This makes the problem *non-singular*, even if is not!

This was the main motivation for ridge regression when it was first introduced.trevor09

Using Singular Value Decomposition of the centered input matrix :

For "standard" least squares we have

Observe that , where .

In comparison, the ridge solution is gives us

Note that since , we have

Observe that like linear regression, ridge regression computes the projection of onto the subspace spanned by , but also *shrinks* these coordinates by the factors ! This is why you will sometimes see such methods under the name "shrinkage methods".

#### Variance

Since is the sample covariance, we know that the eigenvectors of this, are the principal components of the variables in . Therefore, letting , we have

Hence, the small singular values correspond to the directions in the column space of having small variance, and ridge regression shrinks these directions the most.

### Lasso regression

**Lasso regression** is defined by the estimate

In the signal processing literature, the **lasso** is also known as *basis pursuit*.

We can write the **lasso** in the equivalent *Lagrangian* form:

The absolute value here makes the optimization problem non-linear, and so there is no closed form expression as in the case of ridge regression.

## Kernel methods

## Different regression-models

### Poission regression

Instead of assuming that to follow

we assume

## Bibliography

# Bibliography

- [trevor09] Trevor Hastie, The elements of statistical learning : data mining, inference, and prediction, Springer (2009).