Bias-Variance Tradeoff

Table of Contents

Bias

Wikipedia definition

Defined as:

\begin{equation*}
  \text{Bias}(\hat{\theta}) = \mathbb{E} \big[ \hat{\theta}} - \theta \big]
\end{equation*}
1

where the expectation is taken over $p(x | \theta)$, i.e. averaging over all possible observations.

Here we have assumed that the real model follows the same model as we want to use as an estimator, and then we're looking at how the exected value of our estimated parameter for this assumed model differs from the real parameter for the assumed model.

ESL definition

From The Elements of Statistical Learning, we have the following definition:

\begin{equation*}
\begin{split}
  Err(x_0) &= \mathbb{E}_{\tau} \Big[ \big(Y - \hat{f}(x) \big)^2 \Big] \\
         &= \Big( \mathbb{E} \big[ \hat{f}(x) \big] - f(x) \Big)^2 + \mathbb{E} \Big[ \big( f(x) - \mathbb{E}[\hat{f}(x)] \big)^2 \Big] + \sigma_e^2 \\
         &= \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}
\end{split}
\end{equation*}
2

where $x$ is a single observation, therefore we interpret $\mathbb{E}[\hat{f}(x)]$ as creating the same model over and over, and then taking the expectation of the predictions of all these models, basically like we do in bagging (Bootstrap Aggregation).

Notice how this differs from the Wikipedia definition where we assume that the estimator follows the same model as the real model, but simply using (potentially) different parameters.

Variance

Bias-Variance tradeoff

From what I understand, this all makes sense when using models where we can analytically determine the bias and variance of our estimator, but how about more complex models which do not have a clear way of performing a bias-variance decomposition?

From what I can tell, people then use the following not-so-rigorous "definitions":

  • Bias relates to underfitting, which can be observed when the training-loss does not decrease further with more training, with the error still being quite large. This might be due to one of the following:
    • Our model is not be "complex" enough to handle the target function → increase complexity
    • We're stuck in a local minima → might be worth changing optimizer (something with momentum, e.g. Adam)
  • Variance relates to overfitting, which can be observed when the difference between the loss on the train and test data is quite large, i.e. don't generalize well. This usually means we need to make use of some regularization methods.