Gradient Boosting

Table of Contents

Gradient boosting

Overview

Gradient boosting is a technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically prediction trees. It builds the model in a stage-wise fashion like other boosting methods do, and generalizes them by allowing optimization of an arbitrary differentiable loss function.

Motivation

Consider a least-squares regression setting, where the goal is to "teach" a model $F$ to predict values in the form $\hat{y} = F(x)$ by minimizing the mean squared error $(\hat{y} - y)^2$, averaged over some training set of actualy values of the output variable $y$.

At each stage $1 \ge m \ge M$, of gradient boosting, it may be assumed that there is a some imperfect model $F_m$. The gradient boosing algorithm improves on $F_m$ by constructing a new model that adds an estimator $h(x)$ to provide a better model: $F_{m + 1}(x) = F_m (x) + h(x)$.

To find $h(x)$, the gradient boosting solution starts with the observation that a perfect $h(x)$ would imply

\begin{equation*}
F_{m+1}(x) = F_m + h(x) = y
\end{equation*}

i.e. that adding the extra predictor $h(x)$ will make up for all the mistakes our previous model $F_m(x)$ did, or equivalently

\begin{equation*}
h(x) = y - F_m(x)
\end{equation*}

Therefore, gradient boosting will fit $h(x)$ to the residual $y - F_m(x)$. Like in other boosting variants, each $F_{m+1}$ learns to correct its predecessor $F_m(x)$.

The generalization to other loss-functions than the least-squares, comes from the observation that $y - F_m(x)$ for a given model are the negative gradients (wrt. $F(x)$ ) of the squared error loss function, hence for any loss function $L$ we'll simply fit to the negative gradient $- \frac{\partial L}{\partial F_m}$.

Algorithm

Input: training set $\{(x_i, y_i)}_{i=1}^n$, a differentiable loss function $L(y, F(x))$, number of iterations $M$.

  1. Initialize model with a constant value:

    \begin{equation*}
F_0(x) = \underset{\gamma}{\arg \min} \sum_{i = 1}^{n} L(y_i, \gamma)
\end{equation*}
  2. For $m = 1$ to $M$:

    1. Compute so-called psuedo-residuals :
    \begin{equation*}
r_{im} = - \Bigg[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \Bigg]_{F(x) = F_{m-1}(x)} \quad \text{for } i = 1, \dots, n
\end{equation*}
    1. Fit a base learner (e.g. tree) $h(x)$ to psuedo-residuals, i.e. train it using the training set $\{(x_i, r_{im})\}_{i=1}^n$
    2. Compute multiplier $\gamma_m$ sby solving the following one-dimension optimization problem
    \begin{equation*}
\gamma_m = \underset{\gamma}{\arg \min} \sum_{i = 1}^{n} L\big(y_i, F_{m-1}(x_i) + \gamma h_m(x_i) \big)
\end{equation*}
    1. Update the model:
    \begin{equation*}
F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)
\end{equation*}
  3. Output $F_M(x)$

Gradient Tree Boosting

  • Uses decision trees of fixed size as base learners

Modified gradient boosting method ("TreeBoost")

  • Generic gradient boosting at the m-th step would fit a decision tree $h_m(x)$ to psuedo-residuals
  • Let $J_m$ be the number of its leaves
  • Tree partitions input space into $J_m$ disjoint regions $R_{1m}, \dots, R_{J_m, m}$ and predicts a constant value in each region, i.e.

    \begin{equation*}
h_m(x) = \sum_{j=1}^{J_m} b_{jm} \mathbf{1}_{R_{jm}} (x)
\end{equation*}

    where $b_{jm}$ is the value predicted in the region $R_{jm}$.

In the case of usual CART trees, where the trees are fitted using least-squares loss, and so the coefficient $b_{jm}$ for the region $R_{jm}$ is equal to just the value of the output variable, averaged over all training instances in $R_{jm}$.

The modified gradient tree boosting ("TreeBoost") algorithm is then to instead choose a separate optimal value $\gamma_{jm}$ for each of the tree's regions, instead of a single $\gamma_m$ for whole tree:

\begin{equation*}
\begin{split}
  F_m(x) &= F_{m - 1}(x) + \sum_{j=1}^{J_m} \gamma_{jm} \mathbf{1}_{R_jm}(x) \\ 
  \gamma_m &= \underset{\gamma}{\arg \min} \sum_{x_i \in R_{jm}} L \Big( y_i, F_{m - 1}(x_i) \Big) + \gamma h_m(x_i)
\end{split}
\end{equation*}

where we've simply absorbed $b_{jm}$ into $y_{jm}$.

Terminology

  • Out-of-bag error / estimate is a method for measuring prediction error of methods utilizing bootstrap aggregating (bagging) to sub-sample data used for training. OOB is the mean prediction error on each training sample $x_i$, using only the trees that did not have $x_i$ in their bootstrap sample

Regularization

  • Number of trees: Large number of trees can overfit
  • Shrinkage: use a modified update rule:

    \begin{equation*}
F_m(x) = F_{m - 1}(x) + \nu \ \gamma_m h_m(x), \quad 0 < \nu \le 1
\end{equation*}

    where $\nu$ is the learnin rate.

  • Subsampling is when at each iteration of the algorithm, a base learner should be fit on a subsample of the training set drawn at random without replacement. Sumsample size $f$ is the fraction of training set to use; if $f = 1$ then algorithm is deterministic and identitical to TreeBoost
  • Limit number of observations in trees' terminal nodes: in the tree building process by ignoring any split that lead to nodes containing fewer than this number of training set instances. Helps reduce variance in predictions at leaves.
  • Penalize complexity of tree: model complexity can be defined as proportional number of leaves in the learned trees. Joint optimization of loss and model complexity corresponds to a post-pruning algorithm.
  • Regularization (e.g. $\ell_1$, $\ell_2$) on leaf nodes