Gradient Boosting

Gradient boosting
Gradient Tree Boosting

Gradient boosting

Overview

Gradient boosting is a technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically prediction trees. It builds the model in a stage-wise fashion like other boosting methods do, and generalizes them by allowing optimization of an arbitrary differentiable loss function.

Motivation

Consider a least-squares regression setting, where the goal is to "teach" a model $F$ to predict values in the form $\hat{y} = F(x)$ by minimizing the mean squared error $(\hat{y} - y)^2$ , averaged over some training set of actualy values of the output variable $y$ .

At each stage $1 \ge m \ge M$ , of gradient boosting, it may be assumed that there is a some imperfect model $F_m$ . The gradient boosing algorithm improves on $F_m$ by constructing a new model that adds an estimator $h(x)$ to provide a better model: $F_{m + 1}(x) = F_m (x) + h(x)$ .

To find $h(x)$ , the gradient boosting solution starts with the observation that a perfect $h(x)$ would imply

$\begin{equation*} F_{m+1}(x) = F_m + h(x) = y \end{equation*}$

i.e. that adding the extra predictor $h(x)$ will make up for all the mistakes our previous model $F_m(x)$ did, or equivalently

$\begin{equation*} h(x) = y - F_m(x) \end{equation*}$

Therefore, gradient boosting will fit $h(x)$ to the residual $y - F_m(x)$ . Like in other boosting variants, each $F_{m+1}$ learns to correct its predecessor $F_m(x)$ .

The generalization to other loss-functions than the least-squares, comes from the observation that $y - F_m(x)$ for a given model are the negative gradients (wrt. $F(x)$ ) of the squared error loss function, hence for any loss function $L$ we'll simply fit to the negative gradient $- \frac{\partial L}{\partial F_m}$ .

Algorithm

Input: training set $\{(x_i, y_i)}_{i=1}^n$ , a differentiable loss function $L(y, F(x))$ , number of iterations $M$ .

Initialize model with a constant value:

$\begin{equation*} F_0(x) = \underset{\gamma}{\arg \min} \sum_{i = 1}^{n} L(y_i, \gamma) \end{equation*}$
For $m = 1$ to $M$ :
1. Compute so-called psuedo-residuals :
$\begin{equation*} r_{im} = - \Bigg[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \Bigg]_{F(x) = F_{m-1}(x)} \quad \text{for } i = 1, \dots, n \end{equation*}$
1. Fit a base learner (e.g. tree) $h(x)$ to psuedo-residuals, i.e. train it using the training set $\{(x_i, r_{im})\}_{i=1}^n$
2. Compute multiplier $\gamma_m$ sby solving the following one-dimension optimization problem
$\begin{equation*} \gamma_m = \underset{\gamma}{\arg \min} \sum_{i = 1}^{n} L\big(y_i, F_{m-1}(x_i) + \gamma h_m(x_i) \big) \end{equation*}$
1. Update the model:
$\begin{equation*} F_m(x) = F_{m-1}(x) + \gamma_m h_m(x) \end{equation*}$
Output $F_M(x)$

Gradient Tree Boosting

Uses decision trees of fixed size as base learners

Modified gradient boosting method ("TreeBoost")

Generic gradient boosting at the m-th step would fit a decision tree $h_m(x)$ to psuedo-residuals
Let $J_m$ be the number of its leaves
Tree partitions input space into $J_m$ disjoint regions $R_{1m}, \dots, R_{J_m, m}$ and predicts a constant value in each region, i.e.

$\begin{equation*} h_m(x) = \sum_{j=1}^{J_m} b_{jm} \mathbf{1}_{R_{jm}} (x) \end{equation*}$

where $b_{jm}$ is the value predicted in the region $R_{jm}$ .

In the case of usual CART trees, where the trees are fitted using least-squares loss, and so the coefficient $b_{jm}$ for the region $R_{jm}$ is equal to just the value of the output variable, averaged over all training instances in $R_{jm}$ .

Continuing with generic gradient boosting, the model is:

$\begin{equation*} \begin{split} F_m(x) &= F_{m - 1}(x) + \gamma_m h_m(x) \\ \gamma_m &= \underset{\gamma}{\arg \min} \sum_{i=1}^{n} L \big( y_i, F_{m - 1}(x_i) \big) + \gamma h_m(x_i) \end{split} \end{equation*}$

The modified gradient tree boosting ("TreeBoost") algorithm is then to instead choose a separate optimal value $\gamma_{jm}$ for each of the tree's regions, instead of a single $\gamma_m$ for whole tree:

$\begin{equation*} \begin{split} F_m(x) &= F_{m - 1}(x) + \sum_{j=1}^{J_m} \gamma_{jm} \mathbf{1}_{R_jm}(x) \\ \gamma_m &= \underset{\gamma}{\arg \min} \sum_{x_i \in R_{jm}} L \Big( y_i, F_{m - 1}(x_i) \Big) + \gamma h_m(x_i) \end{split} \end{equation*}$

where we've simply absorbed $b_{jm}$ into $y_{jm}$ .

Terminology

Out-of-bag error / estimate is a method for measuring prediction error of methods utilizing bootstrap aggregating (bagging) to sub-sample data used for training. OOB is the mean prediction error on each training sample $x_i$ , using only the trees that did not have $x_i$ in their bootstrap sample

Regularization

Number of trees: Large number of trees can overfit
Shrinkage: use a modified update rule:

$\begin{equation*} F_m(x) = F_{m - 1}(x) + \nu \ \gamma_m h_m(x), \quad 0 < \nu \le 1 \end{equation*}$

where $\nu$ is the learnin rate.
Subsampling is when at each iteration of the algorithm, a base learner should be fit on a subsample of the training set drawn at random without replacement. Sumsample size $f$ is the fraction of training set to use; if $f = 1$ then algorithm is deterministic and identitical to TreeBoost
Limit number of observations in trees' terminal nodes: in the tree building process by ignoring any split that lead to nodes containing fewer than this number of training set instances. Helps reduce variance in predictions at leaves.
Penalize complexity of tree: model complexity can be defined as proportional number of leaves in the learned trees. Joint optimization of loss and model complexity corresponds to a post-pruning algorithm.
Regularization (e.g. $\ell_1$ , $\ell_2$ ) on leaf nodes

Gradient Boosting

Table of Contents