Gradient Boosting

Table of Contents

Gradient boosting


Gradient boosting is a technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically prediction trees. It builds the model in a stage-wise fashion like other boosting methods do, and generalizes them by allowing optimization of an arbitrary differentiable loss function.


Consider a least-squares regression setting, where the goal is to "teach" a model boosting_e08421ac16591ee60adecc1e14a359550f81a1e2.png to predict values in the form boosting_6cd9f1ba04647cf0b1ac745de36c20e719e6d1c8.png by minimizing the mean squared error boosting_cec5ad0debfeb1d3108d21afb80a3bfc1deceff2.png, averaged over some training set of actualy values of the output variable boosting_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png.

At each stage boosting_aa45a5ee5b542d54ab696b9a4a2176a13ec92d1a.png, of gradient boosting, it may be assumed that there is a some imperfect model boosting_81ef873ff5da09aea04572cb9e689ffd0c2ef852.png. The gradient boosing algorithm improves on boosting_81ef873ff5da09aea04572cb9e689ffd0c2ef852.png by constructing a new model that adds an estimator boosting_c9d67e43e9ae51aba5ae7fab7f1a3e9ca2204465.png to provide a better model: boosting_dd8c41d151bbc158a938512e3c09cef87a104988.png.

To find boosting_c9d67e43e9ae51aba5ae7fab7f1a3e9ca2204465.png, the gradient boosting solution starts with the observation that a perfect boosting_c9d67e43e9ae51aba5ae7fab7f1a3e9ca2204465.png would imply


i.e. that adding the extra predictor boosting_c9d67e43e9ae51aba5ae7fab7f1a3e9ca2204465.png will make up for all the mistakes our previous model boosting_251f64b5e37166c1a703f2f6f15899f0edc2a63c.png did, or equivalently


Therefore, gradient boosting will fit boosting_c9d67e43e9ae51aba5ae7fab7f1a3e9ca2204465.png to the residual boosting_ec4b48b8819a944d6f695d147e7754cab4b41eda.png. Like in other boosting variants, each boosting_5d2645d50259d3614bc317bb91beecbc0ef8b2df.png learns to correct its predecessor boosting_251f64b5e37166c1a703f2f6f15899f0edc2a63c.png.

The generalization to other loss-functions than the least-squares, comes from the observation that boosting_ec4b48b8819a944d6f695d147e7754cab4b41eda.png for a given model are the negative gradients (wrt. boosting_bc8079a794557a3532a41ec513e4cf07c5e7c7c5.png ) of the squared error loss function, hence for any loss function boosting_9dfb630910b0b3d2a61cb5f1673b9adaa08fe150.png we'll simply fit to the negative gradient boosting_121ca1f92b304cf0d77f7525c98c8e3bc2efb2ef.png.


Input: training set boosting_c5f1d6c399c0709cd3d746cf6e70eb7afff81458.png, a differentiable loss function boosting_925470038339e99ce9f1617eaf2c08c9fd753ed6.png, number of iterations boosting_f75d7681f9d0acecef24eb637ee201aa5b39199a.png.

  1. Initialize model with a constant value:


  2. For boosting_4d056f7330d032cc6971fcc8578a75f025a39725.png to boosting_f75d7681f9d0acecef24eb637ee201aa5b39199a.png:

    1. Compute so-called psuedo-residuals :


    1. Fit a base learner (e.g. tree) boosting_c9d67e43e9ae51aba5ae7fab7f1a3e9ca2204465.png to psuedo-residuals, i.e. train it using the training set boosting_b493ac779043362ae37f4f19216048261f9f4520.png
    2. Compute multiplier boosting_6e74012858faa03ab07f40a78a8b26f272172edb.png sby solving the following one-dimension optimization problem


    1. Update the model:


  3. Output boosting_5e75ef066e03db679ca25f3070adc956544c09e9.png

Gradient Tree Boosting

  • Uses decision trees of fixed size as base learners

Modified gradient boosting method ("TreeBoost")

  • Generic gradient boosting at the m-th step would fit a decision tree boosting_844cb0dac3de550f1864e8a08e70683a43f19011.png to psuedo-residuals
  • Let boosting_814692cafbd8150ca25f0092fa51e247230c90bc.png be the number of its leaves
  • Tree partitions input space into boosting_814692cafbd8150ca25f0092fa51e247230c90bc.png disjoint regions boosting_f3eacb368b83a31b8e2a8b372e0325abdd37827f.png and predicts a constant value in each region, i.e.


    where boosting_20ce483542881384350d4d0fd9412f045d0bb8d6.png is the value predicted in the region boosting_d02333b0a205952506af72a477c2c55263bb0baa.png.

In the case of usual CART trees, where the trees are fitted using least-squares loss, and so the coefficient boosting_20ce483542881384350d4d0fd9412f045d0bb8d6.png for the region boosting_d02333b0a205952506af72a477c2c55263bb0baa.png is equal to just the value of the output variable, averaged over all training instances in boosting_d02333b0a205952506af72a477c2c55263bb0baa.png.

The modified gradient tree boosting ("TreeBoost") algorithm is then to instead choose a separate optimal value boosting_6093490b1abaeb06e1470f6e3b5079d060b1c001.png for each of the tree's regions, instead of a single boosting_6e74012858faa03ab07f40a78a8b26f272172edb.png for whole tree:


where we've simply absorbed boosting_20ce483542881384350d4d0fd9412f045d0bb8d6.png into boosting_c051f84a8c21af910c0a9d45b72e6a2286a5de04.png.


  • Out-of-bag error / estimate is a method for measuring prediction error of methods utilizing bootstrap aggregating (bagging) to sub-sample data used for training. OOB is the mean prediction error on each training sample boosting_e9f4d218474bc10aa94958ec30139aee865c0173.png, using only the trees that did not have boosting_e9f4d218474bc10aa94958ec30139aee865c0173.png in their bootstrap sample


  • Number of trees: Large number of trees can overfit
  • Shrinkage: use a modified update rule:


    where boosting_da8e75455462b818a9b164371ccdd9c0de0995fb.png is the learnin rate.

  • Subsampling is when at each iteration of the algorithm, a base learner should be fit on a subsample of the training set drawn at random without replacement. Sumsample size boosting_cdd1cc131da6040eca078917132a377727053c44.png is the fraction of training set to use; if boosting_2d6dedb2d0eabfda386210a9fef25fe7a0f45833.png then algorithm is deterministic and identitical to TreeBoost
  • Limit number of observations in trees' terminal nodes: in the tree building process by ignoring any split that lead to nodes containing fewer than this number of training set instances. Helps reduce variance in predictions at leaves.
  • Penalize complexity of tree: model complexity can be defined as proportional number of leaves in the learned trees. Joint optimization of loss and model complexity corresponds to a post-pruning algorithm.
  • Regularization (e.g. boosting_9973da8979c5d8b7f2aaab2e22bf9fcf63cdc292.png, boosting_42cc62e9bb5cfa835f2ec2f176687466eca9ac99.png) on leaf nodes