# Gradient Boosting

## Table of Contents

## Gradient boosting

### Overview

**Gradient boosting** is a technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically prediction trees. It builds the model in a stage-wise fashion like other boosting methods do, and generalizes them by allowing optimization of an arbitrary differentiable loss function.

### Motivation

Consider a least-squares regression setting, where the goal is to "teach" a model to predict values in the form by minimizing the mean squared error , averaged over some training set of actualy values of the output variable .

At each stage , of gradient boosting, it may be assumed that there is a some imperfect model . The gradient boosing algorithm improves on by constructing a new model that adds an estimator to provide a better model: .

To find , the gradient boosting solution starts with the observation that a perfect would imply

i.e. that adding the extra predictor will make up for all the mistakes our previous model did, or equivalently

Therefore, gradient boosting will fit to the *residual* . Like in other boosting variants, each learns to correct its predecessor .

The **generalization** to other loss-functions than the least-squares, comes from the observation that for a given model are the negative gradients (wrt. ) of the squared error loss function, hence for any loss function we'll simply fit to the negative gradient .

### Algorithm

**Input:** training set , a differentiable loss function , number of iterations .

Initialize model with a constant value:

For to :

- Compute so-called
*psuedo-residuals*:

- Fit a base learner (e.g. tree) to psuedo-residuals, i.e. train it using the training set
- Compute multiplier sby solving the following one-dimension optimization problem

- Update the model:

- Compute so-called
- Output

## Gradient *Tree* Boosting

- Uses decision trees of fixed size as base learners

### Modified gradient boosting method ("TreeBoost")

- Generic gradient boosting at the m-th step would fit a decision tree to psuedo-residuals
- Let be the
*number of its leaves* Tree partitions input space into disjoint regions and predicts a constant value in each region, i.e.

where is the value predicted in the region .

In the case of usual CART trees, where the trees are fitted using least-squares loss, and so the coefficient for the region is equal to just the *value of the output variable, averaged over all training instances in *.

Continuing with generic gradient boosting, the model is:

The **modified gradient tree boosting ("TreeBoost")** algorithm is then to instead choose a *separate optimal value for each of the tree's regions*, instead of a single for whole tree:

where we've simply absorbed into .

### Terminology

**Out-of-bag error / estimate**is a method for*measuring prediction error*of methods utilizing*bootstrap aggregating (bagging)*to sub-sample data used for training.**OOB**is the mean prediction error on each training sample , using*only the trees that did not have in their bootstrap sample*

### Regularization

**Number of trees**: Large number of trees can overfit**Shrinkage**: use a modified update rule:where is the

*learnin rate*.**Subsampling**is when at each iteration of the algorithm, a base learner should be fit on a*subsample of the training set*drawn at random without replacement. Sumsample size is the fraction of training set to use; if then algorithm is deterministic and identitical to TreeBoost**Limit number of observations in trees' terminal nodes:**in the tree building process by ignoring any split that lead to nodes containing fewer than this number of training set instances. Helps reduce variance in predictions at leaves.**Penalize complexity of tree:**model complexity can be defined as*proportional number of leaves*in the learned trees. Joint optimization of loss and model complexity corresponds to a*post-pruning algorithm*.**Regularization (e.g. , ) on leaf nodes**