# CART: Classification & Regression Trees

## Table of Contents

## Metrics

### Gini impurity

- Measure of how often a randomly chosen element from the
*set*would be incorrectly labeled if it was labeled randomly according to the distribution of the labels in the*subset*.

#### Procedure

- Draw random sample from the set (which in reality has label, say, )
- Classify according to the distribution of the labels
- Compute probability of misclassification, i.e. proportion of elements
*not*labeled - Compute Gini coefficient , where is the proportion of label in the set
- Perform step 1-4 for the child nodes (nodes after making some split)
- Compare sum of coefficient for child nodes with parent node
- Perform steps 1-6 for different features / independent variables / attributes to find the one with the best (i.e. smallest) coefficient and go ham (i.e. make a split).

### Information gain

- In
*information theory and machine learning*:**information gain**is a synonym for the Kullback-Leibner divergence - In context of
*decision trees*: sometimes used synonymously with**mutual information**, which is the expected value of the KL-divergence of the*univariate probability distribution*of one variable from the conditional distribution of this variable given the other one.

#### Notation

- is the set of training examples
- is a training example
- is attribute
- is all possible value can take

#### Definition

or

or **best**:

which is just saying that Information Gain is the difference between the *entropy*
for the entire training set and the weighted (by portion of samples which will fall
into this node) sum of the entropy in the child-nodes.

Remember, Information Gain only makes sense when it comes to *classification*, not regression.

- Difference between the entropy of a node and the sum of
entropy of the children
*given*that we decide to split on attribute - We take the difference between uncertainty before and after the split, which gives us a measure of how much more certain we are by performing said split.
- How much uncertainty we remove by making the split on this independent variable

#### Why the entropy?

- is the probability of class
- Entropy is simply a function chosen to satisfy some criterias
that one would intuitively
*want*from a uncertainty-measure:- , is maximum when , i.e. we are most uncertain when every possiblity is equally likely.
- , with equality iff and are independent
- , i.e. uncertainty regarding is never increased by knowing
- Any change towards equalization of the probabilities increases . Greater uncertainty ⇒ greater entropy.

#### Example

Consider an example data-set of 14 samples, with attributes:

And our (binary) dependent variable / output is .

Let's assume we make the first split on `windy`

(in reality we would consider splitting on each of the
independent variables and then use the one with the best metric), where the resulting tree has two nodes.

### Variance reduction

- Often used when
`target`

is continuous (regression tree) - Variance reduction of a node is the
*total reduction of the variance of the target variable due**to the split at this node**target variable*being the value of the sample

- look for the attribute / independent variable with the split "explaining" the most variance of our
*target variable*, i.e. has the variance reduced the most

#### Defintion

where

- - the set of
*presplit sample indices* - - set of sample indices for which the split test is
`true`

- - set of sample indices for which the split test is
`false`

Each of the above summands are indeed variance estimates, though, written in a form without directly referring to the mean.