CART: Classification & Regression Trees

Table of Contents

Metrics

Gini impurity

  • Measure of how often a randomly chosen element from the set would be incorrectly labeled if it was labeled randomly according to the distribution of the labels in the subset.

classification_and_regression_tress_71ef70aa318f79fdd5cc4d8f7f7b8c9cc723a998.png

Procedure

  1. Draw random sample classification_and_regression_tress_3c314f80373742988ad542a6d4ce66a111b17847.png from the set (which in reality has label, say, classification_and_regression_tress_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png)
  2. Classify classification_and_regression_tress_3c314f80373742988ad542a6d4ce66a111b17847.png according to the distribution of the labels
  3. Compute probability of misclassification, i.e. proportion of elements not labeled classification_and_regression_tress_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png
  4. Compute Gini coefficient classification_and_regression_tress_d76450885c67296033fd05bba35d7b19615d7406.png, where classification_and_regression_tress_3fa5b645482ca39f5049f2eaa922436d58b71f9d.png is the proportion of label classification_and_regression_tress_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png in the set
  5. Perform step 1-4 for the child nodes (nodes after making some split)
  6. Compare sum of coefficient for child nodes with parent node
  7. Perform steps 1-6 for different features / independent variables / attributes to find the one with the best (i.e. smallest) coefficient and go ham (i.e. make a split).

Information gain

  • In information theory and machine learning: information gain is a synonym for the Kullback-Leibner divergence
  • In context of decision trees: sometimes used synonymously with mutual information, which is the expected value of the KL-divergence of the univariate probability distribution of one variable from the conditional distribution of this variable given the other one.

Notation

  • classification_and_regression_tress_b3731d37cd447bd6c31809075a6be43b3d0b04ec.png is the set of training examples
  • classification_and_regression_tress_fcb4cf6bf4c4487c7953269158f09eca75283154.png is a training example
  • classification_and_regression_tress_f36e59665c66b741d5b5d27221ef50f7b5affc02.png is classification_and_regression_tress_d80e53601000092a4110dc61efebd155f05d4b3b.png attribute
  • classification_and_regression_tress_e7d2a8a4ff7b5d7817ce32257c65c679d726930a.png is all possible value classification_and_regression_tress_e5b4541e6b8cec7b7c29409a70a2a5e926b66e19.png can take

Definition

classification_and_regression_tress_f38bb7d91d3a51a252f06a9b8db82cb097a27d06.png

or

classification_and_regression_tress_c9ff9710c42852528cf3ccc75f02d91ee3599007.png

or best:

classification_and_regression_tress_4e5571faf1b534a423f324e5210cce2aacc52135.png

which is just saying that Information Gain is the difference between the entropy for the entire training set and the weighted (by portion of samples which will fall into this node) sum of the entropy in the child-nodes.

Remember, Information Gain only makes sense when it comes to classification, not regression.

  • Difference between the entropy of a node classification_and_regression_tress_b3731d37cd447bd6c31809075a6be43b3d0b04ec.png and the sum of entropy of the children given that we decide to split on attribute classification_and_regression_tress_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png
  • We take the difference between uncertainty before and after the split, which gives us a measure of how much more certain we are by performing said split.
  • How much uncertainty we remove by making the split on this independent variable

Why the entropy?

classification_and_regression_tress_b98439fc520d03e2cc095e6792e532e970fa87a4.png

  • classification_and_regression_tress_4dfbb4efa24b820c3d05ca44f97337b43d713ff2.png is the probability of class classification_and_regression_tress_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png
  • Entropy is simply a function chosen to satisfy some criterias that one would intuitively want from a uncertainty-measure:
    1. classification_and_regression_tress_3704c772a87a75ce40a6c9c875978c0e58347bb5.png
    2. classification_and_regression_tress_53881f90c0dcc601d2e04434f69e696f7506d45b.png, classification_and_regression_tress_f1330523c697c7f312e3dcdd0386cfbf0532a867.png is maximum when classification_and_regression_tress_58e894da1b9903691bcc367ccb083edc53cc34e4.png, i.e. we are most uncertain when every possiblity is equally likely.
    3. classification_and_regression_tress_b37e43eeb31c5f1dca81ed61a9564c6fe2b6a9e4.png, with equality iff classification_and_regression_tress_3c314f80373742988ad542a6d4ce66a111b17847.png and classification_and_regression_tress_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png are independent
    4. classification_and_regression_tress_ed273a6bf21b59b51e08c56c6650bf0a16bf4ab5.png, i.e. uncertainty regarding classification_and_regression_tress_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png is never increased by knowing classification_and_regression_tress_3c314f80373742988ad542a6d4ce66a111b17847.png
    5. Any change towards equalization of the probabilities increases classification_and_regression_tress_f1330523c697c7f312e3dcdd0386cfbf0532a867.png. Greater uncertainty ⇒ greater entropy.

Example

Consider an example data-set of 14 samples, with attributes:

  • classification_and_regression_tress_edab16a4bb9c1374d8b0932725170a4dcf915676.png
  • classification_and_regression_tress_ac08d567a796568e646e11eb5aef51fbe97a4bdd.png
  • classification_and_regression_tress_7a7a969b47709e6153855abca126bd70eb8cc0b6.png
  • classification_and_regression_tress_2352a939c59c813ab51a1743fb240af4653415ea.png

And our (binary) dependent variable / output is classification_and_regression_tress_92dd39dc9ca368294227241864765948bb2421ac.png.

Let's assume we make the first split on windy (in reality we would consider splitting on each of the independent variables and then use the one with the best metric), where the resulting tree has two nodes.

It keeps going here

Variance reduction

  • Often used when target is continuous (regression tree)
  • Variance reduction of a node classification_and_regression_tress_e10e2b430f95617381cdd6d6b52aed29fb971dff.png is the total reduction of the variance of the target variable classification_and_regression_tress_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png due to the split at this node
    • target variable classification_and_regression_tress_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png being the value of the sample
  • look for the attribute / independent variable with the split "explaining" the most variance of our target variable classification_and_regression_tress_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png, i.e. has the variance reduced the most

Defintion

classification_and_regression_tress_c45d3a815b231634257e7fd41a53bdfe2ec3c7aa.png

where

  • classification_and_regression_tress_a7a9870d411fa7be56753e7bbd48731b3acb06b9.png - the set of presplit sample indices
  • classification_and_regression_tress_237e6e1e0ac07db3d5b3bd1ede89efd6cc7c0faf.png - set of sample indices for which the split test is true
  • classification_and_regression_tress_bcb2d4b8f9ff7f15f07609363fc1a52eceb61b64.png - set of sample indices for which the split test is false

Each of the above summands are indeed variance estimates, though, written in a form without directly referring to the mean.