Decision Trees

As we have learnt by now, there is no single best analytics technique. Each technique is best suited to certain type and size of data and to a given objective, with their own strengths and weaknesses. If one has to pick the **most versatile and flexible technique**, which is suitable to wide array of problems, then **Decision Tree would be a strong contender**.

It is a broad, generic name, which covers several techniques – **CART, CHAID, C4.5, Quest** etc. being the most widely used ones. The underlying concept is same for all these techniques. We would cover the basic concepts, and then discuss each of them in detail.

Let us understand the concept with the help of same example on Home Ownership of 20 records, with Income and Household size as two measurements and Home ownership as the outcome. The sample data and a X-Y plot, with Income on x-axis and Household size on y-axis is shown below. The blue dots are “Own” and orange dots are “Rent” class.

Can we pick a boundary in terms of income or size, which separates the two classes (Own & Rent) in the best possible way? In this simple case, household size of 3 seems to separate these classes very well. Most of the records above size > 3 are “own” and below size <= 3 are “Rent”. This separation boundary (size = 3) is partitioning the overall data into two homogenous subgroups, one with mostly “Own” class and other with mostly “Rent” class. Now within these two subgroups, Income of 800 and 2000 (vertical lines in two partitions) seem to again partition these subgroup further into 4 homogenous groups. Effectively, these boundaries are creating smaller subsets, in a hierarchical fashion, which are more and more “pure” compared to parent sets. Hence the lower left partition bounded by Size <=3 and Income < 2000 is completely populated by “Rent” class. This sequence of splits, kind of become a decision rule to classify a record.

This is the concept followed by Decision Trees in general, where a dataset is partitioned into more and more homogenous subgroups in a hierarchical fashion, and the criteria used to create these subgroups are used as a “Predictive Rule” for putting records into appropriate classes. One another valuable aspect of this method is that the strong predictors (important X variables) start emerging at each level of the partition. Hence “variable importance” is another useful result of using these techniques.

Although in this simple case, we could pick the decision boundaries visually, on practical data sets it may not be possible. We need an automated procedure to carry out the partitioning. The criterion to select the best “boundary” at each stage is based on the concept of “purity”. The criterion, which improves the overall purity of the data at each level is picked among various possibilities.

A data set which has all the records of the same class is completely pure. The maximum impurity comes, when there are equal number of records for each class. As an example, a data where there are 50 - 50% records from class 1 and 0 is completely impure. There are several measures to capture impurity of a data. Two of the most widely used are given below.

Gini Index and Entropy attain their maximum value at 0.5 and 1. 0 respectively for a Binary class problem, which corresponds to higher impurity. We need to find a splitting criterion, which brings down the overall impurity to lowest possible value at each level.

Please see in the example, how different criteria perform in terms of improving the overall purity (lowering the Gini Index).

As an example, Income = 1500, splits the data into two subgroups. More than 1500 has 5 records, 4 as “Own” and 1 as “Rent”. Less than or equal to 1500 has 15 records, 4 as “Own” and 11 as “Rent”. These are shown as (4,1) and (4,11) in the calculation. Gini Indices of these subgroups are 32% and 39% respectively. The overall purity is a weighted average purity, which comes as 37%. Among all the possible criteria, Household size = 3, improves the purity to the maximum (overall Gini Index to 18%). Hence this criterion is used at the first level. Now the same process would be applied to the two subsets (one created above size = 3 and one equal to and below it). As the next step, for each of them Income = 810 and 2000 emerge as best splitting criteria. Hence we use these criteria at the second level.

The same process is shown pictorially in the right hand side picture above .

A measure called **Gain** defined as Impurity (Parent Node) – Impurity (Child Nodes) is used to judge the best splitting criterion. The impurity can be based on Gini Index, Entropy or any other measure used. When Entropy is used, it is called Information Gain.

GINI or Entropy would be higher for criterion which divides the data set into multiple levels. As an example, employee ID would divide datasets into much higher levels (as many as no. of records) as compared to Age. And each nodes split with ID would be pure as there is only one employee of a particular class. In order to overcome this, two approaches are used.

- Make use of Binary split always (CART, CHAID)
- Penalize for the number of levels (C4.5) by using
**Gain Ratio = Information Gain/Gain Info**, as number of levels go up, Gain Ratio comes down to annul the effect of this artificial purity

The outcome could be either a Categorical or a Continuous variable. Depending on that, we use different methods to select the splitting criterion. The following methods are widely used.

Categorical target variables:

- Gini index (in the program CART)
- Entropy (also called information gain)
**Chi-square**(in the program CHAID)

Continuous target variables:

- Reduction in variance
- The
**F-Test**

Theoretically it is possible to grow the tree till all the terminal or leaf nodes are completely homogenous or pure. But that amounts to over fitting and may start capturing the noise in the data. One has to stop the tree growth before it starts over fitting and starts showing high error on validation data. There could be several criteria to stop the growth.

Minimum number of records in a node, minimum reduction in GINI Index, Entropy or minimum value of GAIN etc. can be used to stop the tree growth.

There are two approaches to avoid the overfitting. Either the tree is **prevented from growing** beyond a point, where it starts overfitting the data, or a full grown tree is **pruned back** till it minimizes the validation error. Both these techniques are equally popular and are foundation of Decision Tree algorithms in commercial programs.

Stopping tree growth approach uses Chi-square Test for independence to assess whether a split improves the purity by statistically significant amount or not. Chi-square test finds the best splitting criterion, by minimizing the p value of Chi-square test (it indicates the strength of relation between the X and Y variables). Due to this, the algorithm using this approach is also called CHAID (Chi-squared Automatic Interaction Detection).

An alternative approach to stopping the tree is pruning the full-grown tree. Classification & Regression Tree (CART) or C4.5/5 etc. use this approach. CART uses validation data to prune back the full tree from training data, whereas C4.5 uses training data itself for pruning. CART is the basis of SAS Eminer and C4.5 is used in SPSS.

**Strengths of Decision Trees**

- Interpretation is usually straightforward and easy to demonstrate and explain
- There are few underlying assumptions that must be met.
- Interactions among predictor variables can be identified
- Outliers and missing values can be handled without problems
- Non-linear relationships are handled without problems
- Predictor variable selection is automatic (no prior variable screening technique)
- Binary, categorical, ordinal, and interval target and predictor variables can be used

**Weaknesses of Decision Trees**

- Slight changes in the data set can produce dramatically different results
- Large datasets are needed
- Careful validation is required to avoid over fitting
- The data snooping process can be misleading
- Considerable judgement and experimentation may be needed to develop a suitable result

Decision trees seems to work well, when X1 = a or X2 = b seem to split the dataset horizontally or vertically. However there can be a case when combination of predictors seem to segregate better like aX1 + bX2 etc. In these case Discriminant analysis or Regression model would be a better classifier.

Decision Tree usually requires huge data sets in order to provide a good classifier. In order to overcome this limitation, CART has been enhanced as __ Random Forest model__, where multiple trees are generated from same data and their results are combined to enhance the prediction.

Breiman (2001) proposed Random Forests, which generates multiple trees from random samples drawn from the data. An additional layer of randomness to random sampling of records is added, by taking random sample of predictors too.

In standard trees, each node is split using the best split among all variables. In a random forest, each node is split using the best among a subset of predictors randomly chosen at that node. This somewhat counter-intuitive strategy turns out to perform very well compared to many other classifiers, including Discriminant analysis, Support Vector Machines and Neural Networks, and is robust against overfitting (Breiman, 2001). In addition, it is very user-friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest), and is usually not very sensitive to their values.

The autonomous nature of Decision Trees (not much user intervention) makes it a very valuable technique. Due to this, it is quite frequently used as a complementary technique for variable screening. The variables, which emerge as part of the decision rules are considered important and used with techniques like KNN, Logistic Regression etc. It can also be used for category merging, in case there are too many levels for a certain categorical predictor, without losing their predictive power.

The other very powerful feature of these techniques are their ability to detect interaction among variables. Please see an example on "**Interesting Stuff**" link on Home page for a better appreciation. With all this flexibility and autonomy comes susceptibility to over-fitting and high variance error. As pointed out in the weaknesses, careful use of validation data becomes extremely important for these techniques.

There is often a dilemma on using Decision Tree or Logistic Regression. Both of them can be used in similar cases, with their own strengths and weaknesses. Regression as earlier mentioned can be used on a relatively smaller data set and is more stable. Decision Tree on the other hand is far more flexible, but usually works fine with larger data sets and is very prone to overfitting. It is a good idea to try both, and used the insights from them to enhance each other. Decision Tree can be used as a complimentary technique for Logistic Regression.

**Please watch a Youtube Video on this (from the link at the top)**, and observe how Decision Tree and Logistic Regression go hand-in-hand in developing a better Predictive model.