12.1 Decision Trees


: 60 minutes

Decision trees are defined by recursively partitioning the input space, and defining a local model in each resulting region of the input space. Consequently, the overall model can be represented by a tree.

Although decision trees can also be used for regression, we study here only classifiers trees.

TipTrees

Trees are special graphs with a designated root node and no cycles.

Decision Trees are Everywhere!

It’s quite common in real life to use nested decision rules or a decision tree for making day-to-day choices.

For example, a GW data science student like yourself may choose a DATS course based on the following decision tree.

Figure 31.1: Choosing a DATS course
Tip Internal vs External Nodes

Exercise 31.1 How many internal and external nodes are there in the tree shown in Figure 31.1.

The diagram above, we pretend to use four binary features: \{x_1\ ({\tt required}), x_2\ ({\tt exam}), x_3\ ({\tt takehome}), x_4\ ({\tt curve})\} and two class labels: \{{\color{green}\tt Y}, {\color{red}\tt N}\}.

In practice, the features are continuous—e.g. average final score, number of students, etc. In that case, it’s most nature to use a threshold value for each feature for creating the nested decision rules.

Tip More Decision Trees

Exercise 31.2 Give examples of real life scenarios where nested decision are used.

Mathematical Formulation

Let us consider a classification problem with features \bold{x}=(x_1, x_2, \ldots, x_d) and class labels \{{\tt 1}, {\tt 2}, \ldots, {\tt C}\}.

The idea behind decision tree is to divide the feature space \mathcal{X} into non-overlapping regions \mathcal{T}=\{\mathcal{R}_i\}, where each region \mathcal{R}_i is given by

  1. a sequence of nested conditions each defined by a threshold t\in\mathbb{R} (e.g., x_i\leq t_i, etc)
  2. a class label \tt c_j\in \{{\tt 1}, {\tt 2}, \ldots, {\tt C}\}.

The regions are commonly taken to be rectangular (axis-parallel) and entire the decision boundary looks like a colored maze.

For example, below is a decision tree for a two-dimensional feature vector \bold{x}=(x_1, x_2) and three class labels \{{\color{orange}\tt 0}, {\color{green}\tt 1}, {\color{royalblue}\tt 2}\}. See ; see Figure 31.2.

Figure 31.2: The decision boundary of an example decision tree

The governing decision tree is shown below in Figure 31.3.

Figure 31.3: An example decision tree for three-class problem
Tip Prediction using Decision Tree

Exercise 31.3 Using the decision tree (Figure 31.3) predict the label of the point (7, 1.5) in the feature space.

Prediction

Given such a decision tree with colored regions, a new data point \bold{x} can be classified depending on the color or label of the unique region \mathcal{R}_i that contains \bold{x}.

Training

To obtain such a decision tree, we use training data \mathcal{D} to meaningfully choose a sequence of features and thresholds that minimizes some notion of loss. The two commonly used loss functions are Gini (Equation 31.2) and Entropy ((Equation 31.3)). See [Subsection](#the-training-algorithm-(optional) for more details.

Tip Parametric vs Non-parametric

Exercise 31.4 Is decision tree model parametric or non-parametric?

Classifying Iris

Before we describe the loss functions and training algorithm in the next section, we first take an easy example.

Recall the classic Iris dataset has four features and three class labels: \color{orange}\tt setosa, \color{green}\tt versicolor, and \color{royalblue}\tt virginica. To keep it simple, let us assume that we have only two (continuous) features petalLength, petalWidth of Iris flower.

Try to pick thresholds t_1 and t_2 to visually classify the training labels to avoid miss classification as much as possible.

Below is the associated decision tree.

The Training Algorithm (Optional)

Training the perfect split of the feature space that minimizes loss is known to be an \mathcal{NP}-hard problem. The standard approach is to use a greedy procedure, in which we iteratively grow the tree one node at a time.

Step: i\geq1

Suppose we are at node i; let \mathcal{D}_i=\{(\bold{x}_n, y_n)\in N_i\} be the subset of the training data that reaches the node. We consider how to split this node into a left branch and right branch so as to minimize the training error in each child subtree.

  1. We first sort the unique values of the jth feature x_j in \mathcal{D}_i to get the set of all possible thresholds \mathcal{T}_{j}=\{x_{nj}\} for x_j.

  2. For each possible threshold t\in\mathcal{T}_j, define the left and right splits, respectively: \mathcal{D}_i^L(j, t)=\{(\bold{x}_n, y_n)\in N_i\mid x_{n,j}\leq t\},\text{ and} \mathcal{D}^R_i(j, t)=\{(\bold{x}_n, y_n)\in N_i\mid x_{n,j}> t\}.

  3. Finally, to find the best split at node i, we choose the best feature x_{ji} and threshold t_i that minimizes the following loss: \mathcal{L}_i(j, t)=\frac{\mathcal{D}^L_i(j,t)}{\mathcal{D}_i}{\tt cost}(\mathcal{D}^L_i(j,t)) + \frac{\mathcal{D}^R_i(j,t)}{\mathcal{D}_i}{\tt cost}(\mathcal{D}^R_i(j,t)). \tag{31.1} Here, the cost {\tt cost}(\mathcal{D}_i) is computed by Gini index as described below.

Gini and Entropy

Let us consider any node with data \mathcal{D}_i. We first compute the empirical distribution over class labels for this node: \hat\pi_{ic}=\frac{|\{ n\in\mathcal{D}_i\mid y_n=c\}|}{|\mathcal{D}_i|}. From this, we can define the Gini index. G_i=\sum_{c=1}^C\hat\pi_{ic}(1-\hat\pi_{ic})=1-\sum_{c=1}^C\hat\pi_{ic}^2. \tag{31.2} Alternatively, the entropy is defined by H_i=-\sum_{c=1}^C\hat\pi_{ic}\log{\hat\pi_{ic}}. \tag{31.3}

In Equation 31.1, one of the above cost description can be used to compute the loss used by the decision tree algorithm.

Training Demo

In this demo, we see the steps of the training algorithm as discussed above. We still use the Iris dataset with just two features petalLength and petalWidth.

Gini loss:

Below is the final decision tree.

Figure 31.4: The output decision tree
Tip Training Error

Exercise 31.5 What’s the minimum training error that can be achieved?

Feature Importance

For a decision tree \mathcal{T}, the feature importance of the feature x_k is defined by its total contribution in reducing impurity or Gini.

Let us assume that the feature x_k appears in J many internal nodes of the decision tree \mathcal{T}. The contribution or importance of x_k in reducing impurity at node the jth node can be defined as: {\tt Imp_k(j)}=\frac{|\mathcal{D}_j|}{|\mathcal{D}|}G(\mathcal{D}_j) -\left(\frac{|\mathcal{D}_j^L|}{|\mathcal{D}|}G(\mathcal{D}_j^L) +\frac{|\mathcal{D}^R_j|}{|\mathcal{D}|}G(\mathcal{D}_j^R)\right). Adding the contributions from all the J nodes, we get the feature importance of the feature: {\tt Imp_k(j)}=\sum_{j-1}^J{\tt Imp_k(j)}.

Pros and Cons

Tip Pros and Cons

Exercise 31.6 Discuss some of the pros and cons of the decision tree classifier.