Plot.plot({
width: 600,
height: 400,
margin: 50,
grid: true,
x: { domain: [0, 8], label: "Petal Length (x_1)" },
y: { domain: [-1, 3], label: "Petal Width (x_2)" },
color: {
domain: ['setosa', 'versicolor', 'virginica'],
range: ["rgb(255, 165, 0)", "rgb(0, 128, 0)", "rgb(65, 105, 225)"]
},
marks: [
Plot.ruleX([t1]),
Plot.ruleY([t2], {x1: t1, x2: 8}),
// Decision boundary background
Plot.rect(boundaryData(treeA, ['petalLength', 'petalWidth'], true, iris), {
x1: d => d.x - 0.05,
x2: d => d.x + 0.05,
y1: d => d.y - 0.05,
y2: d => d.y + 0.05,
fill: d => label ? d.label : 'None',
opacity: 0.2
}),
Plot.dot(iris, {
x: 'petalLength',
y: 'petalWidth',
fill: 'species'
})
]
});12.1 Decision Trees
: 60 minutes
Decision trees are defined by recursively partitioning the input space, and defining a local model in each resulting region of the input space. Consequently, the overall model can be represented by a tree.
Although decision trees can also be used for regression, we study here only classifiers trees.
TipTrees
Trees are special graphs with a designated root node and no cycles.
Decision Trees are Everywhere!
It’s quite common in real life to use nested decision rules or a decision tree for making day-to-day choices.
For example, a GW data science student like yourself may choose a DATS course based on the following decision tree.
The diagram above, we pretend to use four binary features: \{x_1\ ({\tt required}), x_2\ ({\tt exam}), x_3\ ({\tt takehome}), x_4\ ({\tt curve})\} and two class labels: \{{\color{green}\tt Y}, {\color{red}\tt N}\}.
In practice, the features are continuous—e.g. average final score, number of students, etc. In that case, it’s most nature to use a threshold value for each feature for creating the nested decision rules.
Mathematical Formulation
Let us consider a classification problem with features \bold{x}=(x_1, x_2, \ldots, x_d) and class labels \{{\tt 1}, {\tt 2}, \ldots, {\tt C}\}.
The idea behind decision tree is to divide the feature space \mathcal{X} into non-overlapping regions \mathcal{T}=\{\mathcal{R}_i\}, where each region \mathcal{R}_i is given by
- a sequence of nested conditions each defined by a threshold t\in\mathbb{R} (e.g., x_i\leq t_i, etc)
- a class label \tt c_j\in \{{\tt 1}, {\tt 2}, \ldots, {\tt C}\}.
The regions are commonly taken to be rectangular (axis-parallel) and entire the decision boundary looks like a colored maze.
For example, below is a decision tree for a two-dimensional feature vector \bold{x}=(x_1, x_2) and three class labels \{{\color{orange}\tt 0}, {\color{green}\tt 1}, {\color{royalblue}\tt 2}\}. See ; see Figure 31.2.
The governing decision tree is shown below in Figure 31.3.
Prediction
Given such a decision tree with colored regions, a new data point \bold{x} can be classified depending on the color or label of the unique region \mathcal{R}_i that contains \bold{x}.
Training
To obtain such a decision tree, we use training data \mathcal{D} to meaningfully choose a sequence of features and thresholds that minimizes some notion of loss. The two commonly used loss functions are Gini (Equation 31.2) and Entropy ((Equation 31.3)). See [Subsection](#the-training-algorithm-(optional) for more details.
Classifying Iris
Before we describe the loss functions and training algorithm in the next section, we first take an easy example.
Recall the classic Iris dataset has four features and three class labels: \color{orange}\tt setosa, \color{green}\tt versicolor, and \color{royalblue}\tt virginica. To keep it simple, let us assume that we have only two (continuous) features petalLength, petalWidth of Iris flower.
Try to pick thresholds t_1 and t_2 to visually classify the training labels to avoid miss classification as much as possible.
Below is the associated decision tree.
The Training Algorithm (Optional)
Training the perfect split of the feature space that minimizes loss is known to be an \mathcal{NP}-hard problem. The standard approach is to use a greedy procedure, in which we iteratively grow the tree one node at a time.
Step: i\geq1
Suppose we are at node i; let \mathcal{D}_i=\{(\bold{x}_n, y_n)\in N_i\} be the subset of the training data that reaches the node. We consider how to split this node into a left branch and right branch so as to minimize the training error in each child subtree.
We first sort the unique values of the jth feature x_j in \mathcal{D}_i to get the set of all possible thresholds \mathcal{T}_{j}=\{x_{nj}\} for x_j.
For each possible threshold t\in\mathcal{T}_j, define the left and right splits, respectively: \mathcal{D}_i^L(j, t)=\{(\bold{x}_n, y_n)\in N_i\mid x_{n,j}\leq t\},\text{ and} \mathcal{D}^R_i(j, t)=\{(\bold{x}_n, y_n)\in N_i\mid x_{n,j}> t\}.
Finally, to find the best split at node i, we choose the best feature x_{ji} and threshold t_i that minimizes the following loss: \mathcal{L}_i(j, t)=\frac{\mathcal{D}^L_i(j,t)}{\mathcal{D}_i}{\tt cost}(\mathcal{D}^L_i(j,t)) + \frac{\mathcal{D}^R_i(j,t)}{\mathcal{D}_i}{\tt cost}(\mathcal{D}^R_i(j,t)). \tag{31.1} Here, the cost {\tt cost}(\mathcal{D}_i) is computed by Gini index as described below.
Gini and Entropy
Let us consider any node with data \mathcal{D}_i. We first compute the empirical distribution over class labels for this node: \hat\pi_{ic}=\frac{|\{ n\in\mathcal{D}_i\mid y_n=c\}|}{|\mathcal{D}_i|}. From this, we can define the Gini index. G_i=\sum_{c=1}^C\hat\pi_{ic}(1-\hat\pi_{ic})=1-\sum_{c=1}^C\hat\pi_{ic}^2. \tag{31.2} Alternatively, the entropy is defined by H_i=-\sum_{c=1}^C\hat\pi_{ic}\log{\hat\pi_{ic}}. \tag{31.3}
In Equation 31.1, one of the above cost description can be used to compute the loss used by the decision tree algorithm.
Training Demo
In this demo, we see the steps of the training algorithm as discussed above. We still use the Iris dataset with just two features petalLength and petalWidth.
Gini loss:
Below is the final decision tree.
Feature Importance
For a decision tree \mathcal{T}, the feature importance of the feature x_k is defined by its total contribution in reducing impurity or Gini.
Let us assume that the feature x_k appears in J many internal nodes of the decision tree \mathcal{T}. The contribution or importance of x_k in reducing impurity at node the jth node can be defined as: {\tt Imp_k(j)}=\frac{|\mathcal{D}_j|}{|\mathcal{D}|}G(\mathcal{D}_j) -\left(\frac{|\mathcal{D}_j^L|}{|\mathcal{D}|}G(\mathcal{D}_j^L) +\frac{|\mathcal{D}^R_j|}{|\mathcal{D}|}G(\mathcal{D}_j^R)\right). Adding the contributions from all the J nodes, we get the feature importance of the feature: {\tt Imp_k(j)}=\sum_{j-1}^J{\tt Imp_k(j)}.