Lecture 10: Decision Trees and Ensemble MethodsΒΆ
πΉ Watch Lecture
From Andrew Ngβs CS229 Lecture 10 (Autumn 2018)ΒΆ
βDecision trees are one of our first examples of a non-linear modelβ - TA Raphael Townshend
Why Decision Trees?ΒΆ
The Skiing Example from Lecture:
Task: Predict if you can ski given time (month) and location (latitude)
Problem: Data has separate regions (Northern Hemisphere winter, Southern Hemisphere winter)
Challenge: Linear classifiers canβt separate these regions
Solution: Partition the space into individual regions
Northern Hemisphere: Ski in Jan-Mar (positive latitude)
Southern Hemisphere: Ski in Jun-Aug (negative latitude)
Equator: Never ski (latitude near 0)
Recursive PartitioningΒΆ
Decision trees work through greedy, top-down, recursive partitioning:
Top-down: Start with overall region, slowly partition it
Greedy: At each step, pick the best partition possible
Recursive: After splitting, treat each region as a new problem
βThe tree is basically gonna play 20 Questions with this spaceβ
Example Questions:
βIs latitude > 30 degrees?β β Split horizontally
βIs month < 3?β β Split vertically
Continue recursively on each partition
Formal Definition: Split FunctionΒΆ
Split Function S_p(j, t) on region R_p:
Rβ = {x β R_p : x_j < t}
Rβ = {x β R_p : x_j β₯ t}
where:
j = which feature to split on
t = threshold value
R_p = parent region
Rβ, Rβ = child regions
Choosing Splits: Loss FunctionsΒΆ
Goal: Pick split that decreases loss as much as possible
Minimize: L(R_parent) - [L(Rβ) + L(Rβ)]
Option 1: Misclassification Loss (Donβt Use!)ΒΆ
L_misclass(R) = 1 - max_c pΜ_c
where pΜ_c = proportion of class c in region R
Problem from lecture example:
Parent: 900 pos, 100 neg β Loss = 100
Split 1: (700 pos, 100 neg) + (200 pos, 0 neg) β Loss = 100 + 0 = 100
Split 2: (400 pos, 100 neg) + (500 pos, 0 neg) β Loss = 100 + 0 = 100
βSplit 2 is clearly better (isolates more pure region), but misclassification loss canβt tell the difference!β
Option 2: Cross-Entropy Loss (Use This!)ΒΆ
L_cross(R) = -Ξ£_c pΜ_c log(pΜ_c)
From information theory: βNumber of bits needed to communicate which class, given probabilities are knownβ
Why better:
100% one class: No bits needed (L = 0)
Even split: Maximum bits needed (L is highest)
Sensitive to purity: Prefers splits that create pure regions
Option 3: Gini Impurity (Also Good)ΒΆ
L_Gini(R) = Ξ£_c pΜ_c(1 - pΜ_c) = 1 - Ξ£_c pΜ_cΒ²
Interpretation: Probability of misclassifying a randomly chosen element
When to Stop Splitting?ΒΆ
Stopping Criteria:
All examples in node have same label (pure node)
Maximum depth reached
Minimum samples per leaf
No split improves loss by minimum threshold
Prediction:
For leaf node: Predict most common class (majority vote)
Or predict class probabilities based on leaf composition
Implementation NotesΒΆ
Computational Complexity:
For each split: Try all features (n) Γ all thresholds (m)
Greedy algorithm: Not guaranteed to find global optimum
But works well in practice!
Advantages:
Easy to interpret (human-readable rules)
Handles non-linear decision boundaries naturally
No feature scaling needed
Can handle mixed data types
Disadvantages:
High variance (sensitive to training data)
Greedy splits may miss better global structure
Tends to overfit without pruning
Solution to high variance: Use ensemble methods (bagging, boosting)!
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)
print("Libraries loaded!")
10.1 The Skiing Example from LectureΒΆ
Letβs recreate the skiing classification example from the lecture.
# Generate skiing data
# Northern Hemisphere: Ski in months 1-3, 11-12 (winter) at positive latitude
# Southern Hemisphere: Ski in months 6-8 (winter) at negative latitude
# Equator: Never ski (too warm)
np.random.seed(42)
n_samples = 300
# Northern Hemisphere winter (Jan-Mar, Nov-Dec)
nh_winter_months = np.concatenate([np.random.uniform(1, 3.5, 50),
np.random.uniform(11, 12.5, 50)])
nh_winter_lat = np.random.uniform(30, 90, 100)
nh_winter_label = np.ones(100) # Can ski
# Northern Hemisphere summer (May-Sep)
nh_summer_months = np.random.uniform(5, 9, 50)
nh_summer_lat = np.random.uniform(30, 90, 50)
nh_summer_label = np.zeros(50) # Cannot ski
# Southern Hemisphere winter (Jun-Aug)
sh_winter_months = np.random.uniform(6, 8.5, 50)
sh_winter_lat = np.random.uniform(-90, -30, 50)
sh_winter_label = np.ones(50) # Can ski
# Southern Hemisphere summer (Dec-Feb)
sh_summer_months = np.concatenate([np.random.uniform(1, 2.5, 25),
np.random.uniform(11.5, 12.5, 25)])
sh_summer_lat = np.random.uniform(-90, -30, 50)
sh_summer_label = np.zeros(50) # Cannot ski
# Equator (never ski)
equator_months = np.random.uniform(1, 12, 50)
equator_lat = np.random.uniform(-20, 20, 50)
equator_label = np.zeros(50) # Cannot ski
# Combine all data
X_ski = np.column_stack([
np.concatenate([nh_winter_months, nh_summer_months, sh_winter_months,
sh_summer_months, equator_months]),
np.concatenate([nh_winter_lat, nh_summer_lat, sh_winter_lat,
sh_summer_lat, equator_lat])
])
y_ski = np.concatenate([nh_winter_label, nh_summer_label, sh_winter_label,
sh_summer_label, equator_label])
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Raw data
axes[0].scatter(X_ski[y_ski==0, 0], X_ski[y_ski==0, 1],
c='red', marker='x', s=50, alpha=0.6, label='Cannot Ski')
axes[0].scatter(X_ski[y_ski==1, 0], X_ski[y_ski==1, 1],
c='blue', marker='o', s=50, alpha=0.6, label='Can Ski')
axes[0].set_xlabel('Month (1=Jan, 12=Dec)', fontsize=12)
axes[0].set_ylabel('Latitude (degrees)', fontsize=12)
axes[0].set_title('Skiing Dataset from Lecture\n(Non-linearly Separable)',
fontsize=13, fontweight='bold')
axes[0].axhline(y=0, color='gray', linestyle='--', alpha=0.5, label='Equator')
axes[0].axhline(y=30, color='lightblue', linestyle='--', alpha=0.5)
axes[0].axhline(y=-30, color='lightblue', linestyle='--', alpha=0.5)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)
axes[0].set_xlim(0, 13)
axes[0].set_ylim(-100, 100)
# Train decision tree
dt_ski = DecisionTreeClassifier(max_depth=4, random_state=42)
dt_ski.fit(X_ski, y_ski)
# Decision boundary
h = 0.1
x_min, x_max = 0, 13
y_min, y_max = -100, 100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = dt_ski.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
axes[1].contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
axes[1].scatter(X_ski[y_ski==0, 0], X_ski[y_ski==0, 1],
c='red', marker='x', s=50, edgecolors='k', label='Cannot Ski')
axes[1].scatter(X_ski[y_ski==1, 0], X_ski[y_ski==1, 1],
c='blue', marker='o', s=50, edgecolors='k', label='Can Ski')
axes[1].set_xlabel('Month', fontsize=12)
axes[1].set_ylabel('Latitude', fontsize=12)
axes[1].set_title(f'Decision Tree (max_depth=4)\nAccuracy: {dt_ski.score(X_ski, y_ski):.3f}',
fontsize=13, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)
axes[1].set_xlim(0, 13)
axes[1].set_ylim(-100, 100)
plt.tight_layout()
plt.show()
print("\nObservations:")
print("- Decision tree naturally partitions space into rectangular regions")
print("- Captures Northern Hemisphere winter (high latitude, early/late months)")
print("- Captures Southern Hemisphere winter (low latitude, middle months)")
print("- Linear classifier would fail miserably on this data!")
10.2 Visualizing the Tree StructureΒΆ
Letβs see the β20 Questionsβ the tree is asking.
# Visualize tree structure
plt.figure(figsize=(20, 10))
plot_tree(dt_ski,
feature_names=['Month', 'Latitude'],
class_names=['Cannot Ski', 'Can Ski'],
filled=True,
fontsize=10,
rounded=True)
plt.title('Decision Tree Structure: Playing 20 Questions',
fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()
print("\nHow to read this tree:")
print("- Start at the top (root node)")
print("- Each box asks a yes/no question about a feature")
print("- Left branch = True, Right branch = False")
print("- Leaf nodes (bottom) show final prediction")
print("- Color intensity shows class probability")
10.3 Comparing Loss FunctionsΒΆ
Letβs verify the lectureβs claim that cross-entropy is more sensitive than misclassification loss.
def misclassification_loss(p_c):
"""L = 1 - max(p_c)"""
return 1 - np.max(p_c)
def cross_entropy_loss(p_c):
"""L = -Ξ£ p_c log(p_c)"""
# Avoid log(0)
p_c = np.clip(p_c, 1e-10, 1)
return -np.sum(p_c * np.log2(p_c))
def gini_impurity(p_c):
"""L = 1 - Ξ£ p_cΒ²"""
return 1 - np.sum(p_c ** 2)
# Example from lecture
print("Example from Lecture:")
print("Parent: 900 pos, 100 neg")
p_parent = np.array([0.9, 0.1])
print("\nSplit 1: (700 pos, 100 neg) + (200 pos, 0 neg)")
p_split1_left = np.array([700/800, 100/800])
p_split1_right = np.array([1.0, 0.0])
print("\nSplit 2: (400 pos, 100 neg) + (500 pos, 0 neg)")
p_split2_left = np.array([400/500, 100/500])
p_split2_right = np.array([1.0, 0.0])
# Compute weighted losses
print("\n" + "="*60)
print("MISCLASSIFICATION LOSS:")
print("="*60)
parent_misclass = misclassification_loss(p_parent)
split1_misclass = (800/1000) * misclassification_loss(p_split1_left) + \
(200/1000) * misclassification_loss(p_split1_right)
split2_misclass = (500/1000) * misclassification_loss(p_split2_left) + \
(500/1000) * misclassification_loss(p_split2_right)
print(f"Parent loss: {parent_misclass:.4f}")
print(f"Split 1 loss: {split1_misclass:.4f} (reduction: {parent_misclass - split1_misclass:.4f})")
print(f"Split 2 loss: {split2_misclass:.4f} (reduction: {parent_misclass - split2_misclass:.4f})")
print("\nβ οΈ PROBLEM: Both splits show SAME reduction! Cannot distinguish better split.")
print("\n" + "="*60)
print("CROSS-ENTROPY LOSS:")
print("="*60)
parent_ce = cross_entropy_loss(p_parent)
split1_ce = (800/1000) * cross_entropy_loss(p_split1_left) + \
(200/1000) * cross_entropy_loss(p_split1_right)
split2_ce = (500/1000) * cross_entropy_loss(p_split2_left) + \
(500/1000) * cross_entropy_loss(p_split2_right)
print(f"Parent loss: {parent_ce:.4f}")
print(f"Split 1 loss: {split1_ce:.4f} (reduction: {parent_ce - split1_ce:.4f})")
print(f"Split 2 loss: {split2_ce:.4f} (reduction: {parent_ce - split2_ce:.4f})")
print("\nβ BETTER: Split 2 shows LARGER reduction! Correctly prefers purer split.")
print("\n" + "="*60)
print("GINI IMPURITY:")
print("="*60)
parent_gini = gini_impurity(p_parent)
split1_gini = (800/1000) * gini_impurity(p_split1_left) + \
(200/1000) * gini_impurity(p_split1_right)
split2_gini = (500/1000) * gini_impurity(p_split2_left) + \
(500/1000) * gini_impurity(p_split2_right)
print(f"Parent loss: {parent_gini:.4f}")
print(f"Split 1 loss: {split1_gini:.4f} (reduction: {parent_gini - split1_gini:.4f})")
print(f"Split 2 loss: {split2_gini:.4f} (reduction: {parent_gini - split2_gini:.4f})")
print("\nβ BETTER: Split 2 shows LARGER reduction! Also works well.")
10.4 Effect of Tree DepthΒΆ
Decision trees can overfit if grown too deep (high variance problem).
# Generate moon dataset
X_moon, y_moon = make_moons(n_samples=200, noise=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_moon, y_moon,
test_size=0.3, random_state=42)
# Try different depths
depths = [2, 4, 8, None] # None = unlimited
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
for idx, max_depth in enumerate(depths):
# Train
dt = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
dt.fit(X_train, y_train)
# Accuracies
train_acc = dt.score(X_train, y_train)
test_acc = dt.score(X_test, y_test)
# Decision boundary
h = 0.02
x_min, x_max = X_moon[:, 0].min() - 0.5, X_moon[:, 0].max() + 0.5
y_min, y_max = X_moon[:, 1].min() - 0.5, X_moon[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = dt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot decision boundary
axes[0, idx].contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
axes[0, idx].scatter(X_train[y_train==0, 0], X_train[y_train==0, 1],
c='red', marker='x', s=50, edgecolors='k', alpha=0.7)
axes[0, idx].scatter(X_train[y_train==1, 0], X_train[y_train==1, 1],
c='blue', marker='o', s=50, edgecolors='k', alpha=0.7)
depth_str = f"{max_depth}" if max_depth else "β (Unlimited)"
if max_depth == 2:
title = f"Depth = {depth_str}\n(Underfit)\nTrain: {train_acc:.3f}, Test: {test_acc:.3f}"
elif max_depth is None:
title = f"Depth = {depth_str}\n(Overfit)\nTrain: {train_acc:.3f}, Test: {test_acc:.3f}"
else:
title = f"Depth = {depth_str}\nTrain: {train_acc:.3f}, Test: {test_acc:.3f}"
axes[0, idx].set_title(title, fontsize=11, fontweight='bold')
axes[0, idx].set_xlabel('Feature 1')
axes[0, idx].set_ylabel('Feature 2')
axes[0, idx].grid(True, alpha=0.3)
# Plot tree structure (smaller)
plot_tree(dt, ax=axes[1, idx], filled=True, fontsize=7,
class_names=['Class 0', 'Class 1'], rounded=True)
axes[1, idx].set_title(f'Tree Structure (Depth={depth_str})',
fontsize=10, fontweight='bold')
plt.tight_layout()
plt.show()
print("\nKey Observations:")
print("- Depth 2: Underfits (too simple, high bias)")
print("- Depth 4-8: Good balance")
print("- Unlimited depth: Overfits (memorizes training data, high variance)")
print("- Gap between train/test accuracy indicates overfitting")