Lecture 10: Decision Trees and Ensemble MethodsΒΆ

πŸ“Ή Watch Lecture

From Andrew Ng’s CS229 Lecture 10 (Autumn 2018)ΒΆ

β€œDecision trees are one of our first examples of a non-linear model” - TA Raphael Townshend

Why Decision Trees?ΒΆ

The Skiing Example from Lecture:

  • Task: Predict if you can ski given time (month) and location (latitude)

  • Problem: Data has separate regions (Northern Hemisphere winter, Southern Hemisphere winter)

  • Challenge: Linear classifiers can’t separate these regions

  • Solution: Partition the space into individual regions

Northern Hemisphere: Ski in Jan-Mar (positive latitude)
Southern Hemisphere: Ski in Jun-Aug (negative latitude)  
Equator: Never ski (latitude near 0)

Recursive PartitioningΒΆ

Decision trees work through greedy, top-down, recursive partitioning:

  1. Top-down: Start with overall region, slowly partition it

  2. Greedy: At each step, pick the best partition possible

  3. Recursive: After splitting, treat each region as a new problem

β€œThe tree is basically gonna play 20 Questions with this space”

Example Questions:

  • β€œIs latitude > 30 degrees?” β†’ Split horizontally

  • β€œIs month < 3?” β†’ Split vertically

  • Continue recursively on each partition

Formal Definition: Split FunctionΒΆ

Split Function S_p(j, t) on region R_p:

R₁ = {x ∈ R_p : x_j < t}
Rβ‚‚ = {x ∈ R_p : x_j β‰₯ t}

where:

  • j = which feature to split on

  • t = threshold value

  • R_p = parent region

  • R₁, Rβ‚‚ = child regions

Choosing Splits: Loss FunctionsΒΆ

Goal: Pick split that decreases loss as much as possible

Minimize: L(R_parent) - [L(R₁) + L(Rβ‚‚)]

Option 1: Misclassification Loss (Don’t Use!)ΒΆ

L_misclass(R) = 1 - max_c pΜ‚_c

where pΜ‚_c = proportion of class c in region R

Problem from lecture example:

Parent: 900 pos, 100 neg β†’ Loss = 100

Split 1: (700 pos, 100 neg) + (200 pos, 0 neg) β†’ Loss = 100 + 0 = 100
Split 2: (400 pos, 100 neg) + (500 pos, 0 neg) β†’ Loss = 100 + 0 = 100

β€œSplit 2 is clearly better (isolates more pure region), but misclassification loss can’t tell the difference!”

Option 2: Cross-Entropy Loss (Use This!)ΒΆ

L_cross(R) = -Ξ£_c pΜ‚_c log(pΜ‚_c)

From information theory: β€œNumber of bits needed to communicate which class, given probabilities are known”

Why better:

  • 100% one class: No bits needed (L = 0)

  • Even split: Maximum bits needed (L is highest)

  • Sensitive to purity: Prefers splits that create pure regions

Option 3: Gini Impurity (Also Good)ΒΆ

L_Gini(R) = Ξ£_c pΜ‚_c(1 - pΜ‚_c) = 1 - Ξ£_c pΜ‚_cΒ²

Interpretation: Probability of misclassifying a randomly chosen element

When to Stop Splitting?ΒΆ

Stopping Criteria:

  1. All examples in node have same label (pure node)

  2. Maximum depth reached

  3. Minimum samples per leaf

  4. No split improves loss by minimum threshold

Prediction:

  • For leaf node: Predict most common class (majority vote)

  • Or predict class probabilities based on leaf composition

Implementation NotesΒΆ

Computational Complexity:

  • For each split: Try all features (n) Γ— all thresholds (m)

  • Greedy algorithm: Not guaranteed to find global optimum

  • But works well in practice!

Advantages:

  • Easy to interpret (human-readable rules)

  • Handles non-linear decision boundaries naturally

  • No feature scaling needed

  • Can handle mixed data types

Disadvantages:

  • High variance (sensitive to training data)

  • Greedy splits may miss better global structure

  • Tends to overfit without pruning

Solution to high variance: Use ensemble methods (bagging, boosting)!

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded!")

10.1 The Skiing Example from LectureΒΆ

Let’s recreate the skiing classification example from the lecture.

# Generate skiing data
# Northern Hemisphere: Ski in months 1-3, 11-12 (winter) at positive latitude
# Southern Hemisphere: Ski in months 6-8 (winter) at negative latitude
# Equator: Never ski (too warm)

np.random.seed(42)
n_samples = 300

# Northern Hemisphere winter (Jan-Mar, Nov-Dec)
nh_winter_months = np.concatenate([np.random.uniform(1, 3.5, 50), 
                                   np.random.uniform(11, 12.5, 50)])
nh_winter_lat = np.random.uniform(30, 90, 100)
nh_winter_label = np.ones(100)  # Can ski

# Northern Hemisphere summer (May-Sep)
nh_summer_months = np.random.uniform(5, 9, 50)
nh_summer_lat = np.random.uniform(30, 90, 50)
nh_summer_label = np.zeros(50)  # Cannot ski

# Southern Hemisphere winter (Jun-Aug)
sh_winter_months = np.random.uniform(6, 8.5, 50)
sh_winter_lat = np.random.uniform(-90, -30, 50)
sh_winter_label = np.ones(50)  # Can ski

# Southern Hemisphere summer (Dec-Feb)
sh_summer_months = np.concatenate([np.random.uniform(1, 2.5, 25),
                                   np.random.uniform(11.5, 12.5, 25)])
sh_summer_lat = np.random.uniform(-90, -30, 50)
sh_summer_label = np.zeros(50)  # Cannot ski

# Equator (never ski)
equator_months = np.random.uniform(1, 12, 50)
equator_lat = np.random.uniform(-20, 20, 50)
equator_label = np.zeros(50)  # Cannot ski

# Combine all data
X_ski = np.column_stack([
    np.concatenate([nh_winter_months, nh_summer_months, sh_winter_months, 
                    sh_summer_months, equator_months]),
    np.concatenate([nh_winter_lat, nh_summer_lat, sh_winter_lat,
                    sh_summer_lat, equator_lat])
])
y_ski = np.concatenate([nh_winter_label, nh_summer_label, sh_winter_label,
                        sh_summer_label, equator_label])

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Raw data
axes[0].scatter(X_ski[y_ski==0, 0], X_ski[y_ski==0, 1], 
               c='red', marker='x', s=50, alpha=0.6, label='Cannot Ski')
axes[0].scatter(X_ski[y_ski==1, 0], X_ski[y_ski==1, 1], 
               c='blue', marker='o', s=50, alpha=0.6, label='Can Ski')
axes[0].set_xlabel('Month (1=Jan, 12=Dec)', fontsize=12)
axes[0].set_ylabel('Latitude (degrees)', fontsize=12)
axes[0].set_title('Skiing Dataset from Lecture\n(Non-linearly Separable)', 
                 fontsize=13, fontweight='bold')
axes[0].axhline(y=0, color='gray', linestyle='--', alpha=0.5, label='Equator')
axes[0].axhline(y=30, color='lightblue', linestyle='--', alpha=0.5)
axes[0].axhline(y=-30, color='lightblue', linestyle='--', alpha=0.5)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)
axes[0].set_xlim(0, 13)
axes[0].set_ylim(-100, 100)

# Train decision tree
dt_ski = DecisionTreeClassifier(max_depth=4, random_state=42)
dt_ski.fit(X_ski, y_ski)

# Decision boundary
h = 0.1
x_min, x_max = 0, 13
y_min, y_max = -100, 100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
Z = dt_ski.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

axes[1].contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
axes[1].scatter(X_ski[y_ski==0, 0], X_ski[y_ski==0, 1], 
               c='red', marker='x', s=50, edgecolors='k', label='Cannot Ski')
axes[1].scatter(X_ski[y_ski==1, 0], X_ski[y_ski==1, 1], 
               c='blue', marker='o', s=50, edgecolors='k', label='Can Ski')
axes[1].set_xlabel('Month', fontsize=12)
axes[1].set_ylabel('Latitude', fontsize=12)
axes[1].set_title(f'Decision Tree (max_depth=4)\nAccuracy: {dt_ski.score(X_ski, y_ski):.3f}',
                 fontsize=13, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)
axes[1].set_xlim(0, 13)
axes[1].set_ylim(-100, 100)

plt.tight_layout()
plt.show()

print("\nObservations:")
print("- Decision tree naturally partitions space into rectangular regions")
print("- Captures Northern Hemisphere winter (high latitude, early/late months)")
print("- Captures Southern Hemisphere winter (low latitude, middle months)")
print("- Linear classifier would fail miserably on this data!")

10.2 Visualizing the Tree StructureΒΆ

Let’s see the β€œ20 Questions” the tree is asking.

# Visualize tree structure
plt.figure(figsize=(20, 10))
plot_tree(dt_ski, 
          feature_names=['Month', 'Latitude'],
          class_names=['Cannot Ski', 'Can Ski'],
          filled=True,
          fontsize=10,
          rounded=True)
plt.title('Decision Tree Structure: Playing 20 Questions', 
         fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

print("\nHow to read this tree:")
print("- Start at the top (root node)")
print("- Each box asks a yes/no question about a feature")
print("- Left branch = True, Right branch = False")
print("- Leaf nodes (bottom) show final prediction")
print("- Color intensity shows class probability")

10.3 Comparing Loss FunctionsΒΆ

Let’s verify the lecture’s claim that cross-entropy is more sensitive than misclassification loss.

def misclassification_loss(p_c):
    """L = 1 - max(p_c)"""
    return 1 - np.max(p_c)

def cross_entropy_loss(p_c):
    """L = -Ξ£ p_c log(p_c)"""
    # Avoid log(0)
    p_c = np.clip(p_c, 1e-10, 1)
    return -np.sum(p_c * np.log2(p_c))

def gini_impurity(p_c):
    """L = 1 - Ξ£ p_cΒ²"""
    return 1 - np.sum(p_c ** 2)

# Example from lecture
print("Example from Lecture:")
print("Parent: 900 pos, 100 neg")
p_parent = np.array([0.9, 0.1])

print("\nSplit 1: (700 pos, 100 neg) + (200 pos, 0 neg)")
p_split1_left = np.array([700/800, 100/800])
p_split1_right = np.array([1.0, 0.0])

print("\nSplit 2: (400 pos, 100 neg) + (500 pos, 0 neg)")
p_split2_left = np.array([400/500, 100/500])
p_split2_right = np.array([1.0, 0.0])

# Compute weighted losses
print("\n" + "="*60)
print("MISCLASSIFICATION LOSS:")
print("="*60)
parent_misclass = misclassification_loss(p_parent)
split1_misclass = (800/1000) * misclassification_loss(p_split1_left) + \
                  (200/1000) * misclassification_loss(p_split1_right)
split2_misclass = (500/1000) * misclassification_loss(p_split2_left) + \
                  (500/1000) * misclassification_loss(p_split2_right)

print(f"Parent loss:        {parent_misclass:.4f}")
print(f"Split 1 loss:       {split1_misclass:.4f}  (reduction: {parent_misclass - split1_misclass:.4f})")
print(f"Split 2 loss:       {split2_misclass:.4f}  (reduction: {parent_misclass - split2_misclass:.4f})")
print("\n⚠️  PROBLEM: Both splits show SAME reduction! Cannot distinguish better split.")

print("\n" + "="*60)
print("CROSS-ENTROPY LOSS:")
print("="*60)
parent_ce = cross_entropy_loss(p_parent)
split1_ce = (800/1000) * cross_entropy_loss(p_split1_left) + \
            (200/1000) * cross_entropy_loss(p_split1_right)
split2_ce = (500/1000) * cross_entropy_loss(p_split2_left) + \
            (500/1000) * cross_entropy_loss(p_split2_right)

print(f"Parent loss:        {parent_ce:.4f}")
print(f"Split 1 loss:       {split1_ce:.4f}  (reduction: {parent_ce - split1_ce:.4f})")
print(f"Split 2 loss:       {split2_ce:.4f}  (reduction: {parent_ce - split2_ce:.4f})")
print("\nβœ“ BETTER: Split 2 shows LARGER reduction! Correctly prefers purer split.")

print("\n" + "="*60)
print("GINI IMPURITY:")
print("="*60)
parent_gini = gini_impurity(p_parent)
split1_gini = (800/1000) * gini_impurity(p_split1_left) + \
              (200/1000) * gini_impurity(p_split1_right)
split2_gini = (500/1000) * gini_impurity(p_split2_left) + \
              (500/1000) * gini_impurity(p_split2_right)

print(f"Parent loss:        {parent_gini:.4f}")
print(f"Split 1 loss:       {split1_gini:.4f}  (reduction: {parent_gini - split1_gini:.4f})")
print(f"Split 2 loss:       {split2_gini:.4f}  (reduction: {parent_gini - split2_gini:.4f})")
print("\nβœ“ BETTER: Split 2 shows LARGER reduction! Also works well.")

10.4 Effect of Tree DepthΒΆ

Decision trees can overfit if grown too deep (high variance problem).

# Generate moon dataset
X_moon, y_moon = make_moons(n_samples=200, noise=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_moon, y_moon, 
                                                     test_size=0.3, random_state=42)

# Try different depths
depths = [2, 4, 8, None]  # None = unlimited
fig, axes = plt.subplots(2, 4, figsize=(20, 10))

for idx, max_depth in enumerate(depths):
    # Train
    dt = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
    dt.fit(X_train, y_train)
    
    # Accuracies
    train_acc = dt.score(X_train, y_train)
    test_acc = dt.score(X_test, y_test)
    
    # Decision boundary
    h = 0.02
    x_min, x_max = X_moon[:, 0].min() - 0.5, X_moon[:, 0].max() + 0.5
    y_min, y_max = X_moon[:, 1].min() - 0.5, X_moon[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = dt.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot decision boundary
    axes[0, idx].contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
    axes[0, idx].scatter(X_train[y_train==0, 0], X_train[y_train==0, 1],
                        c='red', marker='x', s=50, edgecolors='k', alpha=0.7)
    axes[0, idx].scatter(X_train[y_train==1, 0], X_train[y_train==1, 1],
                        c='blue', marker='o', s=50, edgecolors='k', alpha=0.7)
    
    depth_str = f"{max_depth}" if max_depth else "∞ (Unlimited)"
    if max_depth == 2:
        title = f"Depth = {depth_str}\n(Underfit)\nTrain: {train_acc:.3f}, Test: {test_acc:.3f}"
    elif max_depth is None:
        title = f"Depth = {depth_str}\n(Overfit)\nTrain: {train_acc:.3f}, Test: {test_acc:.3f}"
    else:
        title = f"Depth = {depth_str}\nTrain: {train_acc:.3f}, Test: {test_acc:.3f}"
    
    axes[0, idx].set_title(title, fontsize=11, fontweight='bold')
    axes[0, idx].set_xlabel('Feature 1')
    axes[0, idx].set_ylabel('Feature 2')
    axes[0, idx].grid(True, alpha=0.3)
    
    # Plot tree structure (smaller)
    plot_tree(dt, ax=axes[1, idx], filled=True, fontsize=7, 
             class_names=['Class 0', 'Class 1'], rounded=True)
    axes[1, idx].set_title(f'Tree Structure (Depth={depth_str})', 
                          fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nKey Observations:")
print("- Depth 2: Underfits (too simple, high bias)")
print("- Depth 4-8: Good balance")
print("- Unlimited depth: Overfits (memorizes training data, high variance)")
print("- Gap between train/test accuracy indicates overfitting")