Run this notebook: Open in Colab Open in Kaggle

Chapter 8: Tree-Based Methods¶

Overview¶

Tree-based methods partition the feature space into simple regions and make predictions based on the mean or mode of training observations in each region.

Methods Covered¶

1. Decision Trees¶

Single tree: Simple, interpretable
CART algorithm: Recursive binary splitting
Pruning: Control complexity

2. Bagging (Bootstrap Aggregating)¶

Build multiple trees on bootstrap samples
Average predictions (regression) or vote (classification)
Reduces variance

3. Random Forests¶

Bagging + random feature selection
Decorrelates trees
Often best out-of-the-box performance

4. Boosting¶

Sequential: Each tree corrects previous trees
AdaBoost: Reweight misclassified samples
Gradient Boosting: Fit residuals
XGBoost, LightGBM: Modern implementations

Key Advantages¶

✅ Interpretability (single trees)
✅ Handle non-linearity naturally
✅ No feature scaling needed
✅ Automatic feature interaction
✅ Missing value handling
✅ Feature importance built-in
✅ Classification and regression

Key Disadvantages¶

❌ High variance (single trees)
❌ Lack of smoothness
❌ Prediction accuracy (single trees)
❌ Ensemble interpretability (forests, boosting)

8.1 Decision Trees¶

The Algorithm: Recursive Binary Splitting¶

For Regression:

Find best split: minimize RSS in resulting regions
Split data into two regions
Repeat for each region
Stop when criterion met (min samples, max depth)

\[\text{RSS} = \sum_{j=1}^J \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2\]

where $\hat{y}_{R_j}$ is mean of training observations in region $R_j$

For Classification: Minimize Gini impurity or entropy:

\[\text{Gini} = \sum_{k=1}^K \hat{p}_{mk}(1 - \hat{p}_{mk})\]

\[\text{Entropy} = -\sum_{k=1}^K \hat{p}_{mk} \log(\hat{p}_{mk})\]

where $\hat{p}_{mk}$ is proportion of class $k$ in region $m$

Tree Pruning¶

Problem: Large trees overfit
Solution: Cost complexity pruning

Minimize: $$\sum_{m=1}^{|T|} \sum_{i: x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha |T|$$

where $|T|$ is number of terminal nodes, $\alpha$ controls tradeoff

# Regression Tree Example
np.random.seed(42)
n = 100
X = np.linspace(0, 10, n).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.randn(n) * 0.3

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Compare different tree depths
depths = [2, 4, 6, 10]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

X_plot = np.linspace(0, 10, 300).reshape(-1, 1)

for idx, depth in enumerate(depths):
    tree = DecisionTreeRegressor(max_depth=depth, random_state=42)
    tree.fit(X_train, y_train)
    y_pred = tree.predict(X_plot)
    
    train_score = tree.score(X_train, y_train)
    test_score = tree.score(X_test, y_test)
    
    axes[idx].scatter(X_train, y_train, alpha=0.5, label='Train')
    axes[idx].scatter(X_test, y_test, alpha=0.5, label='Test', color='orange')
    axes[idx].plot(X_plot, y_pred, 'g-', linewidth=2, label='Tree')
    axes[idx].set_xlabel('X')
    axes[idx].set_ylabel('Y')
    axes[idx].set_title(f'Depth={depth}\nTrain R²={train_score:.3f}, Test R²={test_score:.3f}')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Observations:")
print("   • Depth 2: Underfits (too simple)")
print("   • Depth 4-6: Good balance")
print("   • Depth 10: Overfits (train R² high, test R² lower)")
print("   • Trees create piecewise constant predictions")

# Visualize tree structure
tree_viz = DecisionTreeRegressor(max_depth=3, random_state=42)
tree_viz.fit(X_train, y_train)

plt.figure(figsize=(16, 8))
plot_tree(tree_viz, filled=True, feature_names=['X'], 
         rounded=True, fontsize=10)
plt.title('Decision Tree Structure (max_depth=3)', fontsize=14)
plt.show()

print("\n📊 Tree Information:")
print(f"Number of leaves: {tree_viz.get_n_leaves()}")
print(f"Max depth: {tree_viz.get_depth()}")
print(f"Train R²: {tree_viz.score(X_train, y_train):.4f}")
print(f"Test R²: {tree_viz.score(X_test, y_test):.4f}")

# Classification Tree Example
X_class, y_class = make_classification(n_samples=200, n_features=2, 
                                       n_informative=2, n_redundant=0,
                                       n_clusters_per_class=1, random_state=42)

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_class, y_class, test_size=0.3, random_state=42)

# Fit tree
tree_class = DecisionTreeClassifier(max_depth=4, random_state=42)
tree_class.fit(X_train_c, y_train_c)

# Create decision boundary plot
h = 0.02
x_min, x_max = X_class[:, 0].min() - 1, X_class[:, 0].max() + 1
y_min, y_max = X_class[:, 1].min() - 1, X_class[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = tree_class.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
plt.scatter(X_train_c[:, 0], X_train_c[:, 1], c=y_train_c, 
           edgecolors='k', marker='o', s=100, cmap='RdYlBu', label='Train')
plt.scatter(X_test_c[:, 0], X_test_c[:, 1], c=y_test_c, 
           edgecolors='k', marker='s', s=100, cmap='RdYlBu', label='Test')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title(f'Decision Tree Classification (max_depth=4)\n'
         f'Train Acc: {tree_class.score(X_train_c, y_train_c):.3f}, '
         f'Test Acc: {tree_class.score(X_test_c, y_test_c):.3f}')
plt.legend()
plt.show()

print("\n💡 Note:")
print("   • Decision boundaries are axis-aligned (rectangular regions)")
print("   • Cannot capture diagonal or curved boundaries well")

8.2 Bagging (Bootstrap Aggregating)¶

The Problem¶

Decision trees have high variance - small changes in data → large changes in tree

The Solution: Bagging¶

Generate B bootstrap samples from training data
Train a tree on each bootstrap sample
Average predictions (regression) or vote (classification)

Why It Works¶

Variance reduction through averaging:

If we have B independent models with variance $\sigma^2$: $$\text{Var}(\text{average}) = \frac{\sigma^2}{B}$$

Out-of-Bag (OOB) Error¶

Each bootstrap sample uses ~63% of data
Remaining ~37% can be used for validation
OOB error ≈ cross-validation error
No need for separate validation set!

Key Parameters¶

n_estimators: Number of trees (B)
max_features: Features to consider at each split
max_depth: Tree depth

# Bagging vs Single Tree
# Compare single tree vs bagging with different number of trees

single_tree = DecisionTreeRegressor(random_state=42)
single_tree.fit(X_train, y_train)

n_estimators_list = [5, 10, 50, 100]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, n_est in enumerate(n_estimators_list):
    bagging = BaggingRegressor(
        estimator=DecisionTreeRegressor(),
        n_estimators=n_est,
        random_state=42
    )
    bagging.fit(X_train, y_train)
    y_pred = bagging.predict(X_plot)
    
    axes[idx].scatter(X_train, y_train, alpha=0.3, label='Data')
    axes[idx].plot(X_plot, y_pred, 'g-', linewidth=2, label=f'Bagging ({n_est} trees)')
    axes[idx].set_xlabel('X')
    axes[idx].set_ylabel('Y')
    axes[idx].set_title(f'Bagging: {n_est} trees\n'
                       f'Test R²: {bagging.score(X_test, y_test):.3f}')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n📊 Comparison:\n")
print(f"Single Tree Test R²: {single_tree.score(X_test, y_test):.4f}")
for n_est in n_estimators_list:
    bagging = BaggingRegressor(
        estimator=DecisionTreeRegressor(),
        n_estimators=n_est,
        random_state=42
    )
    bagging.fit(X_train, y_train)
    print(f"Bagging ({n_est:3d} trees) Test R²: {bagging.score(X_test, y_test):.4f}")

print("\n💡 Key Insight: More trees → smoother predictions, better performance")

8.3 Random Forests¶

Problem with Bagging¶

Bootstrap samples are correlated (use same features)

If one strong predictor exists, most trees will use it
Trees become similar
Less variance reduction from averaging

Random Forest Solution¶

Decorrelate trees by randomly selecting features at each split

Algorithm:

Generate B bootstrap samples
For each tree:
- At each split, randomly select m features from p total
- Choose best split among these m features only
- Typical choice: $m = \sqrt{p}$ (classification) or $m = p/3$ (regression)
Average predictions

Feature Importance¶

Measure how much each feature decreases impurity (Gini or MSE)

Sum over all trees and splits
Normalize to 100%

Advantages¶

✅ Excellent performance out-of-the-box
✅ Handles high-dimensional data
✅ Feature importance built-in
✅ OOB error for free
✅ Parallel training
✅ Robust to overfitting

Typical Hyperparameters¶

n_estimators: 100-1000 (more is better, diminishing returns)
max_features: sqrt(p) or p/3
max_depth: Often unlimited
min_samples_split: 2-10
min_samples_leaf: 1-5

# Random Forest Example
# Load breast cancer dataset
data = load_breast_cancer()
X_bc, y_bc = data.data, data.target
feature_names = data.feature_names

X_train_bc, X_test_bc, y_train_bc, y_test_bc = train_test_split(
    X_bc, y_bc, test_size=0.3, random_state=42)

# Compare models
models = {
    'Single Tree': DecisionTreeClassifier(random_state=42),
    'Bagging': BaggingClassifier(n_estimators=100, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

results = {}
for name, model in models.items():
    model.fit(X_train_bc, y_train_bc)
    train_acc = model.score(X_train_bc, y_train_bc)
    test_acc = model.score(X_test_bc, y_test_bc)
    results[name] = {'train': train_acc, 'test': test_acc}

# Plot comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
model_names = list(results.keys())
train_accs = [results[m]['train'] for m in model_names]
test_accs = [results[m]['test'] for m in model_names]

x = np.arange(len(model_names))
width = 0.35

ax1.bar(x - width/2, train_accs, width, label='Train', alpha=0.8)
ax1.bar(x + width/2, test_accs, width, label='Test', alpha=0.8)
ax1.set_ylabel('Accuracy')
ax1.set_title('Model Comparison on Breast Cancer Dataset')
ax1.set_xticks(x)
ax1.set_xticklabels(model_names)
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')

# Feature importance from Random Forest
rf = models['Random Forest']
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1][:10]  # Top 10

ax2.barh(range(10), importances[indices], alpha=0.8)
ax2.set_yticks(range(10))
ax2.set_yticklabels([feature_names[i] for i in indices])
ax2.set_xlabel('Importance')
ax2.set_title('Top 10 Feature Importances (Random Forest)')
ax2.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\n📊 Results:\n")
print(f"{'Model':<20} {'Train Acc':<12} {'Test Acc':<12} {'Overfit Gap'}")
print("="*60)
for name in model_names:
    train = results[name]['train']
    test = results[name]['test']
    gap = train - test
    marker = '✅' if test == max(test_accs) else ''
    print(f"{name:<20} {train:>10.4f}   {test:>10.4f}   {gap:>10.4f}  {marker}")

print("\n💡 Observations:")
print("   • Single tree: High variance (overfits)")
print("   • Bagging: Reduces variance significantly")
print("   • Random Forest: Best test performance (decorrelated trees)")

8.4 Boosting¶

Key Difference from Bagging/RF¶

Bagging/RF: Build trees independently in parallel
Boosting: Build trees sequentially, each correcting previous errors

AdaBoost (Adaptive Boosting)¶

For Classification:

Start with equal weights for all samples
Train weak learner (shallow tree)
Increase weights for misclassified samples
Repeat: Next tree focuses on hard examples
Combine with weighted vote

Gradient Boosting¶

For Regression and Classification:

Initialize with simple model (constant)
Compute residuals (errors)
Fit tree to predict residuals
Add tree to ensemble (with learning rate)
Repeat: Each tree fits residuals of current ensemble

Update rule: $$f_m(x) = f_{m-1}(x) + \nu \cdot h_m(x)$$

where:

$f_m$ = ensemble after m trees
$\nu$ = learning rate (shrinkage)
$h_m$ = new tree fit to residuals

Key Hyperparameters¶

n_estimators: Number of boosting stages (50-500)
learning_rate: Shrinkage (0.01-0.3)
- Lower rate → need more trees
- Typical: 0.1
max_depth: Tree depth (1-10)
- Boosting works with shallow trees!
- Often 3-5
subsample: Fraction of samples for each tree (0.5-1.0)

Regularization¶

Shrinkage (learning_rate): Slow learning prevents overfitting
Subsampling: Randomness helps generalization
Tree constraints: max_depth, min_samples_split

# Gradient Boosting Example
# Compare different learning rates and n_estimators

learning_rates = [0.01, 0.1, 0.3]
n_estimators_range = range(10, 201, 10)

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for idx, lr in enumerate(learning_rates):
    train_scores = []
    test_scores = []
    
    for n_est in n_estimators_range:
        gb = GradientBoostingRegressor(
            n_estimators=n_est,
            learning_rate=lr,
            max_depth=3,
            random_state=42
        )
        gb.fit(X_train, y_train)
        train_scores.append(gb.score(X_train, y_train))
        test_scores.append(gb.score(X_test, y_test))
    
    axes[idx].plot(n_estimators_range, train_scores, label='Train', linewidth=2)
    axes[idx].plot(n_estimators_range, test_scores, label='Test', linewidth=2)
    axes[idx].set_xlabel('Number of Trees')
    axes[idx].set_ylabel('R²')
    axes[idx].set_title(f'Gradient Boosting\nLearning Rate = {lr}')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)
    
    best_n = n_estimators_range[np.argmax(test_scores)]
    best_score = max(test_scores)
    axes[idx].axvline(best_n, color='r', linestyle='--', alpha=0.5)
    axes[idx].text(best_n, 0.5, f'Best: {best_n}\nR²={best_score:.3f}', 
                  fontsize=9, ha='left')

plt.tight_layout()
plt.show()

print("\n💡 Learning Rate Effects:")
print("   • lr = 0.01: Slow learning, needs many trees, less overfitting")
print("   • lr = 0.1:  Good balance (common choice)")
print("   • lr = 0.3:  Fast learning, fewer trees, more overfitting risk")
print("\n   Rule: Lower learning rate → more trees needed")

# Compare all ensemble methods
ensemble_models = {
    'Bagging': BaggingClassifier(
        n_estimators=100, random_state=42),
    'Random Forest': RandomForestClassifier(
        n_estimators=100, random_state=42),
    'AdaBoost': AdaBoostClassifier(
        n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(
        n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
}

ensemble_results = {}
for name, model in ensemble_models.items():
    model.fit(X_train_bc, y_train_bc)
    train_acc = model.score(X_train_bc, y_train_bc)
    test_acc = model.score(X_test_bc, y_test_bc)
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_bc, y_bc, cv=5)
    ensemble_results[name] = {
        'train': train_acc,
        'test': test_acc,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std()
    }

# Visualize comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Train vs Test accuracy
names = list(ensemble_results.keys())
train_acc_list = [ensemble_results[n]['train'] for n in names]
test_acc_list = [ensemble_results[n]['test'] for n in names]

x = np.arange(len(names))
width = 0.35

ax1.bar(x - width/2, train_acc_list, width, label='Train', alpha=0.8)
ax1.bar(x + width/2, test_acc_list, width, label='Test', alpha=0.8)
ax1.set_ylabel('Accuracy')
ax1.set_title('Ensemble Methods Comparison')
ax1.set_xticks(x)
ax1.set_xticklabels(names, rotation=15, ha='right')
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')
ax1.set_ylim([0.9, 1.0])

# CV scores with error bars
cv_means = [ensemble_results[n]['cv_mean'] for n in names]
cv_stds = [ensemble_results[n]['cv_std'] for n in names]

ax2.barh(range(len(names)), cv_means, xerr=cv_stds, alpha=0.8, capsize=5)
ax2.set_yticks(range(len(names)))
ax2.set_yticklabels(names)
ax2.set_xlabel('Cross-Validation Accuracy')
ax2.set_title('5-Fold CV Performance')
ax2.grid(True, alpha=0.3, axis='x')
ax2.set_xlim([0.92, 0.98])

plt.tight_layout()
plt.show()

# Print detailed results
print("\n📊 Detailed Comparison:\n")
print(f"{'Method':<20} {'Train Acc':<12} {'Test Acc':<12} {'CV Mean±Std':<20} {'Overfit'}")
print("="*85)
for name in names:
    r = ensemble_results[name]
    overfit = r['train'] - r['test']
    cv_str = f"{r['cv_mean']:.4f}±{r['cv_std']:.4f}"
    marker = '✅' if r['test'] == max(test_acc_list) else ''
    print(f"{name:<20} {r['train']:>10.4f}   {r['test']:>10.4f}   {cv_str:<20} {overfit:>7.4f}  {marker}")

Key Takeaways¶

Method Comparison¶

Method	Pros	Cons	Best For
Single Tree	Interpretable, fast	High variance, poor accuracy	Exploration, interpretability
Bagging	Reduces variance, stable	Less interpretable	General purpose
Random Forest	Excellent performance, robust	Black box, slower	Default choice for tabular data
AdaBoost	Good with weak learners	Sensitive to noise	When outliers handled
Gradient Boosting	Often best performance	Prone to overfitting, tuning needed	Competitions, best accuracy

When to Use Each¶

Decision Trees:

Need interpretability
Quick baseline
Exploratory analysis
Embedded in other methods

Random Forest:

Default choice for tabular data ✅
Need feature importance
Want robust out-of-the-box performance
Don’t want to tune many hyperparameters

Gradient Boosting:

Squeezing last % of performance
Kaggle competitions
Have time for hyperparameter tuning
Structured/tabular data

Hyperparameter Guidelines¶

Random Forest:

RandomForestClassifier(
    n_estimators=100,      # More is better (diminishing returns)
    max_features='sqrt',   # sqrt(p) for classification
    max_depth=None,        # Grow full trees
    min_samples_split=2,   # Default usually good
    min_samples_leaf=1     # Default usually good
)

Gradient Boosting:

GradientBoostingClassifier(
    n_estimators=100,      # Start with 100
    learning_rate=0.1,     # Lower if overfitting
    max_depth=3,           # Shallow trees! (3-5)
    subsample=0.8,         # Stochastic boosting
    min_samples_split=2
)

Tuning Strategy¶

Random Forest (Less Tuning Needed):

Set n_estimators=100-500
Try max_features in [‘sqrt’, ‘log2’, 0.5]
Adjust max_depth if overfitting

Gradient Boosting (More Tuning Needed):

Fix learning_rate=0.1, max_depth=3
Tune n_estimators with early stopping
Lower learning_rate, increase n_estimators together
Tune max_depth, min_samples_split
Add subsample for regularization

Common Pitfalls¶

❌ Using deep trees in boosting → Overfitting
❌ Too high learning rate in boosting → Unstable
❌ Not using OOB error in RF → Wasting free validation
❌ Comparing train accuracy → Use test or CV!
❌ Not checking feature importance → Missing insights
❌ Forgetting to set random_state → Irreproducible

Best Practices¶

✅ Start with Random Forest - Usually excellent baseline
✅ Use cross-validation - Don’t rely on single train/test split
✅ Check feature importance - Understand what model learns
✅ Monitor train vs test - Watch for overfitting
✅ Use OOB error - Free validation for RF
✅ Ensemble different types - Combine RF + GB for best results

Production Considerations¶

Random Forest:

✅ Parallelizable (fast prediction)
✅ Robust (less tuning in production)
❌ Large memory (many trees)

Gradient Boosting:

✅ Smaller models (fewer trees)
✅ Often best accuracy
❌ Sequential prediction (slower)
❌ More sensitive to input changes

Modern Variants¶

Consider these for production:

XGBoost: Optimized gradient boosting
LightGBM: Fast, efficient, handles categorical
CatBoost: Great for categorical features

Next Chapter¶

Chapter 9: Support Vector Machines

Maximal Margin Classifier
Support Vector Classifiers
Support Vector Machines (Kernels)
Multi-class SVMs

Practice Exercises¶

Exercise 1: Tree Depth Analysis¶

Generate regression data with true polynomial relationship
Fit trees with depths 1-20
Plot train MSE, test MSE, and tree complexity
Find optimal depth via cross-validation

Exercise 2: Bagging Convergence¶

Fit bagging with n_estimators from 1 to 200
Plot test error vs number of trees
Identify when performance plateaus
Calculate OOB error and compare to test error

Exercise 3: Feature Importance¶

Using a real dataset (e.g., housing prices):

Fit Random Forest
Extract and visualize feature importances
Retrain with only top 5 features
Compare performance

Exercise 4: Boosting Hyperparameters¶

Grid search over:

learning_rate: [0.01, 0.05, 0.1, 0.2]
n_estimators: [50, 100, 200]
max_depth: [2, 3, 4, 5]

Find best combination and visualize trade-offs

Exercise 5: Ensemble Comparison¶

On a classification dataset:

Implement: Single Tree, Bagging, RF, AdaBoost, GB
Compare accuracy, training time, prediction time
Plot ROC curves for all methods
Analyze confusion matrices
Recommend best method with justification

Exercise 6: Overfitting Investigation¶

Create small dataset (n=100)
Fit RF and GB with varying complexity
Plot learning curves (train/test vs training size)
Identify overfitting signatures
Apply regularization to mitigate