Chapter 8: Tree-Based MethodsΒΆ
OverviewΒΆ
Tree-based methods partition the feature space into simple regions and make predictions based on the mean or mode of training observations in each region.
Methods CoveredΒΆ
1. Decision TreesΒΆ
Single tree: Simple, interpretable
CART algorithm: Recursive binary splitting
Pruning: Control complexity
2. Bagging (Bootstrap Aggregating)ΒΆ
Build multiple trees on bootstrap samples
Average predictions (regression) or vote (classification)
Reduces variance
3. Random ForestsΒΆ
Bagging + random feature selection
Decorrelates trees
Often best out-of-the-box performance
4. BoostingΒΆ
Sequential: Each tree corrects previous trees
AdaBoost: Reweight misclassified samples
Gradient Boosting: Fit residuals
XGBoost, LightGBM: Modern implementations
Key AdvantagesΒΆ
β
Interpretability (single trees)
β
Handle non-linearity naturally
β
No feature scaling needed
β
Automatic feature interaction
β
Missing value handling
β
Feature importance built-in
β
Classification and regression
Key DisadvantagesΒΆ
β High variance (single trees)
β Lack of smoothness
β Prediction accuracy (single trees)
β Ensemble interpretability (forests, boosting)
8.1 Decision TreesΒΆ
The Algorithm: Recursive Binary SplittingΒΆ
For Regression:
Find best split: minimize RSS in resulting regions
Split data into two regions
Repeat for each region
Stop when criterion met (min samples, max depth)
where \(\hat{y}_{R_j}\) is mean of training observations in region \(R_j\)
For Classification: Minimize Gini impurity or entropy:
where \(\hat{p}_{mk}\) is proportion of class \(k\) in region \(m\)
Tree PruningΒΆ
Problem: Large trees overfit
Solution: Cost complexity pruning
Minimize: $\(\sum_{m=1}^{|T|} \sum_{i: x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha |T|\)$
where \(|T|\) is number of terminal nodes, \(\alpha\) controls tradeoff
# Regression Tree Example
np.random.seed(42)
n = 100
X = np.linspace(0, 10, n).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.randn(n) * 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Compare different tree depths
depths = [2, 4, 6, 10]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()
X_plot = np.linspace(0, 10, 300).reshape(-1, 1)
for idx, depth in enumerate(depths):
tree = DecisionTreeRegressor(max_depth=depth, random_state=42)
tree.fit(X_train, y_train)
y_pred = tree.predict(X_plot)
train_score = tree.score(X_train, y_train)
test_score = tree.score(X_test, y_test)
axes[idx].scatter(X_train, y_train, alpha=0.5, label='Train')
axes[idx].scatter(X_test, y_test, alpha=0.5, label='Test', color='orange')
axes[idx].plot(X_plot, y_pred, 'g-', linewidth=2, label='Tree')
axes[idx].set_xlabel('X')
axes[idx].set_ylabel('Y')
axes[idx].set_title(f'Depth={depth}\nTrain RΒ²={train_score:.3f}, Test RΒ²={test_score:.3f}')
axes[idx].legend()
axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nπ‘ Observations:")
print(" β’ Depth 2: Underfits (too simple)")
print(" β’ Depth 4-6: Good balance")
print(" β’ Depth 10: Overfits (train RΒ² high, test RΒ² lower)")
print(" β’ Trees create piecewise constant predictions")
# Visualize tree structure
tree_viz = DecisionTreeRegressor(max_depth=3, random_state=42)
tree_viz.fit(X_train, y_train)
plt.figure(figsize=(16, 8))
plot_tree(tree_viz, filled=True, feature_names=['X'],
rounded=True, fontsize=10)
plt.title('Decision Tree Structure (max_depth=3)', fontsize=14)
plt.show()
print("\nπ Tree Information:")
print(f"Number of leaves: {tree_viz.get_n_leaves()}")
print(f"Max depth: {tree_viz.get_depth()}")
print(f"Train RΒ²: {tree_viz.score(X_train, y_train):.4f}")
print(f"Test RΒ²: {tree_viz.score(X_test, y_test):.4f}")
# Classification Tree Example
X_class, y_class = make_classification(n_samples=200, n_features=2,
n_informative=2, n_redundant=0,
n_clusters_per_class=1, random_state=42)
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
X_class, y_class, test_size=0.3, random_state=42)
# Fit tree
tree_class = DecisionTreeClassifier(max_depth=4, random_state=42)
tree_class.fit(X_train_c, y_train_c)
# Create decision boundary plot
h = 0.02
x_min, x_max = X_class[:, 0].min() - 1, X_class[:, 0].max() + 1
y_min, y_max = X_class[:, 1].min() - 1, X_class[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = tree_class.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
plt.scatter(X_train_c[:, 0], X_train_c[:, 1], c=y_train_c,
edgecolors='k', marker='o', s=100, cmap='RdYlBu', label='Train')
plt.scatter(X_test_c[:, 0], X_test_c[:, 1], c=y_test_c,
edgecolors='k', marker='s', s=100, cmap='RdYlBu', label='Test')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title(f'Decision Tree Classification (max_depth=4)\n'
f'Train Acc: {tree_class.score(X_train_c, y_train_c):.3f}, '
f'Test Acc: {tree_class.score(X_test_c, y_test_c):.3f}')
plt.legend()
plt.show()
print("\nπ‘ Note:")
print(" β’ Decision boundaries are axis-aligned (rectangular regions)")
print(" β’ Cannot capture diagonal or curved boundaries well")
8.2 Bagging (Bootstrap Aggregating)ΒΆ
The ProblemΒΆ
Decision trees have high variance - small changes in data β large changes in tree
The Solution: BaggingΒΆ
Generate B bootstrap samples from training data
Train a tree on each bootstrap sample
Average predictions (regression) or vote (classification)
Why It WorksΒΆ
Variance reduction through averaging:
If we have B independent models with variance \(\sigma^2\): $\(\text{Var}(\text{average}) = \frac{\sigma^2}{B}\)$
Out-of-Bag (OOB) ErrorΒΆ
Each bootstrap sample uses ~63% of data
Remaining ~37% can be used for validation
OOB error β cross-validation error
No need for separate validation set!
Key ParametersΒΆ
n_estimators: Number of trees (B)
max_features: Features to consider at each split
max_depth: Tree depth
# Bagging vs Single Tree
# Compare single tree vs bagging with different number of trees
single_tree = DecisionTreeRegressor(random_state=42)
single_tree.fit(X_train, y_train)
n_estimators_list = [5, 10, 50, 100]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()
for idx, n_est in enumerate(n_estimators_list):
bagging = BaggingRegressor(
estimator=DecisionTreeRegressor(),
n_estimators=n_est,
random_state=42
)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_plot)
axes[idx].scatter(X_train, y_train, alpha=0.3, label='Data')
axes[idx].plot(X_plot, y_pred, 'g-', linewidth=2, label=f'Bagging ({n_est} trees)')
axes[idx].set_xlabel('X')
axes[idx].set_ylabel('Y')
axes[idx].set_title(f'Bagging: {n_est} trees\n'
f'Test RΒ²: {bagging.score(X_test, y_test):.3f}')
axes[idx].legend()
axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nπ Comparison:\n")
print(f"Single Tree Test RΒ²: {single_tree.score(X_test, y_test):.4f}")
for n_est in n_estimators_list:
bagging = BaggingRegressor(
estimator=DecisionTreeRegressor(),
n_estimators=n_est,
random_state=42
)
bagging.fit(X_train, y_train)
print(f"Bagging ({n_est:3d} trees) Test RΒ²: {bagging.score(X_test, y_test):.4f}")
print("\nπ‘ Key Insight: More trees β smoother predictions, better performance")
8.3 Random ForestsΒΆ
Problem with BaggingΒΆ
Bootstrap samples are correlated (use same features)
If one strong predictor exists, most trees will use it
Trees become similar
Less variance reduction from averaging
Random Forest SolutionΒΆ
Decorrelate trees by randomly selecting features at each split
Algorithm:
Generate B bootstrap samples
For each tree:
At each split, randomly select m features from p total
Choose best split among these m features only
Typical choice: \(m = \sqrt{p}\) (classification) or \(m = p/3\) (regression)
Average predictions
Feature ImportanceΒΆ
Measure how much each feature decreases impurity (Gini or MSE)
Sum over all trees and splits
Normalize to 100%
AdvantagesΒΆ
β
Excellent performance out-of-the-box
β
Handles high-dimensional data
β
Feature importance built-in
β
OOB error for free
β
Parallel training
β
Robust to overfitting
Typical HyperparametersΒΆ
n_estimators: 100-1000 (more is better, diminishing returns)max_features: sqrt(p) or p/3max_depth: Often unlimitedmin_samples_split: 2-10min_samples_leaf: 1-5
# Random Forest Example
# Load breast cancer dataset
data = load_breast_cancer()
X_bc, y_bc = data.data, data.target
feature_names = data.feature_names
X_train_bc, X_test_bc, y_train_bc, y_test_bc = train_test_split(
X_bc, y_bc, test_size=0.3, random_state=42)
# Compare models
models = {
'Single Tree': DecisionTreeClassifier(random_state=42),
'Bagging': BaggingClassifier(n_estimators=100, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}
results = {}
for name, model in models.items():
model.fit(X_train_bc, y_train_bc)
train_acc = model.score(X_train_bc, y_train_bc)
test_acc = model.score(X_test_bc, y_test_bc)
results[name] = {'train': train_acc, 'test': test_acc}
# Plot comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Accuracy comparison
model_names = list(results.keys())
train_accs = [results[m]['train'] for m in model_names]
test_accs = [results[m]['test'] for m in model_names]
x = np.arange(len(model_names))
width = 0.35
ax1.bar(x - width/2, train_accs, width, label='Train', alpha=0.8)
ax1.bar(x + width/2, test_accs, width, label='Test', alpha=0.8)
ax1.set_ylabel('Accuracy')
ax1.set_title('Model Comparison on Breast Cancer Dataset')
ax1.set_xticks(x)
ax1.set_xticklabels(model_names)
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')
# Feature importance from Random Forest
rf = models['Random Forest']
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1][:10] # Top 10
ax2.barh(range(10), importances[indices], alpha=0.8)
ax2.set_yticks(range(10))
ax2.set_yticklabels([feature_names[i] for i in indices])
ax2.set_xlabel('Importance')
ax2.set_title('Top 10 Feature Importances (Random Forest)')
ax2.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()
print("\nπ Results:\n")
print(f"{'Model':<20} {'Train Acc':<12} {'Test Acc':<12} {'Overfit Gap'}")
print("="*60)
for name in model_names:
train = results[name]['train']
test = results[name]['test']
gap = train - test
marker = 'β
' if test == max(test_accs) else ''
print(f"{name:<20} {train:>10.4f} {test:>10.4f} {gap:>10.4f} {marker}")
print("\nπ‘ Observations:")
print(" β’ Single tree: High variance (overfits)")
print(" β’ Bagging: Reduces variance significantly")
print(" β’ Random Forest: Best test performance (decorrelated trees)")
8.4 BoostingΒΆ
Key Difference from Bagging/RFΒΆ
Bagging/RF: Build trees independently in parallel
Boosting: Build trees sequentially, each correcting previous errors
AdaBoost (Adaptive Boosting)ΒΆ
For Classification:
Start with equal weights for all samples
Train weak learner (shallow tree)
Increase weights for misclassified samples
Repeat: Next tree focuses on hard examples
Combine with weighted vote
Gradient BoostingΒΆ
For Regression and Classification:
Initialize with simple model (constant)
Compute residuals (errors)
Fit tree to predict residuals
Add tree to ensemble (with learning rate)
Repeat: Each tree fits residuals of current ensemble
Update rule: $\(f_m(x) = f_{m-1}(x) + \nu \cdot h_m(x)\)$
where:
\(f_m\) = ensemble after m trees
\(\nu\) = learning rate (shrinkage)
\(h_m\) = new tree fit to residuals
Key HyperparametersΒΆ
n_estimators: Number of boosting stages (50-500)
learning_rate: Shrinkage (0.01-0.3)
Lower rate β need more trees
Typical: 0.1
max_depth: Tree depth (1-10)
Boosting works with shallow trees!
Often 3-5
subsample: Fraction of samples for each tree (0.5-1.0)
RegularizationΒΆ
Shrinkage (learning_rate): Slow learning prevents overfitting
Subsampling: Randomness helps generalization
Tree constraints: max_depth, min_samples_split
# Gradient Boosting Example
# Compare different learning rates and n_estimators
learning_rates = [0.01, 0.1, 0.3]
n_estimators_range = range(10, 201, 10)
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
for idx, lr in enumerate(learning_rates):
train_scores = []
test_scores = []
for n_est in n_estimators_range:
gb = GradientBoostingRegressor(
n_estimators=n_est,
learning_rate=lr,
max_depth=3,
random_state=42
)
gb.fit(X_train, y_train)
train_scores.append(gb.score(X_train, y_train))
test_scores.append(gb.score(X_test, y_test))
axes[idx].plot(n_estimators_range, train_scores, label='Train', linewidth=2)
axes[idx].plot(n_estimators_range, test_scores, label='Test', linewidth=2)
axes[idx].set_xlabel('Number of Trees')
axes[idx].set_ylabel('RΒ²')
axes[idx].set_title(f'Gradient Boosting\nLearning Rate = {lr}')
axes[idx].legend()
axes[idx].grid(True, alpha=0.3)
best_n = n_estimators_range[np.argmax(test_scores)]
best_score = max(test_scores)
axes[idx].axvline(best_n, color='r', linestyle='--', alpha=0.5)
axes[idx].text(best_n, 0.5, f'Best: {best_n}\nRΒ²={best_score:.3f}',
fontsize=9, ha='left')
plt.tight_layout()
plt.show()
print("\nπ‘ Learning Rate Effects:")
print(" β’ lr = 0.01: Slow learning, needs many trees, less overfitting")
print(" β’ lr = 0.1: Good balance (common choice)")
print(" β’ lr = 0.3: Fast learning, fewer trees, more overfitting risk")
print("\n Rule: Lower learning rate β more trees needed")
# Compare all ensemble methods
ensemble_models = {
'Bagging': BaggingClassifier(
n_estimators=100, random_state=42),
'Random Forest': RandomForestClassifier(
n_estimators=100, random_state=42),
'AdaBoost': AdaBoostClassifier(
n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(
n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
}
ensemble_results = {}
for name, model in ensemble_models.items():
model.fit(X_train_bc, y_train_bc)
train_acc = model.score(X_train_bc, y_train_bc)
test_acc = model.score(X_test_bc, y_test_bc)
# Cross-validation
cv_scores = cross_val_score(model, X_bc, y_bc, cv=5)
ensemble_results[name] = {
'train': train_acc,
'test': test_acc,
'cv_mean': cv_scores.mean(),
'cv_std': cv_scores.std()
}
# Visualize comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Train vs Test accuracy
names = list(ensemble_results.keys())
train_acc_list = [ensemble_results[n]['train'] for n in names]
test_acc_list = [ensemble_results[n]['test'] for n in names]
x = np.arange(len(names))
width = 0.35
ax1.bar(x - width/2, train_acc_list, width, label='Train', alpha=0.8)
ax1.bar(x + width/2, test_acc_list, width, label='Test', alpha=0.8)
ax1.set_ylabel('Accuracy')
ax1.set_title('Ensemble Methods Comparison')
ax1.set_xticks(x)
ax1.set_xticklabels(names, rotation=15, ha='right')
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')
ax1.set_ylim([0.9, 1.0])
# CV scores with error bars
cv_means = [ensemble_results[n]['cv_mean'] for n in names]
cv_stds = [ensemble_results[n]['cv_std'] for n in names]
ax2.barh(range(len(names)), cv_means, xerr=cv_stds, alpha=0.8, capsize=5)
ax2.set_yticks(range(len(names)))
ax2.set_yticklabels(names)
ax2.set_xlabel('Cross-Validation Accuracy')
ax2.set_title('5-Fold CV Performance')
ax2.grid(True, alpha=0.3, axis='x')
ax2.set_xlim([0.92, 0.98])
plt.tight_layout()
plt.show()
# Print detailed results
print("\nπ Detailed Comparison:\n")
print(f"{'Method':<20} {'Train Acc':<12} {'Test Acc':<12} {'CV MeanΒ±Std':<20} {'Overfit'}")
print("="*85)
for name in names:
r = ensemble_results[name]
overfit = r['train'] - r['test']
cv_str = f"{r['cv_mean']:.4f}Β±{r['cv_std']:.4f}"
marker = 'β
' if r['test'] == max(test_acc_list) else ''
print(f"{name:<20} {r['train']:>10.4f} {r['test']:>10.4f} {cv_str:<20} {overfit:>7.4f} {marker}")
Key TakeawaysΒΆ
Method ComparisonΒΆ
Method |
Pros |
Cons |
Best For |
|---|---|---|---|
Single Tree |
Interpretable, fast |
High variance, poor accuracy |
Exploration, interpretability |
Bagging |
Reduces variance, stable |
Less interpretable |
General purpose |
Random Forest |
Excellent performance, robust |
Black box, slower |
Default choice for tabular data |
AdaBoost |
Good with weak learners |
Sensitive to noise |
When outliers handled |
Gradient Boosting |
Often best performance |
Prone to overfitting, tuning needed |
Competitions, best accuracy |
When to Use EachΒΆ
Decision Trees:
Need interpretability
Quick baseline
Exploratory analysis
Embedded in other methods
Random Forest:
Default choice for tabular data β
Need feature importance
Want robust out-of-the-box performance
Donβt want to tune many hyperparameters
Gradient Boosting:
Squeezing last % of performance
Kaggle competitions
Have time for hyperparameter tuning
Structured/tabular data
Hyperparameter GuidelinesΒΆ
Random Forest:
RandomForestClassifier(
n_estimators=100, # More is better (diminishing returns)
max_features='sqrt', # sqrt(p) for classification
max_depth=None, # Grow full trees
min_samples_split=2, # Default usually good
min_samples_leaf=1 # Default usually good
)
Gradient Boosting:
GradientBoostingClassifier(
n_estimators=100, # Start with 100
learning_rate=0.1, # Lower if overfitting
max_depth=3, # Shallow trees! (3-5)
subsample=0.8, # Stochastic boosting
min_samples_split=2
)
Tuning StrategyΒΆ
Random Forest (Less Tuning Needed):
Set
n_estimators=100-500Try
max_featuresin [βsqrtβ, βlog2β, 0.5]Adjust
max_depthif overfitting
Gradient Boosting (More Tuning Needed):
Fix
learning_rate=0.1,max_depth=3Tune
n_estimatorswith early stoppingLower
learning_rate, increasen_estimatorstogetherTune
max_depth,min_samples_splitAdd
subsamplefor regularization
Common PitfallsΒΆ
β Using deep trees in boosting β Overfitting
β Too high learning rate in boosting β Unstable
β Not using OOB error in RF β Wasting free validation
β Comparing train accuracy β Use test or CV!
β Not checking feature importance β Missing insights
β Forgetting to set random_state β Irreproducible
Best PracticesΒΆ
β
Start with Random Forest - Usually excellent baseline
β
Use cross-validation - Donβt rely on single train/test split
β
Check feature importance - Understand what model learns
β
Monitor train vs test - Watch for overfitting
β
Use OOB error - Free validation for RF
β
Ensemble different types - Combine RF + GB for best results
Production ConsiderationsΒΆ
Random Forest:
β Parallelizable (fast prediction)
β Robust (less tuning in production)
β Large memory (many trees)
Gradient Boosting:
β Smaller models (fewer trees)
β Often best accuracy
β Sequential prediction (slower)
β More sensitive to input changes
Modern VariantsΒΆ
Consider these for production:
XGBoost: Optimized gradient boosting
LightGBM: Fast, efficient, handles categorical
CatBoost: Great for categorical features
Next ChapterΒΆ
Chapter 9: Support Vector Machines
Maximal Margin Classifier
Support Vector Classifiers
Support Vector Machines (Kernels)
Multi-class SVMs
Practice ExercisesΒΆ
Exercise 1: Tree Depth AnalysisΒΆ
Generate regression data with true polynomial relationship
Fit trees with depths 1-20
Plot train MSE, test MSE, and tree complexity
Find optimal depth via cross-validation
Exercise 2: Bagging ConvergenceΒΆ
Fit bagging with n_estimators from 1 to 200
Plot test error vs number of trees
Identify when performance plateaus
Calculate OOB error and compare to test error
Exercise 3: Feature ImportanceΒΆ
Using a real dataset (e.g., housing prices):
Fit Random Forest
Extract and visualize feature importances
Retrain with only top 5 features
Compare performance
Exercise 4: Boosting HyperparametersΒΆ
Grid search over:
learning_rate: [0.01, 0.05, 0.1, 0.2]n_estimators: [50, 100, 200]max_depth: [2, 3, 4, 5]
Find best combination and visualize trade-offs
Exercise 5: Ensemble ComparisonΒΆ
On a classification dataset:
Implement: Single Tree, Bagging, RF, AdaBoost, GB
Compare accuracy, training time, prediction time
Plot ROC curves for all methods
Analyze confusion matrices
Recommend best method with justification
Exercise 6: Overfitting InvestigationΒΆ
Create small dataset (n=100)
Fit RF and GB with varying complexity
Plot learning curves (train/test vs training size)
Identify overfitting signatures
Apply regularization to mitigate