Chapter 8: Tree-Based MethodsΒΆ

OverviewΒΆ

Tree-based methods partition the feature space into simple regions and make predictions based on the mean or mode of training observations in each region.

Methods CoveredΒΆ

1. Decision TreesΒΆ

  • Single tree: Simple, interpretable

  • CART algorithm: Recursive binary splitting

  • Pruning: Control complexity

2. Bagging (Bootstrap Aggregating)ΒΆ

  • Build multiple trees on bootstrap samples

  • Average predictions (regression) or vote (classification)

  • Reduces variance

3. Random ForestsΒΆ

  • Bagging + random feature selection

  • Decorrelates trees

  • Often best out-of-the-box performance

4. BoostingΒΆ

  • Sequential: Each tree corrects previous trees

  • AdaBoost: Reweight misclassified samples

  • Gradient Boosting: Fit residuals

  • XGBoost, LightGBM: Modern implementations

Key AdvantagesΒΆ

βœ… Interpretability (single trees)
βœ… Handle non-linearity naturally
βœ… No feature scaling needed
βœ… Automatic feature interaction
βœ… Missing value handling
βœ… Feature importance built-in
βœ… Classification and regression

Key DisadvantagesΒΆ

❌ High variance (single trees)
❌ Lack of smoothness
❌ Prediction accuracy (single trees)
❌ Ensemble interpretability (forests, boosting)

8.1 Decision TreesΒΆ

The Algorithm: Recursive Binary SplittingΒΆ

For Regression:

  1. Find best split: minimize RSS in resulting regions

  2. Split data into two regions

  3. Repeat for each region

  4. Stop when criterion met (min samples, max depth)

\[\text{RSS} = \sum_{j=1}^J \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2\]

where \(\hat{y}_{R_j}\) is mean of training observations in region \(R_j\)

For Classification: Minimize Gini impurity or entropy:

\[\text{Gini} = \sum_{k=1}^K \hat{p}_{mk}(1 - \hat{p}_{mk})\]
\[\text{Entropy} = -\sum_{k=1}^K \hat{p}_{mk} \log(\hat{p}_{mk})\]

where \(\hat{p}_{mk}\) is proportion of class \(k\) in region \(m\)

Tree PruningΒΆ

Problem: Large trees overfit
Solution: Cost complexity pruning

Minimize: $\(\sum_{m=1}^{|T|} \sum_{i: x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha |T|\)$

where \(|T|\) is number of terminal nodes, \(\alpha\) controls tradeoff

# Regression Tree Example
np.random.seed(42)
n = 100
X = np.linspace(0, 10, n).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.randn(n) * 0.3

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Compare different tree depths
depths = [2, 4, 6, 10]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

X_plot = np.linspace(0, 10, 300).reshape(-1, 1)

for idx, depth in enumerate(depths):
    tree = DecisionTreeRegressor(max_depth=depth, random_state=42)
    tree.fit(X_train, y_train)
    y_pred = tree.predict(X_plot)
    
    train_score = tree.score(X_train, y_train)
    test_score = tree.score(X_test, y_test)
    
    axes[idx].scatter(X_train, y_train, alpha=0.5, label='Train')
    axes[idx].scatter(X_test, y_test, alpha=0.5, label='Test', color='orange')
    axes[idx].plot(X_plot, y_pred, 'g-', linewidth=2, label='Tree')
    axes[idx].set_xlabel('X')
    axes[idx].set_ylabel('Y')
    axes[idx].set_title(f'Depth={depth}\nTrain RΒ²={train_score:.3f}, Test RΒ²={test_score:.3f}')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Observations:")
print("   β€’ Depth 2: Underfits (too simple)")
print("   β€’ Depth 4-6: Good balance")
print("   β€’ Depth 10: Overfits (train RΒ² high, test RΒ² lower)")
print("   β€’ Trees create piecewise constant predictions")
# Visualize tree structure
tree_viz = DecisionTreeRegressor(max_depth=3, random_state=42)
tree_viz.fit(X_train, y_train)

plt.figure(figsize=(16, 8))
plot_tree(tree_viz, filled=True, feature_names=['X'], 
         rounded=True, fontsize=10)
plt.title('Decision Tree Structure (max_depth=3)', fontsize=14)
plt.show()

print("\nπŸ“Š Tree Information:")
print(f"Number of leaves: {tree_viz.get_n_leaves()}")
print(f"Max depth: {tree_viz.get_depth()}")
print(f"Train RΒ²: {tree_viz.score(X_train, y_train):.4f}")
print(f"Test RΒ²: {tree_viz.score(X_test, y_test):.4f}")
# Classification Tree Example
X_class, y_class = make_classification(n_samples=200, n_features=2, 
                                       n_informative=2, n_redundant=0,
                                       n_clusters_per_class=1, random_state=42)

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_class, y_class, test_size=0.3, random_state=42)

# Fit tree
tree_class = DecisionTreeClassifier(max_depth=4, random_state=42)
tree_class.fit(X_train_c, y_train_c)

# Create decision boundary plot
h = 0.02
x_min, x_max = X_class[:, 0].min() - 1, X_class[:, 0].max() + 1
y_min, y_max = X_class[:, 1].min() - 1, X_class[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = tree_class.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
plt.scatter(X_train_c[:, 0], X_train_c[:, 1], c=y_train_c, 
           edgecolors='k', marker='o', s=100, cmap='RdYlBu', label='Train')
plt.scatter(X_test_c[:, 0], X_test_c[:, 1], c=y_test_c, 
           edgecolors='k', marker='s', s=100, cmap='RdYlBu', label='Test')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title(f'Decision Tree Classification (max_depth=4)\n'
         f'Train Acc: {tree_class.score(X_train_c, y_train_c):.3f}, '
         f'Test Acc: {tree_class.score(X_test_c, y_test_c):.3f}')
plt.legend()
plt.show()

print("\nπŸ’‘ Note:")
print("   β€’ Decision boundaries are axis-aligned (rectangular regions)")
print("   β€’ Cannot capture diagonal or curved boundaries well")

8.2 Bagging (Bootstrap Aggregating)ΒΆ

The ProblemΒΆ

Decision trees have high variance - small changes in data β†’ large changes in tree

The Solution: BaggingΒΆ

  1. Generate B bootstrap samples from training data

  2. Train a tree on each bootstrap sample

  3. Average predictions (regression) or vote (classification)

Why It WorksΒΆ

Variance reduction through averaging:

If we have B independent models with variance \(\sigma^2\): $\(\text{Var}(\text{average}) = \frac{\sigma^2}{B}\)$

Out-of-Bag (OOB) ErrorΒΆ

  • Each bootstrap sample uses ~63% of data

  • Remaining ~37% can be used for validation

  • OOB error β‰ˆ cross-validation error

  • No need for separate validation set!

Key ParametersΒΆ

  • n_estimators: Number of trees (B)

  • max_features: Features to consider at each split

  • max_depth: Tree depth

# Bagging vs Single Tree
# Compare single tree vs bagging with different number of trees

single_tree = DecisionTreeRegressor(random_state=42)
single_tree.fit(X_train, y_train)

n_estimators_list = [5, 10, 50, 100]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, n_est in enumerate(n_estimators_list):
    bagging = BaggingRegressor(
        estimator=DecisionTreeRegressor(),
        n_estimators=n_est,
        random_state=42
    )
    bagging.fit(X_train, y_train)
    y_pred = bagging.predict(X_plot)
    
    axes[idx].scatter(X_train, y_train, alpha=0.3, label='Data')
    axes[idx].plot(X_plot, y_pred, 'g-', linewidth=2, label=f'Bagging ({n_est} trees)')
    axes[idx].set_xlabel('X')
    axes[idx].set_ylabel('Y')
    axes[idx].set_title(f'Bagging: {n_est} trees\n'
                       f'Test RΒ²: {bagging.score(X_test, y_test):.3f}')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nπŸ“Š Comparison:\n")
print(f"Single Tree Test RΒ²: {single_tree.score(X_test, y_test):.4f}")
for n_est in n_estimators_list:
    bagging = BaggingRegressor(
        estimator=DecisionTreeRegressor(),
        n_estimators=n_est,
        random_state=42
    )
    bagging.fit(X_train, y_train)
    print(f"Bagging ({n_est:3d} trees) Test RΒ²: {bagging.score(X_test, y_test):.4f}")

print("\nπŸ’‘ Key Insight: More trees β†’ smoother predictions, better performance")

8.3 Random ForestsΒΆ

Problem with BaggingΒΆ

Bootstrap samples are correlated (use same features)

  • If one strong predictor exists, most trees will use it

  • Trees become similar

  • Less variance reduction from averaging

Random Forest SolutionΒΆ

Decorrelate trees by randomly selecting features at each split

Algorithm:

  1. Generate B bootstrap samples

  2. For each tree:

    • At each split, randomly select m features from p total

    • Choose best split among these m features only

    • Typical choice: \(m = \sqrt{p}\) (classification) or \(m = p/3\) (regression)

  3. Average predictions

Feature ImportanceΒΆ

Measure how much each feature decreases impurity (Gini or MSE)

  • Sum over all trees and splits

  • Normalize to 100%

AdvantagesΒΆ

βœ… Excellent performance out-of-the-box
βœ… Handles high-dimensional data
βœ… Feature importance built-in
βœ… OOB error for free
βœ… Parallel training
βœ… Robust to overfitting

Typical HyperparametersΒΆ

  • n_estimators: 100-1000 (more is better, diminishing returns)

  • max_features: sqrt(p) or p/3

  • max_depth: Often unlimited

  • min_samples_split: 2-10

  • min_samples_leaf: 1-5

# Random Forest Example
# Load breast cancer dataset
data = load_breast_cancer()
X_bc, y_bc = data.data, data.target
feature_names = data.feature_names

X_train_bc, X_test_bc, y_train_bc, y_test_bc = train_test_split(
    X_bc, y_bc, test_size=0.3, random_state=42)

# Compare models
models = {
    'Single Tree': DecisionTreeClassifier(random_state=42),
    'Bagging': BaggingClassifier(n_estimators=100, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

results = {}
for name, model in models.items():
    model.fit(X_train_bc, y_train_bc)
    train_acc = model.score(X_train_bc, y_train_bc)
    test_acc = model.score(X_test_bc, y_test_bc)
    results[name] = {'train': train_acc, 'test': test_acc}

# Plot comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
model_names = list(results.keys())
train_accs = [results[m]['train'] for m in model_names]
test_accs = [results[m]['test'] for m in model_names]

x = np.arange(len(model_names))
width = 0.35

ax1.bar(x - width/2, train_accs, width, label='Train', alpha=0.8)
ax1.bar(x + width/2, test_accs, width, label='Test', alpha=0.8)
ax1.set_ylabel('Accuracy')
ax1.set_title('Model Comparison on Breast Cancer Dataset')
ax1.set_xticks(x)
ax1.set_xticklabels(model_names)
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')

# Feature importance from Random Forest
rf = models['Random Forest']
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1][:10]  # Top 10

ax2.barh(range(10), importances[indices], alpha=0.8)
ax2.set_yticks(range(10))
ax2.set_yticklabels([feature_names[i] for i in indices])
ax2.set_xlabel('Importance')
ax2.set_title('Top 10 Feature Importances (Random Forest)')
ax2.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\nπŸ“Š Results:\n")
print(f"{'Model':<20} {'Train Acc':<12} {'Test Acc':<12} {'Overfit Gap'}")
print("="*60)
for name in model_names:
    train = results[name]['train']
    test = results[name]['test']
    gap = train - test
    marker = 'βœ…' if test == max(test_accs) else ''
    print(f"{name:<20} {train:>10.4f}   {test:>10.4f}   {gap:>10.4f}  {marker}")

print("\nπŸ’‘ Observations:")
print("   β€’ Single tree: High variance (overfits)")
print("   β€’ Bagging: Reduces variance significantly")
print("   β€’ Random Forest: Best test performance (decorrelated trees)")

8.4 BoostingΒΆ

Key Difference from Bagging/RFΒΆ

  • Bagging/RF: Build trees independently in parallel

  • Boosting: Build trees sequentially, each correcting previous errors

AdaBoost (Adaptive Boosting)ΒΆ

For Classification:

  1. Start with equal weights for all samples

  2. Train weak learner (shallow tree)

  3. Increase weights for misclassified samples

  4. Repeat: Next tree focuses on hard examples

  5. Combine with weighted vote

Gradient BoostingΒΆ

For Regression and Classification:

  1. Initialize with simple model (constant)

  2. Compute residuals (errors)

  3. Fit tree to predict residuals

  4. Add tree to ensemble (with learning rate)

  5. Repeat: Each tree fits residuals of current ensemble

Update rule: $\(f_m(x) = f_{m-1}(x) + \nu \cdot h_m(x)\)$

where:

  • \(f_m\) = ensemble after m trees

  • \(\nu\) = learning rate (shrinkage)

  • \(h_m\) = new tree fit to residuals

Key HyperparametersΒΆ

  • n_estimators: Number of boosting stages (50-500)

  • learning_rate: Shrinkage (0.01-0.3)

    • Lower rate β†’ need more trees

    • Typical: 0.1

  • max_depth: Tree depth (1-10)

    • Boosting works with shallow trees!

    • Often 3-5

  • subsample: Fraction of samples for each tree (0.5-1.0)

RegularizationΒΆ

  • Shrinkage (learning_rate): Slow learning prevents overfitting

  • Subsampling: Randomness helps generalization

  • Tree constraints: max_depth, min_samples_split

# Gradient Boosting Example
# Compare different learning rates and n_estimators

learning_rates = [0.01, 0.1, 0.3]
n_estimators_range = range(10, 201, 10)

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for idx, lr in enumerate(learning_rates):
    train_scores = []
    test_scores = []
    
    for n_est in n_estimators_range:
        gb = GradientBoostingRegressor(
            n_estimators=n_est,
            learning_rate=lr,
            max_depth=3,
            random_state=42
        )
        gb.fit(X_train, y_train)
        train_scores.append(gb.score(X_train, y_train))
        test_scores.append(gb.score(X_test, y_test))
    
    axes[idx].plot(n_estimators_range, train_scores, label='Train', linewidth=2)
    axes[idx].plot(n_estimators_range, test_scores, label='Test', linewidth=2)
    axes[idx].set_xlabel('Number of Trees')
    axes[idx].set_ylabel('RΒ²')
    axes[idx].set_title(f'Gradient Boosting\nLearning Rate = {lr}')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)
    
    best_n = n_estimators_range[np.argmax(test_scores)]
    best_score = max(test_scores)
    axes[idx].axvline(best_n, color='r', linestyle='--', alpha=0.5)
    axes[idx].text(best_n, 0.5, f'Best: {best_n}\nRΒ²={best_score:.3f}', 
                  fontsize=9, ha='left')

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Learning Rate Effects:")
print("   β€’ lr = 0.01: Slow learning, needs many trees, less overfitting")
print("   β€’ lr = 0.1:  Good balance (common choice)")
print("   β€’ lr = 0.3:  Fast learning, fewer trees, more overfitting risk")
print("\n   Rule: Lower learning rate β†’ more trees needed")
# Compare all ensemble methods
ensemble_models = {
    'Bagging': BaggingClassifier(
        n_estimators=100, random_state=42),
    'Random Forest': RandomForestClassifier(
        n_estimators=100, random_state=42),
    'AdaBoost': AdaBoostClassifier(
        n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(
        n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
}

ensemble_results = {}
for name, model in ensemble_models.items():
    model.fit(X_train_bc, y_train_bc)
    train_acc = model.score(X_train_bc, y_train_bc)
    test_acc = model.score(X_test_bc, y_test_bc)
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_bc, y_bc, cv=5)
    ensemble_results[name] = {
        'train': train_acc,
        'test': test_acc,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std()
    }

# Visualize comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Train vs Test accuracy
names = list(ensemble_results.keys())
train_acc_list = [ensemble_results[n]['train'] for n in names]
test_acc_list = [ensemble_results[n]['test'] for n in names]

x = np.arange(len(names))
width = 0.35

ax1.bar(x - width/2, train_acc_list, width, label='Train', alpha=0.8)
ax1.bar(x + width/2, test_acc_list, width, label='Test', alpha=0.8)
ax1.set_ylabel('Accuracy')
ax1.set_title('Ensemble Methods Comparison')
ax1.set_xticks(x)
ax1.set_xticklabels(names, rotation=15, ha='right')
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')
ax1.set_ylim([0.9, 1.0])

# CV scores with error bars
cv_means = [ensemble_results[n]['cv_mean'] for n in names]
cv_stds = [ensemble_results[n]['cv_std'] for n in names]

ax2.barh(range(len(names)), cv_means, xerr=cv_stds, alpha=0.8, capsize=5)
ax2.set_yticks(range(len(names)))
ax2.set_yticklabels(names)
ax2.set_xlabel('Cross-Validation Accuracy')
ax2.set_title('5-Fold CV Performance')
ax2.grid(True, alpha=0.3, axis='x')
ax2.set_xlim([0.92, 0.98])

plt.tight_layout()
plt.show()

# Print detailed results
print("\nπŸ“Š Detailed Comparison:\n")
print(f"{'Method':<20} {'Train Acc':<12} {'Test Acc':<12} {'CV MeanΒ±Std':<20} {'Overfit'}")
print("="*85)
for name in names:
    r = ensemble_results[name]
    overfit = r['train'] - r['test']
    cv_str = f"{r['cv_mean']:.4f}Β±{r['cv_std']:.4f}"
    marker = 'βœ…' if r['test'] == max(test_acc_list) else ''
    print(f"{name:<20} {r['train']:>10.4f}   {r['test']:>10.4f}   {cv_str:<20} {overfit:>7.4f}  {marker}")

Key TakeawaysΒΆ

Method ComparisonΒΆ

Method

Pros

Cons

Best For

Single Tree

Interpretable, fast

High variance, poor accuracy

Exploration, interpretability

Bagging

Reduces variance, stable

Less interpretable

General purpose

Random Forest

Excellent performance, robust

Black box, slower

Default choice for tabular data

AdaBoost

Good with weak learners

Sensitive to noise

When outliers handled

Gradient Boosting

Often best performance

Prone to overfitting, tuning needed

Competitions, best accuracy

When to Use EachΒΆ

Decision Trees:

  • Need interpretability

  • Quick baseline

  • Exploratory analysis

  • Embedded in other methods

Random Forest:

  • Default choice for tabular data βœ…

  • Need feature importance

  • Want robust out-of-the-box performance

  • Don’t want to tune many hyperparameters

Gradient Boosting:

  • Squeezing last % of performance

  • Kaggle competitions

  • Have time for hyperparameter tuning

  • Structured/tabular data

Hyperparameter GuidelinesΒΆ

Random Forest:

RandomForestClassifier(
    n_estimators=100,      # More is better (diminishing returns)
    max_features='sqrt',   # sqrt(p) for classification
    max_depth=None,        # Grow full trees
    min_samples_split=2,   # Default usually good
    min_samples_leaf=1     # Default usually good
)

Gradient Boosting:

GradientBoostingClassifier(
    n_estimators=100,      # Start with 100
    learning_rate=0.1,     # Lower if overfitting
    max_depth=3,           # Shallow trees! (3-5)
    subsample=0.8,         # Stochastic boosting
    min_samples_split=2
)

Tuning StrategyΒΆ

Random Forest (Less Tuning Needed):

  1. Set n_estimators=100-500

  2. Try max_features in [β€˜sqrt’, β€˜log2’, 0.5]

  3. Adjust max_depth if overfitting

Gradient Boosting (More Tuning Needed):

  1. Fix learning_rate=0.1, max_depth=3

  2. Tune n_estimators with early stopping

  3. Lower learning_rate, increase n_estimators together

  4. Tune max_depth, min_samples_split

  5. Add subsample for regularization

Common PitfallsΒΆ

❌ Using deep trees in boosting β†’ Overfitting
❌ Too high learning rate in boosting β†’ Unstable
❌ Not using OOB error in RF β†’ Wasting free validation
❌ Comparing train accuracy β†’ Use test or CV!
❌ Not checking feature importance β†’ Missing insights
❌ Forgetting to set random_state β†’ Irreproducible

Best PracticesΒΆ

βœ… Start with Random Forest - Usually excellent baseline
βœ… Use cross-validation - Don’t rely on single train/test split
βœ… Check feature importance - Understand what model learns
βœ… Monitor train vs test - Watch for overfitting
βœ… Use OOB error - Free validation for RF
βœ… Ensemble different types - Combine RF + GB for best results

Production ConsiderationsΒΆ

Random Forest:

  • βœ… Parallelizable (fast prediction)

  • βœ… Robust (less tuning in production)

  • ❌ Large memory (many trees)

Gradient Boosting:

  • βœ… Smaller models (fewer trees)

  • βœ… Often best accuracy

  • ❌ Sequential prediction (slower)

  • ❌ More sensitive to input changes

Modern VariantsΒΆ

Consider these for production:

  • XGBoost: Optimized gradient boosting

  • LightGBM: Fast, efficient, handles categorical

  • CatBoost: Great for categorical features

Next ChapterΒΆ

Chapter 9: Support Vector Machines

  • Maximal Margin Classifier

  • Support Vector Classifiers

  • Support Vector Machines (Kernels)

  • Multi-class SVMs

Practice ExercisesΒΆ

Exercise 1: Tree Depth AnalysisΒΆ

  1. Generate regression data with true polynomial relationship

  2. Fit trees with depths 1-20

  3. Plot train MSE, test MSE, and tree complexity

  4. Find optimal depth via cross-validation

Exercise 2: Bagging ConvergenceΒΆ

  1. Fit bagging with n_estimators from 1 to 200

  2. Plot test error vs number of trees

  3. Identify when performance plateaus

  4. Calculate OOB error and compare to test error

Exercise 3: Feature ImportanceΒΆ

Using a real dataset (e.g., housing prices):

  1. Fit Random Forest

  2. Extract and visualize feature importances

  3. Retrain with only top 5 features

  4. Compare performance

Exercise 4: Boosting HyperparametersΒΆ

Grid search over:

  • learning_rate: [0.01, 0.05, 0.1, 0.2]

  • n_estimators: [50, 100, 200]

  • max_depth: [2, 3, 4, 5]

Find best combination and visualize trade-offs

Exercise 5: Ensemble ComparisonΒΆ

On a classification dataset:

  1. Implement: Single Tree, Bagging, RF, AdaBoost, GB

  2. Compare accuracy, training time, prediction time

  3. Plot ROC curves for all methods

  4. Analyze confusion matrices

  5. Recommend best method with justification

Exercise 6: Overfitting InvestigationΒΆ

  1. Create small dataset (n=100)

  2. Fit RF and GB with varying complexity

  3. Plot learning curves (train/test vs training size)

  4. Identify overfitting signatures

  5. Apply regularization to mitigate