Run this notebook: Open in Colab Open in Kaggle

Ensemble Methods: Bagging, Boosting & Stacking¶

Random forests, gradient boosting, XGBoost, and stacking — why combining models almost always wins.

1. Why Ensembles Work: The Wisdom of Crowds¶

A single model has error = bias² + variance + noise. Ensembles reduce variance by averaging out individual errors — IF the models are diverse (they make different mistakes).

Thought experiment: If 100 independent classifiers each have 70% accuracy, what accuracy does a majority vote achieve?

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom

def majority_vote_accuracy(n_classifiers, individual_accuracy):
    """Probability that majority (>n/2) vote correctly when each has p accuracy."""
    majority = n_classifiers // 2 + 1
    return sum(binom.pmf(k, n_classifiers, individual_accuracy)
               for k in range(majority, n_classifiers + 1))

n_values = [1, 3, 5, 11, 21, 51, 101, 201]
accuracies = [0.6, 0.7, 0.8]

fig, ax = plt.subplots(figsize=(9, 5))
for p in accuracies:
    ensemble_accs = [majority_vote_accuracy(n, p) for n in n_values]
    ax.plot(n_values, ensemble_accs, 'o-', label=f'Individual accuracy = {p}')
    ax.axhline(p, linestyle='--', alpha=0.4)

ax.set_xlabel('Number of classifiers in ensemble')
ax.set_ylabel('Ensemble accuracy (majority vote)')
ax.set_title('Wisdom of Crowds: Ensemble accuracy improves with N')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"1 classifier at 70% accuracy:   {majority_vote_accuracy(1, 0.7):.4f}")
print(f"11 classifiers at 70% accuracy: {majority_vote_accuracy(11, 0.7):.4f}")
print(f"101 classifiers at 70% accuracy: {majority_vote_accuracy(101, 0.7):.4f}")

2. Bagging: Random Forest¶

Bagging = Bootstrap AGGregatING. Train each tree on a random bootstrap sample of data. At each split, consider only a random subset of features.

Key hyperparameters and their effects:

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score

X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_redundant=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Effect of n_estimators (number of trees)
n_trees_range = [1, 5, 10, 25, 50, 100, 200]
test_scores = []

for n in n_trees_range:
    rf = RandomForestClassifier(n_estimators=n, random_state=42, n_jobs=-1)
    rf.fit(X_train, y_train)
    test_scores.append(accuracy_score(y_test, rf.predict(X_test)))

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(n_trees_range, test_scores, 'o-', color='green')
ax.set_xlabel('n_estimators (number of trees)')
ax.set_ylabel('Test Accuracy')
ax.set_title('Random Forest: Effect of n_estimators')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Key RF hyperparameters:")
print("  n_estimators  : More trees = more stable (diminishing returns after ~100-200)")
print("  max_depth     : None = fully grown trees (high variance), small = pruned (high bias)")
print("  max_features  : sqrt(n_features) default for clf, n_features/3 for regression")
print("  min_samples_leaf: Increase to reduce overfitting on noisy data")
print("  bootstrap     : True (default) = bagging; False = pasting")

3. Gradient Boosting: sklearn GradientBoostingClassifier¶

Boosting trains trees sequentially. Each tree corrects the errors of the previous ensemble by fitting on residuals. The result: trees are not diverse by design — they’re specialists.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

gbm = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,  # shrinkage: how much each tree contributes
    max_depth=3,        # shallow trees are conventional for GBM
    subsample=0.8,      # stochastic GBM: use 80% of data per tree
    random_state=42
)

gbm_scores = cross_val_score(gbm, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1)
print(f"GradientBoosting CV:  {gbm_scores.mean():.4f} ± {gbm_scores.std():.4f}")

# Fit to see training vs test evolution
gbm.fit(X_train, y_train)

# staged_predict_proba: get predictions after each boosting stage
train_errors = []
test_errors = []
for y_pred_staged in gbm.staged_predict(X_train):
    train_errors.append(1 - accuracy_score(y_train, y_pred_staged))
for y_pred_staged in gbm.staged_predict(X_test):
    test_errors.append(1 - accuracy_score(y_test, y_pred_staged))

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(train_errors, label='Train error', color='blue')
ax.plot(test_errors, label='Test error', color='orange')
ax.set_xlabel('Boosting iterations')
ax.set_ylabel('Error rate')
ax.set_title('GradientBoosting: Training vs Test Error per Stage')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nBest test error at iteration: {np.argmin(test_errors) + 1}")
print("Tip: Use early_stopping_rounds with XGBoost/LightGBM to find optimal n_estimators.")

4. XGBoost vs sklearn GBM: Speed Comparison¶

import time
from sklearn.datasets import make_classification

# Larger dataset to make timing differences visible
X_big, y_big = make_classification(n_samples=1000, n_features=30, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X_big, y_big, test_size=0.2, random_state=42)

# sklearn GBM
gbm_model = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)
t0 = time.time()
gbm_model.fit(X_tr, y_tr)
t_gbm = time.time() - t0
gbm_acc = accuracy_score(y_te, gbm_model.predict(X_te))

# Try XGBoost
try:
    import xgboost as xgb
    xgb_model = xgb.XGBClassifier(
        n_estimators=100, max_depth=3,
        learning_rate=0.1,
        use_label_encoder=False,
        eval_metric='logloss',
        random_state=42,
        verbosity=0
    )
    t0 = time.time()
    xgb_model.fit(X_tr, y_tr)
    t_xgb = time.time() - t0
    xgb_acc = accuracy_score(y_te, xgb_model.predict(X_te))
    
    print(f"sklearn GBM: {t_gbm:.3f}s, accuracy={gbm_acc:.4f}")
    print(f"XGBoost:     {t_xgb:.3f}s, accuracy={xgb_acc:.4f}")
    print(f"Speedup:     {t_gbm/t_xgb:.1f}x faster")
    xgb_available = True
except ImportError:
    print("XGBoost not installed. Install with: pip install xgboost")
    print(f"\nsklearn GBM: {t_gbm:.3f}s, accuracy={gbm_acc:.4f}")
    print("\nXGBoost advantages (when installed):")
    print("  - 5-10x faster due to histogram-based split finding")
    print("  - Built-in L1/L2 regularization (reg_alpha, reg_lambda)")
    print("  - Native handling of missing values")
    print("  - early_stopping_rounds parameter")
    xgb_available = False

5. Feature Importance: RF vs GBM¶

from sklearn.datasets import load_breast_cancer

bc = load_breast_cancer()
X_bc, y_bc = bc.data, bc.target
feature_names = bc.feature_names

rf_bc = RandomForestClassifier(n_estimators=100, random_state=42)
gbm_bc = GradientBoostingClassifier(n_estimators=100, random_state=42)
rf_bc.fit(X_bc, y_bc)
gbm_bc.fit(X_bc, y_bc)

# Top 10 features
top_n = 10
rf_imp  = pd.Series(rf_bc.feature_importances_,  index=feature_names).nlargest(top_n)
gbm_imp = pd.Series(gbm_bc.feature_importances_, index=feature_names).nlargest(top_n)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for ax, imp, title, color in [
    (axes[0], rf_imp,  'Random Forest', 'steelblue'),
    (axes[1], gbm_imp, 'Gradient Boosting', 'darkorange')
]:
    ax.barh(range(top_n), imp.values[::-1], color=color, alpha=0.8)
    ax.set_yticks(range(top_n))
    ax.set_yticklabels(imp.index[::-1])
    ax.set_xlabel('Feature Importance (impurity-based)')
    ax.set_title(f'{title} Feature Importance')
    ax.grid(True, axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

print("Caution: Impurity-based importance is biased toward high-cardinality features.")
print("Prefer permutation importance for model-agnostic, unbiased estimates (see notebook 05).")

6. Stacking with StackingClassifier¶

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Base models (level 0)
estimators = [
    ('rf',  RandomForestClassifier(n_estimators=50, random_state=42)),
    ('svm', make_pipeline(StandardScaler(), SVC(probability=True, random_state=42))),
    ('knn', make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5))),
]

# Meta-learner (level 1): learns how to combine base model predictions
stacking_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(random_state=42),
    cv=5,          # uses CV predictions (out-of-fold) to train meta-learner
    stack_method='predict_proba'  # pass probabilities, not just class labels
)

# Compare all methods
models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'SVM': make_pipeline(StandardScaler(), SVC(probability=True, random_state=42)),
    'KNN': make_pipeline(StandardScaler(), KNeighborsClassifier()),
    'Stacking': stacking_clf
}

print("Model Comparison (5-fold CV on breast cancer dataset):")
results = {}
for name, model in models.items():
    scores = cross_val_score(model, X_bc, y_bc, cv=5, scoring='accuracy', n_jobs=-1)
    results[name] = scores
    print(f"  {name:25s}: {scores.mean():.4f} ± {scores.std():.4f}")

7. When to Use What¶

Method	Use When	Avoid When
Random Forest	Fast prototyping, feature importance needed, noisy data	Need maximum accuracy, very sparse data
GradientBoosting	Tabular data competitions, need best accuracy	Training time is critical, high-dimensional sparse
XGBoost/LightGBM	Large tabular datasets, Kaggle competitions	Tiny datasets (<100 rows), image/text (use DL)
Stacking	Squeezing last 1-2% accuracy, diverse base models	Limited training data, need fast inference
Voting	Quick ensemble of already-tuned models	When models are all the same type

Practical rule of thumb: For tabular data, try GradientBoostingClassifier or XGBoost first. They almost always outperform Random Forest when tuned.

Exercises¶

Diversity experiment: Build 3 identical Random Forest models (same hyperparameters, different random_state). Does stacking them improve over a single RF? Why or why not?
Hyperparameter tuning: Use RandomizedSearchCV to tune a GradientBoostingClassifier on the breast cancer dataset. Key parameters to tune: n_estimators, learning_rate, max_depth, subsample, min_samples_leaf. Report best CV score.
Bias-variance tradeoff: Plot test and train accuracy for RandomForestClassifier as max_depth varies from 1 to None. At what depth does the model transition from underfitting to overfitting?
Feature importance stability: Run RandomForestClassifier 20 times with different random_state values. Plot the standard deviation of feature importance across runs. Which features have stable importance? Which are unstable?
Stacking with OOF: Manually implement a 2-level stacking: (1) split data into 5 folds; (2) for each fold, train base models on the other 4 folds and predict on the current fold; (3) collect out-of-fold predictions; (4) train a meta-learner on these OOF predictions. Compare against sklearn’s StackingClassifier.