Ensemble Methods: Bagging, Boosting & StackingΒΆ
Random forests, gradient boosting, XGBoost, and stacking β why combining models almost always wins.
1. Why Ensembles Work: The Wisdom of CrowdsΒΆ
A single model has error = biasΒ² + variance + noise. Ensembles reduce variance by averaging out individual errors β IF the models are diverse (they make different mistakes).
Thought experiment: If 100 independent classifiers each have 70% accuracy, what accuracy does a majority vote achieve?
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom
def majority_vote_accuracy(n_classifiers, individual_accuracy):
"""Probability that majority (>n/2) vote correctly when each has p accuracy."""
majority = n_classifiers // 2 + 1
return sum(binom.pmf(k, n_classifiers, individual_accuracy)
for k in range(majority, n_classifiers + 1))
n_values = [1, 3, 5, 11, 21, 51, 101, 201]
accuracies = [0.6, 0.7, 0.8]
fig, ax = plt.subplots(figsize=(9, 5))
for p in accuracies:
ensemble_accs = [majority_vote_accuracy(n, p) for n in n_values]
ax.plot(n_values, ensemble_accs, 'o-', label=f'Individual accuracy = {p}')
ax.axhline(p, linestyle='--', alpha=0.4)
ax.set_xlabel('Number of classifiers in ensemble')
ax.set_ylabel('Ensemble accuracy (majority vote)')
ax.set_title('Wisdom of Crowds: Ensemble accuracy improves with N')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"1 classifier at 70% accuracy: {majority_vote_accuracy(1, 0.7):.4f}")
print(f"11 classifiers at 70% accuracy: {majority_vote_accuracy(11, 0.7):.4f}")
print(f"101 classifiers at 70% accuracy: {majority_vote_accuracy(101, 0.7):.4f}")
2. Bagging: Random ForestΒΆ
Bagging = Bootstrap AGGregatING. Train each tree on a random bootstrap sample of data. At each split, consider only a random subset of features.
Key hyperparameters and their effects:
import warnings
warnings.filterwarnings('ignore')
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score
X, y = make_classification(
n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Effect of n_estimators (number of trees)
n_trees_range = [1, 5, 10, 25, 50, 100, 200]
test_scores = []
for n in n_trees_range:
rf = RandomForestClassifier(n_estimators=n, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
test_scores.append(accuracy_score(y_test, rf.predict(X_test)))
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(n_trees_range, test_scores, 'o-', color='green')
ax.set_xlabel('n_estimators (number of trees)')
ax.set_ylabel('Test Accuracy')
ax.set_title('Random Forest: Effect of n_estimators')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Key RF hyperparameters:")
print(" n_estimators : More trees = more stable (diminishing returns after ~100-200)")
print(" max_depth : None = fully grown trees (high variance), small = pruned (high bias)")
print(" max_features : sqrt(n_features) default for clf, n_features/3 for regression")
print(" min_samples_leaf: Increase to reduce overfitting on noisy data")
print(" bootstrap : True (default) = bagging; False = pasting")
3. Gradient Boosting: sklearn GradientBoostingClassifierΒΆ
Boosting trains trees sequentially. Each tree corrects the errors of the previous ensemble by fitting on residuals. The result: trees are not diverse by design β theyβre specialists.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
gbm = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1, # shrinkage: how much each tree contributes
max_depth=3, # shallow trees are conventional for GBM
subsample=0.8, # stochastic GBM: use 80% of data per tree
random_state=42
)
gbm_scores = cross_val_score(gbm, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1)
print(f"GradientBoosting CV: {gbm_scores.mean():.4f} Β± {gbm_scores.std():.4f}")
# Fit to see training vs test evolution
gbm.fit(X_train, y_train)
# staged_predict_proba: get predictions after each boosting stage
train_errors = []
test_errors = []
for y_pred_staged in gbm.staged_predict(X_train):
train_errors.append(1 - accuracy_score(y_train, y_pred_staged))
for y_pred_staged in gbm.staged_predict(X_test):
test_errors.append(1 - accuracy_score(y_test, y_pred_staged))
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(train_errors, label='Train error', color='blue')
ax.plot(test_errors, label='Test error', color='orange')
ax.set_xlabel('Boosting iterations')
ax.set_ylabel('Error rate')
ax.set_title('GradientBoosting: Training vs Test Error per Stage')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\nBest test error at iteration: {np.argmin(test_errors) + 1}")
print("Tip: Use early_stopping_rounds with XGBoost/LightGBM to find optimal n_estimators.")
4. XGBoost vs sklearn GBM: Speed ComparisonΒΆ
import time
from sklearn.datasets import make_classification
# Larger dataset to make timing differences visible
X_big, y_big = make_classification(n_samples=1000, n_features=30, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X_big, y_big, test_size=0.2, random_state=42)
# sklearn GBM
gbm_model = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)
t0 = time.time()
gbm_model.fit(X_tr, y_tr)
t_gbm = time.time() - t0
gbm_acc = accuracy_score(y_te, gbm_model.predict(X_te))
# Try XGBoost
try:
import xgboost as xgb
xgb_model = xgb.XGBClassifier(
n_estimators=100, max_depth=3,
learning_rate=0.1,
use_label_encoder=False,
eval_metric='logloss',
random_state=42,
verbosity=0
)
t0 = time.time()
xgb_model.fit(X_tr, y_tr)
t_xgb = time.time() - t0
xgb_acc = accuracy_score(y_te, xgb_model.predict(X_te))
print(f"sklearn GBM: {t_gbm:.3f}s, accuracy={gbm_acc:.4f}")
print(f"XGBoost: {t_xgb:.3f}s, accuracy={xgb_acc:.4f}")
print(f"Speedup: {t_gbm/t_xgb:.1f}x faster")
xgb_available = True
except ImportError:
print("XGBoost not installed. Install with: pip install xgboost")
print(f"\nsklearn GBM: {t_gbm:.3f}s, accuracy={gbm_acc:.4f}")
print("\nXGBoost advantages (when installed):")
print(" - 5-10x faster due to histogram-based split finding")
print(" - Built-in L1/L2 regularization (reg_alpha, reg_lambda)")
print(" - Native handling of missing values")
print(" - early_stopping_rounds parameter")
xgb_available = False
5. Feature Importance: RF vs GBMΒΆ
from sklearn.datasets import load_breast_cancer
bc = load_breast_cancer()
X_bc, y_bc = bc.data, bc.target
feature_names = bc.feature_names
rf_bc = RandomForestClassifier(n_estimators=100, random_state=42)
gbm_bc = GradientBoostingClassifier(n_estimators=100, random_state=42)
rf_bc.fit(X_bc, y_bc)
gbm_bc.fit(X_bc, y_bc)
# Top 10 features
top_n = 10
rf_imp = pd.Series(rf_bc.feature_importances_, index=feature_names).nlargest(top_n)
gbm_imp = pd.Series(gbm_bc.feature_importances_, index=feature_names).nlargest(top_n)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for ax, imp, title, color in [
(axes[0], rf_imp, 'Random Forest', 'steelblue'),
(axes[1], gbm_imp, 'Gradient Boosting', 'darkorange')
]:
ax.barh(range(top_n), imp.values[::-1], color=color, alpha=0.8)
ax.set_yticks(range(top_n))
ax.set_yticklabels(imp.index[::-1])
ax.set_xlabel('Feature Importance (impurity-based)')
ax.set_title(f'{title} Feature Importance')
ax.grid(True, axis='x', alpha=0.3)
plt.tight_layout()
plt.show()
print("Caution: Impurity-based importance is biased toward high-cardinality features.")
print("Prefer permutation importance for model-agnostic, unbiased estimates (see notebook 05).")
6. Stacking with StackingClassifierΒΆ
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Base models (level 0)
estimators = [
('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
('svm', make_pipeline(StandardScaler(), SVC(probability=True, random_state=42))),
('knn', make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5))),
]
# Meta-learner (level 1): learns how to combine base model predictions
stacking_clf = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression(random_state=42),
cv=5, # uses CV predictions (out-of-fold) to train meta-learner
stack_method='predict_proba' # pass probabilities, not just class labels
)
# Compare all methods
models = {
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
'SVM': make_pipeline(StandardScaler(), SVC(probability=True, random_state=42)),
'KNN': make_pipeline(StandardScaler(), KNeighborsClassifier()),
'Stacking': stacking_clf
}
print("Model Comparison (5-fold CV on breast cancer dataset):")
results = {}
for name, model in models.items():
scores = cross_val_score(model, X_bc, y_bc, cv=5, scoring='accuracy', n_jobs=-1)
results[name] = scores
print(f" {name:25s}: {scores.mean():.4f} Β± {scores.std():.4f}")
7. When to Use WhatΒΆ
Method |
Use When |
Avoid When |
|---|---|---|
Random Forest |
Fast prototyping, feature importance needed, noisy data |
Need maximum accuracy, very sparse data |
GradientBoosting |
Tabular data competitions, need best accuracy |
Training time is critical, high-dimensional sparse |
XGBoost/LightGBM |
Large tabular datasets, Kaggle competitions |
Tiny datasets (<100 rows), image/text (use DL) |
Stacking |
Squeezing last 1-2% accuracy, diverse base models |
Limited training data, need fast inference |
Voting |
Quick ensemble of already-tuned models |
When models are all the same type |
Practical rule of thumb: For tabular data, try GradientBoostingClassifier or XGBoost first. They almost always outperform Random Forest when tuned.
ExercisesΒΆ
Diversity experiment: Build 3 identical Random Forest models (same hyperparameters, different
random_state). Does stacking them improve over a single RF? Why or why not?Hyperparameter tuning: Use
RandomizedSearchCVto tune aGradientBoostingClassifieron the breast cancer dataset. Key parameters to tune:n_estimators,learning_rate,max_depth,subsample,min_samples_leaf. Report best CV score.Bias-variance tradeoff: Plot test and train accuracy for
RandomForestClassifierasmax_depthvaries from 1 to None. At what depth does the model transition from underfitting to overfitting?Feature importance stability: Run
RandomForestClassifier20 times with differentrandom_statevalues. Plot the standard deviation of feature importance across runs. Which features have stable importance? Which are unstable?Stacking with OOF: Manually implement a 2-level stacking: (1) split data into 5 folds; (2) for each fold, train base models on the other 4 folds and predict on the current fold; (3) collect out-of-fold predictions; (4) train a meta-learner on these OOF predictions. Compare against sklearnβs
StackingClassifier.