Run this notebook: Open in Colab Open in Kaggle

Chapter 5: Resampling Methods¶

Overview¶

Resampling = repeatedly drawing samples from a training set and refitting a model

Two main purposes:

Model Assessment: Estimate test error rate
Model Selection: Choose optimal model complexity/parameters

Why Resample?¶

Traditional approach: Single train/test split

Estimates test error
But estimate has high variance (depends on random split)
Uses only portion of data for training

Resampling provides:

More reliable error estimates
Better model selection
Uncertainty quantification
Better use of available data

Key Methods¶

1. Cross-Validation (CV)¶

Validation Set Approach: Single holdout set (50/50 split)
Leave-One-Out CV (LOOCV): n iterations, hold out 1 observation
k-Fold CV: k iterations, hold out n/k observations

2. Bootstrap¶

Sample with replacement from original data
Estimate uncertainty (standard errors, confidence intervals)
Good for complex statistics

Comparison Table¶

Method	Splits	Training Size	Computation	Variance	Bias
Validation Set	1	50%	Fast	High	High
LOOCV	n	(n-1)/n ≈ 100%	Slow	Low	Low
k-Fold (k=5,10)	k	(k-1)/k ≈ 80-90%	Moderate	Medium	Medium
Bootstrap	B	~63% unique	Moderate	Medium	Medium

5.1 Validation Set Approach¶

The Method¶

Randomly split data: 50% training, 50% validation
Fit model on training set
Evaluate on validation set
Test error estimate = validation error

Formula¶

\[\text{MSE}_{\text{val}} = \frac{1}{n_{\text{val}}} \sum_{i \in \text{val}} (y_i - \hat{f}(x_i))^2\]

Pros¶

✅ Simple and fast
✅ Easy to implement
✅ Good for very large datasets

Cons¶

❌ High variance (depends on random split)
❌ Wastes data (only use 50% for training)
❌ Validation error can vary significantly
❌ Tends to overestimate test error (smaller training set)

When to Use¶

Very large datasets (n > 10,000)
Initial quick assessment
Computational constraints
Proof of concept

# Generate polynomial regression data
np.random.seed(42)
n = 200
X = np.random.uniform(-3, 3, n)
true_function = lambda x: 2 + x - 0.5*x**2  # Quadratic
y = true_function(X) + np.random.normal(0, 1.5, n)

X_2d = X.reshape(-1, 1)

print("📊 Validation Set Approach Demo")
print(f"\nTrue function: f(x) = 2 + x - 0.5x²")
print(f"Data: n = {n} observations")

# Try multiple random splits to show variability
n_splits = 10
degrees = range(1, 11)
validation_errors = {split_idx: [] for split_idx in range(n_splits)}

for split_idx in range(n_splits):
    # Different random split each time
    X_train, X_val, y_train, y_val = train_test_split(
        X_2d, y, test_size=0.5, random_state=split_idx)
    
    for degree in degrees:
        # Fit polynomial
        poly = PolynomialFeatures(degree=degree)
        X_train_poly = poly.fit_transform(X_train)
        X_val_poly = poly.transform(X_val)
        
        model = LinearRegression()
        model.fit(X_train_poly, y_train)
        y_pred = model.predict(X_val_poly)
        
        mse = mean_squared_error(y_val, y_pred)
        validation_errors[split_idx].append(mse)

# Visualize variability across splits
plt.figure(figsize=(14, 5))

# Individual splits
plt.subplot(1, 2, 1)
for split_idx in range(n_splits):
    plt.plot(degrees, validation_errors[split_idx], 'o-', 
            alpha=0.4, linewidth=1, label=f'Split {split_idx+1}' if split_idx < 3 else '')
plt.xlabel('Polynomial Degree')
plt.ylabel('Validation MSE')
plt.title('Validation Set Approach: High Variability\n(10 different random 50/50 splits)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axvline(x=2, color='g', linestyle='--', linewidth=2, alpha=0.7, label='True degree')

# Average across splits with error bars
plt.subplot(1, 2, 2)
avg_errors = [np.mean([validation_errors[s][i] for s in range(n_splits)]) 
              for i in range(len(degrees))]
std_errors = [np.std([validation_errors[s][i] for s in range(n_splits)]) 
              for i in range(len(degrees))]

plt.errorbar(degrees, avg_errors, yerr=std_errors, fmt='o-', 
            capsize=5, linewidth=2, markersize=8, color='blue')
plt.xlabel('Polynomial Degree')
plt.ylabel('Average Validation MSE')
plt.title('Mean Validation Error ± Std Dev\n(High variability shows instability)')
plt.grid(True, alpha=0.3)
plt.axvline(x=2, color='g', linestyle='--', linewidth=2, label='True degree')
plt.legend()

plt.tight_layout()
plt.show()

# Detailed statistics
print("\n📊 Validation Errors by Polynomial Degree\n")
print(f"{'Degree':<8} {'Mean MSE':<12} {'Std MSE':<12} {'Min MSE':<12} {'Max MSE':<12} {'Range'}")
print("="*80)
for i, deg in enumerate(degrees):
    deg_errors = [validation_errors[s][i] for s in range(n_splits)]
    mean_mse = np.mean(deg_errors)
    std_mse = np.std(deg_errors)
    min_mse = np.min(deg_errors)
    max_mse = np.max(deg_errors)
    range_mse = max_mse - min_mse
    print(f"{deg:<8} {mean_mse:>10.3f}   {std_mse:>10.3f}   {min_mse:>10.3f}   {max_mse:>10.3f}   {range_mse:>10.3f}")

print("\n💡 Key Observations:")
print("   • Different random splits suggest different 'best' models")
print("   • High standard deviation shows instability")
print("   • Some splits suggest degree 1, others degree 3-5")
print("   • True degree = 2, but hard to identify reliably")
print("   • Large range shows sensitivity to random split")
print("\n⚠️ Problem: Can't trust a single validation set!")

5.2 Leave-One-Out Cross-Validation (LOOCV)¶

The Method¶

For each observation i = 1, …, n:

Fit model on all observations except i
Predict for observation i
Calculate error: $(y_i - \hat{y}_i)^2$
Repeat for all n observations

LOOCV Estimate¶

\[\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \text{MSE}_i = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2\]

Pros¶

✅ Less bias than validation set (uses n-1 observations)
✅ No randomness (deterministic)
✅ Almost unbiased estimate of test error
✅ Uses maximum training data

Cons¶

❌ Computationally expensive (n model fits)
❌ Can have high variance (similar training sets)
❌ Impractical for large n
❌ Training sets highly correlated → high variance

Special Case: Linear Models¶

For linear/polynomial regression, LOOCV has shortcut: $$\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \left(\frac{y_i - \hat{y}_i}{1 - h_i}\right)^2$$ where $h_i$ is leverage (diagonal of hat matrix). Only need to fit model once!

When to Use¶

Small datasets (n < 1000)
Need deterministic results
Linear models (can use shortcut)
Want nearly unbiased estimate

# LOOCV demonstration
print("📊 Leave-One-Out Cross-Validation (LOOCV)\n")

# Use smaller dataset for LOOCV (computational reasons)
n_small = 100
X_small = np.random.uniform(-3, 3, n_small)
y_small = true_function(X_small) + np.random.normal(0, 1.5, n_small)
X_small_2d = X_small.reshape(-1, 1)

print(f"Dataset: n = {n_small} observations")
print(f"Method: Fit {n_small} models (each with {n_small-1} training points)\n")

# LOOCV for different polynomial degrees
loocv = LeaveOneOut()
loocv_errors = []

for degree in degrees:
    mse_scores = []
    
    for train_idx, test_idx in loocv.split(X_small_2d):
        X_train, X_test = X_small_2d[train_idx], X_small_2d[test_idx]
        y_train, y_test = y_small[train_idx], y_small[test_idx]
        
        # Fit polynomial
        poly = PolynomialFeatures(degree=degree)
        X_train_poly = poly.fit_transform(X_train)
        X_test_poly = poly.transform(X_test)
        
        model = LinearRegression()
        model.fit(X_train_poly, y_train)
        y_pred = model.predict(X_test_poly)
        
        mse = (y_test[0] - y_pred[0])**2
        mse_scores.append(mse)
    
    loocv_mse = np.mean(mse_scores)
    loocv_errors.append(loocv_mse)

# Using sklearn's built-in cross_val_score (faster)
loocv_errors_sklearn = []
for degree in degrees:
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X_small_2d)
    model = LinearRegression()
    
    # negative MSE (sklearn convention)
    scores = cross_val_score(model, X_poly, y_small, 
                            cv=LeaveOneOut(), scoring='neg_mean_squared_error')
    loocv_errors_sklearn.append(-scores.mean())

# Compare with validation set (from before, using small dataset)
val_errors_small = []
X_train_vs, X_val_vs, y_train_vs, y_val_vs = train_test_split(
    X_small_2d, y_small, test_size=0.5, random_state=42)

for degree in degrees:
    poly = PolynomialFeatures(degree=degree)
    X_train_poly = poly.fit_transform(X_train_vs)
    X_val_poly = poly.transform(X_val_vs)
    model = LinearRegression()
    model.fit(X_train_poly, y_train_vs)
    y_pred = model.predict(X_val_poly)
    val_errors_small.append(mean_squared_error(y_val_vs, y_pred))

# Visualize comparison
plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
plt.plot(degrees, val_errors_small, 'o-', linewidth=2, markersize=8, 
        label='Validation Set (50/50)', color='orange')
plt.plot(degrees, loocv_errors_sklearn, 's-', linewidth=2, markersize=8, 
        label='LOOCV (deterministic)', color='blue')
plt.axvline(x=2, color='g', linestyle='--', linewidth=2, alpha=0.7, label='True degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('Estimated Test MSE')
plt.title('LOOCV vs Validation Set Approach')
plt.legend()
plt.grid(True, alpha=0.3)

# Show LOOCV stability
plt.subplot(1, 2, 2)
plt.plot(degrees, loocv_errors_sklearn, 'o-', linewidth=2, markersize=8, color='blue')
plt.axvline(x=2, color='g', linestyle='--', linewidth=2, alpha=0.7, label='True degree')
best_degree = degrees[np.argmin(loocv_errors_sklearn)]
plt.axvline(x=best_degree, color='r', linestyle='--', linewidth=2, 
           label=f'LOOCV selects degree {best_degree}')
plt.xlabel('Polynomial Degree')
plt.ylabel('LOOCV MSE')
plt.title('LOOCV Error Curve\n(Smooth, deterministic, no randomness)')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print results
print("\n📊 Error Comparison\n")
print(f"{'Degree':<8} {'Validation Set':<18} {'LOOCV':<18} {'Difference'}")
print("="*65)
for i, deg in enumerate(degrees):
    diff = abs(val_errors_small[i] - loocv_errors_sklearn[i])
    marker = '✅' if deg == 2 else ''
    print(f"{deg:<8} {val_errors_small[i]:>15.3f}    {loocv_errors_sklearn[i]:>15.3f}    {diff:>10.3f}  {marker}")

print(f"\n💡 Advantages of LOOCV:")
print(f"   • No randomness: same result every time")
print(f"   • Smooth error curve (easier to identify minimum)")
print(f"   • Less bias: uses {n_small-1}/{n_small} = {(n_small-1)/n_small:.1%} of data for training")
print(f"   • Selected degree {best_degree} (closer to true degree 2)")
print(f"\n⚠️ Drawback: Required {n_small * len(degrees)} = {n_small * len(degrees)} model fits!")

5.3 k-Fold Cross-Validation¶

The Method¶

Randomly divide data into k equal-sized folds
For each fold i = 1, …, k:
- Use fold i as validation set
- Use remaining k-1 folds as training set
- Compute error on fold i: $\text{MSE}_i$
Average the k errors

k-Fold CV Estimate¶

\[\text{CV}_{(k)} = \frac{1}{k} \sum_{i=1}^k \text{MSE}_i\]

Common Choices¶

k = 5: Good balance, common default
k = 10: Most popular, well-studied
k = n: Equivalent to LOOCV

Bias-Variance Tradeoff¶

Small k (e.g., k=5):
- More bias (smaller training sets)
- Less variance (less overlap between folds)
- Faster computation
Large k (e.g., k=10, k=n):
- Less bias (larger training sets)
- More variance (more overlap between folds)
- Slower computation

Pros¶

✅ Better bias-variance tradeoff than validation set or LOOCV
✅ Computationally efficient (only k fits)
✅ More stable than LOOCV
✅ Works for any model type
✅ Industry standard

Cons¶

❌ Some randomness (depends on fold assignment)
❌ More complex than validation set

When to Use¶

Almost always! Default choice for most problems
Any dataset size (n > 100)
Model selection
Hyperparameter tuning

# k-Fold CV demonstration
print("📊 k-Fold Cross-Validation\n")

# Compare different values of k
k_values = [3, 5, 10, 20, n_small]  # Last one is LOOCV
cv_results = {}

for k in k_values:
    cv_errors = []
    
    for degree in degrees:
        poly = PolynomialFeatures(degree=degree)
        X_poly = poly.fit_transform(X_small_2d)
        model = LinearRegression()
        
        if k == n_small:
            cv = LeaveOneOut()
        else:
            cv = KFold(n_splits=k, shuffle=True, random_state=42)
        
        scores = cross_val_score(model, X_poly, y_small, 
                                cv=cv, scoring='neg_mean_squared_error')
        cv_errors.append(-scores.mean())
    
    cv_results[k] = cv_errors

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

# Plot each k value
for idx, k in enumerate(k_values[:-1]):
    axes[idx].plot(degrees, cv_results[k], 'o-', linewidth=2, markersize=8)
    axes[idx].axvline(x=2, color='g', linestyle='--', linewidth=2, alpha=0.7, label='True degree')
    best_deg = degrees[np.argmin(cv_results[k])]
    axes[idx].axvline(x=best_deg, color='r', linestyle='--', linewidth=2, 
                     label=f'Selected: degree {best_deg}')
    axes[idx].set_xlabel('Polynomial Degree')
    axes[idx].set_ylabel('CV Error')
    axes[idx].set_title(f'{k}-Fold CV\n(Train on {(k-1)/k:.1%} of data each fold)')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

# Comparison plot
for k, errors in cv_results.items():
    label = f'LOOCV (k={k})' if k == n_small else f'{k}-Fold CV'
    axes[3].plot(degrees, errors, 'o-', linewidth=2, markersize=6, label=label)

axes[3].axvline(x=2, color='g', linestyle='--', linewidth=2, alpha=0.7, label='True degree')
axes[3].set_xlabel('Polynomial Degree')
axes[3].set_ylabel('CV Error')
axes[3].set_title('Comparison of Different k Values')
axes[3].legend()
axes[3].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print detailed comparison
print("\n📊 k-Fold CV Results\n")
header = f"{'Degree':<8}"
for k in k_values:
    k_label = f'k={k}' if k != n_small else 'LOOCV'
    header += f" {k_label:<12}"
print(header)
print("="*80)

for i, deg in enumerate(degrees):
    row = f"{deg:<8}"
    for k in k_values:
        row += f" {cv_results[k][i]:>10.3f}  "
    marker = '✅' if deg == 2 else ''
    print(row + marker)

print("\n📈 Selected Model by Each Method:\n")
for k in k_values:
    best_degree = degrees[np.argmin(cv_results[k])]
    k_label = f'{k}-Fold' if k != n_small else 'LOOCV'
    min_error = min(cv_results[k])
    print(f"{k_label:<15} → Degree {best_degree}  (MSE = {min_error:.3f})")

print("\n💡 Key Insights:")
print(f"   • All methods select similar optimal degree")
print(f"   • 5-fold and 10-fold CV are most common in practice")
print(f"   • k=10 often considered the 'sweet spot'")
print(f"   • Computational cost: k={k_values[0]} needs {k_values[0]} fits, LOOCV needs {n_small} fits")

5.4 Cross-Validation for Classification¶

Adaptation for Classification¶

Instead of MSE, use:

Misclassification rate: $\text{Err}_i = \frac{1}{n_i} \sum_{j \in C_i} I(y_j \neq \hat{y}_j)$
Accuracy: 1 - Error rate
Other metrics: AUC, F1, etc.

k-Fold CV for Classification¶

\[\text{CV}_{(k)} = \frac{1}{k} \sum_{i=1}^k \text{Err}_i\]

Stratified k-Fold¶

Ensure each fold has same class proportions as original data
Important for imbalanced datasets
Recommended for classification

# Generate classification data
np.random.seed(42)
n_class = 200

# Binary classification
X_class = np.random.randn(n_class, 2)
# Decision boundary: x1 + x2 > 0
y_class = (X_class[:, 0] + X_class[:, 1] + np.random.randn(n_class)*0.5 > 0).astype(int)

print("📊 Cross-Validation for Classification\n")
print(f"Dataset: n = {n_class}, 2 features, binary classification")
print(f"Class balance: {np.mean(y_class):.1%} positive, {1-np.mean(y_class):.1%} negative\n")

# Compare models using CV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

models_clf = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree (depth=3)': DecisionTreeClassifier(max_depth=3, random_state=42),
    'Decision Tree (depth=10)': DecisionTreeClassifier(max_depth=10, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=50, random_state=42),
    'SVM (linear)': SVC(kernel='linear'),
    'KNN (k=5)': KNeighborsClassifier(n_neighbors=5)
}

cv_results_clf = {}

for name, model in models_clf.items():
    # 10-fold CV
    scores = cross_val_score(model, X_class, y_class, cv=10, scoring='accuracy')
    cv_results_clf[name] = {
        'mean': scores.mean(),
        'std': scores.std(),
        'scores': scores
    }

# Print results
print("10-Fold CV Results:\n")
print(f"{'Model':<30} {'Mean Accuracy':<18} {'Std Dev':<12} {'Min':<10} {'Max'}")
print("="*90)
for name, results in sorted(cv_results_clf.items(), key=lambda x: x[1]['mean'], reverse=True):
    print(f"{name:<30} {results['mean']:>15.4f}    {results['std']:>10.4f}   "
          f"{results['scores'].min():>8.4f}   {results['scores'].max():>8.4f}")

# Visualize CV scores
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot with error bars
names = list(cv_results_clf.keys())
means = [cv_results_clf[n]['mean'] for n in names]
stds = [cv_results_clf[n]['std'] for n in names]

axes[0].barh(range(len(names)), means, xerr=stds, capsize=5, alpha=0.7)
axes[0].set_yticks(range(len(names)))
axes[0].set_yticklabels(names)
axes[0].set_xlabel('10-Fold CV Accuracy')
axes[0].set_title('Model Comparison Using Cross-Validation')
axes[0].grid(True, alpha=0.3, axis='x')
axes[0].axvline(x=0.5, color='r', linestyle='--', alpha=0.5, label='Random guess')
axes[0].legend()

# Box plot showing distribution
scores_list = [cv_results_clf[n]['scores'] for n in names]
bp = axes[1].boxplot(scores_list, labels=names, vert=False, patch_artist=True)
for patch in bp['boxes']:
    patch.set_facecolor('lightblue')
axes[1].set_xlabel('Accuracy')
axes[1].set_title('Distribution of CV Scores (10 folds)')
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\n💡 Interpretation:")
best_model = max(cv_results_clf.items(), key=lambda x: x[1]['mean'])[0]
print(f"   • Best model: {best_model}")
print(f"   • Error bars show variability across folds")
print(f"   • Small std → consistent performance")
print(f"   • Large std → performance depends on data split")
print(f"   • Use CV to select model architecture/hyperparameters")

5.5 The Bootstrap¶

The Concept¶

Bootstrap = sampling with replacement from original dataset

The Method¶

Draw n observations with replacement from original data → Bootstrap sample B*₁
Compute statistic of interest on B₁ → α̂₁
Repeat B times → α̂₁, α̂₂, …, α̂*_B
Estimate standard error: $\text{SE}_B(\hat{\alpha}) = \sqrt{\frac{1}{B-1} \sum_{r=1}^B (\hat{\alpha}^{*r} - \bar{\alpha}^*)^2}$

Key Properties¶

Each bootstrap sample has ~63% unique observations
- P(observation selected) = 1 - (1 - 1/n)ⁿ ≈ 0.632
Observations can appear multiple times
~37% of observations not selected (out-of-bag)

Uses¶

Standard errors: Estimate SE of any statistic
Confidence intervals: Percentile or normal-based
Model assessment: Out-of-bag error estimation
Ensemble methods: Bagging, Random Forests

Pros¶

✅ Works for any statistic (median, ratio, etc.)
✅ No distributional assumptions
✅ Provides uncertainty quantification
✅ Foundation for powerful ensemble methods

Cons¶

❌ Computationally intensive (need B=1000+ samples)
❌ Not ideal for test error estimation (biased)
❌ Can fail for small samples

When to Use¶

Estimating standard errors of complex statistics
No analytical formula for SE available
Building confidence intervals
Ensemble learning (bagging, random forests)

# Bootstrap demonstration
print("📊 Bootstrap Resampling\n")

# Generate sample data
np.random.seed(42)
n_boot = 50
data_boot = np.random.exponential(scale=2, size=n_boot)  # Right-skewed distribution

print(f"Original sample: n = {n_boot}")
print(f"Sample mean: {data_boot.mean():.3f}")
print(f"Sample median: {np.median(data_boot):.3f}")
print(f"Sample std: {data_boot.std():.3f}\n")

# Bootstrap resampling
B = 1000  # Number of bootstrap samples
boot_means = []
boot_medians = []
boot_stds = []

for b in range(B):
    # Sample with replacement
    boot_sample = resample(data_boot, replace=True, n_samples=n_boot, random_state=b)
    boot_means.append(boot_sample.mean())
    boot_medians.append(np.median(boot_sample))
    boot_stds.append(boot_sample.std())

boot_means = np.array(boot_means)
boot_medians = np.array(boot_medians)
boot_stds = np.array(boot_stds)

# Bootstrap standard errors
se_mean = boot_means.std()
se_median = boot_medians.std()
se_std = boot_stds.std()

# 95% Confidence intervals (percentile method)
ci_mean = np.percentile(boot_means, [2.5, 97.5])
ci_median = np.percentile(boot_medians, [2.5, 97.5])

print(f"Bootstrap Results (B = {B} samples):\n")
print(f"{'Statistic':<15} {'Estimate':<12} {'Boot SE':<12} {'95% CI'}")
print("="*65)
print(f"{'Mean':<15} {data_boot.mean():>10.3f}   {se_mean:>10.3f}   [{ci_mean[0]:.3f}, {ci_mean[1]:.3f}]")
print(f"{'Median':<15} {np.median(data_boot):>10.3f}   {se_median:>10.3f}   [{ci_median[0]:.3f}, {ci_median[1]:.3f}]")
print(f"{'Std Dev':<15} {data_boot.std():>10.3f}   {se_std:>10.3f}   N/A")

# Visualize bootstrap distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Original data
axes[0, 0].hist(data_boot, bins=15, edgecolor='black', alpha=0.7)
axes[0, 0].axvline(data_boot.mean(), color='r', linestyle='--', linewidth=2, label=f'Mean = {data_boot.mean():.2f}')
axes[0, 0].axvline(np.median(data_boot), color='g', linestyle='--', linewidth=2, label=f'Median = {np.median(data_boot):.2f}')
axes[0, 0].set_xlabel('Value')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Original Data Distribution\n(Right-skewed)')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Bootstrap distribution of mean
axes[0, 1].hist(boot_means, bins=30, edgecolor='black', alpha=0.7, density=True)
axes[0, 1].axvline(data_boot.mean(), color='r', linestyle='-', linewidth=2, label='Sample mean')
axes[0, 1].axvline(ci_mean[0], color='orange', linestyle='--', linewidth=2, label='95% CI')
axes[0, 1].axvline(ci_mean[1], color='orange', linestyle='--', linewidth=2)
axes[0, 1].set_xlabel('Bootstrap Mean')
axes[0, 1].set_ylabel('Density')
axes[0, 1].set_title(f'Bootstrap Distribution of Mean\n(SE = {se_mean:.3f})')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Bootstrap distribution of median
axes[1, 0].hist(boot_medians, bins=30, edgecolor='black', alpha=0.7, density=True, color='green')
axes[1, 0].axvline(np.median(data_boot), color='darkgreen', linestyle='-', linewidth=2, label='Sample median')
axes[1, 0].axvline(ci_median[0], color='orange', linestyle='--', linewidth=2, label='95% CI')
axes[1, 0].axvline(ci_median[1], color='orange', linestyle='--', linewidth=2)
axes[1, 0].set_xlabel('Bootstrap Median')
axes[1, 0].set_ylabel('Density')
axes[1, 0].set_title(f'Bootstrap Distribution of Median\n(SE = {se_median:.3f})')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Convergence of SE estimates
B_values = range(50, B+1, 50)
se_convergence = [boot_means[:b].std() for b in B_values]
axes[1, 1].plot(B_values, se_convergence, 'o-', linewidth=2, markersize=6)
axes[1, 1].axhline(se_mean, color='r', linestyle='--', linewidth=2, label=f'Final SE = {se_mean:.3f}')
axes[1, 1].set_xlabel('Number of Bootstrap Samples (B)')
axes[1, 1].set_ylabel('Estimated SE of Mean')
axes[1, 1].set_title('Convergence of Bootstrap SE Estimate')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Key Observations:")
print(f"   • Bootstrap distributions are approximately normal (CLT)")
print(f"   • Even though original data is skewed, sampling distributions are symmetric")
print(f"   • SE estimates stabilize around B = 500-1000")
print(f"   • 95% CI: We're 95% confident true parameter is in this range")
print(f"   • Bootstrap SE for median: {se_median:.3f} (no analytical formula!)")
print(f"\n✅ Bootstrap is powerful for statistics with unknown SE formulas")

Key Takeaways¶

1. Method Comparison¶

Method	# Fits	Training %	Bias	Variance	Best For
Validation Set	1	50%	High	High	Large n, quick check
LOOCV	n	99%	Low	High	Small n, linear models
5-Fold CV	5	80%	Medium	Medium	General use
10-Fold CV	10	90%	Low	Medium	Industry standard
Bootstrap	B	63%	Medium	Medium	SE estimation

2. Choosing k in k-Fold CV¶

Small k (k=3,5):
  ✅ Faster computation
  ✅ Lower variance
  ❌ Higher bias
  → Use for large datasets or slow models

Large k (k=10,20):
  ✅ Lower bias
  ❌ Higher variance
  ❌ Slower
  → Use for accurate error estimate

LOOCV (k=n):
  ✅ Lowest bias
  ❌ Highest variance
  ❌ Very slow
  → Use for small datasets (n < 1000)

3. Common Practices¶

Model selection: Use CV to choose hyperparameters
Final evaluation: After selecting model, evaluate on held-out test set
Classification: Use stratified k-fold (preserves class proportions)
Time series: Use forward-chaining CV (respect temporal order)
Typical choice: 10-fold CV (well-studied, good bias-variance tradeoff)

4. Bootstrap vs Cross-Validation¶

Cross-Validation:

Purpose: Estimate test error
Best for: Model selection
Each observation used exactly once per fold

Bootstrap:

Purpose: Estimate uncertainty (SE, CI)
Best for: Standard errors, confidence intervals
Observations can appear multiple times
~37% of data not used in each bootstrap sample

5. Practical Workflow¶

# Step 1: Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Step 2: Use CV on training set for model selection
for model in candidate_models:
    cv_score = cross_val_score(model, X_train, y_train, cv=10)
    
# Step 3: Select best model based on CV
best_model.fit(X_train, y_train)

# Step 4: Final evaluation on test set
test_score = best_model.score(X_test, y_test)

6. Common Pitfalls¶

❌ Using test set for model selection (data leakage!)
❌ Not stratifying in classification
❌ Using LOOCV with very large n (too slow)
❌ Comparing models on different CV splits
❌ Using bootstrap for test error estimation (biased)
❌ Too few bootstrap samples (B < 1000)

Next Chapter¶

Chapter 6: Linear Model Selection and Regularization

Subset Selection (best subset, forward, backward)
Shrinkage Methods (Ridge, Lasso)
Dimension Reduction (PCR, PLS)

Practice Exercises¶

Exercise 1: Conceptual Understanding¶

Why does validation set approach have high variance?
Why does LOOCV have lower bias than 5-fold CV?
Why does LOOCV have higher variance than 5-fold CV?
When would you prefer 5-fold over 10-fold CV?

Exercise 2: LOOCV Calculation¶

Given data: X = [1, 2, 3, 4, 5], Y = [2, 4, 5, 4, 5]

Perform LOOCV for simple linear regression
Calculate CV error
Compare with training error

Exercise 3: k-Fold Implementation¶

Implement 5-fold CV from scratch:

Divide data into 5 folds
For each fold, train and evaluate
Compute average error
Compare with sklearn’s cross_val_score

Exercise 4: Bootstrap for Confidence Interval¶

Given sample: [3.2, 5.1, 4.8, 6.2, 3.9, 4.5, 5.8, 4.2]

Compute 95% CI for mean using bootstrap (B=1000)
Compute 95% CI for median
Compare bootstrap SE with analytical SE for mean

Exercise 5: Model Selection with CV¶

Generate polynomial data and:

Use 10-fold CV to select optimal degree (1-10)
Plot CV error vs degree
Compare with validation set approach
Which method is more stable?

Exercise 6: Classification CV¶

For a classification problem:

Compare regular k-fold vs stratified k-fold
Use imbalanced data (90% class 0, 10% class 1)
Show why stratification matters
Compute accuracy, precision, recall for each fold