Chapter 5: Resampling MethodsΒΆ

OverviewΒΆ

Resampling = repeatedly drawing samples from a training set and refitting a model

Two main purposes:

  1. Model Assessment: Estimate test error rate

  2. Model Selection: Choose optimal model complexity/parameters

Why Resample?ΒΆ

Traditional approach: Single train/test split

  • Estimates test error

  • But estimate has high variance (depends on random split)

  • Uses only portion of data for training

Resampling provides:

  • More reliable error estimates

  • Better model selection

  • Uncertainty quantification

  • Better use of available data

Key MethodsΒΆ

1. Cross-Validation (CV)ΒΆ

  • Validation Set Approach: Single holdout set (50/50 split)

  • Leave-One-Out CV (LOOCV): n iterations, hold out 1 observation

  • k-Fold CV: k iterations, hold out n/k observations

2. BootstrapΒΆ

  • Sample with replacement from original data

  • Estimate uncertainty (standard errors, confidence intervals)

  • Good for complex statistics

Comparison TableΒΆ

Method

Splits

Training Size

Computation

Variance

Bias

Validation Set

1

50%

Fast

High

High

LOOCV

n

(n-1)/n β‰ˆ 100%

Slow

Low

Low

k-Fold (k=5,10)

k

(k-1)/k β‰ˆ 80-90%

Moderate

Medium

Medium

Bootstrap

B

~63% unique

Moderate

Medium

Medium

5.1 Validation Set ApproachΒΆ

The MethodΒΆ

  1. Randomly split data: 50% training, 50% validation

  2. Fit model on training set

  3. Evaluate on validation set

  4. Test error estimate = validation error

FormulaΒΆ

\[\text{MSE}_{\text{val}} = \frac{1}{n_{\text{val}}} \sum_{i \in \text{val}} (y_i - \hat{f}(x_i))^2\]

ProsΒΆ

  • βœ… Simple and fast

  • βœ… Easy to implement

  • βœ… Good for very large datasets

ConsΒΆ

  • ❌ High variance (depends on random split)

  • ❌ Wastes data (only use 50% for training)

  • ❌ Validation error can vary significantly

  • ❌ Tends to overestimate test error (smaller training set)

When to UseΒΆ

  • Very large datasets (n > 10,000)

  • Initial quick assessment

  • Computational constraints

  • Proof of concept

# Generate polynomial regression data
np.random.seed(42)
n = 200
X = np.random.uniform(-3, 3, n)
true_function = lambda x: 2 + x - 0.5*x**2  # Quadratic
y = true_function(X) + np.random.normal(0, 1.5, n)

X_2d = X.reshape(-1, 1)

print("πŸ“Š Validation Set Approach Demo")
print(f"\nTrue function: f(x) = 2 + x - 0.5xΒ²")
print(f"Data: n = {n} observations")

# Try multiple random splits to show variability
n_splits = 10
degrees = range(1, 11)
validation_errors = {split_idx: [] for split_idx in range(n_splits)}

for split_idx in range(n_splits):
    # Different random split each time
    X_train, X_val, y_train, y_val = train_test_split(
        X_2d, y, test_size=0.5, random_state=split_idx)
    
    for degree in degrees:
        # Fit polynomial
        poly = PolynomialFeatures(degree=degree)
        X_train_poly = poly.fit_transform(X_train)
        X_val_poly = poly.transform(X_val)
        
        model = LinearRegression()
        model.fit(X_train_poly, y_train)
        y_pred = model.predict(X_val_poly)
        
        mse = mean_squared_error(y_val, y_pred)
        validation_errors[split_idx].append(mse)

# Visualize variability across splits
plt.figure(figsize=(14, 5))

# Individual splits
plt.subplot(1, 2, 1)
for split_idx in range(n_splits):
    plt.plot(degrees, validation_errors[split_idx], 'o-', 
            alpha=0.4, linewidth=1, label=f'Split {split_idx+1}' if split_idx < 3 else '')
plt.xlabel('Polynomial Degree')
plt.ylabel('Validation MSE')
plt.title('Validation Set Approach: High Variability\n(10 different random 50/50 splits)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axvline(x=2, color='g', linestyle='--', linewidth=2, alpha=0.7, label='True degree')

# Average across splits with error bars
plt.subplot(1, 2, 2)
avg_errors = [np.mean([validation_errors[s][i] for s in range(n_splits)]) 
              for i in range(len(degrees))]
std_errors = [np.std([validation_errors[s][i] for s in range(n_splits)]) 
              for i in range(len(degrees))]

plt.errorbar(degrees, avg_errors, yerr=std_errors, fmt='o-', 
            capsize=5, linewidth=2, markersize=8, color='blue')
plt.xlabel('Polynomial Degree')
plt.ylabel('Average Validation MSE')
plt.title('Mean Validation Error Β± Std Dev\n(High variability shows instability)')
plt.grid(True, alpha=0.3)
plt.axvline(x=2, color='g', linestyle='--', linewidth=2, label='True degree')
plt.legend()

plt.tight_layout()
plt.show()

# Detailed statistics
print("\nπŸ“Š Validation Errors by Polynomial Degree\n")
print(f"{'Degree':<8} {'Mean MSE':<12} {'Std MSE':<12} {'Min MSE':<12} {'Max MSE':<12} {'Range'}")
print("="*80)
for i, deg in enumerate(degrees):
    deg_errors = [validation_errors[s][i] for s in range(n_splits)]
    mean_mse = np.mean(deg_errors)
    std_mse = np.std(deg_errors)
    min_mse = np.min(deg_errors)
    max_mse = np.max(deg_errors)
    range_mse = max_mse - min_mse
    print(f"{deg:<8} {mean_mse:>10.3f}   {std_mse:>10.3f}   {min_mse:>10.3f}   {max_mse:>10.3f}   {range_mse:>10.3f}")

print("\nπŸ’‘ Key Observations:")
print("   β€’ Different random splits suggest different 'best' models")
print("   β€’ High standard deviation shows instability")
print("   β€’ Some splits suggest degree 1, others degree 3-5")
print("   β€’ True degree = 2, but hard to identify reliably")
print("   β€’ Large range shows sensitivity to random split")
print("\n⚠️ Problem: Can't trust a single validation set!")

5.2 Leave-One-Out Cross-Validation (LOOCV)ΒΆ

The MethodΒΆ

For each observation i = 1, …, n:

  1. Fit model on all observations except i

  2. Predict for observation i

  3. Calculate error: \((y_i - \hat{y}_i)^2\)

  4. Repeat for all n observations

LOOCV EstimateΒΆ

\[\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \text{MSE}_i = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2\]

ProsΒΆ

  • βœ… Less bias than validation set (uses n-1 observations)

  • βœ… No randomness (deterministic)

  • βœ… Almost unbiased estimate of test error

  • βœ… Uses maximum training data

ConsΒΆ

  • ❌ Computationally expensive (n model fits)

  • ❌ Can have high variance (similar training sets)

  • ❌ Impractical for large n

  • ❌ Training sets highly correlated β†’ high variance

Special Case: Linear ModelsΒΆ

For linear/polynomial regression, LOOCV has shortcut: $\(\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \left(\frac{y_i - \hat{y}_i}{1 - h_i}\right)^2\)\( where \)h_i$ is leverage (diagonal of hat matrix). Only need to fit model once!

When to UseΒΆ

  • Small datasets (n < 1000)

  • Need deterministic results

  • Linear models (can use shortcut)

  • Want nearly unbiased estimate

# LOOCV demonstration
print("πŸ“Š Leave-One-Out Cross-Validation (LOOCV)\n")

# Use smaller dataset for LOOCV (computational reasons)
n_small = 100
X_small = np.random.uniform(-3, 3, n_small)
y_small = true_function(X_small) + np.random.normal(0, 1.5, n_small)
X_small_2d = X_small.reshape(-1, 1)

print(f"Dataset: n = {n_small} observations")
print(f"Method: Fit {n_small} models (each with {n_small-1} training points)\n")

# LOOCV for different polynomial degrees
loocv = LeaveOneOut()
loocv_errors = []

for degree in degrees:
    mse_scores = []
    
    for train_idx, test_idx in loocv.split(X_small_2d):
        X_train, X_test = X_small_2d[train_idx], X_small_2d[test_idx]
        y_train, y_test = y_small[train_idx], y_small[test_idx]
        
        # Fit polynomial
        poly = PolynomialFeatures(degree=degree)
        X_train_poly = poly.fit_transform(X_train)
        X_test_poly = poly.transform(X_test)
        
        model = LinearRegression()
        model.fit(X_train_poly, y_train)
        y_pred = model.predict(X_test_poly)
        
        mse = (y_test[0] - y_pred[0])**2
        mse_scores.append(mse)
    
    loocv_mse = np.mean(mse_scores)
    loocv_errors.append(loocv_mse)

# Using sklearn's built-in cross_val_score (faster)
loocv_errors_sklearn = []
for degree in degrees:
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X_small_2d)
    model = LinearRegression()
    
    # negative MSE (sklearn convention)
    scores = cross_val_score(model, X_poly, y_small, 
                            cv=LeaveOneOut(), scoring='neg_mean_squared_error')
    loocv_errors_sklearn.append(-scores.mean())

# Compare with validation set (from before, using small dataset)
val_errors_small = []
X_train_vs, X_val_vs, y_train_vs, y_val_vs = train_test_split(
    X_small_2d, y_small, test_size=0.5, random_state=42)

for degree in degrees:
    poly = PolynomialFeatures(degree=degree)
    X_train_poly = poly.fit_transform(X_train_vs)
    X_val_poly = poly.transform(X_val_vs)
    model = LinearRegression()
    model.fit(X_train_poly, y_train_vs)
    y_pred = model.predict(X_val_poly)
    val_errors_small.append(mean_squared_error(y_val_vs, y_pred))

# Visualize comparison
plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
plt.plot(degrees, val_errors_small, 'o-', linewidth=2, markersize=8, 
        label='Validation Set (50/50)', color='orange')
plt.plot(degrees, loocv_errors_sklearn, 's-', linewidth=2, markersize=8, 
        label='LOOCV (deterministic)', color='blue')
plt.axvline(x=2, color='g', linestyle='--', linewidth=2, alpha=0.7, label='True degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('Estimated Test MSE')
plt.title('LOOCV vs Validation Set Approach')
plt.legend()
plt.grid(True, alpha=0.3)

# Show LOOCV stability
plt.subplot(1, 2, 2)
plt.plot(degrees, loocv_errors_sklearn, 'o-', linewidth=2, markersize=8, color='blue')
plt.axvline(x=2, color='g', linestyle='--', linewidth=2, alpha=0.7, label='True degree')
best_degree = degrees[np.argmin(loocv_errors_sklearn)]
plt.axvline(x=best_degree, color='r', linestyle='--', linewidth=2, 
           label=f'LOOCV selects degree {best_degree}')
plt.xlabel('Polynomial Degree')
plt.ylabel('LOOCV MSE')
plt.title('LOOCV Error Curve\n(Smooth, deterministic, no randomness)')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print results
print("\nπŸ“Š Error Comparison\n")
print(f"{'Degree':<8} {'Validation Set':<18} {'LOOCV':<18} {'Difference'}")
print("="*65)
for i, deg in enumerate(degrees):
    diff = abs(val_errors_small[i] - loocv_errors_sklearn[i])
    marker = 'βœ…' if deg == 2 else ''
    print(f"{deg:<8} {val_errors_small[i]:>15.3f}    {loocv_errors_sklearn[i]:>15.3f}    {diff:>10.3f}  {marker}")

print(f"\nπŸ’‘ Advantages of LOOCV:")
print(f"   β€’ No randomness: same result every time")
print(f"   β€’ Smooth error curve (easier to identify minimum)")
print(f"   β€’ Less bias: uses {n_small-1}/{n_small} = {(n_small-1)/n_small:.1%} of data for training")
print(f"   β€’ Selected degree {best_degree} (closer to true degree 2)")
print(f"\n⚠️ Drawback: Required {n_small * len(degrees)} = {n_small * len(degrees)} model fits!")

5.3 k-Fold Cross-ValidationΒΆ

The MethodΒΆ

  1. Randomly divide data into k equal-sized folds

  2. For each fold i = 1, …, k:

    • Use fold i as validation set

    • Use remaining k-1 folds as training set

    • Compute error on fold i: \(\text{MSE}_i\)

  3. Average the k errors

k-Fold CV EstimateΒΆ

\[\text{CV}_{(k)} = \frac{1}{k} \sum_{i=1}^k \text{MSE}_i\]

Common ChoicesΒΆ

  • k = 5: Good balance, common default

  • k = 10: Most popular, well-studied

  • k = n: Equivalent to LOOCV

Bias-Variance TradeoffΒΆ

  • Small k (e.g., k=5):

    • More bias (smaller training sets)

    • Less variance (less overlap between folds)

    • Faster computation

  • Large k (e.g., k=10, k=n):

    • Less bias (larger training sets)

    • More variance (more overlap between folds)

    • Slower computation

ProsΒΆ

  • βœ… Better bias-variance tradeoff than validation set or LOOCV

  • βœ… Computationally efficient (only k fits)

  • βœ… More stable than LOOCV

  • βœ… Works for any model type

  • βœ… Industry standard

ConsΒΆ

  • ❌ Some randomness (depends on fold assignment)

  • ❌ More complex than validation set

When to UseΒΆ

  • Almost always! Default choice for most problems

  • Any dataset size (n > 100)

  • Model selection

  • Hyperparameter tuning

# k-Fold CV demonstration
print("πŸ“Š k-Fold Cross-Validation\n")

# Compare different values of k
k_values = [3, 5, 10, 20, n_small]  # Last one is LOOCV
cv_results = {}

for k in k_values:
    cv_errors = []
    
    for degree in degrees:
        poly = PolynomialFeatures(degree=degree)
        X_poly = poly.fit_transform(X_small_2d)
        model = LinearRegression()
        
        if k == n_small:
            cv = LeaveOneOut()
        else:
            cv = KFold(n_splits=k, shuffle=True, random_state=42)
        
        scores = cross_val_score(model, X_poly, y_small, 
                                cv=cv, scoring='neg_mean_squared_error')
        cv_errors.append(-scores.mean())
    
    cv_results[k] = cv_errors

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

# Plot each k value
for idx, k in enumerate(k_values[:-1]):
    axes[idx].plot(degrees, cv_results[k], 'o-', linewidth=2, markersize=8)
    axes[idx].axvline(x=2, color='g', linestyle='--', linewidth=2, alpha=0.7, label='True degree')
    best_deg = degrees[np.argmin(cv_results[k])]
    axes[idx].axvline(x=best_deg, color='r', linestyle='--', linewidth=2, 
                     label=f'Selected: degree {best_deg}')
    axes[idx].set_xlabel('Polynomial Degree')
    axes[idx].set_ylabel('CV Error')
    axes[idx].set_title(f'{k}-Fold CV\n(Train on {(k-1)/k:.1%} of data each fold)')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

# Comparison plot
for k, errors in cv_results.items():
    label = f'LOOCV (k={k})' if k == n_small else f'{k}-Fold CV'
    axes[3].plot(degrees, errors, 'o-', linewidth=2, markersize=6, label=label)

axes[3].axvline(x=2, color='g', linestyle='--', linewidth=2, alpha=0.7, label='True degree')
axes[3].set_xlabel('Polynomial Degree')
axes[3].set_ylabel('CV Error')
axes[3].set_title('Comparison of Different k Values')
axes[3].legend()
axes[3].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print detailed comparison
print("\nπŸ“Š k-Fold CV Results\n")
header = f"{'Degree':<8}"
for k in k_values:
    k_label = f'k={k}' if k != n_small else 'LOOCV'
    header += f" {k_label:<12}"
print(header)
print("="*80)

for i, deg in enumerate(degrees):
    row = f"{deg:<8}"
    for k in k_values:
        row += f" {cv_results[k][i]:>10.3f}  "
    marker = 'βœ…' if deg == 2 else ''
    print(row + marker)

print("\nπŸ“ˆ Selected Model by Each Method:\n")
for k in k_values:
    best_degree = degrees[np.argmin(cv_results[k])]
    k_label = f'{k}-Fold' if k != n_small else 'LOOCV'
    min_error = min(cv_results[k])
    print(f"{k_label:<15} β†’ Degree {best_degree}  (MSE = {min_error:.3f})")

print("\nπŸ’‘ Key Insights:")
print(f"   β€’ All methods select similar optimal degree")
print(f"   β€’ 5-fold and 10-fold CV are most common in practice")
print(f"   β€’ k=10 often considered the 'sweet spot'")
print(f"   β€’ Computational cost: k={k_values[0]} needs {k_values[0]} fits, LOOCV needs {n_small} fits")

5.4 Cross-Validation for ClassificationΒΆ

Adaptation for ClassificationΒΆ

Instead of MSE, use:

  • Misclassification rate: \(\text{Err}_i = \frac{1}{n_i} \sum_{j \in C_i} I(y_j \neq \hat{y}_j)\)

  • Accuracy: 1 - Error rate

  • Other metrics: AUC, F1, etc.

k-Fold CV for ClassificationΒΆ

\[\text{CV}_{(k)} = \frac{1}{k} \sum_{i=1}^k \text{Err}_i\]

Stratified k-FoldΒΆ

  • Ensure each fold has same class proportions as original data

  • Important for imbalanced datasets

  • Recommended for classification

# Generate classification data
np.random.seed(42)
n_class = 200

# Binary classification
X_class = np.random.randn(n_class, 2)
# Decision boundary: x1 + x2 > 0
y_class = (X_class[:, 0] + X_class[:, 1] + np.random.randn(n_class)*0.5 > 0).astype(int)

print("πŸ“Š Cross-Validation for Classification\n")
print(f"Dataset: n = {n_class}, 2 features, binary classification")
print(f"Class balance: {np.mean(y_class):.1%} positive, {1-np.mean(y_class):.1%} negative\n")

# Compare models using CV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

models_clf = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree (depth=3)': DecisionTreeClassifier(max_depth=3, random_state=42),
    'Decision Tree (depth=10)': DecisionTreeClassifier(max_depth=10, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=50, random_state=42),
    'SVM (linear)': SVC(kernel='linear'),
    'KNN (k=5)': KNeighborsClassifier(n_neighbors=5)
}

cv_results_clf = {}

for name, model in models_clf.items():
    # 10-fold CV
    scores = cross_val_score(model, X_class, y_class, cv=10, scoring='accuracy')
    cv_results_clf[name] = {
        'mean': scores.mean(),
        'std': scores.std(),
        'scores': scores
    }

# Print results
print("10-Fold CV Results:\n")
print(f"{'Model':<30} {'Mean Accuracy':<18} {'Std Dev':<12} {'Min':<10} {'Max'}")
print("="*90)
for name, results in sorted(cv_results_clf.items(), key=lambda x: x[1]['mean'], reverse=True):
    print(f"{name:<30} {results['mean']:>15.4f}    {results['std']:>10.4f}   "
          f"{results['scores'].min():>8.4f}   {results['scores'].max():>8.4f}")

# Visualize CV scores
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot with error bars
names = list(cv_results_clf.keys())
means = [cv_results_clf[n]['mean'] for n in names]
stds = [cv_results_clf[n]['std'] for n in names]

axes[0].barh(range(len(names)), means, xerr=stds, capsize=5, alpha=0.7)
axes[0].set_yticks(range(len(names)))
axes[0].set_yticklabels(names)
axes[0].set_xlabel('10-Fold CV Accuracy')
axes[0].set_title('Model Comparison Using Cross-Validation')
axes[0].grid(True, alpha=0.3, axis='x')
axes[0].axvline(x=0.5, color='r', linestyle='--', alpha=0.5, label='Random guess')
axes[0].legend()

# Box plot showing distribution
scores_list = [cv_results_clf[n]['scores'] for n in names]
bp = axes[1].boxplot(scores_list, labels=names, vert=False, patch_artist=True)
for patch in bp['boxes']:
    patch.set_facecolor('lightblue')
axes[1].set_xlabel('Accuracy')
axes[1].set_title('Distribution of CV Scores (10 folds)')
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Interpretation:")
best_model = max(cv_results_clf.items(), key=lambda x: x[1]['mean'])[0]
print(f"   β€’ Best model: {best_model}")
print(f"   β€’ Error bars show variability across folds")
print(f"   β€’ Small std β†’ consistent performance")
print(f"   β€’ Large std β†’ performance depends on data split")
print(f"   β€’ Use CV to select model architecture/hyperparameters")

5.5 The BootstrapΒΆ

The ConceptΒΆ

Bootstrap = sampling with replacement from original dataset

The MethodΒΆ

  1. Draw n observations with replacement from original data β†’ Bootstrap sample B*₁

  2. Compute statistic of interest on B₁ β†’ α̂₁

  3. Repeat B times β†’ α̂₁, Ξ±Μ‚β‚‚, …, Ξ±Μ‚*_B

  4. Estimate standard error: \(\text{SE}_B(\hat{\alpha}) = \sqrt{\frac{1}{B-1} \sum_{r=1}^B (\hat{\alpha}^{*r} - \bar{\alpha}^*)^2}\)

Key PropertiesΒΆ

  • Each bootstrap sample has ~63% unique observations

    • P(observation selected) = 1 - (1 - 1/n)ⁿ β‰ˆ 0.632

  • Observations can appear multiple times

  • ~37% of observations not selected (out-of-bag)

UsesΒΆ

  1. Standard errors: Estimate SE of any statistic

  2. Confidence intervals: Percentile or normal-based

  3. Model assessment: Out-of-bag error estimation

  4. Ensemble methods: Bagging, Random Forests

ProsΒΆ

  • βœ… Works for any statistic (median, ratio, etc.)

  • βœ… No distributional assumptions

  • βœ… Provides uncertainty quantification

  • βœ… Foundation for powerful ensemble methods

ConsΒΆ

  • ❌ Computationally intensive (need B=1000+ samples)

  • ❌ Not ideal for test error estimation (biased)

  • ❌ Can fail for small samples

When to UseΒΆ

  • Estimating standard errors of complex statistics

  • No analytical formula for SE available

  • Building confidence intervals

  • Ensemble learning (bagging, random forests)

# Bootstrap demonstration
print("πŸ“Š Bootstrap Resampling\n")

# Generate sample data
np.random.seed(42)
n_boot = 50
data_boot = np.random.exponential(scale=2, size=n_boot)  # Right-skewed distribution

print(f"Original sample: n = {n_boot}")
print(f"Sample mean: {data_boot.mean():.3f}")
print(f"Sample median: {np.median(data_boot):.3f}")
print(f"Sample std: {data_boot.std():.3f}\n")

# Bootstrap resampling
B = 1000  # Number of bootstrap samples
boot_means = []
boot_medians = []
boot_stds = []

for b in range(B):
    # Sample with replacement
    boot_sample = resample(data_boot, replace=True, n_samples=n_boot, random_state=b)
    boot_means.append(boot_sample.mean())
    boot_medians.append(np.median(boot_sample))
    boot_stds.append(boot_sample.std())

boot_means = np.array(boot_means)
boot_medians = np.array(boot_medians)
boot_stds = np.array(boot_stds)

# Bootstrap standard errors
se_mean = boot_means.std()
se_median = boot_medians.std()
se_std = boot_stds.std()

# 95% Confidence intervals (percentile method)
ci_mean = np.percentile(boot_means, [2.5, 97.5])
ci_median = np.percentile(boot_medians, [2.5, 97.5])

print(f"Bootstrap Results (B = {B} samples):\n")
print(f"{'Statistic':<15} {'Estimate':<12} {'Boot SE':<12} {'95% CI'}")
print("="*65)
print(f"{'Mean':<15} {data_boot.mean():>10.3f}   {se_mean:>10.3f}   [{ci_mean[0]:.3f}, {ci_mean[1]:.3f}]")
print(f"{'Median':<15} {np.median(data_boot):>10.3f}   {se_median:>10.3f}   [{ci_median[0]:.3f}, {ci_median[1]:.3f}]")
print(f"{'Std Dev':<15} {data_boot.std():>10.3f}   {se_std:>10.3f}   N/A")

# Visualize bootstrap distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Original data
axes[0, 0].hist(data_boot, bins=15, edgecolor='black', alpha=0.7)
axes[0, 0].axvline(data_boot.mean(), color='r', linestyle='--', linewidth=2, label=f'Mean = {data_boot.mean():.2f}')
axes[0, 0].axvline(np.median(data_boot), color='g', linestyle='--', linewidth=2, label=f'Median = {np.median(data_boot):.2f}')
axes[0, 0].set_xlabel('Value')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Original Data Distribution\n(Right-skewed)')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Bootstrap distribution of mean
axes[0, 1].hist(boot_means, bins=30, edgecolor='black', alpha=0.7, density=True)
axes[0, 1].axvline(data_boot.mean(), color='r', linestyle='-', linewidth=2, label='Sample mean')
axes[0, 1].axvline(ci_mean[0], color='orange', linestyle='--', linewidth=2, label='95% CI')
axes[0, 1].axvline(ci_mean[1], color='orange', linestyle='--', linewidth=2)
axes[0, 1].set_xlabel('Bootstrap Mean')
axes[0, 1].set_ylabel('Density')
axes[0, 1].set_title(f'Bootstrap Distribution of Mean\n(SE = {se_mean:.3f})')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Bootstrap distribution of median
axes[1, 0].hist(boot_medians, bins=30, edgecolor='black', alpha=0.7, density=True, color='green')
axes[1, 0].axvline(np.median(data_boot), color='darkgreen', linestyle='-', linewidth=2, label='Sample median')
axes[1, 0].axvline(ci_median[0], color='orange', linestyle='--', linewidth=2, label='95% CI')
axes[1, 0].axvline(ci_median[1], color='orange', linestyle='--', linewidth=2)
axes[1, 0].set_xlabel('Bootstrap Median')
axes[1, 0].set_ylabel('Density')
axes[1, 0].set_title(f'Bootstrap Distribution of Median\n(SE = {se_median:.3f})')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Convergence of SE estimates
B_values = range(50, B+1, 50)
se_convergence = [boot_means[:b].std() for b in B_values]
axes[1, 1].plot(B_values, se_convergence, 'o-', linewidth=2, markersize=6)
axes[1, 1].axhline(se_mean, color='r', linestyle='--', linewidth=2, label=f'Final SE = {se_mean:.3f}')
axes[1, 1].set_xlabel('Number of Bootstrap Samples (B)')
axes[1, 1].set_ylabel('Estimated SE of Mean')
axes[1, 1].set_title('Convergence of Bootstrap SE Estimate')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Key Observations:")
print(f"   β€’ Bootstrap distributions are approximately normal (CLT)")
print(f"   β€’ Even though original data is skewed, sampling distributions are symmetric")
print(f"   β€’ SE estimates stabilize around B = 500-1000")
print(f"   β€’ 95% CI: We're 95% confident true parameter is in this range")
print(f"   β€’ Bootstrap SE for median: {se_median:.3f} (no analytical formula!)")
print(f"\nβœ… Bootstrap is powerful for statistics with unknown SE formulas")

Key TakeawaysΒΆ

1. Method ComparisonΒΆ

Method

# Fits

Training %

Bias

Variance

Best For

Validation Set

1

50%

High

High

Large n, quick check

LOOCV

n

99%

Low

High

Small n, linear models

5-Fold CV

5

80%

Medium

Medium

General use

10-Fold CV

10

90%

Low

Medium

Industry standard

Bootstrap

B

63%

Medium

Medium

SE estimation

2. Choosing k in k-Fold CVΒΆ

Small k (k=3,5):
  βœ… Faster computation
  βœ… Lower variance
  ❌ Higher bias
  β†’ Use for large datasets or slow models

Large k (k=10,20):
  βœ… Lower bias
  ❌ Higher variance
  ❌ Slower
  β†’ Use for accurate error estimate

LOOCV (k=n):
  βœ… Lowest bias
  ❌ Highest variance
  ❌ Very slow
  β†’ Use for small datasets (n < 1000)

3. Common PracticesΒΆ

  • Model selection: Use CV to choose hyperparameters

  • Final evaluation: After selecting model, evaluate on held-out test set

  • Classification: Use stratified k-fold (preserves class proportions)

  • Time series: Use forward-chaining CV (respect temporal order)

  • Typical choice: 10-fold CV (well-studied, good bias-variance tradeoff)

4. Bootstrap vs Cross-ValidationΒΆ

Cross-Validation:

  • Purpose: Estimate test error

  • Best for: Model selection

  • Each observation used exactly once per fold

Bootstrap:

  • Purpose: Estimate uncertainty (SE, CI)

  • Best for: Standard errors, confidence intervals

  • Observations can appear multiple times

  • ~37% of data not used in each bootstrap sample

5. Practical WorkflowΒΆ

# Step 1: Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Step 2: Use CV on training set for model selection
for model in candidate_models:
    cv_score = cross_val_score(model, X_train, y_train, cv=10)
    
# Step 3: Select best model based on CV
best_model.fit(X_train, y_train)

# Step 4: Final evaluation on test set
test_score = best_model.score(X_test, y_test)

6. Common PitfallsΒΆ

  • ❌ Using test set for model selection (data leakage!)

  • ❌ Not stratifying in classification

  • ❌ Using LOOCV with very large n (too slow)

  • ❌ Comparing models on different CV splits

  • ❌ Using bootstrap for test error estimation (biased)

  • ❌ Too few bootstrap samples (B < 1000)

Next ChapterΒΆ

Chapter 6: Linear Model Selection and Regularization

  • Subset Selection (best subset, forward, backward)

  • Shrinkage Methods (Ridge, Lasso)

  • Dimension Reduction (PCR, PLS)

Practice ExercisesΒΆ

Exercise 1: Conceptual UnderstandingΒΆ

  1. Why does validation set approach have high variance?

  2. Why does LOOCV have lower bias than 5-fold CV?

  3. Why does LOOCV have higher variance than 5-fold CV?

  4. When would you prefer 5-fold over 10-fold CV?

Exercise 2: LOOCV CalculationΒΆ

Given data: X = [1, 2, 3, 4, 5], Y = [2, 4, 5, 4, 5]

  1. Perform LOOCV for simple linear regression

  2. Calculate CV error

  3. Compare with training error

Exercise 3: k-Fold ImplementationΒΆ

Implement 5-fold CV from scratch:

  1. Divide data into 5 folds

  2. For each fold, train and evaluate

  3. Compute average error

  4. Compare with sklearn’s cross_val_score

Exercise 4: Bootstrap for Confidence IntervalΒΆ

Given sample: [3.2, 5.1, 4.8, 6.2, 3.9, 4.5, 5.8, 4.2]

  1. Compute 95% CI for mean using bootstrap (B=1000)

  2. Compute 95% CI for median

  3. Compare bootstrap SE with analytical SE for mean

Exercise 5: Model Selection with CVΒΆ

Generate polynomial data and:

  1. Use 10-fold CV to select optimal degree (1-10)

  2. Plot CV error vs degree

  3. Compare with validation set approach

  4. Which method is more stable?

Exercise 6: Classification CVΒΆ

For a classification problem:

  1. Compare regular k-fold vs stratified k-fold

  2. Use imbalanced data (90% class 0, 10% class 1)

  3. Show why stratification matters

  4. Compute accuracy, precision, recall for each fold