Chapter 5: Resampling MethodsΒΆ
OverviewΒΆ
Resampling = repeatedly drawing samples from a training set and refitting a model
Two main purposes:
Model Assessment: Estimate test error rate
Model Selection: Choose optimal model complexity/parameters
Why Resample?ΒΆ
Traditional approach: Single train/test split
Estimates test error
But estimate has high variance (depends on random split)
Uses only portion of data for training
Resampling provides:
More reliable error estimates
Better model selection
Uncertainty quantification
Better use of available data
Key MethodsΒΆ
1. Cross-Validation (CV)ΒΆ
Validation Set Approach: Single holdout set (50/50 split)
Leave-One-Out CV (LOOCV): n iterations, hold out 1 observation
k-Fold CV: k iterations, hold out n/k observations
2. BootstrapΒΆ
Sample with replacement from original data
Estimate uncertainty (standard errors, confidence intervals)
Good for complex statistics
Comparison TableΒΆ
Method |
Splits |
Training Size |
Computation |
Variance |
Bias |
|---|---|---|---|---|---|
Validation Set |
1 |
50% |
Fast |
High |
High |
LOOCV |
n |
(n-1)/n β 100% |
Slow |
Low |
Low |
k-Fold (k=5,10) |
k |
(k-1)/k β 80-90% |
Moderate |
Medium |
Medium |
Bootstrap |
B |
~63% unique |
Moderate |
Medium |
Medium |
5.1 Validation Set ApproachΒΆ
The MethodΒΆ
Randomly split data: 50% training, 50% validation
Fit model on training set
Evaluate on validation set
Test error estimate = validation error
FormulaΒΆ
ProsΒΆ
β Simple and fast
β Easy to implement
β Good for very large datasets
ConsΒΆ
β High variance (depends on random split)
β Wastes data (only use 50% for training)
β Validation error can vary significantly
β Tends to overestimate test error (smaller training set)
When to UseΒΆ
Very large datasets (n > 10,000)
Initial quick assessment
Computational constraints
Proof of concept
# Generate polynomial regression data
np.random.seed(42)
n = 200
X = np.random.uniform(-3, 3, n)
true_function = lambda x: 2 + x - 0.5*x**2 # Quadratic
y = true_function(X) + np.random.normal(0, 1.5, n)
X_2d = X.reshape(-1, 1)
print("π Validation Set Approach Demo")
print(f"\nTrue function: f(x) = 2 + x - 0.5xΒ²")
print(f"Data: n = {n} observations")
# Try multiple random splits to show variability
n_splits = 10
degrees = range(1, 11)
validation_errors = {split_idx: [] for split_idx in range(n_splits)}
for split_idx in range(n_splits):
# Different random split each time
X_train, X_val, y_train, y_val = train_test_split(
X_2d, y, test_size=0.5, random_state=split_idx)
for degree in degrees:
# Fit polynomial
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
X_val_poly = poly.transform(X_val)
model = LinearRegression()
model.fit(X_train_poly, y_train)
y_pred = model.predict(X_val_poly)
mse = mean_squared_error(y_val, y_pred)
validation_errors[split_idx].append(mse)
# Visualize variability across splits
plt.figure(figsize=(14, 5))
# Individual splits
plt.subplot(1, 2, 1)
for split_idx in range(n_splits):
plt.plot(degrees, validation_errors[split_idx], 'o-',
alpha=0.4, linewidth=1, label=f'Split {split_idx+1}' if split_idx < 3 else '')
plt.xlabel('Polynomial Degree')
plt.ylabel('Validation MSE')
plt.title('Validation Set Approach: High Variability\n(10 different random 50/50 splits)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axvline(x=2, color='g', linestyle='--', linewidth=2, alpha=0.7, label='True degree')
# Average across splits with error bars
plt.subplot(1, 2, 2)
avg_errors = [np.mean([validation_errors[s][i] for s in range(n_splits)])
for i in range(len(degrees))]
std_errors = [np.std([validation_errors[s][i] for s in range(n_splits)])
for i in range(len(degrees))]
plt.errorbar(degrees, avg_errors, yerr=std_errors, fmt='o-',
capsize=5, linewidth=2, markersize=8, color='blue')
plt.xlabel('Polynomial Degree')
plt.ylabel('Average Validation MSE')
plt.title('Mean Validation Error Β± Std Dev\n(High variability shows instability)')
plt.grid(True, alpha=0.3)
plt.axvline(x=2, color='g', linestyle='--', linewidth=2, label='True degree')
plt.legend()
plt.tight_layout()
plt.show()
# Detailed statistics
print("\nπ Validation Errors by Polynomial Degree\n")
print(f"{'Degree':<8} {'Mean MSE':<12} {'Std MSE':<12} {'Min MSE':<12} {'Max MSE':<12} {'Range'}")
print("="*80)
for i, deg in enumerate(degrees):
deg_errors = [validation_errors[s][i] for s in range(n_splits)]
mean_mse = np.mean(deg_errors)
std_mse = np.std(deg_errors)
min_mse = np.min(deg_errors)
max_mse = np.max(deg_errors)
range_mse = max_mse - min_mse
print(f"{deg:<8} {mean_mse:>10.3f} {std_mse:>10.3f} {min_mse:>10.3f} {max_mse:>10.3f} {range_mse:>10.3f}")
print("\nπ‘ Key Observations:")
print(" β’ Different random splits suggest different 'best' models")
print(" β’ High standard deviation shows instability")
print(" β’ Some splits suggest degree 1, others degree 3-5")
print(" β’ True degree = 2, but hard to identify reliably")
print(" β’ Large range shows sensitivity to random split")
print("\nβ οΈ Problem: Can't trust a single validation set!")
5.2 Leave-One-Out Cross-Validation (LOOCV)ΒΆ
The MethodΒΆ
For each observation i = 1, β¦, n:
Fit model on all observations except i
Predict for observation i
Calculate error: \((y_i - \hat{y}_i)^2\)
Repeat for all n observations
LOOCV EstimateΒΆ
ProsΒΆ
β Less bias than validation set (uses n-1 observations)
β No randomness (deterministic)
β Almost unbiased estimate of test error
β Uses maximum training data
ConsΒΆ
β Computationally expensive (n model fits)
β Can have high variance (similar training sets)
β Impractical for large n
β Training sets highly correlated β high variance
Special Case: Linear ModelsΒΆ
For linear/polynomial regression, LOOCV has shortcut: $\(\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \left(\frac{y_i - \hat{y}_i}{1 - h_i}\right)^2\)\( where \)h_i$ is leverage (diagonal of hat matrix). Only need to fit model once!
When to UseΒΆ
Small datasets (n < 1000)
Need deterministic results
Linear models (can use shortcut)
Want nearly unbiased estimate
# LOOCV demonstration
print("π Leave-One-Out Cross-Validation (LOOCV)\n")
# Use smaller dataset for LOOCV (computational reasons)
n_small = 100
X_small = np.random.uniform(-3, 3, n_small)
y_small = true_function(X_small) + np.random.normal(0, 1.5, n_small)
X_small_2d = X_small.reshape(-1, 1)
print(f"Dataset: n = {n_small} observations")
print(f"Method: Fit {n_small} models (each with {n_small-1} training points)\n")
# LOOCV for different polynomial degrees
loocv = LeaveOneOut()
loocv_errors = []
for degree in degrees:
mse_scores = []
for train_idx, test_idx in loocv.split(X_small_2d):
X_train, X_test = X_small_2d[train_idx], X_small_2d[test_idx]
y_train, y_test = y_small[train_idx], y_small[test_idx]
# Fit polynomial
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
model = LinearRegression()
model.fit(X_train_poly, y_train)
y_pred = model.predict(X_test_poly)
mse = (y_test[0] - y_pred[0])**2
mse_scores.append(mse)
loocv_mse = np.mean(mse_scores)
loocv_errors.append(loocv_mse)
# Using sklearn's built-in cross_val_score (faster)
loocv_errors_sklearn = []
for degree in degrees:
poly = PolynomialFeatures(degree=degree)
X_poly = poly.fit_transform(X_small_2d)
model = LinearRegression()
# negative MSE (sklearn convention)
scores = cross_val_score(model, X_poly, y_small,
cv=LeaveOneOut(), scoring='neg_mean_squared_error')
loocv_errors_sklearn.append(-scores.mean())
# Compare with validation set (from before, using small dataset)
val_errors_small = []
X_train_vs, X_val_vs, y_train_vs, y_val_vs = train_test_split(
X_small_2d, y_small, test_size=0.5, random_state=42)
for degree in degrees:
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train_vs)
X_val_poly = poly.transform(X_val_vs)
model = LinearRegression()
model.fit(X_train_poly, y_train_vs)
y_pred = model.predict(X_val_poly)
val_errors_small.append(mean_squared_error(y_val_vs, y_pred))
# Visualize comparison
plt.figure(figsize=(14, 5))
plt.subplot(1, 2, 1)
plt.plot(degrees, val_errors_small, 'o-', linewidth=2, markersize=8,
label='Validation Set (50/50)', color='orange')
plt.plot(degrees, loocv_errors_sklearn, 's-', linewidth=2, markersize=8,
label='LOOCV (deterministic)', color='blue')
plt.axvline(x=2, color='g', linestyle='--', linewidth=2, alpha=0.7, label='True degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('Estimated Test MSE')
plt.title('LOOCV vs Validation Set Approach')
plt.legend()
plt.grid(True, alpha=0.3)
# Show LOOCV stability
plt.subplot(1, 2, 2)
plt.plot(degrees, loocv_errors_sklearn, 'o-', linewidth=2, markersize=8, color='blue')
plt.axvline(x=2, color='g', linestyle='--', linewidth=2, alpha=0.7, label='True degree')
best_degree = degrees[np.argmin(loocv_errors_sklearn)]
plt.axvline(x=best_degree, color='r', linestyle='--', linewidth=2,
label=f'LOOCV selects degree {best_degree}')
plt.xlabel('Polynomial Degree')
plt.ylabel('LOOCV MSE')
plt.title('LOOCV Error Curve\n(Smooth, deterministic, no randomness)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Print results
print("\nπ Error Comparison\n")
print(f"{'Degree':<8} {'Validation Set':<18} {'LOOCV':<18} {'Difference'}")
print("="*65)
for i, deg in enumerate(degrees):
diff = abs(val_errors_small[i] - loocv_errors_sklearn[i])
marker = 'β
' if deg == 2 else ''
print(f"{deg:<8} {val_errors_small[i]:>15.3f} {loocv_errors_sklearn[i]:>15.3f} {diff:>10.3f} {marker}")
print(f"\nπ‘ Advantages of LOOCV:")
print(f" β’ No randomness: same result every time")
print(f" β’ Smooth error curve (easier to identify minimum)")
print(f" β’ Less bias: uses {n_small-1}/{n_small} = {(n_small-1)/n_small:.1%} of data for training")
print(f" β’ Selected degree {best_degree} (closer to true degree 2)")
print(f"\nβ οΈ Drawback: Required {n_small * len(degrees)} = {n_small * len(degrees)} model fits!")
5.3 k-Fold Cross-ValidationΒΆ
The MethodΒΆ
Randomly divide data into k equal-sized folds
For each fold i = 1, β¦, k:
Use fold i as validation set
Use remaining k-1 folds as training set
Compute error on fold i: \(\text{MSE}_i\)
Average the k errors
k-Fold CV EstimateΒΆ
Common ChoicesΒΆ
k = 5: Good balance, common default
k = 10: Most popular, well-studied
k = n: Equivalent to LOOCV
Bias-Variance TradeoffΒΆ
Small k (e.g., k=5):
More bias (smaller training sets)
Less variance (less overlap between folds)
Faster computation
Large k (e.g., k=10, k=n):
Less bias (larger training sets)
More variance (more overlap between folds)
Slower computation
ProsΒΆ
β Better bias-variance tradeoff than validation set or LOOCV
β Computationally efficient (only k fits)
β More stable than LOOCV
β Works for any model type
β Industry standard
ConsΒΆ
β Some randomness (depends on fold assignment)
β More complex than validation set
When to UseΒΆ
Almost always! Default choice for most problems
Any dataset size (n > 100)
Model selection
Hyperparameter tuning
# k-Fold CV demonstration
print("π k-Fold Cross-Validation\n")
# Compare different values of k
k_values = [3, 5, 10, 20, n_small] # Last one is LOOCV
cv_results = {}
for k in k_values:
cv_errors = []
for degree in degrees:
poly = PolynomialFeatures(degree=degree)
X_poly = poly.fit_transform(X_small_2d)
model = LinearRegression()
if k == n_small:
cv = LeaveOneOut()
else:
cv = KFold(n_splits=k, shuffle=True, random_state=42)
scores = cross_val_score(model, X_poly, y_small,
cv=cv, scoring='neg_mean_squared_error')
cv_errors.append(-scores.mean())
cv_results[k] = cv_errors
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()
# Plot each k value
for idx, k in enumerate(k_values[:-1]):
axes[idx].plot(degrees, cv_results[k], 'o-', linewidth=2, markersize=8)
axes[idx].axvline(x=2, color='g', linestyle='--', linewidth=2, alpha=0.7, label='True degree')
best_deg = degrees[np.argmin(cv_results[k])]
axes[idx].axvline(x=best_deg, color='r', linestyle='--', linewidth=2,
label=f'Selected: degree {best_deg}')
axes[idx].set_xlabel('Polynomial Degree')
axes[idx].set_ylabel('CV Error')
axes[idx].set_title(f'{k}-Fold CV\n(Train on {(k-1)/k:.1%} of data each fold)')
axes[idx].legend()
axes[idx].grid(True, alpha=0.3)
# Comparison plot
for k, errors in cv_results.items():
label = f'LOOCV (k={k})' if k == n_small else f'{k}-Fold CV'
axes[3].plot(degrees, errors, 'o-', linewidth=2, markersize=6, label=label)
axes[3].axvline(x=2, color='g', linestyle='--', linewidth=2, alpha=0.7, label='True degree')
axes[3].set_xlabel('Polynomial Degree')
axes[3].set_ylabel('CV Error')
axes[3].set_title('Comparison of Different k Values')
axes[3].legend()
axes[3].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Print detailed comparison
print("\nπ k-Fold CV Results\n")
header = f"{'Degree':<8}"
for k in k_values:
k_label = f'k={k}' if k != n_small else 'LOOCV'
header += f" {k_label:<12}"
print(header)
print("="*80)
for i, deg in enumerate(degrees):
row = f"{deg:<8}"
for k in k_values:
row += f" {cv_results[k][i]:>10.3f} "
marker = 'β
' if deg == 2 else ''
print(row + marker)
print("\nπ Selected Model by Each Method:\n")
for k in k_values:
best_degree = degrees[np.argmin(cv_results[k])]
k_label = f'{k}-Fold' if k != n_small else 'LOOCV'
min_error = min(cv_results[k])
print(f"{k_label:<15} β Degree {best_degree} (MSE = {min_error:.3f})")
print("\nπ‘ Key Insights:")
print(f" β’ All methods select similar optimal degree")
print(f" β’ 5-fold and 10-fold CV are most common in practice")
print(f" β’ k=10 often considered the 'sweet spot'")
print(f" β’ Computational cost: k={k_values[0]} needs {k_values[0]} fits, LOOCV needs {n_small} fits")
5.4 Cross-Validation for ClassificationΒΆ
Adaptation for ClassificationΒΆ
Instead of MSE, use:
Misclassification rate: \(\text{Err}_i = \frac{1}{n_i} \sum_{j \in C_i} I(y_j \neq \hat{y}_j)\)
Accuracy: 1 - Error rate
Other metrics: AUC, F1, etc.
k-Fold CV for ClassificationΒΆ
Stratified k-FoldΒΆ
Ensure each fold has same class proportions as original data
Important for imbalanced datasets
Recommended for classification
# Generate classification data
np.random.seed(42)
n_class = 200
# Binary classification
X_class = np.random.randn(n_class, 2)
# Decision boundary: x1 + x2 > 0
y_class = (X_class[:, 0] + X_class[:, 1] + np.random.randn(n_class)*0.5 > 0).astype(int)
print("π Cross-Validation for Classification\n")
print(f"Dataset: n = {n_class}, 2 features, binary classification")
print(f"Class balance: {np.mean(y_class):.1%} positive, {1-np.mean(y_class):.1%} negative\n")
# Compare models using CV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
models_clf = {
'Logistic Regression': LogisticRegression(),
'Decision Tree (depth=3)': DecisionTreeClassifier(max_depth=3, random_state=42),
'Decision Tree (depth=10)': DecisionTreeClassifier(max_depth=10, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=50, random_state=42),
'SVM (linear)': SVC(kernel='linear'),
'KNN (k=5)': KNeighborsClassifier(n_neighbors=5)
}
cv_results_clf = {}
for name, model in models_clf.items():
# 10-fold CV
scores = cross_val_score(model, X_class, y_class, cv=10, scoring='accuracy')
cv_results_clf[name] = {
'mean': scores.mean(),
'std': scores.std(),
'scores': scores
}
# Print results
print("10-Fold CV Results:\n")
print(f"{'Model':<30} {'Mean Accuracy':<18} {'Std Dev':<12} {'Min':<10} {'Max'}")
print("="*90)
for name, results in sorted(cv_results_clf.items(), key=lambda x: x[1]['mean'], reverse=True):
print(f"{name:<30} {results['mean']:>15.4f} {results['std']:>10.4f} "
f"{results['scores'].min():>8.4f} {results['scores'].max():>8.4f}")
# Visualize CV scores
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Bar plot with error bars
names = list(cv_results_clf.keys())
means = [cv_results_clf[n]['mean'] for n in names]
stds = [cv_results_clf[n]['std'] for n in names]
axes[0].barh(range(len(names)), means, xerr=stds, capsize=5, alpha=0.7)
axes[0].set_yticks(range(len(names)))
axes[0].set_yticklabels(names)
axes[0].set_xlabel('10-Fold CV Accuracy')
axes[0].set_title('Model Comparison Using Cross-Validation')
axes[0].grid(True, alpha=0.3, axis='x')
axes[0].axvline(x=0.5, color='r', linestyle='--', alpha=0.5, label='Random guess')
axes[0].legend()
# Box plot showing distribution
scores_list = [cv_results_clf[n]['scores'] for n in names]
bp = axes[1].boxplot(scores_list, labels=names, vert=False, patch_artist=True)
for patch in bp['boxes']:
patch.set_facecolor('lightblue')
axes[1].set_xlabel('Accuracy')
axes[1].set_title('Distribution of CV Scores (10 folds)')
axes[1].grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()
print("\nπ‘ Interpretation:")
best_model = max(cv_results_clf.items(), key=lambda x: x[1]['mean'])[0]
print(f" β’ Best model: {best_model}")
print(f" β’ Error bars show variability across folds")
print(f" β’ Small std β consistent performance")
print(f" β’ Large std β performance depends on data split")
print(f" β’ Use CV to select model architecture/hyperparameters")
5.5 The BootstrapΒΆ
The ConceptΒΆ
Bootstrap = sampling with replacement from original dataset
The MethodΒΆ
Draw n observations with replacement from original data β Bootstrap sample B*β
Compute statistic of interest on Bβ β Ξ±Μβ
Repeat B times β Ξ±Μβ, Ξ±Μβ, β¦, Ξ±Μ*_B
Estimate standard error: \(\text{SE}_B(\hat{\alpha}) = \sqrt{\frac{1}{B-1} \sum_{r=1}^B (\hat{\alpha}^{*r} - \bar{\alpha}^*)^2}\)
Key PropertiesΒΆ
Each bootstrap sample has ~63% unique observations
P(observation selected) = 1 - (1 - 1/n)βΏ β 0.632
Observations can appear multiple times
~37% of observations not selected (out-of-bag)
UsesΒΆ
Standard errors: Estimate SE of any statistic
Confidence intervals: Percentile or normal-based
Model assessment: Out-of-bag error estimation
Ensemble methods: Bagging, Random Forests
ProsΒΆ
β Works for any statistic (median, ratio, etc.)
β No distributional assumptions
β Provides uncertainty quantification
β Foundation for powerful ensemble methods
ConsΒΆ
β Computationally intensive (need B=1000+ samples)
β Not ideal for test error estimation (biased)
β Can fail for small samples
When to UseΒΆ
Estimating standard errors of complex statistics
No analytical formula for SE available
Building confidence intervals
Ensemble learning (bagging, random forests)
# Bootstrap demonstration
print("π Bootstrap Resampling\n")
# Generate sample data
np.random.seed(42)
n_boot = 50
data_boot = np.random.exponential(scale=2, size=n_boot) # Right-skewed distribution
print(f"Original sample: n = {n_boot}")
print(f"Sample mean: {data_boot.mean():.3f}")
print(f"Sample median: {np.median(data_boot):.3f}")
print(f"Sample std: {data_boot.std():.3f}\n")
# Bootstrap resampling
B = 1000 # Number of bootstrap samples
boot_means = []
boot_medians = []
boot_stds = []
for b in range(B):
# Sample with replacement
boot_sample = resample(data_boot, replace=True, n_samples=n_boot, random_state=b)
boot_means.append(boot_sample.mean())
boot_medians.append(np.median(boot_sample))
boot_stds.append(boot_sample.std())
boot_means = np.array(boot_means)
boot_medians = np.array(boot_medians)
boot_stds = np.array(boot_stds)
# Bootstrap standard errors
se_mean = boot_means.std()
se_median = boot_medians.std()
se_std = boot_stds.std()
# 95% Confidence intervals (percentile method)
ci_mean = np.percentile(boot_means, [2.5, 97.5])
ci_median = np.percentile(boot_medians, [2.5, 97.5])
print(f"Bootstrap Results (B = {B} samples):\n")
print(f"{'Statistic':<15} {'Estimate':<12} {'Boot SE':<12} {'95% CI'}")
print("="*65)
print(f"{'Mean':<15} {data_boot.mean():>10.3f} {se_mean:>10.3f} [{ci_mean[0]:.3f}, {ci_mean[1]:.3f}]")
print(f"{'Median':<15} {np.median(data_boot):>10.3f} {se_median:>10.3f} [{ci_median[0]:.3f}, {ci_median[1]:.3f}]")
print(f"{'Std Dev':<15} {data_boot.std():>10.3f} {se_std:>10.3f} N/A")
# Visualize bootstrap distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Original data
axes[0, 0].hist(data_boot, bins=15, edgecolor='black', alpha=0.7)
axes[0, 0].axvline(data_boot.mean(), color='r', linestyle='--', linewidth=2, label=f'Mean = {data_boot.mean():.2f}')
axes[0, 0].axvline(np.median(data_boot), color='g', linestyle='--', linewidth=2, label=f'Median = {np.median(data_boot):.2f}')
axes[0, 0].set_xlabel('Value')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Original Data Distribution\n(Right-skewed)')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Bootstrap distribution of mean
axes[0, 1].hist(boot_means, bins=30, edgecolor='black', alpha=0.7, density=True)
axes[0, 1].axvline(data_boot.mean(), color='r', linestyle='-', linewidth=2, label='Sample mean')
axes[0, 1].axvline(ci_mean[0], color='orange', linestyle='--', linewidth=2, label='95% CI')
axes[0, 1].axvline(ci_mean[1], color='orange', linestyle='--', linewidth=2)
axes[0, 1].set_xlabel('Bootstrap Mean')
axes[0, 1].set_ylabel('Density')
axes[0, 1].set_title(f'Bootstrap Distribution of Mean\n(SE = {se_mean:.3f})')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Bootstrap distribution of median
axes[1, 0].hist(boot_medians, bins=30, edgecolor='black', alpha=0.7, density=True, color='green')
axes[1, 0].axvline(np.median(data_boot), color='darkgreen', linestyle='-', linewidth=2, label='Sample median')
axes[1, 0].axvline(ci_median[0], color='orange', linestyle='--', linewidth=2, label='95% CI')
axes[1, 0].axvline(ci_median[1], color='orange', linestyle='--', linewidth=2)
axes[1, 0].set_xlabel('Bootstrap Median')
axes[1, 0].set_ylabel('Density')
axes[1, 0].set_title(f'Bootstrap Distribution of Median\n(SE = {se_median:.3f})')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# Convergence of SE estimates
B_values = range(50, B+1, 50)
se_convergence = [boot_means[:b].std() for b in B_values]
axes[1, 1].plot(B_values, se_convergence, 'o-', linewidth=2, markersize=6)
axes[1, 1].axhline(se_mean, color='r', linestyle='--', linewidth=2, label=f'Final SE = {se_mean:.3f}')
axes[1, 1].set_xlabel('Number of Bootstrap Samples (B)')
axes[1, 1].set_ylabel('Estimated SE of Mean')
axes[1, 1].set_title('Convergence of Bootstrap SE Estimate')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nπ‘ Key Observations:")
print(f" β’ Bootstrap distributions are approximately normal (CLT)")
print(f" β’ Even though original data is skewed, sampling distributions are symmetric")
print(f" β’ SE estimates stabilize around B = 500-1000")
print(f" β’ 95% CI: We're 95% confident true parameter is in this range")
print(f" β’ Bootstrap SE for median: {se_median:.3f} (no analytical formula!)")
print(f"\nβ
Bootstrap is powerful for statistics with unknown SE formulas")
Key TakeawaysΒΆ
1. Method ComparisonΒΆ
Method |
# Fits |
Training % |
Bias |
Variance |
Best For |
|---|---|---|---|---|---|
Validation Set |
1 |
50% |
High |
High |
Large n, quick check |
LOOCV |
n |
99% |
Low |
High |
Small n, linear models |
5-Fold CV |
5 |
80% |
Medium |
Medium |
General use |
10-Fold CV |
10 |
90% |
Low |
Medium |
Industry standard |
Bootstrap |
B |
63% |
Medium |
Medium |
SE estimation |
2. Choosing k in k-Fold CVΒΆ
Small k (k=3,5):
β
Faster computation
β
Lower variance
β Higher bias
β Use for large datasets or slow models
Large k (k=10,20):
β
Lower bias
β Higher variance
β Slower
β Use for accurate error estimate
LOOCV (k=n):
β
Lowest bias
β Highest variance
β Very slow
β Use for small datasets (n < 1000)
3. Common PracticesΒΆ
Model selection: Use CV to choose hyperparameters
Final evaluation: After selecting model, evaluate on held-out test set
Classification: Use stratified k-fold (preserves class proportions)
Time series: Use forward-chaining CV (respect temporal order)
Typical choice: 10-fold CV (well-studied, good bias-variance tradeoff)
4. Bootstrap vs Cross-ValidationΒΆ
Cross-Validation:
Purpose: Estimate test error
Best for: Model selection
Each observation used exactly once per fold
Bootstrap:
Purpose: Estimate uncertainty (SE, CI)
Best for: Standard errors, confidence intervals
Observations can appear multiple times
~37% of data not used in each bootstrap sample
5. Practical WorkflowΒΆ
# Step 1: Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Step 2: Use CV on training set for model selection
for model in candidate_models:
cv_score = cross_val_score(model, X_train, y_train, cv=10)
# Step 3: Select best model based on CV
best_model.fit(X_train, y_train)
# Step 4: Final evaluation on test set
test_score = best_model.score(X_test, y_test)
6. Common PitfallsΒΆ
β Using test set for model selection (data leakage!)
β Not stratifying in classification
β Using LOOCV with very large n (too slow)
β Comparing models on different CV splits
β Using bootstrap for test error estimation (biased)
β Too few bootstrap samples (B < 1000)
Next ChapterΒΆ
Chapter 6: Linear Model Selection and Regularization
Subset Selection (best subset, forward, backward)
Shrinkage Methods (Ridge, Lasso)
Dimension Reduction (PCR, PLS)
Practice ExercisesΒΆ
Exercise 1: Conceptual UnderstandingΒΆ
Why does validation set approach have high variance?
Why does LOOCV have lower bias than 5-fold CV?
Why does LOOCV have higher variance than 5-fold CV?
When would you prefer 5-fold over 10-fold CV?
Exercise 2: LOOCV CalculationΒΆ
Given data: X = [1, 2, 3, 4, 5], Y = [2, 4, 5, 4, 5]
Perform LOOCV for simple linear regression
Calculate CV error
Compare with training error
Exercise 3: k-Fold ImplementationΒΆ
Implement 5-fold CV from scratch:
Divide data into 5 folds
For each fold, train and evaluate
Compute average error
Compare with sklearnβs cross_val_score
Exercise 4: Bootstrap for Confidence IntervalΒΆ
Given sample: [3.2, 5.1, 4.8, 6.2, 3.9, 4.5, 5.8, 4.2]
Compute 95% CI for mean using bootstrap (B=1000)
Compute 95% CI for median
Compare bootstrap SE with analytical SE for mean
Exercise 5: Model Selection with CVΒΆ
Generate polynomial data and:
Use 10-fold CV to select optimal degree (1-10)
Plot CV error vs degree
Compare with validation set approach
Which method is more stable?
Exercise 6: Classification CVΒΆ
For a classification problem:
Compare regular k-fold vs stratified k-fold
Use imbalanced data (90% class 0, 10% class 1)
Show why stratification matters
Compute accuracy, precision, recall for each fold