Chapter 6: Linear Model Selection and Regularization

Overview

Problem: Standard linear regression can suffer from:

  1. Prediction accuracy: High variance when p ≈ n or p > n

  2. Model interpretability: Too many predictors make model hard to understand

Solutions: This chapter covers alternatives to least squares

Three Classes of Methods

1. Subset Selection

Identify subset of p predictors related to response

  • Best Subset: Try all 2ᵖ models

  • Stepwise Selection: Forward/Backward sequential addition/removal

2. Shrinkage (Regularization)

Fit model with all p predictors, but shrink coefficients toward zero

  • Ridge Regression: L2 penalty, shrinks coefficients

  • Lasso: L1 penalty, shrinks coefficients AND performs variable selection

  • Elastic Net: Combination of L1 and L2

3. Dimension Reduction

Project p predictors into M-dimensional subspace (M < p)

  • Principal Components Regression (PCR)

  • Partial Least Squares (PLS)

Why Regularization?

Bias-Variance Tradeoff:

  • Least squares: Unbiased but can have high variance

  • Regularization: Small bias, but reduced variance

  • Result: Often lower test MSE!

\[\text{Test MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}\]

6.1 Ridge Regression

The Problem with OLS

When predictors are correlated or p ≈ n:

  • Coefficient estimates have high variance

  • Small changes in data → large changes in coefficients

  • Poor test set performance

Ridge Solution

Minimize modified loss function: $\(\text{RSS} + \lambda \sum_{j=1}^p \beta_j^2 = \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2 + \lambda \sum_{j=1}^p \beta_j^2\)$

λ (lambda) = tuning parameter

  • λ = 0: Ridge = OLS

  • λ → ∞: All βⱼ → 0

  • λ > 0: Shrinks coefficients toward zero

Key Properties

  • Never sets coefficients exactly to zero (all predictors kept)

  • ✅ Reduces variance at cost of small bias

  • ✅ Works well when all predictors are somewhat useful

  • ✅ Handles multicollinearity well

  • ⚠️ Must standardize predictors (different scales → different penalties)

L2 Penalty (Euclidean norm)

\[||\beta||_2^2 = \sum_{j=1}^p \beta_j^2\]

Choosing λ

  • Use cross-validation

  • Try many values: λ ∈ {0.01, 0.1, 1, 10, 100, …}

  • Select λ with minimum CV error

# Generate data with correlated predictors
np.random.seed(42)
n = 100
p = 20

# Create correlated features
X_base = np.random.randn(n, 5)
X_corr = np.zeros((n, p))
for i in range(p):
    # Each feature is combination of base features + noise
    weights = np.random.randn(5)
    X_corr[:, i] = X_base @ weights + np.random.randn(n) * 0.1

# True model: only first 5 features matter
true_coef = np.zeros(p)
true_coef[:5] = [3, -2, 1.5, -1, 0.5]
y = X_corr @ true_coef + np.random.randn(n) * 2

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_corr, y, test_size=0.3, random_state=42)

# Standardize (crucial for Ridge!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("📊 Ridge Regression Demo\n")
print(f"Data: n = {n}, p = {p}")
print(f"True model: Only first 5 features have non-zero coefficients")
print(f"Problem: High correlation between features\n")

# OLS (λ=0)
ols = LinearRegression()
ols.fit(X_train_scaled, y_train)
y_pred_ols_train = ols.predict(X_train_scaled)
y_pred_ols_test = ols.predict(X_test_scaled)
mse_ols_train = mean_squared_error(y_train, y_pred_ols_train)
mse_ols_test = mean_squared_error(y_test, y_pred_ols_test)

print("Ordinary Least Squares (λ=0):")
print(f"  Train MSE: {mse_ols_train:.3f}")
print(f"  Test MSE:  {mse_ols_test:.3f}")
print(f"  Overfitting gap: {mse_ols_test - mse_ols_train:.3f}\n")

# Ridge with different λ values
lambdas = np.logspace(-2, 4, 100)  # 0.01 to 10000
train_mse_ridge = []
test_mse_ridge = []
coef_paths = []

for lam in lambdas:
    ridge = Ridge(alpha=lam)  # sklearn uses 'alpha' for λ
    ridge.fit(X_train_scaled, y_train)
    
    train_mse_ridge.append(mean_squared_error(y_train, ridge.predict(X_train_scaled)))
    test_mse_ridge.append(mean_squared_error(y_test, ridge.predict(X_test_scaled)))
    coef_paths.append(ridge.coef_.copy())

coef_paths = np.array(coef_paths)
best_lambda_idx = np.argmin(test_mse_ridge)
best_lambda = lambdas[best_lambda_idx]

print(f"Best λ (from test set): {best_lambda:.2f}")
print(f"  Train MSE: {train_mse_ridge[best_lambda_idx]:.3f}")
print(f"  Test MSE:  {test_mse_ridge[best_lambda_idx]:.3f}")
print(f"  Improvement over OLS: {mse_ols_test - test_mse_ridge[best_lambda_idx]:.3f}")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Plot 1: Train vs Test MSE
axes[0].semilogx(lambdas, train_mse_ridge, label='Training MSE', linewidth=2)
axes[0].semilogx(lambdas, test_mse_ridge, label='Test MSE', linewidth=2)
axes[0].axvline(best_lambda, color='r', linestyle='--', linewidth=2, 
               label=f'Best λ = {best_lambda:.2f}')
axes[0].axhline(mse_ols_test, color='orange', linestyle='--', alpha=0.7, label='OLS Test MSE')
axes[0].set_xlabel('λ (Regularization Strength)')
axes[0].set_ylabel('MSE')
axes[0].set_title('Ridge Regression: MSE vs λ')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Coefficient paths
for i in range(p):
    if i < 5:
        axes[1].semilogx(lambdas, coef_paths[:, i], linewidth=2, label=f{i+1} (true≠0)')
    else:
        axes[1].semilogx(lambdas, coef_paths[:, i], linewidth=1, alpha=0.3, color='gray')

axes[1].axvline(best_lambda, color='r', linestyle='--', linewidth=2, alpha=0.7)
axes[1].axhline(0, color='k', linestyle='-', linewidth=0.5)
axes[1].set_xlabel('λ (Regularization Strength)')
axes[1].set_ylabel('Coefficient Value')
axes[1].set_title('Ridge Coefficient Paths\n(Shrink toward 0, never exactly 0)')
axes[1].legend(loc='upper right', fontsize=8)
axes[1].grid(True, alpha=0.3)

# Plot 3: Compare coefficients at best λ
ridge_best = Ridge(alpha=best_lambda)
ridge_best.fit(X_train_scaled, y_train)

x_pos = np.arange(p)
width = 0.35
axes[2].bar(x_pos - width/2, true_coef, width, label='True', alpha=0.7)
axes[2].bar(x_pos + width/2, ridge_best.coef_, width, label=f'Ridge (λ={best_lambda:.2f})', alpha=0.7)
axes[2].axhline(0, color='k', linestyle='-', linewidth=0.5)
axes[2].set_xlabel('Feature Index')
axes[2].set_ylabel('Coefficient Value')
axes[2].set_title(f'True vs Ridge Coefficients (λ={best_lambda:.2f})')
axes[2].legend()
axes[2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n💡 Key Observations:")
print("   • As λ increases, all coefficients shrink toward 0")
print("   • But coefficients never become exactly 0 (L2 penalty)")
print("   • Test MSE decreases then increases (bias-variance tradeoff)")
print("   • Ridge reduces overfitting (smaller gap between train/test)")
print("   • All 20 features retained (no variable selection)")

6.2 The Lasso

Limitation of Ridge

Ridge includes all p predictors in final model

  • Doesn’t perform variable selection

  • Less interpretable when p is large

Lasso Solution

Least Absolute Shrinkage and Selection Operator: $\(\sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2 + \lambda \sum_{j=1}^p |\beta_j|\)$

L1 Penalty (Manhattan norm)

\[||\beta||_1 = \sum_{j=1}^p |\beta_j|\]

Key Difference from Ridge

Lasso uses absolute value instead of squared values

  • Forces some coefficients to be exactly zero

  • Performs automatic variable selection

  • More interpretable (sparse models)

When to Use

  • Lasso: When you believe many features are irrelevant

    • Sparse true model

    • Want interpretability

    • Variable selection needed

  • Ridge: When you believe all features are somewhat useful

    • All predictors contribute

    • Multicollinearity present

    • Don’t need variable selection

Geometric Interpretation

Ridge constraint: \(\sum \beta_j^2 \leq s\) (circle/sphere) Lasso constraint: \(\sum |\beta_j| \leq s\) (diamond/cross-polytope)

Lasso’s corners → sparse solutions!

# Lasso demonstration (same data as Ridge)
print("📊 Lasso Regression\n")

# Lasso with different λ values
train_mse_lasso = []
test_mse_lasso = []
coef_paths_lasso = []
n_nonzero = []

for lam in lambdas:
    lasso = Lasso(alpha=lam, max_iter=10000)
    lasso.fit(X_train_scaled, y_train)
    
    train_mse_lasso.append(mean_squared_error(y_train, lasso.predict(X_train_scaled)))
    test_mse_lasso.append(mean_squared_error(y_test, lasso.predict(X_test_scaled)))
    coef_paths_lasso.append(lasso.coef_.copy())
    n_nonzero.append(np.sum(np.abs(lasso.coef_) > 1e-5))  # Count non-zero coefficients

coef_paths_lasso = np.array(coef_paths_lasso)
best_lambda_lasso_idx = np.argmin(test_mse_lasso)
best_lambda_lasso = lambdas[best_lambda_lasso_idx]

# Fit best Lasso model
lasso_best = Lasso(alpha=best_lambda_lasso, max_iter=10000)
lasso_best.fit(X_train_scaled, y_train)
n_selected = np.sum(np.abs(lasso_best.coef_) > 1e-5)

print(f"Best λ (from test set): {best_lambda_lasso:.2f}")
print(f"  Train MSE: {train_mse_lasso[best_lambda_lasso_idx]:.3f}")
print(f"  Test MSE:  {test_mse_lasso[best_lambda_lasso_idx]:.3f}")
print(f"  Features selected: {n_selected} / {p}")
print(f"  Improvement over OLS: {mse_ols_test - test_mse_lasso[best_lambda_lasso_idx]:.3f}\n")

# Compare Ridge vs Lasso
print("📊 Ridge vs Lasso Comparison:\n")
print(f"{'Method':<15} {'Test MSE':<12} {'# Features':<15} {'Interpretability'}")
print("="*65)
print(f"{'OLS':<15} {mse_ols_test:>10.3f}   {p:>12}     Low (all features)")
print(f"{'Ridge':<15} {test_mse_ridge[best_lambda_idx]:>10.3f}   {p:>12}     Low (all features)")
print(f"{'Lasso':<15} {test_mse_lasso[best_lambda_lasso_idx]:>10.3f}   {n_selected:>12}     High (sparse) ✅")

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Lasso coefficient paths
for i in range(p):
    if i < 5:
        axes[0, 0].semilogx(lambdas, coef_paths_lasso[:, i], linewidth=2, label=f{i+1} (true≠0)')
    else:
        axes[0, 0].semilogx(lambdas, coef_paths_lasso[:, i], linewidth=1, alpha=0.3, color='gray')

axes[0, 0].axvline(best_lambda_lasso, color='r', linestyle='--', linewidth=2, alpha=0.7)
axes[0, 0].axhline(0, color='k', linestyle='-', linewidth=0.5)
axes[0, 0].set_xlabel('λ (Regularization Strength)')
axes[0, 0].set_ylabel('Coefficient Value')
axes[0, 0].set_title('Lasso Coefficient Paths\n(Many coefficients → exactly 0)')
axes[0, 0].legend(loc='upper right', fontsize=8)
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Number of non-zero coefficients
axes[0, 1].semilogx(lambdas, n_nonzero, linewidth=2, color='purple')
axes[0, 1].axvline(best_lambda_lasso, color='r', linestyle='--', linewidth=2, 
                  label=f'Best λ: {n_selected} features')
axes[0, 1].set_xlabel('λ (Regularization Strength)')
axes[0, 1].set_ylabel('Number of Non-Zero Coefficients')
axes[0, 1].set_title('Lasso Variable Selection\n(Automatic feature selection)')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Ridge vs Lasso MSE
axes[1, 0].semilogx(lambdas, test_mse_ridge, label='Ridge', linewidth=2)
axes[1, 0].semilogx(lambdas, test_mse_lasso, label='Lasso', linewidth=2)
axes[1, 0].axhline(mse_ols_test, color='orange', linestyle='--', alpha=0.7, label='OLS')
axes[1, 0].axvline(best_lambda, color='blue', linestyle='--', alpha=0.5, label='Best Ridge λ')
axes[1, 0].axvline(best_lambda_lasso, color='red', linestyle='--', alpha=0.5, label='Best Lasso λ')
axes[1, 0].set_xlabel('λ (Regularization Strength)')
axes[1, 0].set_ylabel('Test MSE')
axes[1, 0].set_title('Ridge vs Lasso: Test MSE')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Coefficient comparison
x_pos = np.arange(p)
width = 0.25
axes[1, 1].bar(x_pos - width, true_coef, width, label='True', alpha=0.7)
axes[1, 1].bar(x_pos, ridge_best.coef_, width, label='Ridge', alpha=0.7)
axes[1, 1].bar(x_pos + width, lasso_best.coef_, width, label='Lasso', alpha=0.7)
axes[1, 1].axhline(0, color='k', linestyle='-', linewidth=0.5)
axes[1, 1].set_xlabel('Feature Index')
axes[1, 1].set_ylabel('Coefficient Value')
axes[1, 1].set_title('True vs Ridge vs Lasso Coefficients')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Show which features Lasso selected
print("\n🎯 Features Selected by Lasso:\n")
selected_features = np.where(np.abs(lasso_best.coef_) > 1e-5)[0]
print(f"{'Feature':<12} {'True Coef':<12} {'Lasso Coef':<15} {'Ridge Coef'}")
print("="*55)
for idx in selected_features:
    marker = '✅' if idx < 5 else ''
    print(f"Feature {idx:<5} {true_coef[idx]:>10.3f}   {lasso_best.coef_[idx]:>12.3f}   "
          f"{ridge_best.coef_[idx]:>12.3f}  {marker}")

print("\n💡 Key Insights:")
print("   • Lasso sets many coefficients to exactly 0 (sparse solution)")
print("   • Ridge shrinks all coefficients but keeps all features")
print("   • Lasso performs automatic variable selection")
print(f"   • Lasso selected {n_selected}/{p} features, including {sum(selected_features < 5)}/5 true features")
print("   • More interpretable model with similar performance")

6.3 Elastic Net

Combining Ridge and Lasso

Elastic Net uses both L1 and L2 penalties: $\(\text{Loss} = \text{RSS} + \lambda_1 \sum_{j=1}^p |\beta_j| + \lambda_2 \sum_{j=1}^p \beta_j^2\)$

Or equivalently: $\(\text{Loss} = \text{RSS} + \lambda \left[\alpha ||\beta||_1 + (1-\alpha) ||\beta||_2^2\right]\)$

where α ∈ [0,1] controls the mix:

  • α = 1: Pure Lasso

  • α = 0: Pure Ridge

  • α ∈ (0,1): Elastic Net

Advantages

  • ✅ Variable selection like Lasso

  • ✅ Handles correlated predictors better than Lasso

  • ✅ Can select more than n features (Lasso limited to n)

  • ✅ More stable than Lasso when features are highly correlated

When to Use

  • Grouped variables (correlated features)

  • p >> n situations

  • Want variable selection + stability

  • Typical choice: α = 0.5 (equal mix)

# Elastic Net demonstration
print("📊 Elastic Net Regression\n")

# Try different α values (l1_ratio in sklearn)
alpha_values = [0.1, 0.5, 0.7, 0.9]  # Mix of L1/L2
lambda_en = best_lambda_lasso  # Use same λ for fair comparison

results_en = {}

for alpha_val in alpha_values:
    en = ElasticNet(alpha=lambda_en, l1_ratio=alpha_val, max_iter=10000)
    en.fit(X_train_scaled, y_train)
    
    test_mse = mean_squared_error(y_test, en.predict(X_test_scaled))
    n_selected = np.sum(np.abs(en.coef_) > 1e-5)
    
    results_en[alpha_val] = {
        'model': en,
        'test_mse': test_mse,
        'n_features': n_selected,
        'coef': en.coef_
    }

print(f"Elastic Net Results (λ = {lambda_en:.2f}):\n")
print(f"{'α (L1 ratio)':<15} {'Test MSE':<12} {'# Features':<15} {'Description'}")
print("="*70)
print(f"{'0.0 (Ridge)':<15} {test_mse_ridge[best_lambda_idx]:>10.3f}   {p:>12}     Pure L2")
for alpha_val in alpha_values:
    desc = f"{int(alpha_val*100)}% L1, {int((1-alpha_val)*100)}% L2"
    print(f"{alpha_val:<15.1f} {results_en[alpha_val]['test_mse']:>10.3f}   "
          f"{results_en[alpha_val]['n_features']:>12}     {desc}")
print(f"{'1.0 (Lasso)':<15} {test_mse_lasso[best_lambda_lasso_idx]:>10.3f}   {n_selected:>12}     Pure L1")

# Visualize Elastic Net with different α
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

for idx, alpha_val in enumerate(alpha_values):
    ax = axes[idx // 2, idx % 2]
    
    # Bar plot of coefficients
    x_pos = np.arange(p)
    ax.bar(x_pos, results_en[alpha_val]['coef'], alpha=0.7, 
          color=plt.cm.viridis(alpha_val))
    ax.axhline(0, color='k', linestyle='-', linewidth=0.5)
    ax.set_xlabel('Feature Index')
    ax.set_ylabel('Coefficient Value')
    ax.set_title(f'Elastic Net: α = {alpha_val:.1f}\n'
                f'MSE = {results_en[alpha_val]["test_mse"]:.3f}, '
                f'{results_en[alpha_val]["n_features"]} features selected')
    ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Compare all methods
print("\n📊 Final Comparison of All Methods:\n")
methods_comparison = [
    ('OLS', mse_ols_test, p, 'None'),
    ('Ridge', test_mse_ridge[best_lambda_idx], p, f'L2, λ={best_lambda:.2f}'),
    ('Lasso', test_mse_lasso[best_lambda_lasso_idx], n_selected, f'L1, λ={best_lambda_lasso:.2f}'),
    ('Elastic Net (α=0.5)', results_en[0.5]['test_mse'], results_en[0.5]['n_features'], 
     f'L1+L2, λ={lambda_en:.2f}')
]

print(f"{'Method':<20} {'Test MSE':<12} {'Features':<12} {'Regularization'}")
print("="*70)
for method, mse, n_feat, reg in sorted(methods_comparison, key=lambda x: x[1]):
    marker = '✅' if mse == min([x[1] for x in methods_comparison]) else ''
    print(f"{method:<20} {mse:>10.3f}   {n_feat:>10}   {reg:<20} {marker}")

print("\n💡 Summary:")
print("   • Ridge: Best when all features matter, handles multicollinearity")
print("   • Lasso: Best for sparse models, automatic variable selection")
print("   • Elastic Net: Best of both worlds, more stable than Lasso")
print("   • All regularization methods beat OLS on test error")

6.4 Selecting the Tuning Parameter

The Problem

How to choose λ (and α for Elastic Net)?

Solution: Cross-Validation

  1. Try grid of λ values

  2. For each λ:

    • Perform k-fold CV

    • Compute average CV error

  3. Select λ with minimum CV error

  4. One-standard-error rule: Choose simplest model within 1 SE of minimum

sklearn GridSearchCV

Automates this process:

  • Tries all parameter combinations

  • Performs CV for each

  • Returns best parameters

# Cross-validation for λ selection
print("📊 Selecting λ using Cross-Validation\n")

# Ridge CV
ridge_cv = GridSearchCV(Ridge(), 
                       param_grid={'alpha': lambdas},
                       cv=10,
                       scoring='neg_mean_squared_error',
                       return_train_score=True)
ridge_cv.fit(X_train_scaled, y_train)

# Lasso CV
lasso_cv = GridSearchCV(Lasso(max_iter=10000),
                       param_grid={'alpha': lambdas},
                       cv=10,
                       scoring='neg_mean_squared_error',
                       return_train_score=True)
lasso_cv.fit(X_train_scaled, y_train)

print("Ridge CV Results:")
print(f"  Best λ: {ridge_cv.best_params_['alpha']:.4f}")
print(f"  Best CV Score: {-ridge_cv.best_score_:.4f}")
print(f"  Test MSE: {mean_squared_error(y_test, ridge_cv.predict(X_test_scaled)):.4f}\n")

print("Lasso CV Results:")
print(f"  Best λ: {lasso_cv.best_params_['alpha']:.4f}")
print(f"  Best CV Score: {-lasso_cv.best_score_:.4f}")
print(f"  Test MSE: {mean_squared_error(y_test, lasso_cv.predict(X_test_scaled)):.4f}")
print(f"  Features selected: {np.sum(np.abs(lasso_cv.best_estimator_.coef_) > 1e-5)}\n")

# Visualize CV results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Ridge CV curve
cv_results_ridge = ridge_cv.cv_results_
axes[0].semilogx(lambdas, -cv_results_ridge['mean_test_score'], 'o-', 
                linewidth=2, label='CV Error', markersize=5)
axes[0].fill_between(lambdas,
                     -cv_results_ridge['mean_test_score'] - cv_results_ridge['std_test_score'],
                     -cv_results_ridge['mean_test_score'] + cv_results_ridge['std_test_score'],
                     alpha=0.2)
axes[0].axvline(ridge_cv.best_params_['alpha'], color='r', linestyle='--', 
               linewidth=2, label=f"Best λ = {ridge_cv.best_params_['alpha']:.4f}")
axes[0].set_xlabel('λ')
axes[0].set_ylabel('CV MSE')
axes[0].set_title('Ridge: 10-Fold CV Error ± 1 SE')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Lasso CV curve
cv_results_lasso = lasso_cv.cv_results_
axes[1].semilogx(lambdas, -cv_results_lasso['mean_test_score'], 'o-', 
                linewidth=2, label='CV Error', markersize=5, color='green')
axes[1].fill_between(lambdas,
                     -cv_results_lasso['mean_test_score'] - cv_results_lasso['std_test_score'],
                     -cv_results_lasso['mean_test_score'] + cv_results_lasso['std_test_score'],
                     alpha=0.2, color='green')
axes[1].axvline(lasso_cv.best_params_['alpha'], color='r', linestyle='--', 
               linewidth=2, label=f"Best λ = {lasso_cv.best_params_['alpha']:.4f}")
axes[1].set_xlabel('λ')
axes[1].set_ylabel('CV MSE')
axes[1].set_title('Lasso: 10-Fold CV Error ± 1 SE')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Best Practices:")
print("   • Always use CV to select λ (never use test set!)")
print("   • Try wide range of λ values (log scale)")
print("   • Shaded region shows ± 1 standard error")
print("   • One-SE rule: Choose simplest model within 1 SE of minimum")
print("   • 10-fold CV is standard choice")

Key Takeaways

1. Method Selection Guide

Situation

Best Method

Why

All features useful

Ridge

Shrinks all, keeps all

Many irrelevant features

Lasso

Variable selection

Grouped correlated features

Elastic Net

Stability + selection

p > n

Lasso or Elastic Net

Can handle high-D

Need interpretability

Lasso

Sparse model

Multicollinearity

Ridge

Handles correlation

2. Ridge vs Lasso

Ridge (L2):

  • ✅ Handles multicollinearity

  • ✅ Stable solutions

  • ✅ Works when all features contribute

  • ❌ No variable selection

  • ❌ All features in final model

Lasso (L1):

  • ✅ Automatic variable selection

  • ✅ Sparse, interpretable models

  • ✅ Works in high dimensions

  • ❌ Can be unstable with correlated features

  • ❌ Selects at most n features

Elastic Net:

  • ✅ Best of both worlds

  • ✅ More stable than Lasso

  • ✅ Variable selection

  • ❌ Two parameters to tune

3. Important Concepts

Bias-Variance Tradeoff:

λ = 0 (OLS):    Low bias, High variance
λ optimal:      Small bias, Lower variance 
λ  :          High bias, Low variance

Standardization:

  • Must standardize features before Ridge/Lasso

  • Penalty is scale-dependent

  • Use StandardScaler

λ Selection:

  • Use cross-validation

  • Try log-spaced grid: [0.001, 0.01, 0.1, 1, 10, 100]

  • One-SE rule for parsimony

4. Practical Workflow

# 1. Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Set up CV grid search
model = Lasso()  # or Ridge, ElasticNet
params = {'alpha': np.logspace(-2, 2, 50)}
cv = GridSearchCV(model, params, cv=10)

# 3. Fit and select best λ
cv.fit(X_train, y_train)
best_model = cv.best_estimator_

# 4. Evaluate on test set
test_score = best_model.score(X_test, y_test)

5. When Regularization Helps Most

  • p ≈ n (many features)

  • Highly correlated predictors

  • Noisy features

  • Overfitting observed

  • Need interpretability (Lasso)

6. Common Mistakes

  • ❌ Forgetting to standardize

  • ❌ Using test set to select λ

  • ❌ Not trying wide enough range of λ

  • ❌ Standardizing test set independently

  • ❌ Comparing unstandardized coefficients

Next Chapter

Chapter 7: Moving Beyond Linearity

  • Polynomial Regression

  • Step Functions

  • Regression Splines

  • Smoothing Splines

  • Local Regression

  • Generalized Additive Models (GAMs)

Practice Exercises

Exercise 1: Ridge vs Lasso Intuition

  1. Why does Lasso set coefficients to exactly 0, but Ridge doesn’t?

  2. Draw the constraint regions for Ridge and Lasso in 2D

  3. Explain why Lasso’s corners lead to sparse solutions

Exercise 2: Implement Ridge from Scratch

# Ridge has closed-form solution:
# β_ridge = (X'X + λI)^(-1) X'y
  1. Implement Ridge regression

  2. Compare with sklearn

  3. Verify coefficients shrink as λ increases

Exercise 3: λ Selection

Generate data and:

  1. Fit Lasso with λ ∈ [0.01, 100]

  2. Plot CV error vs λ

  3. Plot number of non-zero coefficients vs λ

  4. Identify best λ and corresponding features

Exercise 4: Standardization Impact

  1. Fit Ridge/Lasso on unstandardized data

  2. Fit Ridge/Lasso on standardized data

  3. Compare coefficients

  4. Explain the differences

Exercise 5: Elastic Net Tuning

Use GridSearchCV to find best (λ, α) for Elastic Net:

param_grid = {
    'alpha': np.logspace(-2, 2, 20),
    'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
}
  1. Which combination performs best?

  2. Is it closer to Ridge or Lasso?

  3. How many features selected?

Exercise 6: High-Dimensional Case

Generate data with p > n:

  1. Try OLS (will it work?)

  2. Fit Ridge, Lasso, Elastic Net

  3. Which works best?

  4. Explain why