Run this notebook: Open in Colab Open in Kaggle

Chapter 6: Linear Model Selection and Regularization¶

Overview¶

Problem: Standard linear regression can suffer from:

Prediction accuracy: High variance when p ≈ n or p > n
Model interpretability: Too many predictors make model hard to understand

Solutions: This chapter covers alternatives to least squares

Three Classes of Methods¶

1. Subset Selection¶

Identify subset of p predictors related to response

Best Subset: Try all 2ᵖ models
Stepwise Selection: Forward/Backward sequential addition/removal

2. Shrinkage (Regularization)¶

Fit model with all p predictors, but shrink coefficients toward zero

Ridge Regression: L2 penalty, shrinks coefficients
Lasso: L1 penalty, shrinks coefficients AND performs variable selection
Elastic Net: Combination of L1 and L2

3. Dimension Reduction¶

Project p predictors into M-dimensional subspace (M < p)

Principal Components Regression (PCR)
Partial Least Squares (PLS)

Why Regularization?¶

Bias-Variance Tradeoff:

Least squares: Unbiased but can have high variance
Regularization: Small bias, but reduced variance
Result: Often lower test MSE!

\[\text{Test MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}\]

6.1 Ridge Regression¶

The Problem with OLS¶

When predictors are correlated or p ≈ n:

Coefficient estimates have high variance
Small changes in data → large changes in coefficients
Poor test set performance

Ridge Solution¶

Minimize modified loss function: $$\text{RSS} + \lambda \sum_{j=1}^p \beta_j^2 = \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2 + \lambda \sum_{j=1}^p \beta_j^2$$

λ (lambda) = tuning parameter

λ = 0: Ridge = OLS
λ → ∞: All βⱼ → 0
λ > 0: Shrinks coefficients toward zero

Key Properties¶

✅ Never sets coefficients exactly to zero (all predictors kept)
✅ Reduces variance at cost of small bias
✅ Works well when all predictors are somewhat useful
✅ Handles multicollinearity well
⚠️ Must standardize predictors (different scales → different penalties)

L2 Penalty (Euclidean norm)¶

\[||\beta||_2^2 = \sum_{j=1}^p \beta_j^2\]

Choosing λ¶

Use cross-validation
Try many values: λ ∈ {0.01, 0.1, 1, 10, 100, …}
Select λ with minimum CV error

# Generate data with correlated predictors
np.random.seed(42)
n = 100
p = 20

# Create correlated features
X_base = np.random.randn(n, 5)
X_corr = np.zeros((n, p))
for i in range(p):
    # Each feature is combination of base features + noise
    weights = np.random.randn(5)
    X_corr[:, i] = X_base @ weights + np.random.randn(n) * 0.1

# True model: only first 5 features matter
true_coef = np.zeros(p)
true_coef[:5] = [3, -2, 1.5, -1, 0.5]
y = X_corr @ true_coef + np.random.randn(n) * 2

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_corr, y, test_size=0.3, random_state=42)

# Standardize (crucial for Ridge!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("📊 Ridge Regression Demo\n")
print(f"Data: n = {n}, p = {p}")
print(f"True model: Only first 5 features have non-zero coefficients")
print(f"Problem: High correlation between features\n")

# OLS (λ=0)
ols = LinearRegression()
ols.fit(X_train_scaled, y_train)
y_pred_ols_train = ols.predict(X_train_scaled)
y_pred_ols_test = ols.predict(X_test_scaled)
mse_ols_train = mean_squared_error(y_train, y_pred_ols_train)
mse_ols_test = mean_squared_error(y_test, y_pred_ols_test)

print("Ordinary Least Squares (λ=0):")
print(f"  Train MSE: {mse_ols_train:.3f}")
print(f"  Test MSE:  {mse_ols_test:.3f}")
print(f"  Overfitting gap: {mse_ols_test - mse_ols_train:.3f}\n")

# Ridge with different λ values
lambdas = np.logspace(-2, 4, 100)  # 0.01 to 10000
train_mse_ridge = []
test_mse_ridge = []
coef_paths = []

for lam in lambdas:
    ridge = Ridge(alpha=lam)  # sklearn uses 'alpha' for λ
    ridge.fit(X_train_scaled, y_train)
    
    train_mse_ridge.append(mean_squared_error(y_train, ridge.predict(X_train_scaled)))
    test_mse_ridge.append(mean_squared_error(y_test, ridge.predict(X_test_scaled)))
    coef_paths.append(ridge.coef_.copy())

coef_paths = np.array(coef_paths)
best_lambda_idx = np.argmin(test_mse_ridge)
best_lambda = lambdas[best_lambda_idx]

print(f"Best λ (from test set): {best_lambda:.2f}")
print(f"  Train MSE: {train_mse_ridge[best_lambda_idx]:.3f}")
print(f"  Test MSE:  {test_mse_ridge[best_lambda_idx]:.3f}")
print(f"  Improvement over OLS: {mse_ols_test - test_mse_ridge[best_lambda_idx]:.3f}")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Plot 1: Train vs Test MSE
axes[0].semilogx(lambdas, train_mse_ridge, label='Training MSE', linewidth=2)
axes[0].semilogx(lambdas, test_mse_ridge, label='Test MSE', linewidth=2)
axes[0].axvline(best_lambda, color='r', linestyle='--', linewidth=2, 
               label=f'Best λ = {best_lambda:.2f}')
axes[0].axhline(mse_ols_test, color='orange', linestyle='--', alpha=0.7, label='OLS Test MSE')
axes[0].set_xlabel('λ (Regularization Strength)')
axes[0].set_ylabel('MSE')
axes[0].set_title('Ridge Regression: MSE vs λ')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Coefficient paths
for i in range(p):
    if i < 5:
        axes[1].semilogx(lambdas, coef_paths[:, i], linewidth=2, label=f'β{i+1} (true≠0)')
    else:
        axes[1].semilogx(lambdas, coef_paths[:, i], linewidth=1, alpha=0.3, color='gray')

axes[1].axvline(best_lambda, color='r', linestyle='--', linewidth=2, alpha=0.7)
axes[1].axhline(0, color='k', linestyle='-', linewidth=0.5)
axes[1].set_xlabel('λ (Regularization Strength)')
axes[1].set_ylabel('Coefficient Value')
axes[1].set_title('Ridge Coefficient Paths\n(Shrink toward 0, never exactly 0)')
axes[1].legend(loc='upper right', fontsize=8)
axes[1].grid(True, alpha=0.3)

# Plot 3: Compare coefficients at best λ
ridge_best = Ridge(alpha=best_lambda)
ridge_best.fit(X_train_scaled, y_train)

x_pos = np.arange(p)
width = 0.35
axes[2].bar(x_pos - width/2, true_coef, width, label='True', alpha=0.7)
axes[2].bar(x_pos + width/2, ridge_best.coef_, width, label=f'Ridge (λ={best_lambda:.2f})', alpha=0.7)
axes[2].axhline(0, color='k', linestyle='-', linewidth=0.5)
axes[2].set_xlabel('Feature Index')
axes[2].set_ylabel('Coefficient Value')
axes[2].set_title(f'True vs Ridge Coefficients (λ={best_lambda:.2f})')
axes[2].legend()
axes[2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n💡 Key Observations:")
print("   • As λ increases, all coefficients shrink toward 0")
print("   • But coefficients never become exactly 0 (L2 penalty)")
print("   • Test MSE decreases then increases (bias-variance tradeoff)")
print("   • Ridge reduces overfitting (smaller gap between train/test)")
print("   • All 20 features retained (no variable selection)")

6.2 The Lasso¶

Limitation of Ridge¶

Ridge includes all p predictors in final model

Doesn’t perform variable selection
Less interpretable when p is large

Lasso Solution¶

Least Absolute Shrinkage and Selection Operator: $$\sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2 + \lambda \sum_{j=1}^p |\beta_j|$$

L1 Penalty (Manhattan norm)¶

\[||\beta||_1 = \sum_{j=1}^p |\beta_j|\]

Key Difference from Ridge¶

Lasso uses absolute value instead of squared values

Forces some coefficients to be exactly zero
Performs automatic variable selection
More interpretable (sparse models)

When to Use¶

Lasso: When you believe many features are irrelevant
- Sparse true model
- Want interpretability
- Variable selection needed
Ridge: When you believe all features are somewhat useful
- All predictors contribute
- Multicollinearity present
- Don’t need variable selection

Geometric Interpretation¶

Ridge constraint: $\sum \beta_j^2 \leq s$ (circle/sphere) Lasso constraint: $\sum |\beta_j| \leq s$ (diamond/cross-polytope)

Lasso’s corners → sparse solutions!

# Lasso demonstration (same data as Ridge)
print("📊 Lasso Regression\n")

# Lasso with different λ values
train_mse_lasso = []
test_mse_lasso = []
coef_paths_lasso = []
n_nonzero = []

for lam in lambdas:
    lasso = Lasso(alpha=lam, max_iter=10000)
    lasso.fit(X_train_scaled, y_train)
    
    train_mse_lasso.append(mean_squared_error(y_train, lasso.predict(X_train_scaled)))
    test_mse_lasso.append(mean_squared_error(y_test, lasso.predict(X_test_scaled)))
    coef_paths_lasso.append(lasso.coef_.copy())
    n_nonzero.append(np.sum(np.abs(lasso.coef_) > 1e-5))  # Count non-zero coefficients

coef_paths_lasso = np.array(coef_paths_lasso)
best_lambda_lasso_idx = np.argmin(test_mse_lasso)
best_lambda_lasso = lambdas[best_lambda_lasso_idx]

# Fit best Lasso model
lasso_best = Lasso(alpha=best_lambda_lasso, max_iter=10000)
lasso_best.fit(X_train_scaled, y_train)
n_selected = np.sum(np.abs(lasso_best.coef_) > 1e-5)

print(f"Best λ (from test set): {best_lambda_lasso:.2f}")
print(f"  Train MSE: {train_mse_lasso[best_lambda_lasso_idx]:.3f}")
print(f"  Test MSE:  {test_mse_lasso[best_lambda_lasso_idx]:.3f}")
print(f"  Features selected: {n_selected} / {p}")
print(f"  Improvement over OLS: {mse_ols_test - test_mse_lasso[best_lambda_lasso_idx]:.3f}\n")

# Compare Ridge vs Lasso
print("📊 Ridge vs Lasso Comparison:\n")
print(f"{'Method':<15} {'Test MSE':<12} {'# Features':<15} {'Interpretability'}")
print("="*65)
print(f"{'OLS':<15} {mse_ols_test:>10.3f}   {p:>12}     Low (all features)")
print(f"{'Ridge':<15} {test_mse_ridge[best_lambda_idx]:>10.3f}   {p:>12}     Low (all features)")
print(f"{'Lasso':<15} {test_mse_lasso[best_lambda_lasso_idx]:>10.3f}   {n_selected:>12}     High (sparse) ✅")

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Lasso coefficient paths
for i in range(p):
    if i < 5:
        axes[0, 0].semilogx(lambdas, coef_paths_lasso[:, i], linewidth=2, label=f'β{i+1} (true≠0)')
    else:
        axes[0, 0].semilogx(lambdas, coef_paths_lasso[:, i], linewidth=1, alpha=0.3, color='gray')

axes[0, 0].axvline(best_lambda_lasso, color='r', linestyle='--', linewidth=2, alpha=0.7)
axes[0, 0].axhline(0, color='k', linestyle='-', linewidth=0.5)
axes[0, 0].set_xlabel('λ (Regularization Strength)')
axes[0, 0].set_ylabel('Coefficient Value')
axes[0, 0].set_title('Lasso Coefficient Paths\n(Many coefficients → exactly 0)')
axes[0, 0].legend(loc='upper right', fontsize=8)
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Number of non-zero coefficients
axes[0, 1].semilogx(lambdas, n_nonzero, linewidth=2, color='purple')
axes[0, 1].axvline(best_lambda_lasso, color='r', linestyle='--', linewidth=2, 
                  label=f'Best λ: {n_selected} features')
axes[0, 1].set_xlabel('λ (Regularization Strength)')
axes[0, 1].set_ylabel('Number of Non-Zero Coefficients')
axes[0, 1].set_title('Lasso Variable Selection\n(Automatic feature selection)')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Ridge vs Lasso MSE
axes[1, 0].semilogx(lambdas, test_mse_ridge, label='Ridge', linewidth=2)
axes[1, 0].semilogx(lambdas, test_mse_lasso, label='Lasso', linewidth=2)
axes[1, 0].axhline(mse_ols_test, color='orange', linestyle='--', alpha=0.7, label='OLS')
axes[1, 0].axvline(best_lambda, color='blue', linestyle='--', alpha=0.5, label='Best Ridge λ')
axes[1, 0].axvline(best_lambda_lasso, color='red', linestyle='--', alpha=0.5, label='Best Lasso λ')
axes[1, 0].set_xlabel('λ (Regularization Strength)')
axes[1, 0].set_ylabel('Test MSE')
axes[1, 0].set_title('Ridge vs Lasso: Test MSE')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Coefficient comparison
x_pos = np.arange(p)
width = 0.25
axes[1, 1].bar(x_pos - width, true_coef, width, label='True', alpha=0.7)
axes[1, 1].bar(x_pos, ridge_best.coef_, width, label='Ridge', alpha=0.7)
axes[1, 1].bar(x_pos + width, lasso_best.coef_, width, label='Lasso', alpha=0.7)
axes[1, 1].axhline(0, color='k', linestyle='-', linewidth=0.5)
axes[1, 1].set_xlabel('Feature Index')
axes[1, 1].set_ylabel('Coefficient Value')
axes[1, 1].set_title('True vs Ridge vs Lasso Coefficients')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Show which features Lasso selected
print("\n🎯 Features Selected by Lasso:\n")
selected_features = np.where(np.abs(lasso_best.coef_) > 1e-5)[0]
print(f"{'Feature':<12} {'True Coef':<12} {'Lasso Coef':<15} {'Ridge Coef'}")
print("="*55)
for idx in selected_features:
    marker = '✅' if idx < 5 else ''
    print(f"Feature {idx:<5} {true_coef[idx]:>10.3f}   {lasso_best.coef_[idx]:>12.3f}   "
          f"{ridge_best.coef_[idx]:>12.3f}  {marker}")

print("\n💡 Key Insights:")
print("   • Lasso sets many coefficients to exactly 0 (sparse solution)")
print("   • Ridge shrinks all coefficients but keeps all features")
print("   • Lasso performs automatic variable selection")
print(f"   • Lasso selected {n_selected}/{p} features, including {sum(selected_features < 5)}/5 true features")
print("   • More interpretable model with similar performance")

6.3 Elastic Net¶

Combining Ridge and Lasso¶

Elastic Net uses both L1 and L2 penalties: $$\text{Loss} = \text{RSS} + \lambda_1 \sum_{j=1}^p |\beta_j| + \lambda_2 \sum_{j=1}^p \beta_j^2$$

Or equivalently: $$\text{Loss} = \text{RSS} + \lambda \left[\alpha ||\beta||_1 + (1-\alpha) ||\beta||_2^2\right]$$

where α ∈ [0,1] controls the mix:

α = 1: Pure Lasso
α = 0: Pure Ridge
α ∈ (0,1): Elastic Net

Advantages¶

✅ Variable selection like Lasso
✅ Handles correlated predictors better than Lasso
✅ Can select more than n features (Lasso limited to n)
✅ More stable than Lasso when features are highly correlated

When to Use¶

Grouped variables (correlated features)
p >> n situations
Want variable selection + stability
Typical choice: α = 0.5 (equal mix)

# Elastic Net demonstration
print("📊 Elastic Net Regression\n")

# Try different α values (l1_ratio in sklearn)
alpha_values = [0.1, 0.5, 0.7, 0.9]  # Mix of L1/L2
lambda_en = best_lambda_lasso  # Use same λ for fair comparison

results_en = {}

for alpha_val in alpha_values:
    en = ElasticNet(alpha=lambda_en, l1_ratio=alpha_val, max_iter=10000)
    en.fit(X_train_scaled, y_train)
    
    test_mse = mean_squared_error(y_test, en.predict(X_test_scaled))
    n_selected = np.sum(np.abs(en.coef_) > 1e-5)
    
    results_en[alpha_val] = {
        'model': en,
        'test_mse': test_mse,
        'n_features': n_selected,
        'coef': en.coef_
    }

print(f"Elastic Net Results (λ = {lambda_en:.2f}):\n")
print(f"{'α (L1 ratio)':<15} {'Test MSE':<12} {'# Features':<15} {'Description'}")
print("="*70)
print(f"{'0.0 (Ridge)':<15} {test_mse_ridge[best_lambda_idx]:>10.3f}   {p:>12}     Pure L2")
for alpha_val in alpha_values:
    desc = f"{int(alpha_val*100)}% L1, {int((1-alpha_val)*100)}% L2"
    print(f"{alpha_val:<15.1f} {results_en[alpha_val]['test_mse']:>10.3f}   "
          f"{results_en[alpha_val]['n_features']:>12}     {desc}")
print(f"{'1.0 (Lasso)':<15} {test_mse_lasso[best_lambda_lasso_idx]:>10.3f}   {n_selected:>12}     Pure L1")

# Visualize Elastic Net with different α
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

for idx, alpha_val in enumerate(alpha_values):
    ax = axes[idx // 2, idx % 2]
    
    # Bar plot of coefficients
    x_pos = np.arange(p)
    ax.bar(x_pos, results_en[alpha_val]['coef'], alpha=0.7, 
          color=plt.cm.viridis(alpha_val))
    ax.axhline(0, color='k', linestyle='-', linewidth=0.5)
    ax.set_xlabel('Feature Index')
    ax.set_ylabel('Coefficient Value')
    ax.set_title(f'Elastic Net: α = {alpha_val:.1f}\n'
                f'MSE = {results_en[alpha_val]["test_mse"]:.3f}, '
                f'{results_en[alpha_val]["n_features"]} features selected')
    ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Compare all methods
print("\n📊 Final Comparison of All Methods:\n")
methods_comparison = [
    ('OLS', mse_ols_test, p, 'None'),
    ('Ridge', test_mse_ridge[best_lambda_idx], p, f'L2, λ={best_lambda:.2f}'),
    ('Lasso', test_mse_lasso[best_lambda_lasso_idx], n_selected, f'L1, λ={best_lambda_lasso:.2f}'),
    ('Elastic Net (α=0.5)', results_en[0.5]['test_mse'], results_en[0.5]['n_features'], 
     f'L1+L2, λ={lambda_en:.2f}')
]

print(f"{'Method':<20} {'Test MSE':<12} {'Features':<12} {'Regularization'}")
print("="*70)
for method, mse, n_feat, reg in sorted(methods_comparison, key=lambda x: x[1]):
    marker = '✅' if mse == min([x[1] for x in methods_comparison]) else ''
    print(f"{method:<20} {mse:>10.3f}   {n_feat:>10}   {reg:<20} {marker}")

print("\n💡 Summary:")
print("   • Ridge: Best when all features matter, handles multicollinearity")
print("   • Lasso: Best for sparse models, automatic variable selection")
print("   • Elastic Net: Best of both worlds, more stable than Lasso")
print("   • All regularization methods beat OLS on test error")

6.4 Selecting the Tuning Parameter¶

The Problem¶

How to choose λ (and α for Elastic Net)?

Solution: Cross-Validation¶

Try grid of λ values
For each λ:
- Perform k-fold CV
- Compute average CV error
Select λ with minimum CV error
One-standard-error rule: Choose simplest model within 1 SE of minimum

sklearn GridSearchCV¶

Automates this process:

Tries all parameter combinations
Performs CV for each
Returns best parameters

# Cross-validation for λ selection
print("📊 Selecting λ using Cross-Validation\n")

# Ridge CV
ridge_cv = GridSearchCV(Ridge(), 
                       param_grid={'alpha': lambdas},
                       cv=10,
                       scoring='neg_mean_squared_error',
                       return_train_score=True)
ridge_cv.fit(X_train_scaled, y_train)

# Lasso CV
lasso_cv = GridSearchCV(Lasso(max_iter=10000),
                       param_grid={'alpha': lambdas},
                       cv=10,
                       scoring='neg_mean_squared_error',
                       return_train_score=True)
lasso_cv.fit(X_train_scaled, y_train)

print("Ridge CV Results:")
print(f"  Best λ: {ridge_cv.best_params_['alpha']:.4f}")
print(f"  Best CV Score: {-ridge_cv.best_score_:.4f}")
print(f"  Test MSE: {mean_squared_error(y_test, ridge_cv.predict(X_test_scaled)):.4f}\n")

print("Lasso CV Results:")
print(f"  Best λ: {lasso_cv.best_params_['alpha']:.4f}")
print(f"  Best CV Score: {-lasso_cv.best_score_:.4f}")
print(f"  Test MSE: {mean_squared_error(y_test, lasso_cv.predict(X_test_scaled)):.4f}")
print(f"  Features selected: {np.sum(np.abs(lasso_cv.best_estimator_.coef_) > 1e-5)}\n")

# Visualize CV results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Ridge CV curve
cv_results_ridge = ridge_cv.cv_results_
axes[0].semilogx(lambdas, -cv_results_ridge['mean_test_score'], 'o-', 
                linewidth=2, label='CV Error', markersize=5)
axes[0].fill_between(lambdas,
                     -cv_results_ridge['mean_test_score'] - cv_results_ridge['std_test_score'],
                     -cv_results_ridge['mean_test_score'] + cv_results_ridge['std_test_score'],
                     alpha=0.2)
axes[0].axvline(ridge_cv.best_params_['alpha'], color='r', linestyle='--', 
               linewidth=2, label=f"Best λ = {ridge_cv.best_params_['alpha']:.4f}")
axes[0].set_xlabel('λ')
axes[0].set_ylabel('CV MSE')
axes[0].set_title('Ridge: 10-Fold CV Error ± 1 SE')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Lasso CV curve
cv_results_lasso = lasso_cv.cv_results_
axes[1].semilogx(lambdas, -cv_results_lasso['mean_test_score'], 'o-', 
                linewidth=2, label='CV Error', markersize=5, color='green')
axes[1].fill_between(lambdas,
                     -cv_results_lasso['mean_test_score'] - cv_results_lasso['std_test_score'],
                     -cv_results_lasso['mean_test_score'] + cv_results_lasso['std_test_score'],
                     alpha=0.2, color='green')
axes[1].axvline(lasso_cv.best_params_['alpha'], color='r', linestyle='--', 
               linewidth=2, label=f"Best λ = {lasso_cv.best_params_['alpha']:.4f}")
axes[1].set_xlabel('λ')
axes[1].set_ylabel('CV MSE')
axes[1].set_title('Lasso: 10-Fold CV Error ± 1 SE')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Best Practices:")
print("   • Always use CV to select λ (never use test set!)")
print("   • Try wide range of λ values (log scale)")
print("   • Shaded region shows ± 1 standard error")
print("   • One-SE rule: Choose simplest model within 1 SE of minimum")
print("   • 10-fold CV is standard choice")

Key Takeaways¶

1. Method Selection Guide¶

Situation	Best Method	Why
All features useful	Ridge	Shrinks all, keeps all
Many irrelevant features	Lasso	Variable selection
Grouped correlated features	Elastic Net	Stability + selection
p > n	Lasso or Elastic Net	Can handle high-D
Need interpretability	Lasso	Sparse model
Multicollinearity	Ridge	Handles correlation

2. Ridge vs Lasso¶

Ridge (L2):

✅ Handles multicollinearity
✅ Stable solutions
✅ Works when all features contribute
❌ No variable selection
❌ All features in final model

Lasso (L1):

✅ Automatic variable selection
✅ Sparse, interpretable models
✅ Works in high dimensions
❌ Can be unstable with correlated features
❌ Selects at most n features

Elastic Net:

✅ Best of both worlds
✅ More stable than Lasso
✅ Variable selection
❌ Two parameters to tune

3. Important Concepts¶

Bias-Variance Tradeoff:

λ = 0 (OLS):    Low bias, High variance
λ optimal:      Small bias, Lower variance ✅
λ → ∞:          High bias, Low variance

Standardization:

Must standardize features before Ridge/Lasso
Penalty is scale-dependent
Use StandardScaler

λ Selection:

Use cross-validation
Try log-spaced grid: [0.001, 0.01, 0.1, 1, 10, 100]
One-SE rule for parsimony

4. Practical Workflow¶

# 1. Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Set up CV grid search
model = Lasso()  # or Ridge, ElasticNet
params = {'alpha': np.logspace(-2, 2, 50)}
cv = GridSearchCV(model, params, cv=10)

# 3. Fit and select best λ
cv.fit(X_train, y_train)
best_model = cv.best_estimator_

# 4. Evaluate on test set
test_score = best_model.score(X_test, y_test)

5. When Regularization Helps Most¶

p ≈ n (many features)
Highly correlated predictors
Noisy features
Overfitting observed
Need interpretability (Lasso)

6. Common Mistakes¶

❌ Forgetting to standardize
❌ Using test set to select λ
❌ Not trying wide enough range of λ
❌ Standardizing test set independently
❌ Comparing unstandardized coefficients

Next Chapter¶

Chapter 7: Moving Beyond Linearity

Polynomial Regression
Step Functions
Regression Splines
Smoothing Splines
Local Regression
Generalized Additive Models (GAMs)

Practice Exercises¶

Exercise 1: Ridge vs Lasso Intuition¶

Why does Lasso set coefficients to exactly 0, but Ridge doesn’t?
Draw the constraint regions for Ridge and Lasso in 2D
Explain why Lasso’s corners lead to sparse solutions

Exercise 2: Implement Ridge from Scratch¶

# Ridge has closed-form solution:
# β_ridge = (X'X + λI)^(-1) X'y

Implement Ridge regression
Compare with sklearn
Verify coefficients shrink as λ increases

Exercise 3: λ Selection¶

Generate data and:

Fit Lasso with λ ∈ [0.01, 100]
Plot CV error vs λ
Plot number of non-zero coefficients vs λ
Identify best λ and corresponding features

Exercise 4: Standardization Impact¶

Fit Ridge/Lasso on unstandardized data
Fit Ridge/Lasso on standardized data
Compare coefficients
Explain the differences

Exercise 5: Elastic Net Tuning¶

Use GridSearchCV to find best (λ, α) for Elastic Net:

param_grid = {
    'alpha': np.logspace(-2, 2, 20),
    'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
}

Which combination performs best?
Is it closer to Ridge or Lasso?
How many features selected?

Exercise 6: High-Dimensional Case¶

Generate data with p > n:

Try OLS (will it work?)
Fit Ridge, Lasso, Elastic Net
Which works best?
Explain why