Lecture 8: Bias, Variance, and Regularization

📹 Watch Lecture

From Andrew Ng’s CS229 Lecture 8

“Bias and variance is one of those concepts that’s easy to understand but hard to master. I’ve had PhD students that worked with me for several years, and their understanding of bias and variance continues to deepen.” - Andrew Ng

The Fundamental Trade-off

From the lecture: “When you train a learning algorithm, it almost never works the first time. My standard workflow is to train something quick and dirty, then understand if the algorithm has a problem of high bias or high variance, and use that insight to improve it.”

Definitions

High Bias (Underfitting):

  • Model is too simple

  • “Strong preconceptions that don’t match reality”

  • Example: Fitting linear function to curved data

  • “This algorithm had a very strong bias that the relationship is linear, and this bias turns out not to be true”

High Variance (Overfitting):

  • Model is too complex

  • Fits noise in training data

  • “If a friend collects a slightly different dataset, this algorithm will fit some totally other varying function”

  • “Very high variability in predictions”

Just Right:

  • Captures true pattern

  • Generalizes well to new data

  • Balance between bias and variance

Visual Examples from Lecture

Housing Price Prediction:

Underfit (High Bias):      Just Right:           Overfit (High Variance):
θ + θx                    θ + θx + θx²       θ + ... + θx
(straight line)             (quadratic curve)      (passes through all points)

Classification:

Underfit:                   Just Right:            Overfit:
Linear boundary             Smooth curve           Wild zigzag boundary
(too simple)                                       (fits noise)

Workflow for Improving Models

From lecture: “We have a menu of tools for reducing bias or reducing variance”

High Bias → Need more model capacity:

  • Add more features

  • Add polynomial features

  • Decrease regularization (λ)

  • Use more complex model

High Variance → Need regularization:

  • Get more training data

  • Reduce number of features (feature selection)

  • Increase regularization (λ)

  • Use simpler model

Setup

We import core numerical and ML libraries including scikit-learn’s Ridge, Lasso, and ElasticNet estimators, along with PolynomialFeatures for demonstrating overfitting. Regularization is fundamentally about adding a penalty term \(\lambda\|\theta\|\) to the loss function to constrain model complexity, and we will compare how L1 (Lasso), L2 (Ridge), and combined (Elastic Net) penalties affect learned weights. Visualization tools are critical here for seeing how regularization paths – coefficient values plotted as a function of \(\lambda\) – reveal the shrinkage and sparsity behavior of each method.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_regression, load_diabetes, load_boston
from sklearn.model_selection import train_test_split, cross_val_score, learning_curve
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded successfully!")

3.1 Demonstrating Overfitting

Let’s create a simple example where we fit polynomials of increasing degree to data.

# Generate simple dataset: y = sin(2πx) + noise
np.random.seed(42)
n_samples = 20
X_simple = np.sort(np.random.rand(n_samples))
y_simple = np.sin(2 * np.pi * X_simple) + np.random.randn(n_samples) * 0.1

# True function (for visualization)
X_true = np.linspace(0, 1, 200)
y_true = np.sin(2 * np.pi * X_true)

# Fit polynomials of different degrees
degrees = [1, 3, 9, 15]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, degree in enumerate(degrees):
    # Create polynomial features
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_poly = poly.fit_transform(X_simple.reshape(-1, 1))
    X_true_poly = poly.transform(X_true.reshape(-1, 1))
    
    # Fit model
    model = LinearRegression()
    model.fit(X_poly, y_simple)
    y_pred = model.predict(X_true_poly)
    
    # Calculate training error
    train_pred = model.predict(X_poly)
    train_mse = mean_squared_error(y_simple, train_pred)
    
    # Plot
    axes[idx].scatter(X_simple, y_simple, color='red', s=50, 
                     alpha=0.7, label='Training data', zorder=3)
    axes[idx].plot(X_true, y_true, 'g--', linewidth=2, 
                  label='True function', alpha=0.7)
    axes[idx].plot(X_true, y_pred, 'b-', linewidth=2, 
                  label=f'Degree {degree} fit')
    axes[idx].set_xlabel('x', fontsize=11)
    axes[idx].set_ylabel('y', fontsize=11)
    axes[idx].set_title(f'Polynomial Degree {degree}\nTrain MSE: {train_mse:.4f}', 
                       fontsize=12, fontweight='bold')
    axes[idx].legend(fontsize=9)
    axes[idx].grid(True, alpha=0.3)
    axes[idx].set_ylim(-1.5, 1.5)
    
    # Add annotation
    if degree == 1:
        axes[idx].text(0.5, -1.3, 'UNDERFITTING', ha='center', 
                      fontsize=10, color='red', fontweight='bold')
    elif degree == 3:
        axes[idx].text(0.5, -1.3, 'GOOD FIT', ha='center', 
                      fontsize=10, color='green', fontweight='bold')
    elif degree >= 9:
        axes[idx].text(0.5, -1.3, 'OVERFITTING', ha='center', 
                      fontsize=10, color='red', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nObservations:")
print("- Degree 1: High bias (underfitting) - too simple")
print("- Degree 3: Good balance - captures true pattern")
print("- Degree 9: High variance (overfitting) - fits noise")
print("- Degree 15: Severe overfitting - wild oscillations")

3.2 Ridge Regression (L2 Regularization)

Ridge regression adds L2 penalty to prevent large weights:

\[\min_{\theta} \left\{ \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2 \right\}\]

Closed-form solution: $\(\theta = (X^T X + \lambda I)^{-1} X^T y\)$

# Use high-degree polynomial with regularization
degree = 15
poly = PolynomialFeatures(degree=degree, include_bias=False)
X_poly_train = poly.fit_transform(X_simple.reshape(-1, 1))
X_poly_true = poly.transform(X_true.reshape(-1, 1))

# Try different regularization strengths
lambdas = [0, 1e-5, 1e-3, 1e-1, 1.0]

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

for idx, lam in enumerate(lambdas):
    # Ridge regression
    ridge = Ridge(alpha=lam)
    ridge.fit(X_poly_train, y_simple)
    y_pred_ridge = ridge.predict(X_poly_true)
    
    # Calculate metrics
    train_pred = ridge.predict(X_poly_train)
    train_mse = mean_squared_error(y_simple, train_pred)
    coef_norm = np.linalg.norm(ridge.coef_)
    
    # Plot
    axes[idx].scatter(X_simple, y_simple, color='red', s=50, 
                     alpha=0.7, label='Data', zorder=3)
    axes[idx].plot(X_true, y_true, 'g--', linewidth=2, 
                  label='True', alpha=0.7)
    axes[idx].plot(X_true, y_pred_ridge, 'b-', linewidth=2, 
                  label=f'Ridge (λ={lam})')
    axes[idx].set_xlabel('x', fontsize=11)
    axes[idx].set_ylabel('y', fontsize=11)
    axes[idx].set_title(f'λ = {lam}\nMSE: {train_mse:.4f}, ||θ||: {coef_norm:.2f}', 
                       fontsize=12, fontweight='bold')
    axes[idx].legend(fontsize=9)
    axes[idx].grid(True, alpha=0.3)
    axes[idx].set_ylim(-1.5, 1.5)

# Remove extra subplot
fig.delaxes(axes[5])

plt.tight_layout()
plt.show()

print("\nEffect of λ (regularization strength):")
print("- λ = 0: No regularization → overfitting")
print("- λ small: Slight regularization → better generalization")
print("- λ large: Strong regularization → underfitting")
print("- As λ ↑, coefficient magnitudes ↓")

3.3 Lasso Regression (L1 Regularization)

Lasso uses L1 penalty, which can drive some coefficients exactly to zero:

\[\min_{\theta} \left\{ \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{m} \sum_{j=1}^{n} |\theta_j| \right\}\]

Key property: Performs feature selection by setting some coefficients to 0.

# Compare Ridge vs Lasso coefficient behavior
lambdas_range = np.logspace(-4, 2, 50)

ridge_coefs = []
lasso_coefs = []

for lam in lambdas_range:
    # Ridge
    ridge = Ridge(alpha=lam)
    ridge.fit(X_poly_train, y_simple)
    ridge_coefs.append(ridge.coef_)
    
    # Lasso
    lasso = Lasso(alpha=lam, max_iter=10000)
    lasso.fit(X_poly_train, y_simple)
    lasso_coefs.append(lasso.coef_)

ridge_coefs = np.array(ridge_coefs)
lasso_coefs = np.array(lasso_coefs)

# Plot regularization paths
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Ridge path
for i in range(ridge_coefs.shape[1]):
    axes[0].plot(lambdas_range, ridge_coefs[:, i], linewidth=1.5, alpha=0.7)
axes[0].set_xscale('log')
axes[0].set_xlabel('Regularization Parameter (λ)', fontsize=12)
axes[0].set_ylabel('Coefficient Value', fontsize=12)
axes[0].set_title('Ridge Regularization Path (L2)', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].axhline(y=0, color='black', linestyle='--', linewidth=1)

# Lasso path
for i in range(lasso_coefs.shape[1]):
    axes[1].plot(lambdas_range, lasso_coefs[:, i], linewidth=1.5, alpha=0.7)
axes[1].set_xscale('log')
axes[1].set_xlabel('Regularization Parameter (λ)', fontsize=12)
axes[1].set_ylabel('Coefficient Value', fontsize=12)
axes[1].set_title('Lasso Regularization Path (L1)', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
axes[1].axhline(y=0, color='black', linestyle='--', linewidth=1)

plt.tight_layout()
plt.show()

print("\nKey Differences:")
print("- Ridge: Coefficients shrink smoothly but never exactly zero")
print("- Lasso: Coefficients become exactly zero → feature selection")
print("- Lasso path has 'corners' where features drop out")

3.4 Real Dataset: Diabetes Prediction

What: Applying regularized regression to clinical data

We use the scikit-learn diabetes dataset (10 baseline features from 442 patients) to compare Ridge and Lasso regression on a real prediction task. The features include age, sex, BMI, blood pressure, and six blood serum measurements – a moderate-dimensional setting where regularization can meaningfully improve generalization.

Why: Regularization shines on real-world data

Real datasets often have correlated features, noise, and limited samples relative to the number of features. Without regularization, ordinary least squares can overfit by assigning large weights to noise-correlated features. Ridge regression stabilizes the solution by shrinking all coefficients, while Lasso can identify the most predictive features by driving irrelevant coefficients to exactly zero. Comparing both methods on the same data reveals these complementary strengths in practice.

# Load diabetes dataset
diabetes = load_diabetes()
X_diabetes = diabetes.data
y_diabetes = diabetes.target

print("Diabetes Dataset")
print(f"Samples: {X_diabetes.shape[0]}")
print(f"Features: {X_diabetes.shape[1]}")
print(f"\nFeatures: {diabetes.feature_names}")
print(f"\nTarget: Disease progression one year after baseline")

# Split data
X_train_diab, X_test_diab, y_train_diab, y_test_diab = train_test_split(
    X_diabetes, y_diabetes, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_diab)
X_test_scaled = scaler.transform(X_test_diab)

print(f"\nTraining set: {X_train_scaled.shape}")
print(f"Test set: {X_test_scaled.shape}")
# Compare different models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge (λ=0.1)': Ridge(alpha=0.1),
    'Ridge (λ=1.0)': Ridge(alpha=1.0),
    'Ridge (λ=10)': Ridge(alpha=10.0),
    'Lasso (λ=0.1)': Lasso(alpha=0.1),
    'Lasso (λ=1.0)': Lasso(alpha=1.0),
    'Elastic Net': ElasticNet(alpha=0.1, l1_ratio=0.5)
}

results = []

for name, model in models.items():
    # Train
    model.fit(X_train_scaled, y_train_diab)
    
    # Predict
    y_train_pred = model.predict(X_train_scaled)
    y_test_pred = model.predict(X_test_scaled)
    
    # Metrics
    train_mse = mean_squared_error(y_train_diab, y_train_pred)
    test_mse = mean_squared_error(y_test_diab, y_test_pred)
    train_r2 = r2_score(y_train_diab, y_train_pred)
    test_r2 = r2_score(y_test_diab, y_test_pred)
    
    # Coefficient info
    if hasattr(model, 'coef_'):
        coef_norm = np.linalg.norm(model.coef_)
        n_nonzero = np.sum(np.abs(model.coef_) > 1e-5)
    else:
        coef_norm = 0
        n_nonzero = 0
    
    results.append({
        'Model': name,
        'Train MSE': train_mse,
        'Test MSE': test_mse,
        'Train R²': train_r2,
        'Test R²': test_r2,
        '||θ||': coef_norm,
        'Non-zero': n_nonzero
    })

results_df = pd.DataFrame(results)
print("\n" + "="*80)
print("MODEL COMPARISON")
print("="*80)
print(results_df.to_string(index=False))

print("\n" + "="*80)
print("INSIGHTS")
print("="*80)
print("1. Linear regression: Lowest train MSE but may overfit")
print("2. Ridge: Shrinks coefficients, often better test performance")
print("3. Lasso: Performs feature selection (fewer non-zero coefficients)")
print("4. Elastic Net: Combines benefits of Ridge and Lasso")
print("5. Regularization reduces ||θ|| → better generalization")

3.5 Cross-Validation for Hyperparameter Tuning

What: Selecting the optimal regularization strength

The regularization parameter \(\lambda\) (or alpha in scikit-learn) controls the tradeoff between fitting the training data and keeping the model simple. We use \(k\)-fold cross-validation to evaluate multiple \(\lambda\) values: for each candidate, the data is split into \(k\) folds, the model is trained on \(k-1\) folds and evaluated on the held-out fold, and the average validation error identifies the best \(\lambda\).

Why: Avoiding the overfitting-underfitting dilemma

Too little regularization (\(\lambda \approx 0\)) leads to overfitting; too much (\(\lambda \to \infty\)) shrinks all coefficients toward zero, underfitting the data. Cross-validation provides an unbiased estimate of out-of-sample performance for each \(\lambda\), and scikit-learn’s RidgeCV and LassoCV implement efficient leave-one-out or \(k\)-fold search. The resulting validation curve – plotting test error vs. \(\lambda\) – is a standard diagnostic for choosing regularization strength in any penalized model.

from sklearn.model_selection import GridSearchCV

# Ridge grid search
ridge_params = {'alpha': np.logspace(-3, 3, 50)}
ridge_grid = GridSearchCV(Ridge(), ridge_params, cv=5, 
                         scoring='neg_mean_squared_error')
ridge_grid.fit(X_train_scaled, y_train_diab)

# Lasso grid search
lasso_params = {'alpha': np.logspace(-3, 1, 50)}
lasso_grid = GridSearchCV(Lasso(), lasso_params, cv=5, 
                         scoring='neg_mean_squared_error')
lasso_grid.fit(X_train_scaled, y_train_diab)

print(f"Best Ridge λ: {ridge_grid.best_params_['alpha']:.4f}")
print(f"Best Ridge CV Score: {-ridge_grid.best_score_:.2f}")
print(f"\nBest Lasso λ: {lasso_grid.best_params_['alpha']:.4f}")
print(f"Best Lasso CV Score: {-lasso_grid.best_score_:.2f}")

# Plot CV scores vs lambda
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Ridge
ridge_alphas = ridge_params['alpha']
ridge_scores = -ridge_grid.cv_results_['mean_test_score']
axes[0].semilogx(ridge_alphas, ridge_scores, linewidth=2, marker='o', markersize=4)
axes[0].axvline(ridge_grid.best_params_['alpha'], color='r', 
               linestyle='--', linewidth=2, label=f"Best λ = {ridge_grid.best_params_['alpha']:.3f}")
axes[0].set_xlabel('Regularization Parameter (λ)', fontsize=12)
axes[0].set_ylabel('CV Mean Squared Error', fontsize=12)
axes[0].set_title('Ridge Regression: Cross-Validation', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Lasso
lasso_alphas = lasso_params['alpha']
lasso_scores = -lasso_grid.cv_results_['mean_test_score']
axes[1].semilogx(lasso_alphas, lasso_scores, linewidth=2, marker='s', markersize=4, color='orange')
axes[1].axvline(lasso_grid.best_params_['alpha'], color='r', 
               linestyle='--', linewidth=2, label=f"Best λ = {lasso_grid.best_params_['alpha']:.3f}")
axes[1].set_xlabel('Regularization Parameter (λ)', fontsize=12)
axes[1].set_ylabel('CV Mean Squared Error', fontsize=12)
axes[1].set_title('Lasso Regression: Cross-Validation', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

3.6 Feature Importance with Lasso

What: Using L1 regularization as an automatic feature selector

Lasso regression penalizes the absolute value of coefficients: \(\lambda \sum_j |\theta_j|\). Unlike Ridge (L2), this L1 penalty can drive coefficients to exactly zero, effectively removing features from the model. By examining which coefficients survive as \(\lambda\) increases, we obtain a ranking of feature importance – the features with non-zero coefficients at higher \(\lambda\) values are the most predictive.

Why: Interpretability and dimensionality reduction

In many real-world applications – medical diagnosis, financial modeling, scientific discovery – knowing which features matter is as important as making accurate predictions. Lasso provides built-in feature selection: the non-zero coefficients at the optimal \(\lambda\) identify the minimal set of predictive variables. This is especially valuable when the number of features \(n\) approaches or exceeds the number of samples \(m\), where ordinary least squares fails entirely but Lasso can still produce a sparse, interpretable model.

# Train Lasso with optimal lambda
lasso_best = lasso_grid.best_estimator_

# Get feature importances (absolute coefficient values)
feature_importance = pd.DataFrame({
    'Feature': diabetes.feature_names,
    'Coefficient': lasso_best.coef_,
    'Abs_Coefficient': np.abs(lasso_best.coef_)
}).sort_values('Abs_Coefficient', ascending=False)

print("\nFeature Importance (Lasso):")
print(feature_importance.to_string(index=False))

# Visualize
plt.figure(figsize=(10, 6))
colors = ['red' if c == 0 else 'blue' for c in feature_importance['Coefficient']]
plt.barh(feature_importance['Feature'], feature_importance['Coefficient'], color=colors, alpha=0.7)
plt.xlabel('Coefficient Value', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Lasso Feature Coefficients\n(Red = Zero, Blue = Non-zero)', 
         fontsize=14, fontweight='bold')
plt.axvline(x=0, color='black', linestyle='-', linewidth=1)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

n_selected = np.sum(np.abs(lasso_best.coef_) > 1e-5)
print(f"\nLasso selected {n_selected} out of {len(diabetes.feature_names)} features")

3.7 Learning Curves

Learning curves help diagnose bias vs variance:

  • High bias: Both curves plateau at high error

  • High variance: Large gap between train and validation curves

def plot_learning_curves(model, X, y, title):
    """
    Plot learning curves for a model.
    """
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y, cv=5, 
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='neg_mean_squared_error',
        n_jobs=-1
    )
    
    train_scores_mean = -np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    val_scores_mean = -np.mean(val_scores, axis=1)
    val_scores_std = np.std(val_scores, axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_scores_mean, 'o-', linewidth=2, 
             label='Training error', markersize=6)
    plt.fill_between(train_sizes, 
                     train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, 
                     alpha=0.2)
    
    plt.plot(train_sizes, val_scores_mean, 's-', linewidth=2, 
             label='Validation error', markersize=6)
    plt.fill_between(train_sizes, 
                     val_scores_mean - val_scores_std,
                     val_scores_mean + val_scores_std, 
                     alpha=0.2)
    
    plt.xlabel('Training Set Size', fontsize=12)
    plt.ylabel('Mean Squared Error', fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.legend(fontsize=11)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

# Plot for different models
print("Generating learning curves...\n")

plot_learning_curves(LinearRegression(), X_train_scaled, y_train_diab,
                    'Learning Curve: Linear Regression (No Regularization)')

plot_learning_curves(Ridge(alpha=1.0), X_train_scaled, y_train_diab,
                    'Learning Curve: Ridge Regression (λ=1.0)')

plot_learning_curves(Lasso(alpha=0.5), X_train_scaled, y_train_diab,
                    'Learning Curve: Lasso Regression (λ=0.5)')

3.8 Elastic Net: Best of Both Worlds

Elastic Net combines L1 and L2:

  • Gets feature selection from L1

  • Gets stability from L2

  • Works well with correlated features

# Grid search for Elastic Net
enet_params = {
    'alpha': np.logspace(-3, 1, 20),
    'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
}

enet_grid = GridSearchCV(ElasticNet(), enet_params, cv=5, 
                        scoring='neg_mean_squared_error')
enet_grid.fit(X_train_scaled, y_train_diab)

print(f"Best Elastic Net Parameters:")
print(f"  α (overall strength): {enet_grid.best_params_['alpha']:.4f}")
print(f"  l1_ratio (L1 vs L2 mix): {enet_grid.best_params_['l1_ratio']:.2f}")
print(f"  CV Score: {-enet_grid.best_score_:.2f}")

# Visualize alpha vs l1_ratio heatmap
results_grid = enet_grid.cv_results_
scores = -results_grid['mean_test_score']

# Reshape for heatmap
n_alphas = len(enet_params['alpha'])
n_l1_ratios = len(enet_params['l1_ratio'])
scores_matrix = scores.reshape(n_alphas, n_l1_ratios)

plt.figure(figsize=(10, 8))
sns.heatmap(scores_matrix, 
            xticklabels=[f"{r:.1f}" for r in enet_params['l1_ratio']],
            yticklabels=[f"{a:.3f}" for a in enet_params['alpha']],
            cmap='RdYlGn_r', annot=False, fmt='.1f', cbar_kws={'label': 'MSE'})
plt.xlabel('l1_ratio (0=Ridge, 1=Lasso)', fontsize=12)
plt.ylabel('alpha (λ)', fontsize=12)
plt.title('Elastic Net: Hyperparameter Grid Search\n(Darker = Better)', 
         fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Test best model
enet_best = enet_grid.best_estimator_
y_test_pred_enet = enet_best.predict(X_test_scaled)
test_mse_enet = mean_squared_error(y_test_diab, y_test_pred_enet)
test_r2_enet = r2_score(y_test_diab, y_test_pred_enet)

print(f"\nTest Performance:")
print(f"  MSE: {test_mse_enet:.2f}")
print(f"  R²: {test_r2_enet:.4f}")

Key Takeaways

1. Overfitting vs Underfitting

  • Underfitting (high bias): Model too simple, poor performance on all data

  • Overfitting (high variance): Model too complex, great on training, poor on test

  • Just right: Balances complexity and generalization

2. Ridge (L2) Regularization

  • Shrinks coefficients toward zero

  • Never exactly zero → keeps all features

  • Closed-form solution available

  • Works well with correlated features

3. Lasso (L1) Regularization

  • Drives some coefficients exactly to zero

  • Performs automatic feature selection

  • Useful for high-dimensional data

  • No closed-form solution (use coordinate descent)

4. Elastic Net

  • Combines L1 and L2

  • Parameter l1_ratio controls mix (0=Ridge, 1=Lasso)

  • More stable than Lasso with correlated features

  • Two hyperparameters to tune

5. Cross-Validation

  • Essential for choosing λ

  • K-fold CV (typically k=5 or 10)

  • Grid search for hyperparameter tuning

  • Use validation set to prevent overfitting

6. Learning Curves

  • Diagnose bias/variance

  • High bias: Both curves plateau high

  • High variance: Large gap between curves

  • More data helps with high variance

7. When to Use Which

  • Ridge: Many features, all potentially useful

  • Lasso: Need feature selection, sparse models

  • Elastic Net: Correlated features + want selection

  • No regularization: n >> m and low noise

Practice Exercises

  1. Polynomial Overfitting: Generate data from degree-3 polynomial. Fit polynomials of degree 1-20. Plot train vs test error. Where does overfitting start?

  2. Regularization from Scratch: Implement Ridge regression using gradient descent. Compare with closed-form solution.

  3. Feature Selection: Create dataset with 100 features (only 5 relevant). Use Lasso to identify important features. What λ gives best selection?

  4. Correlated Features: Generate data with highly correlated features. Compare Ridge, Lasso, Elastic Net. Which works best?

  5. Cross-Validation: Implement k-fold cross-validation from scratch. Compare with sklearn implementation.

  6. Learning Curves: Plot learning curves for polynomial regression with degrees [1, 3, 9]. Diagnose bias/variance for each.

  7. Early Stopping: Implement early stopping for gradient descent. Monitor validation error and stop when it increases.

  8. Bayesian Interpretation: Show that Ridge regression is equivalent to MAP estimation with Gaussian prior. Derive relationship between λ and prior variance.

  9. Regularization Path: Implement coordinate descent to compute full Lasso path efficiently.

  10. Real Data: Apply all three methods to California Housing dataset. Create comprehensive comparison report.

References

  1. CS229 Lecture Notes: Andrew Ng’s notes on regularization

  2. The Elements of Statistical Learning: Hastie et al., Chapter 3

  3. An Introduction to Statistical Learning: James et al., Chapter 6

  4. Pattern Recognition and Machine Learning: Bishop, Chapter 3

  5. Original Lasso Paper: Tibshirani (1996)

Next: Lecture 4: Generative Models