Lecture 9: Learning Theory

📹 Watch Lecture

From Andrew Ng’s CS229 Lecture 9 (Friday Section)

“This is important in the sense that it deepens your understanding of how machine learning works under the covers. What are the assumptions we’re making and why do things generalize.” - TA Anand

Core Assumptions

Learning theory operates under two fundamental assumptions:

Assumption 1: Data Distribution

  • There exists a distribution D from which (x, y) pairs are sampled

  • Training examples are sampled from D

  • Test examples are also sampled from the same distribution D

  • Without this assumption, theory becomes much harder

Assumption 2: Independent Sampling

  • All samples are drawn independently from D

  • This is the i.i.d. (independent and identically distributed) assumption

The Learning Process

Training Set S = {(x⁽¹⁾, y⁽¹⁾), ..., (x, y)}
        
Learning Algorithm A (deterministic)
        
Hypothesis ĥ (or θ̂)

Key Insights from lecture:

  • S is a random variable (random sample from D)

  • A is deterministic (the algorithm itself is fixed)

  • ĥ (or θ̂) is a random variable (output depends on random input S)

“When you feed a random variable through a deterministic function, you get a random variable”

  • θ̂ has a sampling distribution

  • There exists an unknown true parameter θ* (not random, just unknown)

  • θ̂ is our estimator (statistical terminology)

Bias and Variance: The Parameter View

From the lecture: Instead of looking at data space (x, y plots), look at parameter space (θ₁, θ₂):

Imagine running the experiment many times:

  1. Sample m examples from D → S₁

  2. Run algorithm A → get θ̂₁

  3. Repeat with new sample S₂ → get θ̂₂

  4. Repeat many times → cloud of points in parameter space

Four Algorithm Types:

Algorithm

Bias

Variance

Parameter Cloud

A:

Low

Low

Centered on θ*, tight cluster ✓

B:

Low

High

Centered on θ*, very spread out

C:

High

Low

Off-center, tight cluster

D:

High

High

Off-center, very spread out

Definitions:

  • Bias: Is the sampling distribution centered around true θ*? (first moment - mean)

  • Variance: How dispersed is the sampling distribution? (second moment - variance)

“Bias and variance are basically just properties of the first and second moments of your sampling distribution”

Effects of Training Data Size

As m (training size) increases:

  • Variance decreases (more data → more stable estimates)

  • Bias stays roughly the same (algorithm’s assumptions don’t change)

With regularization:

  • Variance decreases (constraints reduce dispersion)

  • May increase bias slightly (stronger assumptions)

Empirical Risk Minimization (ERM)

Training Error (Empirical Risk):

ε̂(h) = (1/m) Σᵢ 1{h(x)  y}

Generalization Error (True Risk):

ε(h) = P_{(x,y)~D}[h(x)  y]

Goal: Minimize ε(h), but we only have access to ε̂(h)

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import learning_curve
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded!")

11.1 Bias-Variance Tradeoff

What: Decomposing prediction error into bias and variance

The expected prediction error of any model can be decomposed as: $\(\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}\)$. Bias measures how far the average prediction (over many training sets) is from the truth – it captures the error from simplifying assumptions in the model. Variance measures how much predictions fluctuate across different training sets – it captures sensitivity to the particular data sample used for training.

Why: The central tension in all of machine learning

Every model selection decision you make is implicitly navigating this tradeoff. A degree-1 polynomial has high bias (it cannot represent curves) but low variance (the fit is stable across datasets). A degree-15 polynomial has low bias (it can represent any shape) but high variance (small changes in data produce wildly different fits). Understanding this decomposition is the key to diagnosing whether your model needs more complexity (reduce bias) or more regularization/data (reduce variance). As Andrew Ng emphasizes, this understanding should guide every debugging decision in an ML project.

# True function
def true_function(x):
    return np.sin(2 * np.pi * x)

# Generate data
n_samples = 50
X = np.sort(np.random.rand(n_samples))
y_true = true_function(X)
y = y_true + np.random.randn(n_samples) * 0.3

X_test = np.linspace(0, 1, 200)
y_test_true = true_function(X_test)

# Different model complexities
degrees = [1, 3, 9, 15]

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, degree in enumerate(degrees):
    # Fit polynomial
    model = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=0.001))
    model.fit(X.reshape(-1, 1), y)
    y_pred = model.predict(X_test.reshape(-1, 1))
    
    # Plot
    axes[idx].scatter(X, y, alpha=0.5, s=30, label='Data')
    axes[idx].plot(X_test, y_test_true, 'g-', linewidth=2, label='True Function')
    axes[idx].plot(X_test, y_pred, 'r-', linewidth=2, label=f'Degree {degree}')
    
    # Compute MSE
    mse = np.mean((y_pred - y_test_true)**2)
    
    if degree == 1:
        axes[idx].set_title(f'High Bias (Underfitting)\nMSE: {mse:.3f}', 
                           fontsize=12, fontweight='bold')
    elif degree == 15:
        axes[idx].set_title(f'High Variance (Overfitting)\nMSE: {mse:.3f}', 
                           fontsize=12, fontweight='bold')
    else:
        axes[idx].set_title(f'Degree {degree} Polynomial\nMSE: {mse:.3f}', 
                           fontsize=12, fontweight='bold')
    
    axes[idx].legend(fontsize=9)
    axes[idx].grid(True, alpha=0.3)
    axes[idx].set_ylim(-2, 2)

plt.suptitle('Bias-Variance Tradeoff', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

11.2 Learning Curves

What: Plotting training and validation error as a function of dataset size

A learning curve shows how model performance changes as the number of training examples \(m\) increases. We plot two curves: training error (which starts low and increases) and validation/test error (which starts high and decreases). The gap between them, and where they converge, reveals whether the model suffers from high bias or high variance.

Why: The most important diagnostic tool in applied ML

Learning curves directly answer practical questions: “Will collecting more data help?” (yes, if high variance – the validation curve is still declining), “Should I use a more complex model?” (yes, if high bias – both curves have plateaued at unacceptable error). Andrew Ng considers this diagnostic the single most useful tool in his ML workflow. Theory predicts the validation error decays as \(O(1/\sqrt{m})\) toward the Bayes error rate, giving you a rough estimate of how much more data you would need to reach a target performance.

from sklearn.datasets import make_regression
from sklearn.model_selection import learning_curve

# Generate dataset
X_reg, y_reg = make_regression(n_samples=500, n_features=10, noise=10, random_state=42)

# Compare simple vs complex model
models = [
    ('Simple (Ridge α=10)', Ridge(alpha=10)),
    ('Complex (Ridge α=0.01)', Ridge(alpha=0.01))
]

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

for idx, (name, model) in enumerate(models):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X_reg, y_reg, cv=5, 
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='neg_mean_squared_error'
    )
    
    train_scores_mean = -np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    val_scores_mean = -np.mean(val_scores, axis=1)
    val_scores_std = np.std(val_scores, axis=1)
    
    axes[idx].plot(train_sizes, train_scores_mean, 'o-', linewidth=2, label='Training Error')
    axes[idx].fill_between(train_sizes, 
                           train_scores_mean - train_scores_std,
                           train_scores_mean + train_scores_std, alpha=0.2)
    
    axes[idx].plot(train_sizes, val_scores_mean, 's-', linewidth=2, label='Validation Error')
    axes[idx].fill_between(train_sizes,
                           val_scores_mean - val_scores_std,
                           val_scores_mean + val_scores_std, alpha=0.2)
    
    axes[idx].set_xlabel('Training Set Size', fontsize=12)
    axes[idx].set_ylabel('MSE', fontsize=12)
    axes[idx].set_title(name, fontsize=13, fontweight='bold')
    axes[idx].legend(fontsize=11)
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nLearning Curve Diagnosis:")
print("- High bias: Both errors plateau at high values")
print("- High variance: Large gap between train and validation errors")
print("- More data helps high variance, not high bias")

11.3 VC Dimension Illustration

What: Measuring model complexity through shattering capacity

The Vapnik-Chervonenkis (VC) dimension of a hypothesis class \(\mathcal{H}\) is the size of the largest set of points that \(\mathcal{H}\) can shatter – meaning it can realize every possible labeling of those points. For a linear classifier in \(d\) dimensions, the VC dimension is \(d+1\): it can perfectly classify any \(d+1\) points in general position, but there always exist \(d+2\) points with a labeling it cannot achieve (the classic XOR configuration in 2D).

Why: Connecting model capacity to generalization

The VC dimension provides the theoretical foundation for generalization bounds: the gap between training error and true error is bounded by \(O\!\left(\sqrt{\frac{d_{\text{VC}} \log m}{m}}\right)\), where \(m\) is the training set size. Higher VC dimension means the model can fit more patterns, but also requires more data to generalize reliably. This formalizes the intuition behind regularization – by constraining the hypothesis class, we reduce effective VC dimension and tighten the generalization bound.

# Illustrate VC dimension for linear classifiers in 2D
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# 2 points - always separable
points_2 = np.array([[0, 0], [1, 1]])
axes[0].scatter(points_2[:1, 0], points_2[:1, 1], c='blue', s=200, marker='o', edgecolors='black')
axes[0].scatter(points_2[1:, 0], points_2[1:, 1], c='red', s=200, marker='s', edgecolors='black')
axes[0].plot([-.5, 1.5], [1.5, -.5], 'k-', linewidth=2)
axes[0].set_title('2 Points: Always Separable', fontsize=12, fontweight='bold')
axes[0].set_xlim(-0.5, 1.5)
axes[0].set_ylim(-0.5, 1.5)
axes[0].grid(True, alpha=0.3)

# 3 points - separable
points_3 = np.array([[0, 0], [1, 0], [0.5, 1]])
colors_3 = ['blue', 'red', 'red']
markers_3 = ['o', 's', 's']
for i in range(3):
    axes[1].scatter(points_3[i, 0], points_3[i, 1], c=colors_3[i], 
                   s=200, marker=markers_3[i], edgecolors='black')
axes[1].plot([-.2, .3], [.5, -.5], 'k-', linewidth=2)
axes[1].set_title('3 Points: Separable', fontsize=12, fontweight='bold')
axes[1].set_xlim(-0.3, 1.3)
axes[1].set_ylim(-0.3, 1.3)
axes[1].grid(True, alpha=0.3)

# 4 points - XOR not separable
points_4 = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
colors_4 = ['blue', 'red', 'red', 'blue']
markers_4 = ['o', 's', 's', 'o']
for i in range(4):
    axes[2].scatter(points_4[i, 0], points_4[i, 1], c=colors_4[i], 
                   s=200, marker=markers_4[i], edgecolors='black')
axes[2].text(0.5, 0.5, 'NOT\nLINEARLY\nSEPARABLE', 
            ha='center', va='center', fontsize=11, fontweight='bold', color='red')
axes[2].set_title('4 Points (XOR): Not Separable', fontsize=12, fontweight='bold')
axes[2].set_xlim(-0.3, 1.3)
axes[2].set_ylim(-0.3, 1.3)
axes[2].grid(True, alpha=0.3)

plt.suptitle('VC Dimension of Linear Classifiers in 2D = 3', 
            fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nVC Dimension:")
print("- Maximum number of points that can be shattered (all labelings separable)")
print("- Linear classifier in d dimensions: VC dim = d + 1")
print("- Higher VC dim → more capacity → risk of overfitting")

11.4 Generalization Bounds

What: Quantifying the gap between training and true error

Generalization bounds give a mathematical guarantee of the form: with probability at least \(1 - \delta\), the true error \(\varepsilon(h)\) is at most the training error \(\hat{\varepsilon}(h)\) plus a complexity penalty that depends on the VC dimension \(d\) and sample size \(m\). We visualize how this bound shrinks as \(m\) grows and how it increases with model complexity.

Why: The theoretical justification for Occam’s Razor in ML

These bounds explain why simpler models generalize better when data is limited: a model with VC dimension 20 needs roughly 10x more data than a model with VC dimension 2 to achieve the same generalization guarantee. While VC bounds are often loose in practice (the actual generalization gap is much smaller than the bound predicts), they provide the correct qualitative guidance: more complex models require more data. This principle underpins every regularization technique and model selection criterion in machine learning.

# Visualize generalization bound
m = np.logspace(2, 4, 100)  # Training set sizes
d_values = [2, 5, 10, 20]  # VC dimensions

plt.figure(figsize=(12, 7))

for d in d_values:
    # Simplified VC bound: O(sqrt(d log m / m))
    bound = np.sqrt((d * np.log(m)) / m)
    plt.plot(m, bound, linewidth=2, label=f'VC dim = {d}')

plt.xscale('log')
plt.xlabel('Training Set Size (m)', fontsize=12)
plt.ylabel('Generalization Error Bound', fontsize=12)
plt.title('VC Generalization Bound vs Training Set Size', 
         fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nKey Insights:")
print("- Error bound decreases as O(1/√m)")
print("- Higher complexity (VC dim) requires more data")
print("- Theoretical worst-case bound (often loose in practice)")

Key Takeaways

1. Bias-Variance Decomposition

  • Bias: Error from wrong assumptions (underfitting)

  • Variance: Error from sensitivity to training data (overfitting)

  • Tradeoff: Can’t minimize both simultaneously

2. Learning Curves

  • High bias: Train/val errors converge at high value

    • Solution: More features, complex model

  • High variance: Large gap between train/val errors

    • Solution: More data, regularization, simpler model

3. VC Dimension

  • Measure of model complexity

  • Linear classifier in d dimensions: VC = d+1

  • Higher VC → can fit more patterns → needs more data

4. Generalization Bounds

  • Training error ≠ test error

  • Gap depends on: model complexity, training size

  • Regularization reduces effective complexity

5. Practical Guidelines

  • Start simple, increase complexity if needed

  • Use cross-validation

  • Regularization is almost always helpful

  • More data helps variance, not bias

References

  1. CS229 Lecture Notes on Learning Theory

  2. “The Elements of Statistical Learning” - Hastie et al.

  3. “Understanding Machine Learning” - Shalev-Shwartz & Ben-David

Next: Lecture 12: ML Strategy