Run this notebook: Open in Colab Open in Kaggle

1. The Learning Problem¶

Formal Setup¶

Given:

Input space $\mathcal{X}$ (e.g., images, text, numerical features)
Output space $\mathcal{Y}$ (e.g., labels, real values)
Unknown distribution $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$
Training set $S = \{(x_1, y_1), ..., (x_m, y_m)\}$ sampled i.i.d. from $\mathcal{D}$

Goal:

Find a hypothesis $h: \mathcal{X} \rightarrow \mathcal{Y}$ that generalizes well to unseen data

Key Questions¶

When can we learn? Under what conditions is learning possible?
How much data do we need? Sample complexity
What can we learn? Expressiveness vs. generalization
How well can we learn? Approximation and estimation errors

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 4)

# Seed for reproducibility
np.random.seed(42)

2. Generalization: The Core Challenge¶

True Risk vs. Empirical Risk¶

True Risk (Generalization Error): $$L_{\mathcal{D}}(h) = \mathbb{E}_{(x,y) \sim \mathcal{D}}[\ell(h(x), y)]$$

Empirical Risk (Training Error): $$L_S(h) = \frac{1}{m} \sum_{i=1}^{m} \ell(h(x_i), y_i)$$

Key Insight: We can only observe $L_S(h)$, but we care about $L_{\mathcal{D}}(h)$!

Advanced Statistical Learning Theory¶

1. PAC Learning Framework¶

Probably Approximately Correct (PAC) Learning (Valiant, 1984):

A hypothesis class $\mathcal{H}$ is PAC-learnable if there exists an algorithm $A$ and a sample complexity function $m_{\mathcal{H}}(\epsilon, \delta)$ such that:

For any $\epsilon, \delta \in (0, 1)$, any distribution $\mathcal{D}$ over $\mathcal{X}$, and any target concept $c \in \mathcal{H}$:

\[P_{S \sim \mathcal{D}^m}\left[L_{\mathcal{D}}(A(S)) \leq \epsilon\right] \geq 1 - \delta\]

when $m \geq m_{\mathcal{H}}(\epsilon, \delta)$

Interpretation:

Probably: With probability at least $1-\delta$ (confidence)
Approximately Correct: Error at most $\epsilon$ (accuracy)
Sample Complexity: $m_{\mathcal{H}}(\epsilon, \delta)$ tells us how many samples we need

Key Assumptions:

Realizability: True concept $c \in \mathcal{H}$ (strong assumption)
i.i.d. sampling: Training examples drawn independently from $\mathcal{D}$

Fundamental Theorem (Realizable Case):

For finite hypothesis class $|\mathcal{H}| < \infty$, any consistent learner (achieving zero training error) is PAC with:

\[m_{\mathcal{H}}(\epsilon, \delta) \leq \frac{1}{\epsilon}\left(\log|\mathcal{H}| + \log\frac{1}{\delta}\right)\]

Proof sketch:

Bad hypothesis: $h \in \mathcal{H}$ with $L_{\mathcal{D}}(h) > \epsilon$
Probability $h$ is consistent with $m$ random samples: $(1-\epsilon)^m$
By union bound over all bad hypotheses: $P(\text{failure}) \leq |\mathcal{H}|(1-\epsilon)^m$
Using $(1-\epsilon)^m \leq e^{-\epsilon m}$: Set $|\mathcal{H}|e^{-\epsilon m} \leq \delta$
Solve for $m$: $m \geq \frac{1}{\epsilon}(\log|\mathcal{H}| + \log\frac{1}{\delta})$

Agnostic PAC Learning:

Relaxes realizability assumption. For any $\epsilon, \delta$:

\[P_{S \sim \mathcal{D}^m}\left[L_{\mathcal{D}}(A(S)) \leq \min_{h \in \mathcal{H}}L_{\mathcal{D}}(h) + \epsilon\right] \geq 1 - \delta\]

Sample complexity becomes: $m_{\mathcal{H}}(\epsilon, \delta) = O\left(\frac{\log|\mathcal{H}|}{\epsilon^2}\right)$

2. VC Dimension Theory¶

Problem with Finite $|\mathcal{H}|$: Many useful hypothesis classes are infinite (e.g., linear classifiers in $\mathbb{R}^d$).

Vapnik-Chervonenkis (VC) Dimension:

Measures the capacity or expressiveness of a hypothesis class.

Definition: The VC dimension of $\mathcal{H}$, denoted $\text{VCdim}(\mathcal{H})$, is the maximum size of a set $C \subseteq \mathcal{X}$ that can be shattered by $\mathcal{H}$.

Shattering: A set $C = \{x_1, ..., x_d\}$ is shattered by $\mathcal{H}$ if for every labeling $y \in \{0,1\}^d$, there exists $h \in \mathcal{H}$ such that $h(x_i) = y_i$ for all $i$.

In other words: $\mathcal{H}$ can realize all $2^d$ possible dichotomies on $C$.

Examples:

Linear classifiers in $\mathbb{R}^2$:
- Hypothesis: $h(x) = \text{sign}(w^T x + b)$
- $\text{VCdim} = 3$
- Can shatter any 3 points (not in general position)
- Cannot shatter all configurations of 4 points (e.g., XOR pattern)
Linear classifiers in $\mathbb{R}^d$:
- $\text{VCdim} = d + 1$
Axis-aligned rectangles in $\mathbb{R}^2$:
- $\text{VCdim} = 4$ (corners of rectangle)
Polynomial classifiers of degree $k$ in $\mathbb{R}^d$:
- $\text{VCdim} = O(d^k)$

Growth Function:

Number of distinct classifications on $m$ points:

\[\Pi_{\mathcal{H}}(m) = \max_{C \subseteq \mathcal{X}, |C|=m} |\{(h(x_1), ..., h(x_m)) : h \in \mathcal{H}\}|\]

Sauer’s Lemma (1972):

If $\text{VCdim}(\mathcal{H}) = d < \infty$, then:

\[\Pi_{\mathcal{H}}(m) \leq \sum_{i=0}^{d} \binom{m}{i} \leq \left(\frac{em}{d}\right)^d\]

Implication: Even infinite hypothesis classes have polynomial growth if VC dimension is finite!

VC Bound (Fundamental Theorem of Statistical Learning):

For any $\epsilon, \delta \in (0,1)$, with probability at least $1-\delta$:

\[L_{\mathcal{D}}(h) \leq L_S(h) + \sqrt{\frac{d\log(em/d) + \log(1/\delta)}{m}}\]

where $d = \text{VCdim}(\mathcal{H})$.

Sample Complexity:

For agnostic PAC learning with VC dimension $d$:

\[m_{\mathcal{H}}(\epsilon, \delta) = O\left(\frac{d + \log(1/\delta)}{\epsilon^2}\right)\]

Key Insight: Sample complexity depends on VC dimension, not the size of $\mathcal{H}$!

3. Rademacher Complexity¶

Motivation: VC dimension is distribution-independent. Can we get tighter bounds by considering the actual distribution?

Empirical Rademacher Complexity:

Measures how well $\mathcal{H}$ can fit random noise on sample $S$:

\[\hat{\mathfrak{R}}_S(\mathcal{H}) = \mathbb{E}_{\sigma}\left[\sup_{h \in \mathcal{H}} \frac{1}{m}\sum_{i=1}^{m} \sigma_i h(x_i)\right]\]

where $\sigma_i \in \{-1, +1\}$ are independent Rademacher random variables (uniform).

Intuition:

If $\mathcal{H}$ can perfectly correlate with random labels $\sigma$, it has high capacity
Measures average correlation between hypotheses and random noise

Rademacher Complexity:

\[\mathfrak{R}_m(\mathcal{H}) = \mathbb{E}_{S \sim \mathcal{D}^m}[\hat{\mathfrak{R}}_S(\mathcal{H})]\]

Rademacher Generalization Bound:

With probability at least $1-\delta$:

\[L_{\mathcal{D}}(h) \leq L_S(h) + 2\mathfrak{R}_m(\mathcal{H}) + \sqrt{\frac{\log(1/\delta)}{2m}}\]

Comparison with VC bounds:

Data-dependent (can be tighter)
Easier to compute for specific function classes
Connects to margin-based bounds in SVMs

Examples:

Linear classifiers with bounded norm:
- $\mathcal{H} = \{x \mapsto \text{sign}(w^T x) : \|w\|_2 \leq \Lambda\}$
- Assuming $\|x\|_2 \leq R$: $$\mathfrak{R}_m(\mathcal{H}) = O\left(\frac{\Lambda R}{\sqrt{m}}\right)$$
Neural networks:
- Bounds in terms of spectral norms of weight matrices
- Explains why overparameterized networks can still generalize

Massart’s Finite Class Lemma:

For finite $\mathcal{H}$ with $|\mathcal{H}| = N$:

\[\mathfrak{R}_m(\mathcal{H}) \leq \sqrt{\frac{2\log N}{m}}\]

Contraction Lemma:

If $\phi: \mathbb{R} \to \mathbb{R}$ is $L$-Lipschitz and $\phi(0) = 0$:

\[\mathfrak{R}_m(\phi \circ \mathcal{H}) \leq L \cdot \mathfrak{R}_m(\mathcal{H})\]

Useful for analyzing composed function classes (e.g., neural networks with Lipschitz activations).

4. Bias-Variance Decomposition¶

Classical Decomposition (Squared Loss):

For regression with $y = f(x) + \epsilon$ where $\mathbb{E}[\epsilon] = 0, \text{Var}(\epsilon) = \sigma^2$:

\[\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{(f(x) - \mathbb{E}[\hat{f}(x)])^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible}}\]

Proof:

Let $\bar{f}(x) = \mathbb{E}_S[\hat{f}(x)]$ (average predictor over all training sets).

\[\begin{split}\begin{align} \mathbb{E}[(y - \hat{f}(x))^2] &= \mathbb{E}[(y - f(x) + f(x) - \bar{f}(x) + \bar{f}(x) - \hat{f}(x))^2] \\ &= \mathbb{E}[(y-f(x))^2] + (f(x)-\bar{f}(x))^2 + \mathbb{E}[(\bar{f}(x)-\hat{f}(x))^2] \\ &= \sigma^2 + \text{Bias}^2 + \text{Variance} \end{align}\end{split}\]

(Cross terms vanish because $\mathbb{E}[y-f(x)] = 0$)

Interpretation:

Bias: $f(x) - \bar{f}(x)$
- Systematic error from wrong model assumptions
- Measures how far the average prediction deviates from truth
- High bias → underfitting
Variance: $\mathbb{E}[(\hat{f}(x) - \bar{f}(x))^2]$
- Error from sensitivity to training data
- Measures how much predictions vary across different training sets
- High variance → overfitting
Irreducible Error: $\sigma^2$
- Inherent noise in the problem
- Cannot be reduced by any learning algorithm

Model Complexity Trade-off:

Simple models (low capacity): Low variance, high bias
Complex models (high capacity): Low bias, high variance
Optimal model: Balances bias and variance

Connection to Hypothesis Class Size:

Larger $\mathcal{H}$ → Lower bias (can approximate more functions)
Larger $\mathcal{H}$ → Higher variance (more overfitting risk)
Sample complexity increases with $\text{VCdim}(\mathcal{H})$

Beyond Squared Loss:

Bias-variance decomposition exists for other losses but is more complex:

0-1 loss: Bias-variance decomposition involves irreducible disagreement
Cross-entropy: No clean decomposition, but conceptually similar trade-off

5. Uniform Convergence¶

Goal: Bound the difference between empirical and true risk uniformly over all $h \in \mathcal{H}$.

Uniform Convergence Property:

$\mathcal{H}$ has uniform convergence if for any $\epsilon, \delta$, there exists $m$ such that:

\[P_{S \sim \mathcal{D}^m}\left[\sup_{h \in \mathcal{H}} |L_{\mathcal{D}}(h) - L_S(h)| \leq \epsilon\right] \geq 1 - \delta\]

Fundamental Theorem:

A hypothesis class $\mathcal{H}$ is PAC-learnable if and only if it has the uniform convergence property.

Hoeffding’s Inequality (Single Hypothesis):

For a fixed $h$, with probability at least $1-\delta$:

\[|L_{\mathcal{D}}(h) - L_S(h)| \leq \sqrt{\frac{\log(2/\delta)}{2m}}\]

Union Bound Approach (Finite $\mathcal{H}$):

Apply Hoeffding to each $h \in \mathcal{H}$ and use union bound:

\[P\left[\sup_{h \in \mathcal{H}} |L_{\mathcal{D}}(h) - L_S(h)| > \epsilon\right] \leq 2|\mathcal{H}|e^{-2m\epsilon^2}\]

Setting RHS $= \delta$ and solving:

\[\epsilon = \sqrt{\frac{\log(2|\mathcal{H}|/\delta)}{2m}}\]

For Infinite $\mathcal{H}$:

Replace $\log|\mathcal{H}|$ with VC dimension or Rademacher complexity:

\[\epsilon = O\left(\sqrt{\frac{\text{VCdim}(\mathcal{H}) + \log(1/\delta)}{m}}\right)\]

6. Structural Risk Minimization (SRM)¶

Problem: Trading off model complexity vs. empirical fit.

SRM Principle (Vapnik, 1974):

Choose hypothesis that minimizes:

\[\hat{h} = \arg\min_{h \in \mathcal{H}} \left[L_S(h) + \sqrt{\frac{d_h\log(m/d_h) + \log(1/\delta)}{m}}\right]\]

where $d_h$ is complexity measure (VC dimension, number of parameters, norm).

Interpretation:

First term: Empirical risk (fit to data)
Second term: Capacity penalty (complexity regularization)
Balances underfitting vs. overfitting

Nested Hypothesis Classes:

\[\mathcal{H}_1 \subset \mathcal{H}_2 \subset \cdots \subset \mathcal{H}_k\]

with increasing complexity: $\text{VCdim}(\mathcal{H}_1) < \text{VCdim}(\mathcal{H}_2) < \cdots$

Select $\mathcal{H}_i$ that minimizes SRM objective.

Occam’s Razor:

Prefer simpler hypotheses when they explain the data equally well.

Connection to Regularization:

Ridge regression: $\min_w \|Xw - y\|^2 + \lambda\|w\|^2$
Lasso: $\min_w \|Xw - y\|^2 + \lambda\|w\|_1$
$\lambda$ controls complexity penalty

Cross-Validation vs. SRM:

CV: Data-driven model selection (empirical)
SRM: Theory-driven with generalization guarantees

7. No Free Lunch Theorems¶

Ugly Duckling Theorem:

Without prior assumptions, all classification problems are equally hard.

No Free Lunch (Wolpert, 1996):

For any learning algorithm $A$, there exists a distribution $\mathcal{D}$ where $A$ performs poorly.

Averaged over all possible distributions:

\[\mathbb{E}_{\mathcal{D}}[L_{\mathcal{D}}(A(S))] = \mathbb{E}_{\mathcal{D}}[L_{\mathcal{D}}(\text{random})]\]

Implication:

No universally best learning algorithm
Must make inductive bias assumptions about the problem
Success depends on match between bias and true distribution

Practical Takeaway:

Choose hypothesis class $\mathcal{H}$ and learning algorithm based on:

Domain knowledge
Problem structure
Available data

8. Summary Table¶

Concept	Measures	Sample Complexity	Key Property
Finite $\|\mathcal{H}\|$	Class size	$O(\frac{\log\|\mathcal{H}\|}{\epsilon^2})$	Simple, limited expressiveness
VC Dimension	Shattering	$O(\frac{d + \log(1/\delta)}{\epsilon^2})$	Distribution-independent
Rademacher	Noise fitting	$O(\frac{\mathfrak{R}_m^2}{\epsilon^2})$	Data-dependent, tighter
PAC Learning	$(\epsilon, \delta)$	Class-dependent	Formal guarantees
Bias-Variance	Error decomposition	N/A	Diagnostic tool

When to use:

PAC/VC: Formal analysis, worst-case bounds
Rademacher: Margin-based methods (SVM), neural nets
Bias-Variance: Model selection, diagnosis
SRM: Practical regularization strategy

# Generate synthetic data: True function + noise
def true_function(x):
    """Ground truth: sine wave"""
    return np.sin(2 * np.pi * x)

def generate_data(n_samples=100, noise_std=0.3):
    """Generate noisy samples from true function"""
    x = np.random.uniform(0, 1, n_samples)
    y = true_function(x) + np.random.normal(0, noise_std, n_samples)
    return x, y

# Generate datasets
n_train = 50
n_test = 1000

x_train, y_train = generate_data(n_train)
x_test, y_test = generate_data(n_test, noise_std=0.1)  # Less noise for test

# Sort for plotting
x_plot = np.linspace(0, 1, 500)
y_true = true_function(x_plot)

print(f"Training samples: {n_train}")
print(f"Test samples: {n_test}")

3. The Bias-Variance Tradeoff¶

Decomposition of Error¶

For squared loss, the expected test error can be decomposed:

\[\mathbb{E}[(y - \hat{f}(x))^2] = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}\]

Bias: Error from wrong assumptions (underfitting)
Variance: Error from sensitivity to training data (overfitting)
Irreducible Error: Noise in the data

def train_polynomial_models(x_train, y_train, x_test, y_test, degrees):
    """Train polynomial models of different complexities"""
    results = []
    
    for degree in degrees:
        # Create polynomial features
        poly = PolynomialFeatures(degree=degree)
        X_train_poly = poly.fit_transform(x_train.reshape(-1, 1))
        X_test_poly = poly.transform(x_test.reshape(-1, 1))
        
        # Train model
        model = LinearRegression()
        model.fit(X_train_poly, y_train)
        
        # Predictions
        y_train_pred = model.predict(X_train_poly)
        y_test_pred = model.predict(X_test_poly)
        
        # Errors
        train_error = mean_squared_error(y_train, y_train_pred)
        test_error = mean_squared_error(y_test, y_test_pred)
        
        results.append({
            'degree': degree,
            'model': model,
            'poly': poly,
            'train_error': train_error,
            'test_error': test_error
        })
    
    return results

# Train models of varying complexity
degrees = [1, 2, 3, 5, 9, 15]
results = train_polynomial_models(x_train, y_train, x_test, y_test, degrees)

# Display results
print("\nModel Complexity Analysis:")
print("="*60)
print(f"{'Degree':<10} {'Train Error':<15} {'Test Error':<15} {'Gap':<10}")
print("="*60)
for r in results:
    gap = r['test_error'] - r['train_error']
    print(f"{r['degree']:<10} {r['train_error']:<15.4f} {r['test_error']:<15.4f} {gap:<10.4f}")

"""
Learning Theory Implementations: VC Dimension, Rademacher Complexity, and Sample Complexity
"""

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import learning_curve
from itertools import combinations, product

# ============================================================================
# 1. VC Dimension: Shattering Demonstration
# ============================================================================

def check_shattering_2d(points, hypothesis_class='linear'):
    """
    Check if a set of points can be shattered by linear classifiers in 2D.
    
    Args:
        points: Array of shape (n, 2)
        hypothesis_class: 'linear' for half-planes
    
    Returns:
        can_shatter: bool
        successful_labelings: list of labelings that can be realized
    """
    n = len(points)
    total_labelings = 2**n
    successful = []
    
    # Try all possible labelings
    for labeling in product([0, 1], repeat=n):
        labels = np.array(labeling)
        
        if hypothesis_class == 'linear':
            # Try to find separating hyperplane
            # For 2D: w1*x + w2*y + b = 0
            # We'll use a simple perceptron-like check
            
            # Create augmented points (add bias term)
            X_aug = np.column_stack([points, np.ones(n)])
            y = 2 * labels - 1  # Convert to {-1, +1}
            
            # Try to find w such that y_i * (w^T x_i) > 0 for all i
            # Using simple linear programming approach
            separable = check_linear_separability(X_aug, y)
            
            if separable:
                successful.append(labeling)
    
    can_shatter = (len(successful) == total_labelings)
    return can_shatter, successful

def check_linear_separability(X, y, max_iter=1000):
    """Simple perceptron to check linear separability"""
    n, d = X.shape
    w = np.zeros(d)
    
    for _ in range(max_iter):
        mistakes = 0
        for i in range(n):
            if y[i] * np.dot(w, X[i]) <= 0:
                w += y[i] * X[i]
                mistakes += 1
        if mistakes == 0:
            return True
    return False

print("=" * 70)
print("1. VC Dimension: Shattering Linear Classifiers in 2D")
print("=" * 70)

# Test Case 1: 3 points (general position) - Should shatter
points_3 = np.array([[0, 0], [1, 0], [0.5, 1]])
can_shatter_3, successful_3 = check_shattering_2d(points_3)

print(f"\n3 points (general position):")
print(f"  Total possible labelings: {2**3}")
print(f"  Realizable labelings: {len(successful_3)}")
print(f"  Can shatter: {can_shatter_3}")
print(f"  VC dimension ≥ 3: {can_shatter_3}")

# Test Case 2: 4 points (XOR configuration) - Cannot shatter
points_4_xor = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
can_shatter_4, successful_4 = check_shattering_2d(points_4_xor)

print(f"\n4 points (XOR configuration):")
print(f"  Total possible labelings: {2**4}")
print(f"  Realizable labelings: {len(successful_4)}")
print(f"  Can shatter: {can_shatter_4}")
print(f"  VC dimension = 3 (cannot shatter all 4-point sets)")

# Visualize shattering
fig, axes = plt.subplots(2, 4, figsize=(16, 8))

# 3 points - show all 8 labelings
for idx, labeling in enumerate(successful_3[:8]):
    ax = axes[0, idx]
    colors = ['red' if l == 0 else 'blue' for l in labeling]
    ax.scatter(points_3[:, 0], points_3[:, 1], c=colors, s=200, edgecolor='black', linewidth=2)
    ax.set_xlim(-0.5, 1.5)
    ax.set_ylim(-0.5, 1.5)
    ax.set_title(f"Labeling: {labeling}", fontsize=10)
    ax.grid(True, alpha=0.3)

# 4 points - show some labelings
sample_labelings_4 = [successful_4[i] for i in range(min(4, len(successful_4)))]
# Add XOR pattern which is NOT separable
xor_pattern = (0, 1, 1, 0)
sample_labelings_4.append(xor_pattern)
sample_labelings_4 = sample_labelings_4[:4]

for idx, labeling in enumerate(sample_labelings_4):
    ax = axes[1, idx]
    colors = ['red' if l == 0 else 'blue' for l in labeling]
    ax.scatter(points_4_xor[:, 0], points_4_xor[:, 1], c=colors, s=200, edgecolor='black', linewidth=2)
    ax.set_xlim(-0.5, 1.5)
    ax.set_ylim(-0.5, 1.5)
    is_xor = (labeling == xor_pattern)
    title = f"{'XOR (NOT separable)' if is_xor else f'Labeling: {labeling}'}"
    ax.set_title(title, fontsize=10, color='red' if is_xor else 'black')
    ax.grid(True, alpha=0.3)

plt.suptitle('VC Dimension Demonstration: Linear Classifiers in 2D', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('vc_dimension_shattering.png', dpi=150, bbox_inches='tight')
print("\nSaved: vc_dimension_shattering.png")
plt.show()

# ============================================================================
# 2. Empirical Rademacher Complexity
# ============================================================================

def empirical_rademacher_complexity(X, hypothesis_class, n_trials=1000):
    """
    Estimate empirical Rademacher complexity by Monte Carlo sampling.
    
    Args:
        X: Training samples (m, d)
        hypothesis_class: Function that returns predictions for all hypotheses
        n_trials: Number of random sign patterns to try
    
    Returns:
        rademacher_complexity: Estimated R_S(H)
    """
    m = len(X)
    correlations = []
    
    for _ in range(n_trials):
        # Sample Rademacher variables
        sigma = np.random.choice([-1, 1], size=m)
        
        # Compute supremum over hypothesis class
        # For linear classifiers: sup_w sigma^T X w subject to ||w|| <= 1
        # This equals ||X^T sigma||_2 (dual norm)
        
        if hypothesis_class == 'linear_bounded':
            # For ||w||_2 <= 1
            correlation = np.linalg.norm(X.T @ sigma) / m
        elif hypothesis_class == 'linear_l1':
            # For ||w||_1 <= 1
            correlation = np.max(np.abs(X.T @ sigma)) / m
        
        correlations.append(correlation)
    
    return np.mean(correlations)

print("\n" + "=" * 70)
print("2. Empirical Rademacher Complexity")
print("=" * 70)

# Generate random data
m_values = [10, 20, 50, 100, 200, 500]
d = 5
R = 1.0  # Data radius

rademacher_estimates = []

for m in m_values:
    # Generate random points in d-dimensional ball
    X = np.random.randn(m, d)
    X = X / np.linalg.norm(X, axis=1, keepdims=True) * R  # Normalize to radius R
    
    rad_complexity = empirical_rademacher_complexity(X, 'linear_bounded', n_trials=500)
    rademacher_estimates.append(rad_complexity)
    
    # Theoretical bound: O(R / sqrt(m))
    theoretical = R / np.sqrt(m)
    
    print(f"m = {m:4d}: Empirical = {rad_complexity:.4f}, Theoretical O(1/√m) = {theoretical:.4f}")

# Plot Rademacher complexity vs sample size
plt.figure(figsize=(10, 6))
plt.plot(m_values, rademacher_estimates, 'o-', label='Empirical R(H)', markersize=8, linewidth=2)
plt.plot(m_values, [R / np.sqrt(m) for m in m_values], '--', 
         label=f'Theoretical O(R/√m), R={R}', linewidth=2, alpha=0.7)
plt.xlabel('Sample Size (m)', fontsize=12)
plt.ylabel('Rademacher Complexity', fontsize=12)
plt.title('Rademacher Complexity vs Sample Size\nLinear Classifiers with ||w||₂ ≤ 1', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.xscale('log')
plt.yscale('log')
plt.savefig('rademacher_complexity.png', dpi=150, bbox_inches='tight')
print("\nSaved: rademacher_complexity.png")
plt.show()

# ============================================================================
# 3. PAC Learning: Sample Complexity Bounds
# ============================================================================

def pac_sample_complexity(epsilon, delta, vc_dim):
    """
    Compute PAC sample complexity for agnostic learning.
    
    m >= O((d + log(1/delta)) / epsilon^2)
    
    Using constant factor from VC theory: m >= c * (d*log(1/epsilon) + log(1/delta)) / epsilon^2
    """
    c = 4  # Constant from theory
    m = c * (vc_dim * np.log(1/epsilon) + np.log(1/delta)) / (epsilon**2)
    return int(np.ceil(m))

print("\n" + "=" * 70)
print("3. PAC Learning: Sample Complexity Bounds")
print("=" * 70)

# Different hypothesis classes
hypothesis_classes = {
    'Linear (R^2)': 3,      # VC dim = d+1 = 3
    'Linear (R^10)': 11,    # VC dim = 11
    'Linear (R^100)': 101,  # VC dim = 101
    'Polynomial (deg 2, R^2)': 6,  # Approx
    'Polynomial (deg 3, R^2)': 10, # Approx
}

epsilon = 0.1  # 10% error
delta = 0.05   # 95% confidence

print(f"\nSample complexity for epsilon={epsilon}, delta={delta}:\n")
print(f"{'Hypothesis Class':<25} {'VC Dimension':>12} {'Sample Complexity':>20}")
print("-" * 60)

for name, vc_dim in hypothesis_classes.items():
    m = pac_sample_complexity(epsilon, delta, vc_dim)
    print(f"{name:<25} {vc_dim:>12} {m:>20,}")

# Plot sample complexity vs VC dimension
vc_dims = np.arange(1, 201, 5)
sample_complexities = [pac_sample_complexity(epsilon, delta, d) for d in vc_dims]

plt.figure(figsize=(12, 6))
plt.plot(vc_dims, sample_complexities, linewidth=2)
plt.xlabel('VC Dimension (d)', fontsize=12)
plt.ylabel('Sample Complexity (m)', fontsize=12)
plt.title(f'PAC Sample Complexity vs VC Dimension\n(ε={epsilon}, δ={delta})', fontsize=14)
plt.grid(True, alpha=0.3)
plt.savefig('pac_sample_complexity.png', dpi=150, bbox_inches='tight')
print("\nSaved: pac_sample_complexity.png")
plt.show()

# ============================================================================
# 4. Learning Curves: Empirical Validation
# ============================================================================

print("\n" + "=" * 70)
print("4. Learning Curves: Generalization Gap")
print("=" * 70)

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Generate synthetic classification data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, 
                           n_redundant=5, random_state=42)

# Different model complexities
models = {
    'Low Complexity (C=0.01)': LogisticRegression(C=0.01, max_iter=1000),
    'Medium Complexity (C=1)': LogisticRegression(C=1.0, max_iter=1000),
    'High Complexity (C=100)': LogisticRegression(C=100, max_iter=1000),
    'Very High (Deep Tree)': DecisionTreeClassifier(max_depth=None),
}

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, (name, model) in enumerate(models.items()):
    # Compute learning curves
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y, cv=5, n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='accuracy',
        random_state=42
    )
    
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_mean = np.mean(val_scores, axis=1)
    val_std = np.std(val_scores, axis=1)
    
    ax = axes[idx]
    ax.plot(train_sizes, train_mean, 'o-', label='Training accuracy', linewidth=2, markersize=6)
    ax.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.2)
    
    ax.plot(train_sizes, val_mean, 's-', label='Validation accuracy', linewidth=2, markersize=6)
    ax.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.2)
    
    # Highlight generalization gap
    gap = train_mean - val_mean
    ax.plot(train_sizes, gap, 'r--', label='Generalization gap', linewidth=1.5, alpha=0.7)
    
    ax.set_xlabel('Training Set Size', fontsize=11)
    ax.set_ylabel('Accuracy', fontsize=11)
    ax.set_title(name, fontsize=12, fontweight='bold')
    ax.legend(loc='lower right', fontsize=9)
    ax.grid(True, alpha=0.3)
    ax.set_ylim([0.5, 1.05])

plt.suptitle('Learning Curves: Bias-Variance Tradeoff Across Model Complexities', 
             fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.savefig('learning_curves_comparison.png', dpi=150, bbox_inches='tight')
print("\nSaved: learning_curves_comparison.png")
plt.show()

# ============================================================================
# 5. Generalization Bounds Visualization
# ============================================================================

print("\n" + "=" * 70)
print("5. Generalization Bounds: Theory vs Practice")
print("=" * 70)

def generalization_bound_vc(m, d, delta):
    """VC-based generalization bound"""
    if m <= d:
        return np.inf
    return np.sqrt((d * np.log(np.e * m / d) + np.log(1/delta)) / m)

def generalization_bound_rademacher(m, R, Lambda, delta):
    """Rademacher-based bound for linear classifiers"""
    return 2 * R * Lambda / np.sqrt(m) + np.sqrt(np.log(1/delta) / (2*m))

# Parameters
m_range = np.logspace(1, 4, 100).astype(int)
d = 20
delta = 0.05
R = 1.0
Lambda = 1.0

vc_bounds = [generalization_bound_vc(m, d, delta) for m in m_range]
rad_bounds = [generalization_bound_rademacher(m, R, Lambda, delta) for m in m_range]

plt.figure(figsize=(12, 6))
plt.plot(m_range, vc_bounds, label=f'VC Bound (d={d})', linewidth=2)
plt.plot(m_range, rad_bounds, label=f'Rademacher Bound (R={R}, Λ={Lambda})', linewidth=2)
plt.axhline(y=0.1, color='red', linestyle='--', alpha=0.5, label='ε = 0.1 (target accuracy)')
plt.xlabel('Sample Size (m)', fontsize=12)
plt.ylabel('Generalization Bound (ε)', fontsize=12)
plt.title(f'Generalization Bounds vs Sample Size (δ={delta})', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.xscale('log')
plt.yscale('log')
plt.xlim([10, 10000])
plt.ylim([0.01, 10])
plt.savefig('generalization_bounds.png', dpi=150, bbox_inches='tight')
print("\nSaved: generalization_bounds.png")
plt.show()

print("\n" + "=" * 70)
print("Summary: Statistical Learning Theory")
print("=" * 70)
print(f"""
Key Results Demonstrated:
1. VC Dimension: Linear classifiers in R^2 have VC dim = 3
   - Can shatter any 3 points in general position
   - Cannot shatter all configurations of 4 points (e.g., XOR)

2. Rademacher Complexity: Decreases as O(1/√m)
   - Empirical estimates match theoretical predictions
   - Provides data-dependent bounds

3. PAC Sample Complexity: O((d log(1/ε) + log(1/δ)) / ε²)
   - Grows linearly with VC dimension
   - For d=100, ε=0.1, δ=0.05: need ~45,000 samples

4. Learning Curves: Show bias-variance tradeoff
   - Low complexity: High bias, low variance (underfitting)
   - High complexity: Low bias, high variance (overfitting)
   - Generalization gap decreases with more data

5. Generalization Bounds: Theory provides worst-case guarantees
   - Actual performance often much better in practice
   - Rademacher bounds tighter than VC bounds for specific cases
""")

# Visualize bias-variance tradeoff
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.ravel()

for idx, result in enumerate(results):
    ax = axes[idx]
    
    # Get predictions for plotting
    X_plot_poly = result['poly'].transform(x_plot.reshape(-1, 1))
    y_pred = result['model'].predict(X_plot_poly)
    
    # Plot
    ax.scatter(x_train, y_train, alpha=0.6, label='Training data')
    ax.plot(x_plot, y_true, 'g-', linewidth=2, label='True function')
    ax.plot(x_plot, y_pred, 'r-', linewidth=2, label=f'Degree {result["degree"]}')
    
    ax.set_title(f'Polynomial Degree {result["degree"]}\n'
                 f'Train: {result["train_error"]:.3f}, Test: {result["test_error"]:.3f}')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.legend(fontsize=8)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Observations¶

Degree 1-2 (Underfitting): High bias, low variance
- Cannot capture true pattern
- Similar train and test error
Degree 3-5 (Good fit): Balanced bias and variance
- Captures true pattern well
- Small gap between train and test error
Degree 9-15 (Overfitting): Low bias, high variance
- Fits training data perfectly
- Large gap between train and test error
- Poor generalization

# Plot learning curves
train_errors = [r['train_error'] for r in results]
test_errors = [r['test_error'] for r in results]

plt.figure(figsize=(10, 6))
plt.plot(degrees, train_errors, 'o-', linewidth=2, markersize=8, label='Training Error')
plt.plot(degrees, test_errors, 's-', linewidth=2, markersize=8, label='Test Error')
plt.axhline(y=min(test_errors), color='g', linestyle='--', alpha=0.5, label='Best Test Error')
plt.xlabel('Model Complexity (Polynomial Degree)', fontsize=12)
plt.ylabel('Mean Squared Error', fontsize=12)
plt.title('Bias-Variance Tradeoff: Error vs Model Complexity', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()

# Find optimal complexity
optimal_idx = np.argmin(test_errors)
print(f"\n✓ Optimal polynomial degree: {degrees[optimal_idx]}")
print(f"  Test error: {test_errors[optimal_idx]:.4f}")

4. Sample Complexity¶

Question: How many samples do we need to learn well?

Learning Curves¶

Plot error vs. training set size to understand:

How much data is needed
Whether we’re data-limited or model-limited

def compute_learning_curves(degree, train_sizes):
    """Compute learning curves for a given model complexity"""
    train_errors = []
    test_errors = []
    
    for size in train_sizes:
        # Sample training data
        x_tr, y_tr = generate_data(size)
        
        # Train model
        poly = PolynomialFeatures(degree=degree)
        X_train_poly = poly.fit_transform(x_tr.reshape(-1, 1))
        X_test_poly = poly.transform(x_test.reshape(-1, 1))
        
        model = LinearRegression()
        model.fit(X_train_poly, y_tr)
        
        # Compute errors
        train_errors.append(mean_squared_error(y_tr, model.predict(X_train_poly)))
        test_errors.append(mean_squared_error(y_test, model.predict(X_test_poly)))
    
    return np.array(train_errors), np.array(test_errors)

# Compute learning curves for different complexities
train_sizes = [10, 20, 30, 50, 75, 100, 150, 200]
degrees_to_test = [1, 3, 9]

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, degree in enumerate(degrees_to_test):
    train_errs, test_errs = compute_learning_curves(degree, train_sizes)
    
    ax = axes[idx]
    ax.plot(train_sizes, train_errs, 'o-', label='Training Error', linewidth=2)
    ax.plot(train_sizes, test_errs, 's-', label='Test Error', linewidth=2)
    ax.set_xlabel('Training Set Size', fontsize=11)
    ax.set_ylabel('MSE', fontsize=11)
    ax.set_title(f'Degree {degree}', fontsize=12, fontweight='bold')
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_yscale('log')

plt.tight_layout()
plt.show()

Interpretation¶

Degree 1 (High Bias):

Errors converge quickly
Large gap remains (cannot fit data well)
More data won’t help much → Need better model

Degree 3 (Balanced):

Errors decrease with more data
Gap closes gradually
More data helps → Good model

Degree 9 (High Variance):

Large initial gap
Needs lots of data to converge
More data helps but slowly → Model too complex for available data

5. PAC Learning Framework¶

Probably Approximately Correct (PAC) Learning¶

Definition: A hypothesis class $\mathcal{H}$ is PAC-learnable if there exists an algorithm $A$ such that for any:

Target concept $c \in \mathcal{H}$
Distribution $\mathcal{D}$ over $\mathcal{X}$
Error tolerance $\epsilon > 0$
Confidence $\delta > 0$

Given $m \geq m_{\mathcal{H}}(\epsilon, \delta)$ samples, algorithm $A$ outputs $h$ such that:

\[P[L_{\mathcal{D}}(h) \leq \epsilon] \geq 1 - \delta\]

With probability at least $1-\delta$, the error is at most $\epsilon$.

Key Parameters¶

$\epsilon$: Accuracy (how close to optimal)
$\delta$: Confidence (how certain we are)
$m$: Sample complexity (how much data needed)

# Simulate PAC learning guarantees
def simulate_pac_learning(n_trials=1000, sample_sizes=[10, 50, 100, 500], epsilon=0.1):
    """
    Simulate PAC learning: For each sample size, check how often
    we achieve epsilon-accuracy
    """
    degree = 3  # Use balanced model
    results = {}
    
    for m in sample_sizes:
        successes = 0
        test_errors = []
        
        for trial in range(n_trials):
            # Generate training data
            x_tr, y_tr = generate_data(m)
            
            # Train model
            poly = PolynomialFeatures(degree=degree)
            X_train_poly = poly.fit_transform(x_tr.reshape(-1, 1))
            X_test_poly = poly.transform(x_test.reshape(-1, 1))
            
            model = LinearRegression()
            model.fit(X_train_poly, y_tr)
            
            # Compute test error
            test_error = mean_squared_error(y_test, model.predict(X_test_poly))
            test_errors.append(test_error)
            
            if test_error <= epsilon:
                successes += 1
        
        confidence = successes / n_trials
        results[m] = {
            'confidence': confidence,
            'mean_error': np.mean(test_errors),
            'std_error': np.std(test_errors)
        }
    
    return results

# Run simulation
epsilon = 0.1
pac_results = simulate_pac_learning(epsilon=epsilon)

print(f"\nPAC Learning Simulation (ε = {epsilon})")
print("="*70)
print(f"{'Sample Size':<15} {'P[error ≤ ε]':<20} {'Mean Error':<15} {'Std Error':<15}")
print("="*70)
for m, res in pac_results.items():
    print(f"{m:<15} {res['confidence']:<20.3f} {res['mean_error']:<15.4f} {res['std_error']:<15.4f}")

# Visualize PAC guarantees
sample_sizes = list(pac_results.keys())
confidences = [pac_results[m]['confidence'] for m in sample_sizes]
mean_errors = [pac_results[m]['mean_error'] for m in sample_sizes]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Confidence plot
ax1.plot(sample_sizes, confidences, 'o-', linewidth=2, markersize=8)
ax1.axhline(y=0.95, color='r', linestyle='--', label='95% confidence')
ax1.set_xlabel('Training Set Size', fontsize=12)
ax1.set_ylabel(f'P[Error ≤ {epsilon}]', fontsize=12)
ax1.set_title('PAC Confidence vs Sample Size', fontsize=13, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.legend()

# Error plot
ax2.plot(sample_sizes, mean_errors, 's-', linewidth=2, markersize=8, color='green')
ax2.axhline(y=epsilon, color='r', linestyle='--', label=f'ε = {epsilon}')
ax2.set_xlabel('Training Set Size', fontsize=12)
ax2.set_ylabel('Mean Test Error', fontsize=12)
ax2.set_title('Mean Error vs Sample Size', fontsize=13, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.legend()
ax2.set_yscale('log')

plt.tight_layout()
plt.show()

6. Summary¶

Key Concepts¶

✅ Learning Problem: Find hypothesis that generalizes from training to test data

✅ Generalization Gap: Difference between training and test error

Underfitting: High bias (model too simple)
Overfitting: High variance (model too complex)

✅ Bias-Variance Tradeoff: $$\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}$$

✅ Sample Complexity: How much data needed to learn?

More complex models need more data
Learning curves reveal data vs. model limitations

✅ PAC Learning: Formal framework for understanding learnability

Provides probabilistic guarantees
Sample complexity bounds

Next Steps¶

Concentration Inequalities: Formalize generalization bounds
VC Dimension: Measure hypothesis class complexity
Rademacher Complexity: Alternative complexity measure
Regularization: Control complexity explicitly

Practical Takeaways¶

Monitor both train and test error to detect overfitting
Use validation set to select model complexity
Plot learning curves to diagnose bias/variance issues
Get more data when variance-limited
Use better features when bias-limited

7. Exercises¶

Exercise 1: Implement Cross-Validation¶

Implement k-fold cross-validation to select optimal polynomial degree.

Exercise 2: Regularization¶

Add L2 regularization (Ridge) and observe its effect on overfitting.

Exercise 3: Feature Engineering¶

Create custom features to reduce bias for the sine wave problem.

Exercise 4: Theory¶

Prove that for finite hypothesis class $|\mathcal{H}|$, the sample complexity is: $$m \geq \frac{1}{\epsilon}\left(\log |\mathcal{H}| + \log \frac{1}{\delta}\right)$$

Exercise 5: Experimentation¶

Generate data from different distributions and observe how it affects learning.

References¶

Shalev-Shwartz & Ben-David - “Understanding Machine Learning: From Theory to Algorithms”
Mohri, Rostamizadeh, Talwalkar - “Foundations of Machine Learning”
Abu-Mostafa, Magdon-Ismail, Lin - “Learning From Data”
Vapnik - “Statistical Learning Theory”

Next Notebook: 02_concentration_inequalities.ipynb

Concept	Measures	Sample Complexity	Key Property
Finite \(\|\mathcal{H}\|\)	Class size	\(O(\frac{\log\|\mathcal{H}\|}{\epsilon^2})\)	Simple, limited expressiveness
VC Dimension	Shattering	\(O(\frac{d + \log(1/\delta)}{\epsilon^2})\)	Distribution-independent
Rademacher	Noise fitting	\(O(\frac{\mathfrak{R}_m^2}{\epsilon^2})\)	Data-dependent, tighter
PAC Learning	\((\epsilon, \delta)\)	Class-dependent	Formal guarantees
Bias-Variance	Error decomposition	N/A	Diagnostic tool