Chapter 2: Statistical LearningΒΆ

OverviewΒΆ

This chapter introduces the fundamental concepts of statistical learning:

  • What is statistical learning? Understanding the relationship between inputs and outputs

  • Why estimate f? Prediction vs. Inference

  • How to estimate f? Parametric vs. Non-parametric methods

  • Model accuracy Trade-offs between bias and variance

  • Supervised vs. Unsupervised learning

  • Regression vs. Classification

Key ConceptsΒΆ

The Statistical Learning FrameworkΒΆ

Given:

  • Predictors/Features: X = (X₁, Xβ‚‚, …, Xβ‚š)

  • Response/Output: Y

We assume there exists a relationship:

Y = f(X) + Ξ΅

Where:

  • f is the unknown function we want to estimate

  • Ξ΅ is irreducible error (mean zero)

Goals:

  1. Prediction: Estimate Y given new X values

  2. Inference: Understand the relationship between X and Y

# Generate synthetic data to demonstrate concepts
# True function: f(x) = 2 + 3*x + 5*x^2
np.random.seed(42)
n = 100

X_train = np.random.uniform(-2, 2, n)
X_test = np.random.uniform(-2, 2, 50)

# True function
def true_function(x):
    return 2 + 3*x + 5*x**2

# Add noise
epsilon_std = 3
y_train = true_function(X_train) + np.random.normal(0, epsilon_std, n)
y_test = true_function(X_test) + np.random.normal(0, epsilon_std, 50)

# Visualize the data
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
x_smooth = np.linspace(-2, 2, 200)
plt.scatter(X_train, y_train, alpha=0.5, label='Training data')
plt.plot(x_smooth, true_function(x_smooth), 'r-', linewidth=2, label='True f(x)')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Statistical Learning: Y = f(X) + Ξ΅')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
# Show reducible vs irreducible error
predictions = true_function(X_train)
plt.scatter(X_train, y_train, alpha=0.5, label='Observed Y')
plt.scatter(X_train, predictions, alpha=0.5, color='orange', label='f(X) - no noise')
for i in range(min(20, n)):
    plt.plot([X_train[i], X_train[i]], [y_train[i], predictions[i]], 
             'k-', alpha=0.2, linewidth=0.5)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Reducible Error (noise Ξ΅)')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Training samples: {n}")
print(f"Test samples: {len(X_test)}")
print(f"Noise std deviation (irreducible error): {epsilon_std}")

2.1 Prediction vs. InferenceΒΆ

PredictionΒΆ

Goal: Accurately predict Y for new X values

Example: Predict house price given features (location, size, etc.)

  • Don’t care about the form of f(X)

  • Only care about accuracy

InferenceΒΆ

Goal: Understand the relationship between X and Y

Questions we want to answer:

  • Which predictors are most important?

  • What is the relationship between Y and each Xβ±Ό?

  • Is the relationship linear, non-linear, or more complex?

Example: How does advertising budget affect sales?

  • Need interpretable model

  • Care about the form of f(X)

# Demonstration: Parametric vs. Non-parametric Methods

# Reshape for sklearn
X_train_2d = X_train.reshape(-1, 1)
X_test_2d = X_test.reshape(-1, 1)

# 1. PARAMETRIC METHOD: Linear Regression (assumes linear form)
linear_model = LinearRegression()
linear_model.fit(X_train_2d, y_train)
y_pred_linear = linear_model.predict(X_test_2d)

# 2. PARAMETRIC METHOD: Polynomial Regression (assumes polynomial form)
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2)
X_train_poly = poly_features.fit_transform(X_train_2d)
X_test_poly = poly_features.transform(X_test_2d)

poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)
y_pred_poly = poly_model.predict(X_test_poly)

# 3. NON-PARAMETRIC METHOD: K-Nearest Neighbors
knn_model = KNeighborsRegressor(n_neighbors=10)
knn_model.fit(X_train_2d, y_train)
y_pred_knn = knn_model.predict(X_test_2d)

# Visualization
x_plot = np.linspace(-2, 2, 200).reshape(-1, 1)
x_plot_poly = poly_features.transform(x_plot)

plt.figure(figsize=(15, 5))

# Linear model
plt.subplot(1, 3, 1)
plt.scatter(X_train, y_train, alpha=0.3, label='Training data')
plt.plot(x_plot, linear_model.predict(x_plot), 'r-', linewidth=2, label='Linear fit')
plt.plot(x_smooth, true_function(x_smooth), 'g--', linewidth=2, label='True f(x)', alpha=0.7)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Parametric: Linear Regression\n(Assumes linear relationship)')
plt.legend()
plt.grid(True, alpha=0.3)

# Polynomial model
plt.subplot(1, 3, 2)
plt.scatter(X_train, y_train, alpha=0.3, label='Training data')
plt.plot(x_plot, poly_model.predict(x_plot_poly), 'r-', linewidth=2, label='Polynomial fit (deg=2)')
plt.plot(x_smooth, true_function(x_smooth), 'g--', linewidth=2, label='True f(x)', alpha=0.7)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Parametric: Polynomial Regression\n(Assumes polynomial relationship)')
plt.legend()
plt.grid(True, alpha=0.3)

# KNN model
plt.subplot(1, 3, 3)
plt.scatter(X_train, y_train, alpha=0.3, label='Training data')
plt.plot(x_plot, knn_model.predict(x_plot), 'r-', linewidth=2, label='KNN fit (k=10)')
plt.plot(x_smooth, true_function(x_smooth), 'g--', linewidth=2, label='True f(x)', alpha=0.7)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Non-parametric: K-Nearest Neighbors\n(No assumed form)')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate MSE for each method
mse_linear = mean_squared_error(y_test, y_pred_linear)
mse_poly = mean_squared_error(y_test, y_pred_poly)
mse_knn = mean_squared_error(y_test, y_pred_knn)

print("\nπŸ“Š Test Set Performance (MSE):")
print(f"Linear Regression:      {mse_linear:.2f}")
print(f"Polynomial Regression:  {mse_poly:.2f} βœ… Best (matches true form)")
print(f"KNN (k=10):            {mse_knn:.2f}")

2.2 The Bias-Variance Trade-OffΒΆ

A fundamental concept in statistical learning:

Test MSE = BiasΒ² + Variance + Irreducible Error

BiasΒΆ

  • Error from approximating real-life problem with simpler model

  • High bias β†’ model is too simple (underfitting)

  • Example: Using linear model for non-linear relationship

VarianceΒΆ

  • Amount by which estimate would change with different training data

  • High variance β†’ model is too complex (overfitting)

  • Example: High-degree polynomial fitting noise

The Trade-OffΒΆ

  • Simple models: High bias, low variance

  • Complex models: Low bias, high variance

  • Goal: Find the sweet spot that minimizes test error

# Demonstrate bias-variance tradeoff with polynomial models of varying complexity

degrees = [1, 2, 3, 5, 10, 15]
train_errors = []
test_errors = []

plt.figure(figsize=(15, 10))

for idx, degree in enumerate(degrees, 1):
    # Fit polynomial model
    poly = PolynomialFeatures(degree=degree)
    X_train_p = poly.fit_transform(X_train_2d)
    X_test_p = poly.transform(X_test_2d)
    
    model = LinearRegression()
    model.fit(X_train_p, y_train)
    
    # Predictions
    y_train_pred = model.predict(X_train_p)
    y_test_pred = model.predict(X_test_p)
    
    # Errors
    train_mse = mean_squared_error(y_train, y_train_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)
    train_errors.append(train_mse)
    test_errors.append(test_mse)
    
    # Plot
    plt.subplot(2, 3, idx)
    x_plot_p = poly.transform(x_plot)
    plt.scatter(X_train, y_train, alpha=0.3, s=20, label='Training')
    plt.plot(x_plot, model.predict(x_plot_p), 'r-', linewidth=2, label=f'Degree {degree} fit')
    plt.plot(x_smooth, true_function(x_smooth), 'g--', linewidth=2, label='True f(x)', alpha=0.7)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title(f'Degree {degree}\nTrain MSE: {train_mse:.2f}, Test MSE: {test_mse:.2f}')
    plt.legend(fontsize=8)
    plt.grid(True, alpha=0.3)
    plt.ylim(-10, 30)

plt.tight_layout()
plt.show()

# Plot bias-variance tradeoff curve
plt.figure(figsize=(10, 6))
plt.plot(degrees, train_errors, 'o-', linewidth=2, markersize=8, label='Training MSE')
plt.plot(degrees, test_errors, 's-', linewidth=2, markersize=8, label='Test MSE')
plt.axhline(y=epsilon_std**2, color='r', linestyle='--', label=f'Irreducible Error (~{epsilon_std**2:.0f})')
plt.xlabel('Model Complexity (Polynomial Degree)')
plt.ylabel('Mean Squared Error')
plt.title('Bias-Variance Trade-Off')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(degrees)

# Annotate regions
plt.text(1.5, max(test_errors)*0.8, 'HIGH BIAS\nLOW VARIANCE\n(Underfitting)', 
         ha='center', fontsize=10, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
plt.text(12, max(test_errors)*0.8, 'LOW BIAS\nHIGH VARIANCE\n(Overfitting)', 
         ha='center', fontsize=10, bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.5))

plt.show()

best_degree = degrees[np.argmin(test_errors)]
print(f"\nπŸ’‘ Best model complexity: Degree {best_degree}")
print(f"   Minimum test MSE: {min(test_errors):.2f}")
print(f"   True underlying function is degree 2 βœ…")
# Demonstrate model flexibility with KNN (varying k)

k_values = [1, 3, 5, 10, 20, 50]
train_errors_knn = []
test_errors_knn = []

plt.figure(figsize=(15, 10))

for idx, k in enumerate(k_values, 1):
    model = KNeighborsRegressor(n_neighbors=k)
    model.fit(X_train_2d, y_train)
    
    y_train_pred = model.predict(X_train_2d)
    y_test_pred = model.predict(X_test_2d)
    
    train_mse = mean_squared_error(y_train, y_train_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)
    train_errors_knn.append(train_mse)
    test_errors_knn.append(test_mse)
    
    plt.subplot(2, 3, idx)
    plt.scatter(X_train, y_train, alpha=0.3, s=20, label='Training')
    plt.plot(x_plot, model.predict(x_plot), 'r-', linewidth=2, label=f'KNN (k={k})')
    plt.plot(x_smooth, true_function(x_smooth), 'g--', linewidth=2, label='True f(x)', alpha=0.7)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title(f'K = {k}\nTrain MSE: {train_mse:.2f}, Test MSE: {test_mse:.2f}')
    plt.legend(fontsize=8)
    plt.grid(True, alpha=0.3)
    plt.ylim(-10, 30)

plt.tight_layout()
plt.show()

# Plot flexibility curve (note: for KNN, smaller k = more flexible)
flexibility = [1/k for k in k_values]  # 1/k represents flexibility

plt.figure(figsize=(10, 6))
plt.plot(flexibility, train_errors_knn, 'o-', linewidth=2, markersize=8, label='Training MSE')
plt.plot(flexibility, test_errors_knn, 's-', linewidth=2, markersize=8, label='Test MSE')
plt.axhline(y=epsilon_std**2, color='r', linestyle='--', label=f'Irreducible Error (~{epsilon_std**2:.0f})')
plt.xlabel('Model Flexibility (1/k)')
plt.ylabel('Mean Squared Error')
plt.title('Bias-Variance Trade-Off for KNN')
plt.legend()
plt.grid(True, alpha=0.3)

# Add k labels on top axis
ax = plt.gca()
ax2 = ax.twiny()
ax2.set_xlim(ax.get_xlim())
ax2.set_xticks(flexibility)
ax2.set_xticklabels([f'k={k}' for k in k_values])
ax2.set_xlabel('Number of Neighbors (k)')

plt.show()

best_k = k_values[np.argmin(test_errors_knn)]
print(f"\nπŸ’‘ Best K value: {best_k}")
print(f"   Minimum test MSE: {min(test_errors_knn):.2f}")
print(f"   Smaller k = more flexible, larger k = less flexible")

2.3 Classification SettingΒΆ

Key Differences from RegressionΒΆ

  • Regression: Predict quantitative output (continuous)

  • Classification: Predict qualitative output (categorical)

Performance MetricsΒΆ

Training Error Rate:

Error Rate = (1/n) Γ— Ξ£ I(yα΅’ β‰  Ε·α΅’)

Bayes Classifier (optimal, but unknown):

  • Assigns each observation to most likely class given its predictors

  • Produces lowest possible test error rate (Bayes error rate)

K-Nearest Neighbors Classifier:

  • Estimates conditional probability P(Y=j|X=xβ‚€)

  • Classification: majority vote among k nearest neighbors

# Classification example with synthetic 2D data
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from matplotlib.colors import ListedColormap

# Generate classification data
np.random.seed(42)
n_class = 200

# Class 1: centered at (-1, -1)
X_class1 = np.random.randn(n_class//2, 2) * 0.6 + np.array([-1, -1])
y_class1 = np.zeros(n_class//2)

# Class 2: centered at (1, 1)
X_class2 = np.random.randn(n_class//2, 2) * 0.6 + np.array([1, 1])
y_class2 = np.ones(n_class//2)

X_class = np.vstack([X_class1, X_class2])
y_class = np.hstack([y_class1, y_class2])

# Split data
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_class, y_class, test_size=0.3, random_state=42
)

# Create mesh for decision boundary
h = 0.02
x_min, x_max = X_class[:, 0].min() - 1, X_class[:, 0].max() + 1
y_min, y_max = X_class[:, 1].min() - 1, X_class[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Try different K values
k_values_class = [1, 5, 15, 50]
cmap_light = ListedColormap(['#FFAAAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#0000FF'])

fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, k in enumerate(k_values_class):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_c, y_train_c)
    
    # Predict on mesh
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    axes[idx].contourf(xx, yy, Z, alpha=0.3, cmap=cmap_light)
    axes[idx].scatter(X_train_c[:, 0], X_train_c[:, 1], c=y_train_c, 
                     cmap=cmap_bold, alpha=0.6, edgecolor='k', s=50)
    
    # Calculate accuracy
    train_acc = accuracy_score(y_train_c, knn.predict(X_train_c))
    test_acc = accuracy_score(y_test_c, knn.predict(X_test_c))
    
    axes[idx].set_xlabel('X₁')
    axes[idx].set_ylabel('Xβ‚‚')
    axes[idx].set_title(f'KNN Classification (k={k})\nTrain Acc: {train_acc:.3f}, Test Acc: {test_acc:.3f}')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nπŸ“Š Classification Accuracy Summary:")
for k in k_values_class:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_c, y_train_c)
    train_acc = accuracy_score(y_train_c, knn.predict(X_train_c))
    test_acc = accuracy_score(y_test_c, knn.predict(X_test_c))
    print(f"k={k:2d}:  Train: {train_acc:.3f}  Test: {test_acc:.3f}")

print("\nπŸ’‘ Key Observations:")
print("   β€’ k=1: High flexibility β†’ Overfitting (perfect train, lower test)")
print("   β€’ k=50: Low flexibility β†’ Underfitting (boundary too smooth)")
print("   β€’ Middle values (k=5, 15) balance bias-variance well")

2.4 Supervised vs. Unsupervised LearningΒΆ

Supervised LearningΒΆ

Have labeled data: Both X (predictors) and Y (response)

Goal: Learn function f: X β†’ Y

Examples:

  • Linear regression

  • Logistic regression

  • K-Nearest Neighbors

  • Support Vector Machines

  • Neural Networks

Applications: Prediction, classification

Unsupervised LearningΒΆ

Have unlabeled data: Only X (no response Y)

Goal: Discover patterns, structure, or relationships in data

Examples:

  • Clustering (K-means, hierarchical)

  • Principal Component Analysis (PCA)

  • Matrix completion

Applications: Customer segmentation, anomaly detection, dimensionality reduction

# Demonstrate unsupervised learning with clustering
from sklearn.cluster import KMeans

# Generate data with hidden structure (3 clusters)
np.random.seed(42)
cluster_centers = np.array([[0, 0], [4, 4], [0, 4]])
n_samples = 150
n_clusters = 3

X_cluster = []
for center in cluster_centers:
    points = np.random.randn(n_samples//n_clusters, 2) * 0.7 + center
    X_cluster.append(points)
X_cluster = np.vstack(X_cluster)

# Try different numbers of clusters
fig, axes = plt.subplots(1, 4, figsize=(18, 4))

# Raw data (no labels)
axes[0].scatter(X_cluster[:, 0], X_cluster[:, 1], alpha=0.6, s=50)
axes[0].set_title('Unlabeled Data\n(Unsupervised Learning)')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].grid(True, alpha=0.3)

# Try k=2, 3, 4 clusters
for idx, k in enumerate([2, 3, 4], 1):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_cluster)
    centers = kmeans.cluster_centers_
    
    axes[idx].scatter(X_cluster[:, 0], X_cluster[:, 1], c=labels, 
                     cmap='viridis', alpha=0.6, s=50)
    axes[idx].scatter(centers[:, 0], centers[:, 1], c='red', 
                     marker='X', s=200, edgecolor='black', linewidth=2,
                     label='Centroids')
    axes[idx].set_title(f'K-Means Clustering\n(k={k} clusters)')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("πŸ’‘ Unsupervised Learning Insights:")
print("   β€’ No labeled responses - algorithm finds patterns on its own")
print("   β€’ k=3 correctly identifies the true structure")
print("   β€’ Choosing correct k is a challenge (often use elbow method, silhouette score)")

Key TakeawaysΒΆ

1. Statistical Learning FrameworkΒΆ

  • Goal: Estimate f in Y = f(X) + Ξ΅

  • Two purposes: Prediction (accuracy) vs Inference (interpretability)

2. MethodsΒΆ

  • Parametric: Assume functional form (faster, less flexible, may not match reality)

  • Non-parametric: No assumptions (flexible, needs more data, can overfit)

3. Bias-Variance Trade-OffΒΆ

Test MSE = BiasΒ² + Variance + Irreducible Error
  • Simple models: High bias, low variance

  • Complex models: Low bias, high variance

  • Goal: Minimize test error (not training error!)

4. Problem TypesΒΆ

  • Regression: Quantitative output (price, temperature)

  • Classification: Qualitative output (spam/not spam, disease type)

5. Learning ParadigmsΒΆ

  • Supervised: Have Y labels β†’ predict new Y

  • Unsupervised: No Y labels β†’ find structure

6. Model AssessmentΒΆ

  • Always evaluate on test data (not training data)

  • Training error can be misleading (overfitting)

  • Cross-validation helps estimate test error

Next StepsΒΆ

The following chapters dive deeper into specific methods:

  • Chapter 3: Linear Regression (parametric, regression)

  • Chapter 4: Classification (logistic regression, LDA, QDA, KNN)

  • Chapter 5: Resampling (cross-validation, bootstrap)

  • Chapter 6: Regularization (Ridge, Lasso)

  • Chapters 7-10: Advanced methods (splines, trees, SVM, deep learning)

  • Chapters 11-13: Special topics (survival, unsupervised, testing)

Practice ExercisesΒΆ

Exercise 1: ConceptualΒΆ

Explain the difference between parametric and non-parametric methods. What are the advantages and disadvantages of each?

Exercise 2: Bias-VarianceΒΆ

Generate a new dataset and experiment with polynomial degrees 1 through 20. Plot the training and test MSE. Identify the point of overfitting.

Exercise 3: KNN FlexibilityΒΆ

For the classification example, try k values from 1 to 100. Plot test accuracy vs. k. What is the optimal k?

Exercise 4: ClusteringΒΆ

Generate data with 5 clusters and use K-means with k=3, 4, 5, 6, 7. Use the elbow method to determine the optimal number of clusters.

Exercise 5: Real DataΒΆ

Download a real dataset (e.g., from sklearn.datasets) and:

  1. Split into train/test

  2. Fit multiple models with different complexity

  3. Plot bias-variance tradeoff

  4. Select the best model based on test performance