Chapter 2: Statistical LearningΒΆ
OverviewΒΆ
This chapter introduces the fundamental concepts of statistical learning:
What is statistical learning? Understanding the relationship between inputs and outputs
Why estimate f? Prediction vs. Inference
How to estimate f? Parametric vs. Non-parametric methods
Model accuracy Trade-offs between bias and variance
Supervised vs. Unsupervised learning
Regression vs. Classification
Key ConceptsΒΆ
The Statistical Learning FrameworkΒΆ
Given:
Predictors/Features: X = (Xβ, Xβ, β¦, Xβ)
Response/Output: Y
We assume there exists a relationship:
Y = f(X) + Ξ΅
Where:
fis the unknown function we want to estimateΞ΅is irreducible error (mean zero)
Goals:
Prediction: Estimate Y given new X values
Inference: Understand the relationship between X and Y
# Generate synthetic data to demonstrate concepts
# True function: f(x) = 2 + 3*x + 5*x^2
np.random.seed(42)
n = 100
X_train = np.random.uniform(-2, 2, n)
X_test = np.random.uniform(-2, 2, 50)
# True function
def true_function(x):
return 2 + 3*x + 5*x**2
# Add noise
epsilon_std = 3
y_train = true_function(X_train) + np.random.normal(0, epsilon_std, n)
y_test = true_function(X_test) + np.random.normal(0, epsilon_std, 50)
# Visualize the data
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
x_smooth = np.linspace(-2, 2, 200)
plt.scatter(X_train, y_train, alpha=0.5, label='Training data')
plt.plot(x_smooth, true_function(x_smooth), 'r-', linewidth=2, label='True f(x)')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Statistical Learning: Y = f(X) + Ξ΅')
plt.legend()
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
# Show reducible vs irreducible error
predictions = true_function(X_train)
plt.scatter(X_train, y_train, alpha=0.5, label='Observed Y')
plt.scatter(X_train, predictions, alpha=0.5, color='orange', label='f(X) - no noise')
for i in range(min(20, n)):
plt.plot([X_train[i], X_train[i]], [y_train[i], predictions[i]],
'k-', alpha=0.2, linewidth=0.5)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Reducible Error (noise Ξ΅)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Training samples: {n}")
print(f"Test samples: {len(X_test)}")
print(f"Noise std deviation (irreducible error): {epsilon_std}")
2.1 Prediction vs. InferenceΒΆ
PredictionΒΆ
Goal: Accurately predict Y for new X values
Example: Predict house price given features (location, size, etc.)
Donβt care about the form of f(X)
Only care about accuracy
InferenceΒΆ
Goal: Understand the relationship between X and Y
Questions we want to answer:
Which predictors are most important?
What is the relationship between Y and each Xβ±Ό?
Is the relationship linear, non-linear, or more complex?
Example: How does advertising budget affect sales?
Need interpretable model
Care about the form of f(X)
# Demonstration: Parametric vs. Non-parametric Methods
# Reshape for sklearn
X_train_2d = X_train.reshape(-1, 1)
X_test_2d = X_test.reshape(-1, 1)
# 1. PARAMETRIC METHOD: Linear Regression (assumes linear form)
linear_model = LinearRegression()
linear_model.fit(X_train_2d, y_train)
y_pred_linear = linear_model.predict(X_test_2d)
# 2. PARAMETRIC METHOD: Polynomial Regression (assumes polynomial form)
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2)
X_train_poly = poly_features.fit_transform(X_train_2d)
X_test_poly = poly_features.transform(X_test_2d)
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)
y_pred_poly = poly_model.predict(X_test_poly)
# 3. NON-PARAMETRIC METHOD: K-Nearest Neighbors
knn_model = KNeighborsRegressor(n_neighbors=10)
knn_model.fit(X_train_2d, y_train)
y_pred_knn = knn_model.predict(X_test_2d)
# Visualization
x_plot = np.linspace(-2, 2, 200).reshape(-1, 1)
x_plot_poly = poly_features.transform(x_plot)
plt.figure(figsize=(15, 5))
# Linear model
plt.subplot(1, 3, 1)
plt.scatter(X_train, y_train, alpha=0.3, label='Training data')
plt.plot(x_plot, linear_model.predict(x_plot), 'r-', linewidth=2, label='Linear fit')
plt.plot(x_smooth, true_function(x_smooth), 'g--', linewidth=2, label='True f(x)', alpha=0.7)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Parametric: Linear Regression\n(Assumes linear relationship)')
plt.legend()
plt.grid(True, alpha=0.3)
# Polynomial model
plt.subplot(1, 3, 2)
plt.scatter(X_train, y_train, alpha=0.3, label='Training data')
plt.plot(x_plot, poly_model.predict(x_plot_poly), 'r-', linewidth=2, label='Polynomial fit (deg=2)')
plt.plot(x_smooth, true_function(x_smooth), 'g--', linewidth=2, label='True f(x)', alpha=0.7)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Parametric: Polynomial Regression\n(Assumes polynomial relationship)')
plt.legend()
plt.grid(True, alpha=0.3)
# KNN model
plt.subplot(1, 3, 3)
plt.scatter(X_train, y_train, alpha=0.3, label='Training data')
plt.plot(x_plot, knn_model.predict(x_plot), 'r-', linewidth=2, label='KNN fit (k=10)')
plt.plot(x_smooth, true_function(x_smooth), 'g--', linewidth=2, label='True f(x)', alpha=0.7)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Non-parametric: K-Nearest Neighbors\n(No assumed form)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Calculate MSE for each method
mse_linear = mean_squared_error(y_test, y_pred_linear)
mse_poly = mean_squared_error(y_test, y_pred_poly)
mse_knn = mean_squared_error(y_test, y_pred_knn)
print("\nπ Test Set Performance (MSE):")
print(f"Linear Regression: {mse_linear:.2f}")
print(f"Polynomial Regression: {mse_poly:.2f} β
Best (matches true form)")
print(f"KNN (k=10): {mse_knn:.2f}")
2.2 The Bias-Variance Trade-OffΒΆ
A fundamental concept in statistical learning:
Test MSE = BiasΒ² + Variance + Irreducible Error
BiasΒΆ
Error from approximating real-life problem with simpler model
High bias β model is too simple (underfitting)
Example: Using linear model for non-linear relationship
VarianceΒΆ
Amount by which estimate would change with different training data
High variance β model is too complex (overfitting)
Example: High-degree polynomial fitting noise
The Trade-OffΒΆ
Simple models: High bias, low variance
Complex models: Low bias, high variance
Goal: Find the sweet spot that minimizes test error
# Demonstrate bias-variance tradeoff with polynomial models of varying complexity
degrees = [1, 2, 3, 5, 10, 15]
train_errors = []
test_errors = []
plt.figure(figsize=(15, 10))
for idx, degree in enumerate(degrees, 1):
# Fit polynomial model
poly = PolynomialFeatures(degree=degree)
X_train_p = poly.fit_transform(X_train_2d)
X_test_p = poly.transform(X_test_2d)
model = LinearRegression()
model.fit(X_train_p, y_train)
# Predictions
y_train_pred = model.predict(X_train_p)
y_test_pred = model.predict(X_test_p)
# Errors
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
train_errors.append(train_mse)
test_errors.append(test_mse)
# Plot
plt.subplot(2, 3, idx)
x_plot_p = poly.transform(x_plot)
plt.scatter(X_train, y_train, alpha=0.3, s=20, label='Training')
plt.plot(x_plot, model.predict(x_plot_p), 'r-', linewidth=2, label=f'Degree {degree} fit')
plt.plot(x_smooth, true_function(x_smooth), 'g--', linewidth=2, label='True f(x)', alpha=0.7)
plt.xlabel('X')
plt.ylabel('Y')
plt.title(f'Degree {degree}\nTrain MSE: {train_mse:.2f}, Test MSE: {test_mse:.2f}')
plt.legend(fontsize=8)
plt.grid(True, alpha=0.3)
plt.ylim(-10, 30)
plt.tight_layout()
plt.show()
# Plot bias-variance tradeoff curve
plt.figure(figsize=(10, 6))
plt.plot(degrees, train_errors, 'o-', linewidth=2, markersize=8, label='Training MSE')
plt.plot(degrees, test_errors, 's-', linewidth=2, markersize=8, label='Test MSE')
plt.axhline(y=epsilon_std**2, color='r', linestyle='--', label=f'Irreducible Error (~{epsilon_std**2:.0f})')
plt.xlabel('Model Complexity (Polynomial Degree)')
plt.ylabel('Mean Squared Error')
plt.title('Bias-Variance Trade-Off')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(degrees)
# Annotate regions
plt.text(1.5, max(test_errors)*0.8, 'HIGH BIAS\nLOW VARIANCE\n(Underfitting)',
ha='center', fontsize=10, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
plt.text(12, max(test_errors)*0.8, 'LOW BIAS\nHIGH VARIANCE\n(Overfitting)',
ha='center', fontsize=10, bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.5))
plt.show()
best_degree = degrees[np.argmin(test_errors)]
print(f"\nπ‘ Best model complexity: Degree {best_degree}")
print(f" Minimum test MSE: {min(test_errors):.2f}")
print(f" True underlying function is degree 2 β
")
# Demonstrate model flexibility with KNN (varying k)
k_values = [1, 3, 5, 10, 20, 50]
train_errors_knn = []
test_errors_knn = []
plt.figure(figsize=(15, 10))
for idx, k in enumerate(k_values, 1):
model = KNeighborsRegressor(n_neighbors=k)
model.fit(X_train_2d, y_train)
y_train_pred = model.predict(X_train_2d)
y_test_pred = model.predict(X_test_2d)
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
train_errors_knn.append(train_mse)
test_errors_knn.append(test_mse)
plt.subplot(2, 3, idx)
plt.scatter(X_train, y_train, alpha=0.3, s=20, label='Training')
plt.plot(x_plot, model.predict(x_plot), 'r-', linewidth=2, label=f'KNN (k={k})')
plt.plot(x_smooth, true_function(x_smooth), 'g--', linewidth=2, label='True f(x)', alpha=0.7)
plt.xlabel('X')
plt.ylabel('Y')
plt.title(f'K = {k}\nTrain MSE: {train_mse:.2f}, Test MSE: {test_mse:.2f}')
plt.legend(fontsize=8)
plt.grid(True, alpha=0.3)
plt.ylim(-10, 30)
plt.tight_layout()
plt.show()
# Plot flexibility curve (note: for KNN, smaller k = more flexible)
flexibility = [1/k for k in k_values] # 1/k represents flexibility
plt.figure(figsize=(10, 6))
plt.plot(flexibility, train_errors_knn, 'o-', linewidth=2, markersize=8, label='Training MSE')
plt.plot(flexibility, test_errors_knn, 's-', linewidth=2, markersize=8, label='Test MSE')
plt.axhline(y=epsilon_std**2, color='r', linestyle='--', label=f'Irreducible Error (~{epsilon_std**2:.0f})')
plt.xlabel('Model Flexibility (1/k)')
plt.ylabel('Mean Squared Error')
plt.title('Bias-Variance Trade-Off for KNN')
plt.legend()
plt.grid(True, alpha=0.3)
# Add k labels on top axis
ax = plt.gca()
ax2 = ax.twiny()
ax2.set_xlim(ax.get_xlim())
ax2.set_xticks(flexibility)
ax2.set_xticklabels([f'k={k}' for k in k_values])
ax2.set_xlabel('Number of Neighbors (k)')
plt.show()
best_k = k_values[np.argmin(test_errors_knn)]
print(f"\nπ‘ Best K value: {best_k}")
print(f" Minimum test MSE: {min(test_errors_knn):.2f}")
print(f" Smaller k = more flexible, larger k = less flexible")
2.3 Classification SettingΒΆ
Key Differences from RegressionΒΆ
Regression: Predict quantitative output (continuous)
Classification: Predict qualitative output (categorical)
Performance MetricsΒΆ
Training Error Rate:
Error Rate = (1/n) Γ Ξ£ I(yα΅’ β Ε·α΅’)
Bayes Classifier (optimal, but unknown):
Assigns each observation to most likely class given its predictors
Produces lowest possible test error rate (Bayes error rate)
K-Nearest Neighbors Classifier:
Estimates conditional probability P(Y=j|X=xβ)
Classification: majority vote among k nearest neighbors
# Classification example with synthetic 2D data
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from matplotlib.colors import ListedColormap
# Generate classification data
np.random.seed(42)
n_class = 200
# Class 1: centered at (-1, -1)
X_class1 = np.random.randn(n_class//2, 2) * 0.6 + np.array([-1, -1])
y_class1 = np.zeros(n_class//2)
# Class 2: centered at (1, 1)
X_class2 = np.random.randn(n_class//2, 2) * 0.6 + np.array([1, 1])
y_class2 = np.ones(n_class//2)
X_class = np.vstack([X_class1, X_class2])
y_class = np.hstack([y_class1, y_class2])
# Split data
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
X_class, y_class, test_size=0.3, random_state=42
)
# Create mesh for decision boundary
h = 0.02
x_min, x_max = X_class[:, 0].min() - 1, X_class[:, 0].max() + 1
y_min, y_max = X_class[:, 1].min() - 1, X_class[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Try different K values
k_values_class = [1, 5, 15, 50]
cmap_light = ListedColormap(['#FFAAAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#0000FF'])
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()
for idx, k in enumerate(k_values_class):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train_c, y_train_c)
# Predict on mesh
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot
axes[idx].contourf(xx, yy, Z, alpha=0.3, cmap=cmap_light)
axes[idx].scatter(X_train_c[:, 0], X_train_c[:, 1], c=y_train_c,
cmap=cmap_bold, alpha=0.6, edgecolor='k', s=50)
# Calculate accuracy
train_acc = accuracy_score(y_train_c, knn.predict(X_train_c))
test_acc = accuracy_score(y_test_c, knn.predict(X_test_c))
axes[idx].set_xlabel('Xβ')
axes[idx].set_ylabel('Xβ')
axes[idx].set_title(f'KNN Classification (k={k})\nTrain Acc: {train_acc:.3f}, Test Acc: {test_acc:.3f}')
axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nπ Classification Accuracy Summary:")
for k in k_values_class:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train_c, y_train_c)
train_acc = accuracy_score(y_train_c, knn.predict(X_train_c))
test_acc = accuracy_score(y_test_c, knn.predict(X_test_c))
print(f"k={k:2d}: Train: {train_acc:.3f} Test: {test_acc:.3f}")
print("\nπ‘ Key Observations:")
print(" β’ k=1: High flexibility β Overfitting (perfect train, lower test)")
print(" β’ k=50: Low flexibility β Underfitting (boundary too smooth)")
print(" β’ Middle values (k=5, 15) balance bias-variance well")
2.4 Supervised vs. Unsupervised LearningΒΆ
Supervised LearningΒΆ
Have labeled data: Both X (predictors) and Y (response)
Goal: Learn function f: X β Y
Examples:
Linear regression
Logistic regression
K-Nearest Neighbors
Support Vector Machines
Neural Networks
Applications: Prediction, classification
Unsupervised LearningΒΆ
Have unlabeled data: Only X (no response Y)
Goal: Discover patterns, structure, or relationships in data
Examples:
Clustering (K-means, hierarchical)
Principal Component Analysis (PCA)
Matrix completion
Applications: Customer segmentation, anomaly detection, dimensionality reduction
# Demonstrate unsupervised learning with clustering
from sklearn.cluster import KMeans
# Generate data with hidden structure (3 clusters)
np.random.seed(42)
cluster_centers = np.array([[0, 0], [4, 4], [0, 4]])
n_samples = 150
n_clusters = 3
X_cluster = []
for center in cluster_centers:
points = np.random.randn(n_samples//n_clusters, 2) * 0.7 + center
X_cluster.append(points)
X_cluster = np.vstack(X_cluster)
# Try different numbers of clusters
fig, axes = plt.subplots(1, 4, figsize=(18, 4))
# Raw data (no labels)
axes[0].scatter(X_cluster[:, 0], X_cluster[:, 1], alpha=0.6, s=50)
axes[0].set_title('Unlabeled Data\n(Unsupervised Learning)')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].grid(True, alpha=0.3)
# Try k=2, 3, 4 clusters
for idx, k in enumerate([2, 3, 4], 1):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_cluster)
centers = kmeans.cluster_centers_
axes[idx].scatter(X_cluster[:, 0], X_cluster[:, 1], c=labels,
cmap='viridis', alpha=0.6, s=50)
axes[idx].scatter(centers[:, 0], centers[:, 1], c='red',
marker='X', s=200, edgecolor='black', linewidth=2,
label='Centroids')
axes[idx].set_title(f'K-Means Clustering\n(k={k} clusters)')
axes[idx].set_xlabel('Feature 1')
axes[idx].set_ylabel('Feature 2')
axes[idx].legend()
axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("π‘ Unsupervised Learning Insights:")
print(" β’ No labeled responses - algorithm finds patterns on its own")
print(" β’ k=3 correctly identifies the true structure")
print(" β’ Choosing correct k is a challenge (often use elbow method, silhouette score)")
Key TakeawaysΒΆ
1. Statistical Learning FrameworkΒΆ
Goal: Estimate f in Y = f(X) + Ξ΅
Two purposes: Prediction (accuracy) vs Inference (interpretability)
2. MethodsΒΆ
Parametric: Assume functional form (faster, less flexible, may not match reality)
Non-parametric: No assumptions (flexible, needs more data, can overfit)
3. Bias-Variance Trade-OffΒΆ
Test MSE = BiasΒ² + Variance + Irreducible Error
Simple models: High bias, low variance
Complex models: Low bias, high variance
Goal: Minimize test error (not training error!)
4. Problem TypesΒΆ
Regression: Quantitative output (price, temperature)
Classification: Qualitative output (spam/not spam, disease type)
5. Learning ParadigmsΒΆ
Supervised: Have Y labels β predict new Y
Unsupervised: No Y labels β find structure
6. Model AssessmentΒΆ
Always evaluate on test data (not training data)
Training error can be misleading (overfitting)
Cross-validation helps estimate test error
Next StepsΒΆ
The following chapters dive deeper into specific methods:
Chapter 3: Linear Regression (parametric, regression)
Chapter 4: Classification (logistic regression, LDA, QDA, KNN)
Chapter 5: Resampling (cross-validation, bootstrap)
Chapter 6: Regularization (Ridge, Lasso)
Chapters 7-10: Advanced methods (splines, trees, SVM, deep learning)
Chapters 11-13: Special topics (survival, unsupervised, testing)
Practice ExercisesΒΆ
Exercise 1: ConceptualΒΆ
Explain the difference between parametric and non-parametric methods. What are the advantages and disadvantages of each?
Exercise 2: Bias-VarianceΒΆ
Generate a new dataset and experiment with polynomial degrees 1 through 20. Plot the training and test MSE. Identify the point of overfitting.
Exercise 3: KNN FlexibilityΒΆ
For the classification example, try k values from 1 to 100. Plot test accuracy vs. k. What is the optimal k?
Exercise 4: ClusteringΒΆ
Generate data with 5 clusters and use K-means with k=3, 4, 5, 6, 7. Use the elbow method to determine the optimal number of clusters.
Exercise 5: Real DataΒΆ
Download a real dataset (e.g., from sklearn.datasets) and:
Split into train/test
Fit multiple models with different complexity
Plot bias-variance tradeoff
Select the best model based on test performance