Chapter 1: Introduction to Statistical LearningΒΆ
What is Statistical Learning?ΒΆ
Statistical learning refers to a set of tools for understanding data. These tools can be broadly classified into:
Supervised LearningΒΆ
Build a model to predict or estimate an output based on inputs
We have labeled data (known outcomes)
Examples: regression, classification
Unsupervised LearningΒΆ
Find patterns and structure in data
No labeled outcomes
Examples: clustering, dimensionality reduction
Why Statistical Learning?ΒΆ
In many situations we want to:
Predict future outcomes
Will this customer churn?
What will house prices be next year?
Is this email spam?
Understand relationships
How does advertising affect sales?
Which genes are associated with disease?
What factors influence customer satisfaction?
Discover patterns
Customer segmentation
Anomaly detection
Topic modeling
The Learning FrameworkΒΆ
NotationΒΆ
Input variables: \(X = (X_1, X_2, \ldots, X_p)\)
Also called: features, predictors, independent variables
Output variable: \(Y\)
Also called: response, target, dependent variable
Relationship: \(Y = f(X) + \epsilon\)
\(f\): systematic information that \(X\) provides about \(Y\)
\(\epsilon\): random error (irreducible)
GoalΒΆ
Estimate \(f\) using observed data to:
Predict \(Y\) for new \(X\) values
Infer which \(X_j\) are important
Understand the relationship between \(X\) and \(Y\)
Real-World ApplicationsΒΆ
Domain |
Problem |
Type |
Methods |
|---|---|---|---|
Healthcare |
Disease diagnosis |
Classification |
Logistic, SVM, Neural Nets |
Finance |
Stock price prediction |
Regression |
Time series, Random Forest |
Marketing |
Customer segmentation |
Clustering |
K-Means, Hierarchical |
E-commerce |
Product recommendation |
Collaborative filtering |
Matrix factorization |
Manufacturing |
Quality control |
Classification |
Decision trees, SVM |
Social Media |
Sentiment analysis |
NLP + Classification |
Naive Bayes, Deep learning |
Genomics |
Gene expression |
Multiple testing |
ANOVA, FDR control |
Insurance |
Risk assessment |
Regression |
GLM, GAM |
Course OverviewΒΆ
This book covers:
Foundations (Chapters 2-5)
Statistical learning framework
Linear regression
Classification methods
Model validation
Advanced Supervised (Chapters 6-10)
Regularization
Non-linear methods
Tree-based methods
Support vector machines
Deep learning
Specialized Topics (Chapters 11-13)
Survival analysis
Unsupervised learning
Multiple testing
1.1 Real Data ExamplesΒΆ
Letβs explore several real datasets to understand different types of learning problems.
# Example 1: Regression - California Housing
print("π Example 1: REGRESSION PROBLEM")
print("="*70)
# Load data
housing = fetch_california_housing(as_frame=True)
X = housing.data
y = housing.target
print(f"\nπ California Housing Dataset")
print(f" Samples: {len(X):,}")
print(f" Features: {X.shape[1]}")
print(f" Target: Median house value (in $100,000s)")
print(f"\n Features: {list(X.columns)}")
# Show sample
print(f"\nπ First 3 samples:")
display(pd.concat([X.head(3), y.head(3).rename('MedianValue')], axis=1))
# Basic statistics
print(f"\nπ Target Statistics:")
print(f" Mean: ${y.mean()*100:.2f}k")
print(f" Median: ${y.median()*100:.2f}k")
print(f" Std: ${y.std()*100:.2f}k")
print(f" Range: ${y.min()*100:.2f}k - ${y.max()*100:.2f}k")
# Simple model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"\nπ€ Simple Linear Regression:")
print(f" RΒ² score: {score:.4f}")
print(f" Interpretation: Model explains {score*100:.2f}% of variance")
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# Target distribution
axes[0].hist(y, bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Median House Value ($100k)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of House Prices')
axes[0].grid(True, alpha=0.3)
# Feature correlation
corr = X.corrwith(y).sort_values(ascending=False)
axes[1].barh(corr.index, corr.values, color=['green' if v > 0 else 'red' for v in corr.values], alpha=0.7)
axes[1].set_xlabel('Correlation with Price')
axes[1].set_title('Feature Importance (Correlation)')
axes[1].grid(True, alpha=0.3, axis='x')
# Predictions vs Actual
y_pred = model.predict(X_test)
axes[2].scatter(y_test, y_pred, alpha=0.3)
axes[2].plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
axes[2].set_xlabel('Actual Price ($100k)')
axes[2].set_ylabel('Predicted Price ($100k)')
axes[2].set_title(f'Predictions vs Actual (RΒ²={score:.3f})')
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nπ‘ Key Insight: MedInc (median income) is the strongest predictor!")
# Example 2: Classification - Breast Cancer
print("\n" + "="*70)
print("π Example 2: CLASSIFICATION PROBLEM")
print("="*70)
# Load data
cancer = load_breast_cancer(as_frame=True)
X = cancer.data
y = cancer.target
print(f"\n㪠Breast Cancer Dataset")
print(f" Samples: {len(X):,}")
print(f" Features: {X.shape[1]}")
print(f" Classes: {cancer.target_names}")
print(f" Class distribution:")
print(f" Malignant: {(y==0).sum()} ({(y==0).sum()/len(y)*100:.1f}%)")
print(f" Benign: {(y==1).sum()} ({(y==1).sum()/len(y)*100:.1f}%)")
# Sample features
print(f"\n Sample features: {list(X.columns[:5])} ... (30 total)")
# Simple model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"\nπ€ Logistic Regression:")
print(f" Accuracy: {score:.4f} ({score*100:.2f}%)")
# Predictions
y_pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(f"\nπ Confusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(f"")
print(f" Predicted")
print(f" Malignant Benign")
print(f" Actual Malignant {cm[0,0]:3d} {cm[0,1]:3d}")
print(f" Benign {cm[1,0]:3d} {cm[1,1]:3d}")
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# Class distribution
class_counts = pd.Series(y).value_counts().sort_index()
axes[0].bar(['Malignant', 'Benign'], class_counts.values,
color=['red', 'green'], alpha=0.7, edgecolor='black')
axes[0].set_ylabel('Count')
axes[0].set_title('Class Distribution')
axes[0].grid(True, alpha=0.3, axis='y')
# Feature importance (coefficients)
coef = pd.Series(model.coef_[0], index=X.columns).sort_values()
top_features = pd.concat([coef.head(5), coef.tail(5)])
colors = ['red' if x < 0 else 'green' for x in top_features.values]
axes[1].barh(range(len(top_features)), top_features.values, color=colors, alpha=0.7)
axes[1].set_yticks(range(len(top_features)))
axes[1].set_yticklabels(top_features.index, fontsize=8)
axes[1].set_xlabel('Coefficient Value')
axes[1].set_title('Top 10 Most Important Features')
axes[1].grid(True, alpha=0.3, axis='x')
# Confusion matrix heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Malignant', 'Benign'],
yticklabels=['Malignant', 'Benign'],
ax=axes[2], cbar=False)
axes[2].set_ylabel('Actual')
axes[2].set_xlabel('Predicted')
axes[2].set_title(f'Confusion Matrix (Acc={score:.3f})')
plt.tight_layout()
plt.show()
print("\nπ‘ Key Insight: Model achieves high accuracy with worst radius as key predictor!")
# Example 3: Unsupervised Learning - Iris Clustering
print("\n" + "="*70)
print("π Example 3: UNSUPERVISED LEARNING (Clustering)")
print("="*70)
# Load data
iris = load_iris(as_frame=True)
X = iris.data
y_true = iris.target # We won't use this for clustering, just for evaluation
print(f"\nπΈ Iris Dataset")
print(f" Samples: {len(X)}")
print(f" Features: {X.shape[1]} - {list(X.columns)}")
print(f" Species: {list(iris.target_names)}")
print(f"\n Task: Discover natural groups WITHOUT using species labels")
# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X)
# Evaluate how well clusters match true species
from sklearn.metrics import adjusted_rand_score, silhouette_score
ari = adjusted_rand_score(y_true, clusters)
silhouette = silhouette_score(X, clusters)
print(f"\nπ€ K-Means Clustering (k=3):")
print(f" Adjusted Rand Index: {ari:.4f}")
print(f" (1.0 = perfect match with true species)")
print(f" Silhouette Score: {silhouette:.4f}")
print(f" (Higher = better separated clusters)")
# Cluster sizes
print(f"\n Cluster sizes:")
for i in range(3):
print(f" Cluster {i}: {(clusters==i).sum()} samples")
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# True species (for comparison)
for i, species in enumerate(iris.target_names):
mask = y_true == i
axes[0].scatter(X.loc[mask, 'petal length (cm)'],
X.loc[mask, 'petal width (cm)'],
label=species, alpha=0.6, s=100)
axes[0].set_xlabel('Petal Length (cm)')
axes[0].set_ylabel('Petal Width (cm)')
axes[0].set_title('True Species Labels (Ground Truth)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Discovered clusters
scatter = axes[1].scatter(X['petal length (cm)'],
X['petal width (cm)'],
c=clusters, cmap='viridis', alpha=0.6, s=100)
# Plot centroids
centroids = kmeans.cluster_centers_
axes[1].scatter(centroids[:, 2], centroids[:, 3],
c='red', marker='X', s=300, edgecolor='black', linewidth=2,
label='Centroids')
axes[1].set_xlabel('Petal Length (cm)')
axes[1].set_ylabel('Petal Width (cm)')
axes[1].set_title(f'K-Means Clusters (ARI={ari:.3f})')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.colorbar(scatter, ax=axes[1], label='Cluster')
plt.tight_layout()
plt.show()
print("\nπ‘ Key Insight: Unsupervised clustering discovers species groups without labels!")
1.2 Types of Learning ProblemsΒΆ
Summary of ExamplesΒΆ
Example |
Type |
Input (X) |
Output (Y) |
Goal |
Method |
|---|---|---|---|---|---|
Housing |
Regression |
House features |
Price |
Predict |
Linear Regression |
Cancer |
Classification |
Cell measurements |
Malignant/Benign |
Classify |
Logistic Regression |
Iris |
Clustering |
Flower measurements |
(none) |
Group |
K-Means |
Regression vs ClassificationΒΆ
Regression (quantitative Y):
Predict numerical values
Examples: price, temperature, sales
Metrics: MSE, RΒ², MAE
Classification (qualitative Y):
Predict categories/classes
Examples: spam/not spam, disease type
Metrics: accuracy, precision, recall, AUC
Supervised vs UnsupervisedΒΆ
Supervised (have Y):
Regression
Classification
Goal: predict Y from X
Unsupervised (no Y):
Clustering
Dimensionality reduction (PCA)
Goal: find structure in X
1.3 The Machine Learning WorkflowΒΆ
Standard ProcessΒΆ
1. Problem Definition
ββ> What are we trying to predict/understand?
2. Data Collection
ββ> Gather relevant data
3. Exploratory Data Analysis (EDA)
ββ> Understand patterns, distributions, correlations
4. Data Preparation
ββ> Handle missing values
ββ> Encode categorical variables
ββ> Scale/normalize features
ββ> Split train/test sets
5. Model Selection
ββ> Choose appropriate algorithm(s)
6. Training
ββ> Fit model on training data
7. Evaluation
ββ> Assess performance on test data
ββ> Compare multiple models
8. Tuning
ββ> Optimize hyperparameters
9. Final Model
ββ> Retrain on all available data
10. Deployment
ββ> Use model in production
Key PrinciplesΒΆ
1. Train-Test Split
Always evaluate on unseen data
Typical split: 70-80% train, 20-30% test
Never βpeekβ at test data during training
2. Cross-Validation
More reliable than single train-test split
k-fold CV: divide data into k parts
Each part serves as test set once
3. Bias-Variance Trade-off
Simple models: high bias, low variance
Complex models: low bias, high variance
Goal: balance both
4. Overfitting vs Underfitting
Overfitting: model too complex, fits training noise
Underfitting: model too simple, misses patterns
Solution: regularization, cross-validation
# Demonstration: Train-Test Split and Overfitting
print("π Demonstrating Overfitting vs Proper Fitting")
print("="*70)
# Generate synthetic data
np.random.seed(42)
X_train_demo = np.linspace(0, 10, 30)
y_train_demo = 2 * X_train_demo + 1 + np.random.randn(30) * 2
X_test_demo = np.linspace(0, 10, 100)
y_test_true = 2 * X_test_demo + 1
# Fit polynomials of different degrees
degrees = [1, 3, 15]
colors = ['green', 'blue', 'red']
labels = ['Degree 1 (Underfitting)', 'Degree 3 (Good Fit)', 'Degree 15 (Overfitting)']
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
for i, (deg, color, label) in enumerate(zip(degrees, colors, labels)):
# Fit polynomial
coeffs = np.polyfit(X_train_demo, y_train_demo, deg)
poly = np.poly1d(coeffs)
y_pred = poly(X_test_demo)
# Training error
train_error = np.mean((y_train_demo - poly(X_train_demo))**2)
# Test error (on true function)
test_error = np.mean((y_test_true - y_pred)**2)
# Plot
axes[i].scatter(X_train_demo, y_train_demo, alpha=0.5, label='Training data')
axes[i].plot(X_test_demo, y_test_true, 'k--', alpha=0.3, label='True function')
axes[i].plot(X_test_demo, y_pred, color=color, linewidth=2, label=f'Degree {deg}')
axes[i].set_xlabel('X')
axes[i].set_ylabel('Y')
axes[i].set_title(f'{label}\nTrain MSE: {train_error:.2f} | Test MSE: {test_error:.2f}')
axes[i].legend()
axes[i].grid(True, alpha=0.3)
axes[i].set_ylim([-5, 25])
plt.tight_layout()
plt.show()
print("\nπ‘ Observations:")
print(" β’ Degree 1: Too simple, high error on both train and test (UNDERFITTING)")
print(" β’ Degree 3: Good balance, captures trend without noise (GOOD FIT)")
print(" β’ Degree 15: Fits training perfectly but wild on test (OVERFITTING)")
print("\nπ― Goal: Minimize TEST error, not training error!")
1.4 Model AssessmentΒΆ
Regression MetricsΒΆ
Mean Squared Error (MSE): $\(MSE = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2\)$
Root Mean Squared Error (RMSE): $\(RMSE = \sqrt{MSE}\)$
Same units as Y
Easier to interpret
RΒ² (Coefficient of Determination): $\(R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}\)$
Range: [0, 1] (can be negative for poor models)
Proportion of variance explained
Classification MetricsΒΆ
Accuracy: $\(Accuracy = \frac{\text{Correct predictions}}{\text{Total predictions}}\)$
Confusion Matrix:
Predicted
Neg Pos
Actual Neg TN FP
Pos FN TP
Precision: \(\frac{TP}{TP + FP}\) (of predicted positives, how many correct?)
Recall: \(\frac{TP}{TP + FN}\) (of actual positives, how many found?)
F1-Score: \(2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\)
# Comprehensive Metrics Demonstration
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, precision_score, recall_score, f1_score
print("π Model Evaluation Metrics")
print("="*70)
# Regression metrics (California Housing)
housing = fetch_california_housing(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
reg_model = LinearRegression()
reg_model.fit(X_train, y_train)
y_pred_reg = reg_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred_reg)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_reg)
print("\nπ’ REGRESSION METRICS (Housing Prices):")
print(f" MSE: {mse:.4f}")
print(f" RMSE: {rmse:.4f} (in $100k units β ${rmse*100:.2f}k error)")
print(f" RΒ²: {r2:.4f} ({r2*100:.2f}% variance explained)")
# Classification metrics (Breast Cancer)
cancer = load_breast_cancer(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, test_size=0.2, random_state=42
)
clf_model = LogisticRegression(max_iter=10000)
clf_model.fit(X_train, y_train)
y_pred_clf = clf_model.predict(X_test)
acc = accuracy_score(y_test, y_pred_clf)
prec = precision_score(y_test, y_pred_clf)
rec = recall_score(y_test, y_pred_clf)
f1 = f1_score(y_test, y_pred_clf)
print("\nπ― CLASSIFICATION METRICS (Cancer Detection):")
print(f" Accuracy: {acc:.4f} ({acc*100:.2f}% correct)")
print(f" Precision: {prec:.4f} (of predicted benign, {prec*100:.2f}% actually benign)")
print(f" Recall: {rec:.4f} (found {rec*100:.2f}% of all benign cases)")
print(f" F1-Score: {f1:.4f} (harmonic mean of precision & recall)")
# Visualize metrics comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Regression: Actual vs Predicted
axes[0].scatter(y_test, y_pred_reg, alpha=0.3)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0].set_xlabel('Actual Price')
axes[0].set_ylabel('Predicted Price')
axes[0].set_title(f'Regression: RΒ²={r2:.3f}, RMSE={rmse:.3f}')
axes[0].grid(True, alpha=0.3)
# Classification: Metrics comparison
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
values = [acc, prec, rec, f1]
bars = axes[1].bar(metrics, values, alpha=0.7, color=['blue', 'green', 'orange', 'red'], edgecolor='black')
axes[1].set_ylabel('Score')
axes[1].set_ylim([0.9, 1.0])
axes[1].set_title('Classification Metrics')
axes[1].grid(True, alpha=0.3, axis='y')
# Add value labels on bars
for bar, val in zip(bars, values):
height = bar.get_height()
axes[1].text(bar.get_x() + bar.get_width()/2., height + 0.005,
f'{val:.3f}', ha='center', va='bottom', fontweight='bold')
plt.tight_layout()
plt.show()
print("\nπ‘ Choosing Metrics:")
print(" β’ Regression: Use RΒ² for overall fit, RMSE for error magnitude")
print(" β’ Classification: Accuracy for balance, Precision/Recall for imbalance")
print(" β’ Medical: Prioritize Recall (find all diseases)")
print(" β’ Spam: Prioritize Precision (avoid false alarms)")
Key TakeawaysΒΆ
What is Statistical Learning?ΒΆ
Set of tools for understanding data
Estimate function \(f\) where \(Y = f(X) + \epsilon\)
Goals: predict, infer, understand
Types of ProblemsΒΆ
Supervised Learning:
Regression (quantitative Y)
Classification (qualitative Y)
Unsupervised Learning:
Clustering
Dimensionality reduction
Critical ConceptsΒΆ
Train-Test Split: Always evaluate on unseen data
Bias-Variance Trade-off: Balance simplicity and complexity
Overfitting: Model fits training noise, fails on new data
Cross-Validation: More reliable than single split
Appropriate Metrics: Choose based on problem type
The Path ForwardΒΆ
Chapters 2-5: Foundations
Deep dive into statistical learning theory
Linear regression and classification
Model validation techniques
Chapters 6-10: Advanced Supervised Learning
Regularization methods
Non-linear approaches
Ensemble methods
Neural networks
Chapters 11-13: Specialized Topics
Time-to-event analysis
Unsupervised methods
Statistical inference with multiple tests
Best PracticesΒΆ
β
Always split data before any analysis
β
Use cross-validation for model selection
β
Evaluate on appropriate metrics
β
Check for overfitting
β
Understand your data first (EDA)
β
Start simple, then increase complexity
β
Document your workflow
β
Validate assumptions
Ready to Begin!ΒΆ
You now have the foundation to dive into statistical learning. The journey ahead will cover:
Theory: Mathematical foundations
Practice: Hands-on implementations
Applications: Real-world problems
Letβs get started with Chapter 2: Statistical Learning! π
Practice ExercisesΒΆ
Exercise 1: Dataset ExplorationΒΆ
Load the diabetes dataset from scikit-learn:
from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame=True)
How many samples and features?
What is the target variable?
Is this regression or classification?
Create visualizations of target distribution
Find which feature correlates most with target
Exercise 2: Train-Test SplitΒΆ
Using the diabetes dataset:
Split into 70% train, 30% test
Fit a LinearRegression model
Calculate MSE on both train and test
Calculate RΒ² on both train and test
Is the model overfitting or underfitting? Why?
Exercise 3: Classification PracticeΒΆ
Load the wine dataset:
from sklearn.datasets import load_wine
wine = load_wine(as_frame=True)
How many classes?
Split into train/test (80/20)
Train LogisticRegression and DecisionTreeClassifier
Compare accuracy, precision, recall for both
Which model performs better?
Exercise 4: Overfitting InvestigationΒΆ
Generate synthetic data:
X = np.linspace(0, 10, 50)
y = np.sin(X) + np.random.randn(50) * 0.2
Fit polynomials of degree 1, 5, 10, 20
Calculate training error for each
Generate new test data and calculate test error
Plot training vs test error by degree
What degree minimizes test error?
Exercise 5: Clustering AnalysisΒΆ
Using the Iris dataset:
Apply K-Means with k=2, 3, 4, 5
Calculate silhouette score for each k
Which k gives best score?
Visualize clusters for best k
Compare with true species labels
Exercise 6: Metrics UnderstandingΒΆ
Given predictions and actual values:
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 1]
Manually calculate confusion matrix
Calculate accuracy
Calculate precision
Calculate recall
Calculate F1-score
Verify with sklearn functions
Exercise 7: Real-World ApplicationΒΆ
Choose a dataset from UCI ML Repository or Kaggle:
Load and explore the data
Identify the problem type
Perform EDA (visualizations, statistics)
Apply appropriate model
Evaluate with proper metrics
Document your findings
Exercise 8: Workflow PracticeΒΆ
Complete end-to-end pipeline:
Load digits dataset (
load_digits)Split train/test
Try 3 different classifiers
Use cross-validation for each
Select best model
Final evaluation on test set
Create confusion matrix heatmap
Report findings professionally