Run this notebook: Open in Colab Open in Kaggle

Chapter 1: Introduction to Statistical Learning¶

What is Statistical Learning?¶

Statistical learning refers to a set of tools for understanding data. These tools can be broadly classified into:

Supervised Learning¶

Build a model to predict or estimate an output based on inputs
We have labeled data (known outcomes)
Examples: regression, classification

Unsupervised Learning¶

Find patterns and structure in data
No labeled outcomes
Examples: clustering, dimensionality reduction

Why Statistical Learning?¶

In many situations we want to:

Predict future outcomes
- Will this customer churn?
- What will house prices be next year?
- Is this email spam?
Understand relationships
- How does advertising affect sales?
- Which genes are associated with disease?
- What factors influence customer satisfaction?
Discover patterns
- Customer segmentation
- Anomaly detection
- Topic modeling

The Learning Framework¶

Notation¶

Input variables: $X = (X_1, X_2, \ldots, X_p)$
- Also called: features, predictors, independent variables
Output variable: $Y$
- Also called: response, target, dependent variable
Relationship: $Y = f(X) + \epsilon$
- $f$: systematic information that $X$ provides about $Y$
- $\epsilon$: random error (irreducible)

Goal¶

Estimate $f$ using observed data to:

Predict $Y$ for new $X$ values
Infer which $X_j$ are important
Understand the relationship between $X$ and $Y$

Real-World Applications¶

Domain	Problem	Type	Methods
Healthcare	Disease diagnosis	Classification	Logistic, SVM, Neural Nets
Finance	Stock price prediction	Regression	Time series, Random Forest
Marketing	Customer segmentation	Clustering	K-Means, Hierarchical
E-commerce	Product recommendation	Collaborative filtering	Matrix factorization
Manufacturing	Quality control	Classification	Decision trees, SVM
Social Media	Sentiment analysis	NLP + Classification	Naive Bayes, Deep learning
Genomics	Gene expression	Multiple testing	ANOVA, FDR control
Insurance	Risk assessment	Regression	GLM, GAM

Course Overview¶

This book covers:

Foundations (Chapters 2-5)

Statistical learning framework
Linear regression
Classification methods
Model validation

Advanced Supervised (Chapters 6-10)

Regularization
Non-linear methods
Tree-based methods
Support vector machines
Deep learning

Specialized Topics (Chapters 11-13)

Survival analysis
Unsupervised learning
Multiple testing

1.1 Real Data Examples¶

Let’s explore several real datasets to understand different types of learning problems.

# Example 1: Regression - California Housing
print("📊 Example 1: REGRESSION PROBLEM")
print("="*70)

# Load data
housing = fetch_california_housing(as_frame=True)
X = housing.data
y = housing.target

print(f"\n🏠 California Housing Dataset")
print(f"   Samples: {len(X):,}")
print(f"   Features: {X.shape[1]}")
print(f"   Target: Median house value (in $100,000s)")
print(f"\n   Features: {list(X.columns)}")

# Show sample
print(f"\n📋 First 3 samples:")
display(pd.concat([X.head(3), y.head(3).rename('MedianValue')], axis=1))

# Basic statistics
print(f"\n📈 Target Statistics:")
print(f"   Mean: ${y.mean()*100:.2f}k")
print(f"   Median: ${y.median()*100:.2f}k")
print(f"   Std: ${y.std()*100:.2f}k")
print(f"   Range: ${y.min()*100:.2f}k - ${y.max()*100:.2f}k")

# Simple model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)

print(f"\n🤖 Simple Linear Regression:")
print(f"   R² score: {score:.4f}")
print(f"   Interpretation: Model explains {score*100:.2f}% of variance")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Target distribution
axes[0].hist(y, bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Median House Value ($100k)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of House Prices')
axes[0].grid(True, alpha=0.3)

# Feature correlation
corr = X.corrwith(y).sort_values(ascending=False)
axes[1].barh(corr.index, corr.values, color=['green' if v > 0 else 'red' for v in corr.values], alpha=0.7)
axes[1].set_xlabel('Correlation with Price')
axes[1].set_title('Feature Importance (Correlation)')
axes[1].grid(True, alpha=0.3, axis='x')

# Predictions vs Actual
y_pred = model.predict(X_test)
axes[2].scatter(y_test, y_pred, alpha=0.3)
axes[2].plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
axes[2].set_xlabel('Actual Price ($100k)')
axes[2].set_ylabel('Predicted Price ($100k)')
axes[2].set_title(f'Predictions vs Actual (R²={score:.3f})')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Key Insight: MedInc (median income) is the strongest predictor!")

# Example 2: Classification - Breast Cancer
print("\n" + "="*70)
print("📊 Example 2: CLASSIFICATION PROBLEM")
print("="*70)

# Load data
cancer = load_breast_cancer(as_frame=True)
X = cancer.data
y = cancer.target

print(f"\n🔬 Breast Cancer Dataset")
print(f"   Samples: {len(X):,}")
print(f"   Features: {X.shape[1]}")
print(f"   Classes: {cancer.target_names}")
print(f"   Class distribution:")
print(f"      Malignant: {(y==0).sum()} ({(y==0).sum()/len(y)*100:.1f}%)")
print(f"      Benign: {(y==1).sum()} ({(y==1).sum()/len(y)*100:.1f}%)")

# Sample features
print(f"\n   Sample features: {list(X.columns[:5])} ... (30 total)")

# Simple model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)

print(f"\n🤖 Logistic Regression:")
print(f"   Accuracy: {score:.4f} ({score*100:.2f}%)")

# Predictions
y_pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report

print(f"\n📊 Confusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(f"")
print(f"              Predicted")
print(f"              Malignant  Benign")
print(f"   Actual Malignant    {cm[0,0]:3d}      {cm[0,1]:3d}")
print(f"          Benign       {cm[1,0]:3d}      {cm[1,1]:3d}")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Class distribution
class_counts = pd.Series(y).value_counts().sort_index()
axes[0].bar(['Malignant', 'Benign'], class_counts.values, 
           color=['red', 'green'], alpha=0.7, edgecolor='black')
axes[0].set_ylabel('Count')
axes[0].set_title('Class Distribution')
axes[0].grid(True, alpha=0.3, axis='y')

# Feature importance (coefficients)
coef = pd.Series(model.coef_[0], index=X.columns).sort_values()
top_features = pd.concat([coef.head(5), coef.tail(5)])
colors = ['red' if x < 0 else 'green' for x in top_features.values]
axes[1].barh(range(len(top_features)), top_features.values, color=colors, alpha=0.7)
axes[1].set_yticks(range(len(top_features)))
axes[1].set_yticklabels(top_features.index, fontsize=8)
axes[1].set_xlabel('Coefficient Value')
axes[1].set_title('Top 10 Most Important Features')
axes[1].grid(True, alpha=0.3, axis='x')

# Confusion matrix heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
           xticklabels=['Malignant', 'Benign'],
           yticklabels=['Malignant', 'Benign'],
           ax=axes[2], cbar=False)
axes[2].set_ylabel('Actual')
axes[2].set_xlabel('Predicted')
axes[2].set_title(f'Confusion Matrix (Acc={score:.3f})')

plt.tight_layout()
plt.show()

print("\n💡 Key Insight: Model achieves high accuracy with worst radius as key predictor!")

# Example 3: Unsupervised Learning - Iris Clustering
print("\n" + "="*70)
print("📊 Example 3: UNSUPERVISED LEARNING (Clustering)")
print("="*70)

# Load data
iris = load_iris(as_frame=True)
X = iris.data
y_true = iris.target  # We won't use this for clustering, just for evaluation

print(f"\n🌸 Iris Dataset")
print(f"   Samples: {len(X)}")
print(f"   Features: {X.shape[1]} - {list(X.columns)}")
print(f"   Species: {list(iris.target_names)}")
print(f"\n   Task: Discover natural groups WITHOUT using species labels")

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X)

# Evaluate how well clusters match true species
from sklearn.metrics import adjusted_rand_score, silhouette_score
ari = adjusted_rand_score(y_true, clusters)
silhouette = silhouette_score(X, clusters)

print(f"\n🤖 K-Means Clustering (k=3):")
print(f"   Adjusted Rand Index: {ari:.4f}")
print(f"      (1.0 = perfect match with true species)")
print(f"   Silhouette Score: {silhouette:.4f}")
print(f"      (Higher = better separated clusters)")

# Cluster sizes
print(f"\n   Cluster sizes:")
for i in range(3):
    print(f"      Cluster {i}: {(clusters==i).sum()} samples")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# True species (for comparison)
for i, species in enumerate(iris.target_names):
    mask = y_true == i
    axes[0].scatter(X.loc[mask, 'petal length (cm)'], 
                   X.loc[mask, 'petal width (cm)'],
                   label=species, alpha=0.6, s=100)
axes[0].set_xlabel('Petal Length (cm)')
axes[0].set_ylabel('Petal Width (cm)')
axes[0].set_title('True Species Labels (Ground Truth)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Discovered clusters
scatter = axes[1].scatter(X['petal length (cm)'], 
                         X['petal width (cm)'],
                         c=clusters, cmap='viridis', alpha=0.6, s=100)
# Plot centroids
centroids = kmeans.cluster_centers_
axes[1].scatter(centroids[:, 2], centroids[:, 3], 
               c='red', marker='X', s=300, edgecolor='black', linewidth=2,
               label='Centroids')
axes[1].set_xlabel('Petal Length (cm)')
axes[1].set_ylabel('Petal Width (cm)')
axes[1].set_title(f'K-Means Clusters (ARI={ari:.3f})')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.colorbar(scatter, ax=axes[1], label='Cluster')

plt.tight_layout()
plt.show()

print("\n💡 Key Insight: Unsupervised clustering discovers species groups without labels!")

1.2 Types of Learning Problems¶

Summary of Examples¶

Example	Type	Input (X)	Output (Y)	Goal	Method
Housing	Regression	House features	Price	Predict	Linear Regression
Cancer	Classification	Cell measurements	Malignant/Benign	Classify	Logistic Regression
Iris	Clustering	Flower measurements	(none)	Group	K-Means

Regression vs Classification¶

Regression (quantitative Y):

Predict numerical values
Examples: price, temperature, sales
Metrics: MSE, R², MAE

Classification (qualitative Y):

Predict categories/classes
Examples: spam/not spam, disease type
Metrics: accuracy, precision, recall, AUC

Supervised vs Unsupervised¶

Supervised (have Y):

Regression
Classification
Goal: predict Y from X

Unsupervised (no Y):

Clustering
Dimensionality reduction (PCA)
Goal: find structure in X

1.3 The Machine Learning Workflow¶

Standard Process¶

1. Problem Definition
   └─> What are we trying to predict/understand?
   
2. Data Collection
   └─> Gather relevant data
   
3. Exploratory Data Analysis (EDA)
   └─> Understand patterns, distributions, correlations
   
4. Data Preparation
   ├─> Handle missing values
   ├─> Encode categorical variables
   ├─> Scale/normalize features
   └─> Split train/test sets
   
5. Model Selection
   └─> Choose appropriate algorithm(s)
   
6. Training
   └─> Fit model on training data
   
7. Evaluation
   ├─> Assess performance on test data
   └─> Compare multiple models
   
8. Tuning
   └─> Optimize hyperparameters
   
9. Final Model
   └─> Retrain on all available data
   
10. Deployment
    └─> Use model in production

Key Principles¶

1. Train-Test Split

Always evaluate on unseen data
Typical split: 70-80% train, 20-30% test
Never “peek” at test data during training

2. Cross-Validation

More reliable than single train-test split
k-fold CV: divide data into k parts
Each part serves as test set once

3. Bias-Variance Trade-off

Simple models: high bias, low variance
Complex models: low bias, high variance
Goal: balance both

4. Overfitting vs Underfitting

Overfitting: model too complex, fits training noise
Underfitting: model too simple, misses patterns
Solution: regularization, cross-validation

# Demonstration: Train-Test Split and Overfitting
print("📊 Demonstrating Overfitting vs Proper Fitting")
print("="*70)

# Generate synthetic data
np.random.seed(42)
X_train_demo = np.linspace(0, 10, 30)
y_train_demo = 2 * X_train_demo + 1 + np.random.randn(30) * 2

X_test_demo = np.linspace(0, 10, 100)
y_test_true = 2 * X_test_demo + 1

# Fit polynomials of different degrees
degrees = [1, 3, 15]
colors = ['green', 'blue', 'red']
labels = ['Degree 1 (Underfitting)', 'Degree 3 (Good Fit)', 'Degree 15 (Overfitting)']

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for i, (deg, color, label) in enumerate(zip(degrees, colors, labels)):
    # Fit polynomial
    coeffs = np.polyfit(X_train_demo, y_train_demo, deg)
    poly = np.poly1d(coeffs)
    y_pred = poly(X_test_demo)
    
    # Training error
    train_error = np.mean((y_train_demo - poly(X_train_demo))**2)
    # Test error (on true function)
    test_error = np.mean((y_test_true - y_pred)**2)
    
    # Plot
    axes[i].scatter(X_train_demo, y_train_demo, alpha=0.5, label='Training data')
    axes[i].plot(X_test_demo, y_test_true, 'k--', alpha=0.3, label='True function')
    axes[i].plot(X_test_demo, y_pred, color=color, linewidth=2, label=f'Degree {deg}')
    axes[i].set_xlabel('X')
    axes[i].set_ylabel('Y')
    axes[i].set_title(f'{label}\nTrain MSE: {train_error:.2f} | Test MSE: {test_error:.2f}')
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)
    axes[i].set_ylim([-5, 25])

plt.tight_layout()
plt.show()

print("\n💡 Observations:")
print("   • Degree 1: Too simple, high error on both train and test (UNDERFITTING)")
print("   • Degree 3: Good balance, captures trend without noise (GOOD FIT)")
print("   • Degree 15: Fits training perfectly but wild on test (OVERFITTING)")
print("\n🎯 Goal: Minimize TEST error, not training error!")

1.4 Model Assessment¶

Regression Metrics¶

Mean Squared Error (MSE): $$MSE = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2$$

Root Mean Squared Error (RMSE): $$RMSE = \sqrt{MSE}$$

Same units as Y
Easier to interpret

R² (Coefficient of Determination): $$R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

Range: [0, 1] (can be negative for poor models)
Proportion of variance explained

Classification Metrics¶

Accuracy: $$Accuracy = \frac{\text{Correct predictions}}{\text{Total predictions}}$$

Confusion Matrix:

              Predicted
              Neg   Pos
Actual  Neg   TN    FP
        Pos   FN    TP

Precision: $\frac{TP}{TP + FP}$ (of predicted positives, how many correct?)

Recall: $\frac{TP}{TP + FN}$ (of actual positives, how many found?)

F1-Score: $2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

# Comprehensive Metrics Demonstration
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, precision_score, recall_score, f1_score

print("📊 Model Evaluation Metrics")
print("="*70)

# Regression metrics (California Housing)
housing = fetch_california_housing(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)

reg_model = LinearRegression()
reg_model.fit(X_train, y_train)
y_pred_reg = reg_model.predict(X_test)

mse = mean_squared_error(y_test, y_pred_reg)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_reg)

print("\n🔢 REGRESSION METRICS (Housing Prices):")
print(f"   MSE:  {mse:.4f}")
print(f"   RMSE: {rmse:.4f} (in $100k units → ${rmse*100:.2f}k error)")
print(f"   R²:   {r2:.4f} ({r2*100:.2f}% variance explained)")

# Classification metrics (Breast Cancer)
cancer = load_breast_cancer(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, test_size=0.2, random_state=42
)

clf_model = LogisticRegression(max_iter=10000)
clf_model.fit(X_train, y_train)
y_pred_clf = clf_model.predict(X_test)

acc = accuracy_score(y_test, y_pred_clf)
prec = precision_score(y_test, y_pred_clf)
rec = recall_score(y_test, y_pred_clf)
f1 = f1_score(y_test, y_pred_clf)

print("\n🎯 CLASSIFICATION METRICS (Cancer Detection):")
print(f"   Accuracy:  {acc:.4f} ({acc*100:.2f}% correct)")
print(f"   Precision: {prec:.4f} (of predicted benign, {prec*100:.2f}% actually benign)")
print(f"   Recall:    {rec:.4f} (found {rec*100:.2f}% of all benign cases)")
print(f"   F1-Score:  {f1:.4f} (harmonic mean of precision & recall)")

# Visualize metrics comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Regression: Actual vs Predicted
axes[0].scatter(y_test, y_pred_reg, alpha=0.3)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0].set_xlabel('Actual Price')
axes[0].set_ylabel('Predicted Price')
axes[0].set_title(f'Regression: R²={r2:.3f}, RMSE={rmse:.3f}')
axes[0].grid(True, alpha=0.3)

# Classification: Metrics comparison
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
values = [acc, prec, rec, f1]
bars = axes[1].bar(metrics, values, alpha=0.7, color=['blue', 'green', 'orange', 'red'], edgecolor='black')
axes[1].set_ylabel('Score')
axes[1].set_ylim([0.9, 1.0])
axes[1].set_title('Classification Metrics')
axes[1].grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, val in zip(bars, values):
    height = bar.get_height()
    axes[1].text(bar.get_x() + bar.get_width()/2., height + 0.005,
                f'{val:.3f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print("\n💡 Choosing Metrics:")
print("   • Regression: Use R² for overall fit, RMSE for error magnitude")
print("   • Classification: Accuracy for balance, Precision/Recall for imbalance")
print("   • Medical: Prioritize Recall (find all diseases)")
print("   • Spam: Prioritize Precision (avoid false alarms)")

Key Takeaways¶

What is Statistical Learning?¶

Set of tools for understanding data
Estimate function $f$ where $Y = f(X) + \epsilon$
Goals: predict, infer, understand

Types of Problems¶

Supervised Learning:

Regression (quantitative Y)
Classification (qualitative Y)

Unsupervised Learning:

Clustering
Dimensionality reduction

Critical Concepts¶

Train-Test Split: Always evaluate on unseen data
Bias-Variance Trade-off: Balance simplicity and complexity
Overfitting: Model fits training noise, fails on new data
Cross-Validation: More reliable than single split
Appropriate Metrics: Choose based on problem type

The Path Forward¶

Chapters 2-5: Foundations

Deep dive into statistical learning theory
Linear regression and classification
Model validation techniques

Chapters 6-10: Advanced Supervised Learning

Regularization methods
Non-linear approaches
Ensemble methods
Neural networks

Chapters 11-13: Specialized Topics

Time-to-event analysis
Unsupervised methods
Statistical inference with multiple tests

Best Practices¶

✅ Always split data before any analysis
✅ Use cross-validation for model selection
✅ Evaluate on appropriate metrics
✅ Check for overfitting
✅ Understand your data first (EDA)
✅ Start simple, then increase complexity
✅ Document your workflow
✅ Validate assumptions

Ready to Begin!¶

You now have the foundation to dive into statistical learning. The journey ahead will cover:

Theory: Mathematical foundations
Practice: Hands-on implementations
Applications: Real-world problems

Let’s get started with Chapter 2: Statistical Learning! 🚀

Practice Exercises¶

Exercise 1: Dataset Exploration¶

Load the diabetes dataset from scikit-learn:

from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame=True)

How many samples and features?
What is the target variable?
Is this regression or classification?
Create visualizations of target distribution
Find which feature correlates most with target

Exercise 2: Train-Test Split¶

Using the diabetes dataset:

Split into 70% train, 30% test
Fit a LinearRegression model
Calculate MSE on both train and test
Calculate R² on both train and test
Is the model overfitting or underfitting? Why?

Exercise 3: Classification Practice¶

Load the wine dataset:

from sklearn.datasets import load_wine
wine = load_wine(as_frame=True)

How many classes?
Split into train/test (80/20)
Train LogisticRegression and DecisionTreeClassifier
Compare accuracy, precision, recall for both
Which model performs better?

Exercise 4: Overfitting Investigation¶

Generate synthetic data:

X = np.linspace(0, 10, 50)
y = np.sin(X) + np.random.randn(50) * 0.2

Fit polynomials of degree 1, 5, 10, 20
Calculate training error for each
Generate new test data and calculate test error
Plot training vs test error by degree
What degree minimizes test error?

Exercise 5: Clustering Analysis¶

Using the Iris dataset:

Apply K-Means with k=2, 3, 4, 5
Calculate silhouette score for each k
Which k gives best score?
Visualize clusters for best k
Compare with true species labels

Exercise 6: Metrics Understanding¶

Given predictions and actual values:

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 1]

Manually calculate confusion matrix
Calculate accuracy
Calculate precision
Calculate recall
Calculate F1-score
Verify with sklearn functions

Exercise 7: Real-World Application¶

Choose a dataset from UCI ML Repository or Kaggle:

Load and explore the data
Identify the problem type
Perform EDA (visualizations, statistics)
Apply appropriate model
Evaluate with proper metrics
Document your findings

Exercise 8: Workflow Practice¶

Complete end-to-end pipeline:

Load digits dataset (load_digits)
Split train/test
Try 3 different classifiers
Use cross-validation for each
Select best model
Final evaluation on test set
Create confusion matrix heatmap
Report findings professionally