Introduction to Statistical Learning with Python (ISLP)¶

A comprehensive collection of Jupyter notebooks covering the foundational concepts and advanced techniques in statistical learning and machine learning, based on “An Introduction to Statistical Learning” adapted for Python.

📚 Overview¶

This series provides hands-on implementations of statistical learning methods with Python, featuring:

13 comprehensive chapters covering fundamental to advanced topics
Theory + Practice: Mathematical formulations with executable code
Real datasets: Practical examples using classic ML datasets
Visualizations: Professional plots for understanding concepts
Exercises: Practice problems for each chapter
100+ additional exercises: See PRACTICE_EXERCISES.md

🗂️ Chapter Guide¶

Chapter 1: Introduction¶

File: 01_introduction.ipynb (NEW!)

Topics:

What is statistical learning?
Supervised vs unsupervised learning
Real-world applications
The ML workflow
Model assessment metrics
Overfitting vs underfitting

Demonstrations:

California Housing (regression example)
Breast Cancer (classification example)
Iris Clustering (unsupervised example)
Train-test split and overfitting visualization
Comprehensive metrics comparison

Key Concepts:

The learning framework: Y = f(X) + ε
Bias-variance trade-off
Train-test split importance
Regression vs classification metrics
Supervised vs unsupervised paradigms

Practice: 8 comprehensive exercises covering all intro concepts

Chapter 2: Statistical Learning¶

File: 02_statistical_learning.ipynb (25KB)

Topics:

Supervised vs unsupervised learning
Regression vs classification
Bias-variance trade-off
Training vs test error
Model assessment and selection

Key Concepts:

Reducible vs irreducible error
Overfitting and underfitting
Cross-validation basics

Chapter 3: Linear Regression¶

File: 03_linear_regression.ipynb (40KB)

Topics:

Simple linear regression
Multiple linear regression
Least squares estimation
Hypothesis testing (t-tests, F-tests)
R² and adjusted R²
Residual analysis

Demonstrations:

Boston Housing dataset
Advertising dataset
Confidence vs prediction intervals
Diagnostic plots

Key Formulas:

β̂ = (X'X)⁻¹X'y
RSS = Σ(yi - ŷi)²
R² = 1 - RSS/TSS

Chapter 4: Classification¶

File: 04_classification.ipynb (32KB)

Topics:

Logistic regression
Linear Discriminant Analysis (LDA)
Quadratic Discriminant Analysis (QDA)
Naive Bayes
K-Nearest Neighbors (KNN)

Demonstrations:

Binary classification (Default dataset)
Multi-class classification (Iris)
Decision boundaries
Confusion matrices
ROC curves and AUC

Metrics:

Accuracy, precision, recall, F1-score
Sensitivity and specificity
Classification error rate

Chapter 5: Resampling Methods¶

File: 05_resampling_methods.ipynb (37KB)

Topics:

Cross-validation (LOOCV, k-fold)
Bootstrap
Model selection
Uncertainty estimation

Demonstrations:

k-fold CV for polynomial degree selection
LOOCV vs k-fold comparison
Bootstrap confidence intervals
Bootstrap standard errors

Applications:

Estimating test error
Model comparison
Parameter uncertainty
Sample size effects

Chapter 6: Linear Model Selection and Regularization¶

File: 06_regularization.ipynb (46KB)

Topics:

Subset selection (best subset, forward, backward)
Ridge regression (L2 penalty)
Lasso regression (L1 penalty)
Elastic Net
Principal Component Regression (PCR)

Key Concepts:

Regularization path
Cross-validation for λ selection
Feature selection vs shrinkage
Multicollinearity handling

Formulas:

Ridge: minimize RSS + λΣβj²
Lasso: minimize RSS + λΣ|βj|
Elastic Net: minimize RSS + λ₁Σ|βj| + λ₂Σβj²

Chapter 7: Moving Beyond Linearity¶

File: 07_beyond_linearity.ipynb (35KB)

Topics:

Polynomial regression
Step functions
Regression splines (B-splines)
Smoothing splines
Generalized Additive Models (GAMs)

Demonstrations:

Polynomial degree selection via CV
Spline knot placement
GAMs with multiple predictors
Method comparison

Use Cases:

Non-linear relationships
Flexible modeling
Interpretable non-linearity
Smooth curve fitting

Chapter 8: Tree-Based Methods¶

File: 08_tree_based_methods.ipynb (40-45KB)

Topics:

Decision trees (CART)
Bagging (Bootstrap Aggregation)
Random Forests
Boosting (AdaBoost, Gradient Boosting)
Feature importance

Demonstrations:

Tree pruning and depth control
Out-of-bag (OOB) error
Feature importance visualization
Ensemble comparison
Breast Cancer dataset

Key Algorithms:

DecisionTreeClassifier/Regressor
BaggingClassifier
RandomForestClassifier
AdaBoostClassifier
GradientBoostingClassifier

Chapter 9: Support Vector Machines¶

File: 09_support_vector_machines.ipynb (40-45KB)

Topics:

Maximal margin classifier
Support Vector Classifier (soft margin)
Kernel methods (Linear, Polynomial, RBF)
Multi-class SVMs
Support Vector Regression (SVR)

Demonstrations:

C parameter tuning (margin control)
Kernel comparison
Gamma parameter effects (RBF)
Hyperparameter grid search
Breast Cancer classification

Key Concepts:

Margin maximization
Support vectors
Kernel trick
Slack variables (ξ)

Chapter 10: Deep Learning¶

File: 10_deep_learning.ipynb (35-40KB)

Topics:

Neural network fundamentals
Activation functions (ReLU, sigmoid, tanh, softmax)
Single vs deep networks
Backpropagation
Regularization (L2, dropout, early stopping)
MLPClassifier and MLPRegressor

Demonstrations:

California Housing (regression)
MNIST digit classification
Hidden layer comparison
Learning curves
Regularization effects

Architecture:

Input → Hidden₁ → Hidden₂ → ... → Output
Each layer: z = Wx + b, a = σ(z)

Chapter 11: Survival Analysis¶

File: 11_survival_analysis.ipynb (45-50KB)

Topics:

Survival functions
Censoring (right, left, interval)
Kaplan-Meier estimator
Log-Rank test
Cox Proportional Hazards model
Hazard ratios

Demonstrations:

Rossi recidivism dataset
Survival curves by group
Median survival time
Hazard ratio interpretation
Proportional hazards assumption

Key Library: lifelines

Applications:

Time-to-event analysis
Medical studies
Customer churn
Equipment failure

Chapter 12: Unsupervised Learning¶

File: 12_unsupervised_learning.ipynb (40-45KB)

Topics:

Principal Component Analysis (PCA)
K-Means clustering
Hierarchical clustering
DBSCAN
Dimensionality reduction

Demonstrations:

Iris dataset (PCA: 4D → 2D)
Scree plots and variance explained
Elbow method for K selection
Silhouette analysis
Dendrogram visualization
MNIST digits clustering

Linkage Methods:

Complete
Average
Single
Ward

Validation:

Silhouette score
Davies-Bouldin index
Inertia (within-cluster SS)

Chapter 13: Multiple Testing¶

File: 13_multiple_testing.ipynb (35-40KB)

Topics:

Multiple testing problem
Family-Wise Error Rate (FWER)
False Discovery Rate (FDR)
Bonferroni correction
Holm’s method
Benjamini-Hochberg procedure
Benjamini-Yekutieli procedure

Demonstrations:

Simulation of Type I error inflation
FWER vs FDR comparison
Threshold visualization
Power analysis
Method selection guidelines

Key Library: statsmodels.stats.multitest

Applications:

Genomics (thousands of tests)
Neuroimaging
A/B testing
Clinical trials

Decision Guide:

Small m (< 20): Bonferroni or Holm
Large m (≥ 100): Benjamini-Hochberg (FDR)
Confirmatory studies: FWER control
Exploratory studies: FDR control

🚀 Getting Started¶

Prerequisites¶

Python Version: 3.8+

Required Libraries:

pip install numpy pandas matplotlib seaborn scikit-learn scipy statsmodels lifelines

Or use the requirements file:

pip install -r requirements.txt

Installation¶

Clone the repository:

git clone https://github.com/PavanMudigonda/aiml.git
cd aiml/2-maths/islp-book

Create virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Launch Jupyter:

jupyter notebook

📖 Learning Path¶

Beginner Track (Fundamentals)¶

Chapter 1: Introduction → Overview and motivation
Chapter 2: Statistical Learning → Understand core concepts
Chapter 3: Linear Regression → First predictive model
Chapter 4: Classification → Categorical outcomes
Chapter 5: Resampling → Model validation

Time: 2-4 weeks

Intermediate Track (Advanced Methods)¶

Chapter 6: Regularization → Handle complex models
Chapter 7: Beyond Linearity → Non-linear relationships
Chapter 8: Tree Methods → Ensemble learning
Chapter 9: SVMs → Powerful classification

Time: 3-4 weeks

Advanced Track (Specialized Topics)¶

Chapter 10: Deep Learning → Neural networks
Chapter 11: Survival Analysis → Time-to-event
Chapter 12: Unsupervised → Clustering and PCA
Chapter 13: Multiple Testing → Statistical inference

Time: 3-4 weeks

Total Program: 9-12 weeks for comprehensive mastery

🎯 How to Use These Notebooks¶

For Self-Study¶

Read theory sections (markdown cells) carefully
Run code cells sequentially (Shift+Enter)
Modify parameters to see effects
Complete exercises at the end
Compare your solutions with demonstrations

For Teaching¶

Use as lecture supplements
Live coding demonstrations
Student projects and assignments
Flipped classroom materials

For Reference¶

Quick lookup of methods
Code snippets for projects
Visualization templates
Best practices

📊 Datasets Used¶

Dataset	Used In	Description
Boston Housing	Ch 3, 6, 7	House prices with 13 features
Advertising	Ch 3	Sales vs TV/Radio/Newspaper
Default	Ch 4	Credit card default prediction
Iris	Ch 4, 12	3 species, 4 features
Breast Cancer	Ch 8, 9	Binary classification, 30 features
California Housing	Ch 10	Regression, 8 features
MNIST Digits	Ch 10, 12	64 features (8×8 pixels)
Rossi	Ch 11	Recidivism survival data
Synthetic	Multiple	Generated for demonstrations

Most datasets are built into scikit-learn or easily accessible.

🔑 Key Libraries Reference¶

Core¶

import numpy as np              # Numerical computing
import pandas as pd             # Data manipulation
import matplotlib.pyplot as plt # Plotting
import seaborn as sns          # Statistical visualization

Machine Learning¶

from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.svm import SVC, SVR
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc

Statistical¶

from scipy import stats                              # Statistical tests
from statsmodels.stats.multitest import multipletests # Multiple testing
from lifelines import KaplanMeierFitter, CoxPHFitter  # Survival analysis

🎓 Concepts Covered¶

Fundamental Concepts¶

Bias-variance trade-off
Overfitting and regularization
Cross-validation
Feature engineering
Model selection

Regression Techniques¶

Linear regression
Polynomial regression
Ridge and Lasso
Splines and GAMs
SVR

Classification Methods¶

Logistic regression
LDA/QDA
Decision trees
Random forests
Boosting
SVMs
Neural networks

Unsupervised Learning¶

PCA
K-Means
Hierarchical clustering
Dimensionality reduction

Statistical Inference¶

Hypothesis testing
Confidence intervals
Bootstrap
Multiple testing correction
Survival analysis

💡 Best Practices¶

Code Style¶

Set random seeds for reproducibility: np.random.seed(42)
Use train-test splits: train_test_split(test_size=0.2)
Scale features when needed: StandardScaler()
Validate with cross-validation

Visualization¶

Clear labels and titles
Appropriate color schemes
Grid lines for readability
Legend placement
Figure size optimization

Model Development¶

Explore data (EDA)
Split data (train/test)
Preprocess (scaling, encoding)
Train model
Validate (cross-validation)
Tune hyperparameters
Test (final evaluation)
Interpret results

🔍 Quick Method Lookup¶

Choose Regression Method¶

Linear relationships: Linear Regression
Multicollinearity: Ridge or Lasso
Feature selection: Lasso
Non-linear: Polynomial, Splines, GAMs
Complex patterns: Random Forest, Gradient Boosting
Small sample: Ridge

Choose Classification Method¶

Linear boundary: Logistic Regression, LDA
Non-linear boundary: QDA, KNN, SVM (RBF)
Interpretability: Logistic Regression, Decision Tree
High accuracy: Random Forest, Gradient Boosting, SVM
Large dataset: Logistic Regression, Neural Network
Small dataset: LDA, Naive Bayes

Choose Unsupervised Method¶

Dimensionality reduction: PCA
Clustering (spherical): K-Means
Clustering (arbitrary shape): Hierarchical, DBSCAN
Visualization: PCA + scatter plot

📝 Practice Exercises¶

In-Chapter Exercises¶

Each chapter includes 5-8 practice exercises covering:

Conceptual: Understanding theory
Applied: Using methods on new datasets
Computational: Implementing from scratch
Analysis: Interpreting results

Additional Practice¶

PRACTICE_EXERCISES.md includes 100+ extra problems:

3 additional problems per chapter (39 total)
8 comprehensive projects
5 challenge problems
Solutions and hints
Progress tracker

Difficulty Levels:

Beginner: Chapters 1-5 (25 exercises)
Intermediate: Chapters 6-10 (25 exercises)
Advanced: Chapters 11-13 + Projects (50+ exercises)

Recommended Approach:

Complete in-chapter exercises first
Attempt additional exercises in PRACTICE_EXERCISES.md
Work on 2-3 projects
Try 1 challenge problem
Participate in Kaggle competition

🤝 Contributing¶

Contributions welcome! Areas for improvement:

Additional datasets
More exercises
Alternative implementations
Error corrections
Clarifications
Extended examples

Process:

Fork repository
Create feature branch
Make changes
Test notebooks (run all cells)
Submit pull request

📚 Additional Resources¶

Books¶

ISLR (original): James, Witten, Hastie, Tibshirani
ESL: Elements of Statistical Learning (advanced)
Python Data Science Handbook: Jake VanderPlas
Hands-On Machine Learning: Aurélien Géron

Online Courses¶

Stanford CS229 (Machine Learning)
Fast.ai (Practical Deep Learning)
Coursera Machine Learning Specialization

Documentation¶

⚖️ License¶

MIT License - feel free to use for learning and teaching.

📧 Contact¶

Repository: PavanMudigonda/aiml

Issues: Report bugs or suggest improvements via GitHub Issues

🎉 Acknowledgments¶

Based on “An Introduction to Statistical Learning” by James, Witten, Hastie, and Tibshirani
scikit-learn team for excellent ML library
Python community for data science ecosystem
Contributors and users of this repository

🏆 Learning Goals¶

After completing this series, you will be able to:

✅ Understand fundamental statistical learning concepts
✅ Implement regression and classification models
✅ Apply regularization techniques
✅ Use ensemble methods effectively
✅ Work with neural networks
✅ Perform survival analysis
✅ Apply unsupervised learning methods
✅ Handle multiple testing problems
✅ Choose appropriate methods for different problems
✅ Interpret and validate model results
✅ Communicate findings effectively

Happy Learning! 🚀

Introduction to Statistical Learning with Python (ISLP)¶

📚 Overview¶

🗂️ Chapter Guide¶

Chapter 1: Introduction¶

Chapter 2: Statistical Learning¶

Chapter 3: Linear Regression¶

Chapter 4: Classification¶

Chapter 5: Resampling Methods¶

Chapter 6: Linear Model Selection and Regularization¶

Chapter 7: Moving Beyond Linearity¶

Chapter 8: Tree-Based Methods¶

Chapter 9: Support Vector Machines¶

Chapter 10: Deep Learning¶

Chapter 11: Survival Analysis¶

Chapter 12: Unsupervised Learning¶

Chapter 13: Multiple Testing¶

🚀 Getting Started¶

Prerequisites¶

Installation¶

📖 Learning Path¶

Beginner Track (Fundamentals)¶

Intermediate Track (Advanced Methods)¶

Advanced Track (Specialized Topics)¶

🎯 How to Use These Notebooks¶

For Self-Study¶

For Teaching¶

For Reference¶

📊 Datasets Used¶

🔑 Key Libraries Reference¶

Core¶

Machine Learning¶

Statistical¶

🎓 Concepts Covered¶

Fundamental Concepts¶

Regression Techniques¶

Classification Methods¶

Unsupervised Learning¶

Statistical Inference¶

💡 Best Practices¶

Code Style¶

Visualization¶

Model Development¶

🔍 Quick Method Lookup¶

Choose Regression Method¶

Choose Classification Method¶

Choose Unsupervised Method¶

📝 Practice Exercises¶

In-Chapter Exercises¶

Additional Practice¶

🤝 Contributing¶

📚 Additional Resources¶

Books¶

Online Courses¶

Documentation¶

⚖️ License¶

📧 Contact¶

🎉 Acknowledgments¶

📈 Progress Tracker¶

Core Chapters¶

Additional Practice¶

🏆 Learning Goals¶