Chapter 1: Introduction to Statistical LearningΒΆ

What is Statistical Learning?ΒΆ

Statistical learning refers to a set of tools for understanding data. These tools can be broadly classified into:

Supervised LearningΒΆ

  • Build a model to predict or estimate an output based on inputs

  • We have labeled data (known outcomes)

  • Examples: regression, classification

Unsupervised LearningΒΆ

  • Find patterns and structure in data

  • No labeled outcomes

  • Examples: clustering, dimensionality reduction

Why Statistical Learning?ΒΆ

In many situations we want to:

  1. Predict future outcomes

    • Will this customer churn?

    • What will house prices be next year?

    • Is this email spam?

  2. Understand relationships

    • How does advertising affect sales?

    • Which genes are associated with disease?

    • What factors influence customer satisfaction?

  3. Discover patterns

    • Customer segmentation

    • Anomaly detection

    • Topic modeling

The Learning FrameworkΒΆ

NotationΒΆ

  • Input variables: \(X = (X_1, X_2, \ldots, X_p)\)

    • Also called: features, predictors, independent variables

  • Output variable: \(Y\)

    • Also called: response, target, dependent variable

  • Relationship: \(Y = f(X) + \epsilon\)

    • \(f\): systematic information that \(X\) provides about \(Y\)

    • \(\epsilon\): random error (irreducible)

GoalΒΆ

Estimate \(f\) using observed data to:

  1. Predict \(Y\) for new \(X\) values

  2. Infer which \(X_j\) are important

  3. Understand the relationship between \(X\) and \(Y\)

Real-World ApplicationsΒΆ

Domain

Problem

Type

Methods

Healthcare

Disease diagnosis

Classification

Logistic, SVM, Neural Nets

Finance

Stock price prediction

Regression

Time series, Random Forest

Marketing

Customer segmentation

Clustering

K-Means, Hierarchical

E-commerce

Product recommendation

Collaborative filtering

Matrix factorization

Manufacturing

Quality control

Classification

Decision trees, SVM

Social Media

Sentiment analysis

NLP + Classification

Naive Bayes, Deep learning

Genomics

Gene expression

Multiple testing

ANOVA, FDR control

Insurance

Risk assessment

Regression

GLM, GAM

Course OverviewΒΆ

This book covers:

Foundations (Chapters 2-5)

  • Statistical learning framework

  • Linear regression

  • Classification methods

  • Model validation

Advanced Supervised (Chapters 6-10)

  • Regularization

  • Non-linear methods

  • Tree-based methods

  • Support vector machines

  • Deep learning

Specialized Topics (Chapters 11-13)

  • Survival analysis

  • Unsupervised learning

  • Multiple testing

1.1 Real Data ExamplesΒΆ

Let’s explore several real datasets to understand different types of learning problems.

# Example 1: Regression - California Housing
print("πŸ“Š Example 1: REGRESSION PROBLEM")
print("="*70)

# Load data
housing = fetch_california_housing(as_frame=True)
X = housing.data
y = housing.target

print(f"\n🏠 California Housing Dataset")
print(f"   Samples: {len(X):,}")
print(f"   Features: {X.shape[1]}")
print(f"   Target: Median house value (in $100,000s)")
print(f"\n   Features: {list(X.columns)}")

# Show sample
print(f"\nπŸ“‹ First 3 samples:")
display(pd.concat([X.head(3), y.head(3).rename('MedianValue')], axis=1))

# Basic statistics
print(f"\nπŸ“ˆ Target Statistics:")
print(f"   Mean: ${y.mean()*100:.2f}k")
print(f"   Median: ${y.median()*100:.2f}k")
print(f"   Std: ${y.std()*100:.2f}k")
print(f"   Range: ${y.min()*100:.2f}k - ${y.max()*100:.2f}k")

# Simple model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)

print(f"\nπŸ€– Simple Linear Regression:")
print(f"   RΒ² score: {score:.4f}")
print(f"   Interpretation: Model explains {score*100:.2f}% of variance")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Target distribution
axes[0].hist(y, bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Median House Value ($100k)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of House Prices')
axes[0].grid(True, alpha=0.3)

# Feature correlation
corr = X.corrwith(y).sort_values(ascending=False)
axes[1].barh(corr.index, corr.values, color=['green' if v > 0 else 'red' for v in corr.values], alpha=0.7)
axes[1].set_xlabel('Correlation with Price')
axes[1].set_title('Feature Importance (Correlation)')
axes[1].grid(True, alpha=0.3, axis='x')

# Predictions vs Actual
y_pred = model.predict(X_test)
axes[2].scatter(y_test, y_pred, alpha=0.3)
axes[2].plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
axes[2].set_xlabel('Actual Price ($100k)')
axes[2].set_ylabel('Predicted Price ($100k)')
axes[2].set_title(f'Predictions vs Actual (RΒ²={score:.3f})')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Key Insight: MedInc (median income) is the strongest predictor!")
# Example 2: Classification - Breast Cancer
print("\n" + "="*70)
print("πŸ“Š Example 2: CLASSIFICATION PROBLEM")
print("="*70)

# Load data
cancer = load_breast_cancer(as_frame=True)
X = cancer.data
y = cancer.target

print(f"\nπŸ”¬ Breast Cancer Dataset")
print(f"   Samples: {len(X):,}")
print(f"   Features: {X.shape[1]}")
print(f"   Classes: {cancer.target_names}")
print(f"   Class distribution:")
print(f"      Malignant: {(y==0).sum()} ({(y==0).sum()/len(y)*100:.1f}%)")
print(f"      Benign: {(y==1).sum()} ({(y==1).sum()/len(y)*100:.1f}%)")

# Sample features
print(f"\n   Sample features: {list(X.columns[:5])} ... (30 total)")

# Simple model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)

print(f"\nπŸ€– Logistic Regression:")
print(f"   Accuracy: {score:.4f} ({score*100:.2f}%)")

# Predictions
y_pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report

print(f"\nπŸ“Š Confusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(f"")
print(f"              Predicted")
print(f"              Malignant  Benign")
print(f"   Actual Malignant    {cm[0,0]:3d}      {cm[0,1]:3d}")
print(f"          Benign       {cm[1,0]:3d}      {cm[1,1]:3d}")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Class distribution
class_counts = pd.Series(y).value_counts().sort_index()
axes[0].bar(['Malignant', 'Benign'], class_counts.values, 
           color=['red', 'green'], alpha=0.7, edgecolor='black')
axes[0].set_ylabel('Count')
axes[0].set_title('Class Distribution')
axes[0].grid(True, alpha=0.3, axis='y')

# Feature importance (coefficients)
coef = pd.Series(model.coef_[0], index=X.columns).sort_values()
top_features = pd.concat([coef.head(5), coef.tail(5)])
colors = ['red' if x < 0 else 'green' for x in top_features.values]
axes[1].barh(range(len(top_features)), top_features.values, color=colors, alpha=0.7)
axes[1].set_yticks(range(len(top_features)))
axes[1].set_yticklabels(top_features.index, fontsize=8)
axes[1].set_xlabel('Coefficient Value')
axes[1].set_title('Top 10 Most Important Features')
axes[1].grid(True, alpha=0.3, axis='x')

# Confusion matrix heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
           xticklabels=['Malignant', 'Benign'],
           yticklabels=['Malignant', 'Benign'],
           ax=axes[2], cbar=False)
axes[2].set_ylabel('Actual')
axes[2].set_xlabel('Predicted')
axes[2].set_title(f'Confusion Matrix (Acc={score:.3f})')

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Key Insight: Model achieves high accuracy with worst radius as key predictor!")
# Example 3: Unsupervised Learning - Iris Clustering
print("\n" + "="*70)
print("πŸ“Š Example 3: UNSUPERVISED LEARNING (Clustering)")
print("="*70)

# Load data
iris = load_iris(as_frame=True)
X = iris.data
y_true = iris.target  # We won't use this for clustering, just for evaluation

print(f"\n🌸 Iris Dataset")
print(f"   Samples: {len(X)}")
print(f"   Features: {X.shape[1]} - {list(X.columns)}")
print(f"   Species: {list(iris.target_names)}")
print(f"\n   Task: Discover natural groups WITHOUT using species labels")

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X)

# Evaluate how well clusters match true species
from sklearn.metrics import adjusted_rand_score, silhouette_score
ari = adjusted_rand_score(y_true, clusters)
silhouette = silhouette_score(X, clusters)

print(f"\nπŸ€– K-Means Clustering (k=3):")
print(f"   Adjusted Rand Index: {ari:.4f}")
print(f"      (1.0 = perfect match with true species)")
print(f"   Silhouette Score: {silhouette:.4f}")
print(f"      (Higher = better separated clusters)")

# Cluster sizes
print(f"\n   Cluster sizes:")
for i in range(3):
    print(f"      Cluster {i}: {(clusters==i).sum()} samples")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# True species (for comparison)
for i, species in enumerate(iris.target_names):
    mask = y_true == i
    axes[0].scatter(X.loc[mask, 'petal length (cm)'], 
                   X.loc[mask, 'petal width (cm)'],
                   label=species, alpha=0.6, s=100)
axes[0].set_xlabel('Petal Length (cm)')
axes[0].set_ylabel('Petal Width (cm)')
axes[0].set_title('True Species Labels (Ground Truth)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Discovered clusters
scatter = axes[1].scatter(X['petal length (cm)'], 
                         X['petal width (cm)'],
                         c=clusters, cmap='viridis', alpha=0.6, s=100)
# Plot centroids
centroids = kmeans.cluster_centers_
axes[1].scatter(centroids[:, 2], centroids[:, 3], 
               c='red', marker='X', s=300, edgecolor='black', linewidth=2,
               label='Centroids')
axes[1].set_xlabel('Petal Length (cm)')
axes[1].set_ylabel('Petal Width (cm)')
axes[1].set_title(f'K-Means Clusters (ARI={ari:.3f})')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.colorbar(scatter, ax=axes[1], label='Cluster')

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Key Insight: Unsupervised clustering discovers species groups without labels!")

1.2 Types of Learning ProblemsΒΆ

Summary of ExamplesΒΆ

Example

Type

Input (X)

Output (Y)

Goal

Method

Housing

Regression

House features

Price

Predict

Linear Regression

Cancer

Classification

Cell measurements

Malignant/Benign

Classify

Logistic Regression

Iris

Clustering

Flower measurements

(none)

Group

K-Means

Regression vs ClassificationΒΆ

Regression (quantitative Y):

  • Predict numerical values

  • Examples: price, temperature, sales

  • Metrics: MSE, RΒ², MAE

Classification (qualitative Y):

  • Predict categories/classes

  • Examples: spam/not spam, disease type

  • Metrics: accuracy, precision, recall, AUC

Supervised vs UnsupervisedΒΆ

Supervised (have Y):

  • Regression

  • Classification

  • Goal: predict Y from X

Unsupervised (no Y):

  • Clustering

  • Dimensionality reduction (PCA)

  • Goal: find structure in X

1.3 The Machine Learning WorkflowΒΆ

Standard ProcessΒΆ

1. Problem Definition
   └─> What are we trying to predict/understand?
   
2. Data Collection
   └─> Gather relevant data
   
3. Exploratory Data Analysis (EDA)
   └─> Understand patterns, distributions, correlations
   
4. Data Preparation
   β”œβ”€> Handle missing values
   β”œβ”€> Encode categorical variables
   β”œβ”€> Scale/normalize features
   └─> Split train/test sets
   
5. Model Selection
   └─> Choose appropriate algorithm(s)
   
6. Training
   └─> Fit model on training data
   
7. Evaluation
   β”œβ”€> Assess performance on test data
   └─> Compare multiple models
   
8. Tuning
   └─> Optimize hyperparameters
   
9. Final Model
   └─> Retrain on all available data
   
10. Deployment
    └─> Use model in production

Key PrinciplesΒΆ

1. Train-Test Split

  • Always evaluate on unseen data

  • Typical split: 70-80% train, 20-30% test

  • Never β€œpeek” at test data during training

2. Cross-Validation

  • More reliable than single train-test split

  • k-fold CV: divide data into k parts

  • Each part serves as test set once

3. Bias-Variance Trade-off

  • Simple models: high bias, low variance

  • Complex models: low bias, high variance

  • Goal: balance both

4. Overfitting vs Underfitting

  • Overfitting: model too complex, fits training noise

  • Underfitting: model too simple, misses patterns

  • Solution: regularization, cross-validation

# Demonstration: Train-Test Split and Overfitting
print("πŸ“Š Demonstrating Overfitting vs Proper Fitting")
print("="*70)

# Generate synthetic data
np.random.seed(42)
X_train_demo = np.linspace(0, 10, 30)
y_train_demo = 2 * X_train_demo + 1 + np.random.randn(30) * 2

X_test_demo = np.linspace(0, 10, 100)
y_test_true = 2 * X_test_demo + 1

# Fit polynomials of different degrees
degrees = [1, 3, 15]
colors = ['green', 'blue', 'red']
labels = ['Degree 1 (Underfitting)', 'Degree 3 (Good Fit)', 'Degree 15 (Overfitting)']

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for i, (deg, color, label) in enumerate(zip(degrees, colors, labels)):
    # Fit polynomial
    coeffs = np.polyfit(X_train_demo, y_train_demo, deg)
    poly = np.poly1d(coeffs)
    y_pred = poly(X_test_demo)
    
    # Training error
    train_error = np.mean((y_train_demo - poly(X_train_demo))**2)
    # Test error (on true function)
    test_error = np.mean((y_test_true - y_pred)**2)
    
    # Plot
    axes[i].scatter(X_train_demo, y_train_demo, alpha=0.5, label='Training data')
    axes[i].plot(X_test_demo, y_test_true, 'k--', alpha=0.3, label='True function')
    axes[i].plot(X_test_demo, y_pred, color=color, linewidth=2, label=f'Degree {deg}')
    axes[i].set_xlabel('X')
    axes[i].set_ylabel('Y')
    axes[i].set_title(f'{label}\nTrain MSE: {train_error:.2f} | Test MSE: {test_error:.2f}')
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)
    axes[i].set_ylim([-5, 25])

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Observations:")
print("   β€’ Degree 1: Too simple, high error on both train and test (UNDERFITTING)")
print("   β€’ Degree 3: Good balance, captures trend without noise (GOOD FIT)")
print("   β€’ Degree 15: Fits training perfectly but wild on test (OVERFITTING)")
print("\n🎯 Goal: Minimize TEST error, not training error!")

1.4 Model AssessmentΒΆ

Regression MetricsΒΆ

Mean Squared Error (MSE): $\(MSE = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2\)$

Root Mean Squared Error (RMSE): $\(RMSE = \sqrt{MSE}\)$

  • Same units as Y

  • Easier to interpret

RΒ² (Coefficient of Determination): $\(R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}\)$

  • Range: [0, 1] (can be negative for poor models)

  • Proportion of variance explained

Classification MetricsΒΆ

Accuracy: $\(Accuracy = \frac{\text{Correct predictions}}{\text{Total predictions}}\)$

Confusion Matrix:

              Predicted
              Neg   Pos
Actual  Neg   TN    FP
        Pos   FN    TP

Precision: \(\frac{TP}{TP + FP}\) (of predicted positives, how many correct?)

Recall: \(\frac{TP}{TP + FN}\) (of actual positives, how many found?)

F1-Score: \(2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\)

# Comprehensive Metrics Demonstration
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, precision_score, recall_score, f1_score

print("πŸ“Š Model Evaluation Metrics")
print("="*70)

# Regression metrics (California Housing)
housing = fetch_california_housing(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)

reg_model = LinearRegression()
reg_model.fit(X_train, y_train)
y_pred_reg = reg_model.predict(X_test)

mse = mean_squared_error(y_test, y_pred_reg)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_reg)

print("\nπŸ”’ REGRESSION METRICS (Housing Prices):")
print(f"   MSE:  {mse:.4f}")
print(f"   RMSE: {rmse:.4f} (in $100k units β†’ ${rmse*100:.2f}k error)")
print(f"   RΒ²:   {r2:.4f} ({r2*100:.2f}% variance explained)")

# Classification metrics (Breast Cancer)
cancer = load_breast_cancer(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, test_size=0.2, random_state=42
)

clf_model = LogisticRegression(max_iter=10000)
clf_model.fit(X_train, y_train)
y_pred_clf = clf_model.predict(X_test)

acc = accuracy_score(y_test, y_pred_clf)
prec = precision_score(y_test, y_pred_clf)
rec = recall_score(y_test, y_pred_clf)
f1 = f1_score(y_test, y_pred_clf)

print("\n🎯 CLASSIFICATION METRICS (Cancer Detection):")
print(f"   Accuracy:  {acc:.4f} ({acc*100:.2f}% correct)")
print(f"   Precision: {prec:.4f} (of predicted benign, {prec*100:.2f}% actually benign)")
print(f"   Recall:    {rec:.4f} (found {rec*100:.2f}% of all benign cases)")
print(f"   F1-Score:  {f1:.4f} (harmonic mean of precision & recall)")

# Visualize metrics comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Regression: Actual vs Predicted
axes[0].scatter(y_test, y_pred_reg, alpha=0.3)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0].set_xlabel('Actual Price')
axes[0].set_ylabel('Predicted Price')
axes[0].set_title(f'Regression: RΒ²={r2:.3f}, RMSE={rmse:.3f}')
axes[0].grid(True, alpha=0.3)

# Classification: Metrics comparison
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
values = [acc, prec, rec, f1]
bars = axes[1].bar(metrics, values, alpha=0.7, color=['blue', 'green', 'orange', 'red'], edgecolor='black')
axes[1].set_ylabel('Score')
axes[1].set_ylim([0.9, 1.0])
axes[1].set_title('Classification Metrics')
axes[1].grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, val in zip(bars, values):
    height = bar.get_height()
    axes[1].text(bar.get_x() + bar.get_width()/2., height + 0.005,
                f'{val:.3f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Choosing Metrics:")
print("   β€’ Regression: Use RΒ² for overall fit, RMSE for error magnitude")
print("   β€’ Classification: Accuracy for balance, Precision/Recall for imbalance")
print("   β€’ Medical: Prioritize Recall (find all diseases)")
print("   β€’ Spam: Prioritize Precision (avoid false alarms)")

Key TakeawaysΒΆ

What is Statistical Learning?ΒΆ

  • Set of tools for understanding data

  • Estimate function \(f\) where \(Y = f(X) + \epsilon\)

  • Goals: predict, infer, understand

Types of ProblemsΒΆ

Supervised Learning:

  • Regression (quantitative Y)

  • Classification (qualitative Y)

Unsupervised Learning:

  • Clustering

  • Dimensionality reduction

Critical ConceptsΒΆ

  1. Train-Test Split: Always evaluate on unseen data

  2. Bias-Variance Trade-off: Balance simplicity and complexity

  3. Overfitting: Model fits training noise, fails on new data

  4. Cross-Validation: More reliable than single split

  5. Appropriate Metrics: Choose based on problem type

The Path ForwardΒΆ

Chapters 2-5: Foundations

  • Deep dive into statistical learning theory

  • Linear regression and classification

  • Model validation techniques

Chapters 6-10: Advanced Supervised Learning

  • Regularization methods

  • Non-linear approaches

  • Ensemble methods

  • Neural networks

Chapters 11-13: Specialized Topics

  • Time-to-event analysis

  • Unsupervised methods

  • Statistical inference with multiple tests

Best PracticesΒΆ

βœ… Always split data before any analysis
βœ… Use cross-validation for model selection
βœ… Evaluate on appropriate metrics
βœ… Check for overfitting
βœ… Understand your data first (EDA)
βœ… Start simple, then increase complexity
βœ… Document your workflow
βœ… Validate assumptions

Ready to Begin!ΒΆ

You now have the foundation to dive into statistical learning. The journey ahead will cover:

  • Theory: Mathematical foundations

  • Practice: Hands-on implementations

  • Applications: Real-world problems

Let’s get started with Chapter 2: Statistical Learning! πŸš€

Practice ExercisesΒΆ

Exercise 1: Dataset ExplorationΒΆ

Load the diabetes dataset from scikit-learn:

from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame=True)
  1. How many samples and features?

  2. What is the target variable?

  3. Is this regression or classification?

  4. Create visualizations of target distribution

  5. Find which feature correlates most with target

Exercise 2: Train-Test SplitΒΆ

Using the diabetes dataset:

  1. Split into 70% train, 30% test

  2. Fit a LinearRegression model

  3. Calculate MSE on both train and test

  4. Calculate RΒ² on both train and test

  5. Is the model overfitting or underfitting? Why?

Exercise 3: Classification PracticeΒΆ

Load the wine dataset:

from sklearn.datasets import load_wine
wine = load_wine(as_frame=True)
  1. How many classes?

  2. Split into train/test (80/20)

  3. Train LogisticRegression and DecisionTreeClassifier

  4. Compare accuracy, precision, recall for both

  5. Which model performs better?

Exercise 4: Overfitting InvestigationΒΆ

Generate synthetic data:

X = np.linspace(0, 10, 50)
y = np.sin(X) + np.random.randn(50) * 0.2
  1. Fit polynomials of degree 1, 5, 10, 20

  2. Calculate training error for each

  3. Generate new test data and calculate test error

  4. Plot training vs test error by degree

  5. What degree minimizes test error?

Exercise 5: Clustering AnalysisΒΆ

Using the Iris dataset:

  1. Apply K-Means with k=2, 3, 4, 5

  2. Calculate silhouette score for each k

  3. Which k gives best score?

  4. Visualize clusters for best k

  5. Compare with true species labels

Exercise 6: Metrics UnderstandingΒΆ

Given predictions and actual values:

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 1]
  1. Manually calculate confusion matrix

  2. Calculate accuracy

  3. Calculate precision

  4. Calculate recall

  5. Calculate F1-score

  6. Verify with sklearn functions

Exercise 7: Real-World ApplicationΒΆ

Choose a dataset from UCI ML Repository or Kaggle:

  1. Load and explore the data

  2. Identify the problem type

  3. Perform EDA (visualizations, statistics)

  4. Apply appropriate model

  5. Evaluate with proper metrics

  6. Document your findings

Exercise 8: Workflow PracticeΒΆ

Complete end-to-end pipeline:

  1. Load digits dataset (load_digits)

  2. Split train/test

  3. Try 3 different classifiers

  4. Use cross-validation for each

  5. Select best model

  6. Final evaluation on test set

  7. Create confusion matrix heatmap

  8. Report findings professionally