Phase 17: Debugging & TroubleshootingΒΆ

Learn systematic approaches to debug, diagnose, and optimize machine learning models and pipelines.

🎯 Learning Objectives¢

By the end of this phase, you will:

  • βœ… Diagnose common ML model failures and performance issues

  • βœ… Use debugging tools and techniques for ML workflows

  • βœ… Profile and optimize model performance (speed, memory)

  • βœ… Analyze and fix data-related problems

  • βœ… Implement error handling and monitoring

  • βœ… Debug deep learning models effectively

  • βœ… Troubleshoot deployment and production issues

πŸ“š What You’ll LearnΒΆ

1. Model Debugging FundamentalsΒΆ

  • The ML debugging workflow

  • Common failure modes and symptoms

  • Debugging checklist and best practices

  • Logging and instrumentation

2. Data Issues DiagnosisΒΆ

  • Data quality problems (missing, duplicates, outliers)

  • Label errors and class imbalance

  • Distribution shift detection

  • Feature correlation issues

3. Performance OptimizationΒΆ

  • Profiling CPU and memory usage

  • Identifying bottlenecks

  • Optimization techniques (vectorization, caching)

  • GPU utilization monitoring

4. Model-Specific DebuggingΒΆ

  • Gradient vanishing/exploding

  • Overfitting vs underfitting diagnosis

  • Learning curve analysis

  • Activation and weight monitoring

5. Error Analysis FrameworkΒΆ

  • Systematic error categorization

  • Confusion matrix deep dive

  • Per-class error analysis

  • Failure case collection

πŸ—‚οΈ Module StructureΒΆ

NotebooksΒΆ

  1. 01_debugging_workflow.ipynb

    • ML debugging methodology

    • Sanity checks and baseline models

    • Common pitfalls checklist

    • Debugging tools overview

    • Duration: 60-90 minutes

  2. 02_data_issues.ipynb

    • Data quality checks (null values, duplicates, outliers)

    • Label noise detection

    • Distribution shift analysis

    • Feature validation

    • Duration: 60-90 minutes

  3. 03_performance_profiling.ipynb

    • CPU profiling with cProfile and line_profiler

    • Memory profiling with memory_profiler

    • Bottleneck identification

    • Optimization strategies

    • Duration: 90-120 minutes

  4. 04_model_debugging.ipynb

    • Learning curves and convergence

    • Gradient monitoring

    • Weight initialization issues

    • Overfitting/underfitting diagnosis

    • Duration: 90-120 minutes

  5. 05_error_analysis.ipynb

    • Systematic error categorization

    • Per-class performance analysis

    • Failure case analysis

    • Improvement strategies

    • Duration: 60-90 minutes

Supporting MaterialsΒΆ

πŸ› οΈ Tools & LibrariesΒΆ

Profiling ToolsΒΆ

import cProfile          # CPU profiling
import line_profiler     # Line-by-line profiling
import memory_profiler   # Memory usage tracking
import py-spy            # Sampling profiler

Debugging ToolsΒΆ

import pdb              # Python debugger
import logging          # Structured logging
import warnings         # Warning management
import traceback        # Stack trace analysis

Visualization ToolsΒΆ

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import tensorboard      # For deep learning

Analysis LibrariesΒΆ

import pandas as pd
import numpy as np
import scikit-learn
from scipy import stats

πŸ“‹ PrerequisitesΒΆ

Before starting this phase, you should:

  • βœ… Complete Phase 15 (Model Evaluation & Metrics)

  • βœ… Understand model training basics

  • βœ… Be familiar with Python debugging

  • βœ… Know NumPy and pandas fundamentals

πŸŽ“ Learning PathΒΆ

Week 1: Fundamentals & Data IssuesΒΆ

  • Day 1-2: Complete Notebook 1 (Debugging Workflow)

  • Day 3-4: Complete Notebook 2 (Data Issues)

  • Day 5: Practice Challenges 1-2

Week 2: Performance & Model DebuggingΒΆ

  • Day 1-2: Complete Notebook 3 (Performance Profiling)

  • Day 3-4: Complete Notebook 4 (Model Debugging)

  • Day 5: Practice Challenges 3-4

Week 3: Error Analysis & IntegrationΒΆ

  • Day 1-2: Complete Notebook 5 (Error Analysis)

  • Day 3-4: Complete Assignment

  • Day 5: Practice Challenges 5-7, Final Review

πŸš€ Quick StartΒΆ

InstallationΒΆ

# Install profiling tools
pip install line-profiler memory-profiler py-spy

# Install debugging utilities
pip install ipdb pytest-sugar

# Install visualization tools
pip install matplotlib seaborn plotly tensorboard

# Install ML libraries
pip install scikit-learn pandas numpy scipy

Jupyter ExtensionsΒΆ

# Install useful Jupyter extensions
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user

# Enable profiling extension
pip install jupyter-resource-usage

🎯 Common Debugging Scenarios¢

Scenario 1: Model Not LearningΒΆ

Symptoms: Loss not decreasing, random performance Checklist:

  • βœ… Check learning rate (too high/low?)

  • βœ… Verify data loading (labels shuffled?)

  • βœ… Inspect gradients (vanishing/exploding?)

  • βœ… Test with smaller dataset first

Scenario 2: Poor Validation PerformanceΒΆ

Symptoms: Train accuracy high, validation accuracy low Checklist:

  • βœ… Overfitting? Add regularization

  • βœ… Data leakage? Check feature engineering

  • βœ… Distribution shift? Analyze train/val split

  • βœ… Insufficient data? Try augmentation

Scenario 3: Slow TrainingΒΆ

Symptoms: Training takes forever Checklist:

  • βœ… Profile code to find bottlenecks

  • βœ… Check data loading (I/O bound?)

  • βœ… Optimize preprocessing (use vectorization)

  • βœ… Use GPU acceleration if available

Scenario 4: Memory ErrorsΒΆ

Symptoms: Out of memory crashes Checklist:

  • βœ… Profile memory usage

  • βœ… Reduce batch size

  • βœ… Use gradient accumulation

  • βœ… Clear cache between iterations

πŸ“Š Debugging Workflow DiagramΒΆ

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    PROBLEM DETECTED                          β”‚
β”‚          (Poor performance, errors, slow speed)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  1. REPRODUCE BUG                            β”‚
β”‚   β€’ Create minimal reproducible example                      β”‚
β”‚   β€’ Isolate the problem                                      β”‚
β”‚   β€’ Document symptoms                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  2. GATHER DATA                              β”‚
β”‚   β€’ Check logs and error messages                            β”‚
β”‚   β€’ Profile performance                                      β”‚
β”‚   β€’ Inspect intermediate outputs                             β”‚
β”‚   β€’ Visualize data and predictions                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  3. HYPOTHESIZE CAUSE                        β”‚
β”‚   β€’ Review checklist of common issues                        β”‚
β”‚   β€’ Form testable hypothesis                                 β”‚
β”‚   β€’ Prioritize most likely causes                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  4. TEST HYPOTHESIS                          β”‚
β”‚   β€’ Make targeted change                                     β”‚
β”‚   β€’ Re-run experiment                                        β”‚
β”‚   β€’ Compare before/after                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚                β”‚
        Fixed? YES              NO
                β”‚                β”‚
                β–Ό                β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   5. VERIFY    β”‚   β”‚  Try Next    β”‚
    β”‚   β€’ Test edge  β”‚   β”‚  Hypothesis  β”‚
    β”‚     cases      β”‚   β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚   β€’ Document   β”‚          β”‚
    β”‚     solution   β”‚          β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
                               β”‚
                               └──────► Back to Step 3

πŸ” Key Debugging TechniquesΒΆ

1. Sanity ChecksΒΆ

# Check data shapes
print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")

# Verify label distribution
print(y_train.value_counts())

# Test on single batch
model.fit(X_train[:32], y_train[:32])

2. Baseline ModelsΒΆ

# Always start with simple baseline
from sklearn.dummy import DummyClassifier
baseline = DummyClassifier(strategy='most_frequent')
baseline.fit(X_train, y_train)
print(f"Baseline accuracy: {baseline.score(X_test, y_test):.3f}")

3. LoggingΒΆ

import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

logger.info(f"Training started with {len(X_train)} samples")
logger.warning(f"Found {null_count} missing values")

4. VisualizationΒΆ

# Plot learning curves
plt.plot(history['train_loss'], label='Train')
plt.plot(history['val_loss'], label='Validation')
plt.legend()
plt.show()

# Inspect predictions
pd.DataFrame({'Actual': y_test, 'Predicted': y_pred}).head(20)

⚠️ Common Pitfalls¢

πŸ“ˆ Success MetricsΒΆ

After completing this phase, you should be able to:

βœ… Systematically debug ML models using structured approach
βœ… Identify and fix data quality issues
βœ… Profile and optimize code for 2-10x speedup
βœ… Diagnose overfitting, underfitting, and convergence problems
βœ… Perform thorough error analysis
βœ… Use debugging tools (pdb, profilers, loggers)
βœ… Handle common production issues

🎯 Assessment¢

  • Pre-Quiz: Baseline knowledge check (10 questions)

  • Notebooks: 5 interactive debugging exercises

  • Assignment: Debug and optimize a broken ML pipeline

  • Challenges: 7 real-world debugging scenarios

  • Post-Quiz: Comprehensive assessment (18 questions)

Passing Criteria: 70% on post-quiz, complete assignment

πŸ“š Additional ResourcesΒΆ

DocumentationΒΆ

ArticlesΒΆ

  • β€œDebugging Machine Learning Models” - Papers with Code

  • β€œA Recipe for Training Neural Networks” - Andrej Karpathy

  • β€œTroubleshooting Deep Neural Networks” - Josh Tobin

ToolsΒΆ

  • TensorBoard: Visualization for deep learning

  • Weights & Biases: Experiment tracking

  • Neptune.ai: ML metadata store

πŸš€ Next StepsΒΆ

After completing Phase 16:

  • Phase 17: Low-Code AI Tools (Gradio, Streamlit)

  • Phase 18: Production Deployment

  • Advanced: MLOps and Model Monitoring

πŸ’‘ Tips for SuccessΒΆ

  1. Be Systematic: Follow the debugging workflow, don’t guess randomly

  2. Start Simple: Test with minimal data first, then scale up

  3. Document Everything: Keep notes on what you tried and results

  4. Use Version Control: Commit working code before experimenting

  5. Ask for Help: Share reproducible examples when stuck

Happy Debugging! Remember: Every bug is an opportunity to learn! πŸ›β†’πŸ¦‹