Phase 17: Debugging & TroubleshootingΒΆ
Learn systematic approaches to debug, diagnose, and optimize machine learning models and pipelines.
π― Learning ObjectivesΒΆ
By the end of this phase, you will:
β Diagnose common ML model failures and performance issues
β Use debugging tools and techniques for ML workflows
β Profile and optimize model performance (speed, memory)
β Analyze and fix data-related problems
β Implement error handling and monitoring
β Debug deep learning models effectively
β Troubleshoot deployment and production issues
π What Youβll LearnΒΆ
1. Model Debugging FundamentalsΒΆ
The ML debugging workflow
Common failure modes and symptoms
Debugging checklist and best practices
Logging and instrumentation
2. Data Issues DiagnosisΒΆ
Data quality problems (missing, duplicates, outliers)
Label errors and class imbalance
Distribution shift detection
Feature correlation issues
3. Performance OptimizationΒΆ
Profiling CPU and memory usage
Identifying bottlenecks
Optimization techniques (vectorization, caching)
GPU utilization monitoring
4. Model-Specific DebuggingΒΆ
Gradient vanishing/exploding
Overfitting vs underfitting diagnosis
Learning curve analysis
Activation and weight monitoring
5. Error Analysis FrameworkΒΆ
Systematic error categorization
Confusion matrix deep dive
Per-class error analysis
Failure case collection
ποΈ Module StructureΒΆ
NotebooksΒΆ
-
ML debugging methodology
Sanity checks and baseline models
Common pitfalls checklist
Debugging tools overview
Duration: 60-90 minutes
-
Data quality checks (null values, duplicates, outliers)
Label noise detection
Distribution shift analysis
Feature validation
Duration: 60-90 minutes
03_performance_profiling.ipynb
CPU profiling with cProfile and line_profiler
Memory profiling with memory_profiler
Bottleneck identification
Optimization strategies
Duration: 90-120 minutes
-
Learning curves and convergence
Gradient monitoring
Weight initialization issues
Overfitting/underfitting diagnosis
Duration: 90-120 minutes
-
Systematic error categorization
Per-class performance analysis
Failure case analysis
Improvement strategies
Duration: 60-90 minutes
Supporting MaterialsΒΆ
assignment.md - Comprehensive debugging project
challenges.md - 7 progressive debugging challenges
pre-quiz.md - Baseline knowledge assessment
post-quiz.md - Final knowledge verification
π οΈ Tools & LibrariesΒΆ
Profiling ToolsΒΆ
import cProfile # CPU profiling
import line_profiler # Line-by-line profiling
import memory_profiler # Memory usage tracking
import py-spy # Sampling profiler
Debugging ToolsΒΆ
import pdb # Python debugger
import logging # Structured logging
import warnings # Warning management
import traceback # Stack trace analysis
Visualization ToolsΒΆ
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import tensorboard # For deep learning
Analysis LibrariesΒΆ
import pandas as pd
import numpy as np
import scikit-learn
from scipy import stats
π PrerequisitesΒΆ
Before starting this phase, you should:
β Complete Phase 15 (Model Evaluation & Metrics)
β Understand model training basics
β Be familiar with Python debugging
β Know NumPy and pandas fundamentals
π Learning PathΒΆ
Week 1: Fundamentals & Data IssuesΒΆ
Day 1-2: Complete Notebook 1 (Debugging Workflow)
Day 3-4: Complete Notebook 2 (Data Issues)
Day 5: Practice Challenges 1-2
Week 2: Performance & Model DebuggingΒΆ
Day 1-2: Complete Notebook 3 (Performance Profiling)
Day 3-4: Complete Notebook 4 (Model Debugging)
Day 5: Practice Challenges 3-4
Week 3: Error Analysis & IntegrationΒΆ
Day 1-2: Complete Notebook 5 (Error Analysis)
Day 3-4: Complete Assignment
Day 5: Practice Challenges 5-7, Final Review
π Quick StartΒΆ
InstallationΒΆ
# Install profiling tools
pip install line-profiler memory-profiler py-spy
# Install debugging utilities
pip install ipdb pytest-sugar
# Install visualization tools
pip install matplotlib seaborn plotly tensorboard
# Install ML libraries
pip install scikit-learn pandas numpy scipy
Jupyter ExtensionsΒΆ
# Install useful Jupyter extensions
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
# Enable profiling extension
pip install jupyter-resource-usage
π― Common Debugging ScenariosΒΆ
Scenario 1: Model Not LearningΒΆ
Symptoms: Loss not decreasing, random performance Checklist:
β Check learning rate (too high/low?)
β Verify data loading (labels shuffled?)
β Inspect gradients (vanishing/exploding?)
β Test with smaller dataset first
Scenario 2: Poor Validation PerformanceΒΆ
Symptoms: Train accuracy high, validation accuracy low Checklist:
β Overfitting? Add regularization
β Data leakage? Check feature engineering
β Distribution shift? Analyze train/val split
β Insufficient data? Try augmentation
Scenario 3: Slow TrainingΒΆ
Symptoms: Training takes forever Checklist:
β Profile code to find bottlenecks
β Check data loading (I/O bound?)
β Optimize preprocessing (use vectorization)
β Use GPU acceleration if available
Scenario 4: Memory ErrorsΒΆ
Symptoms: Out of memory crashes Checklist:
β Profile memory usage
β Reduce batch size
β Use gradient accumulation
β Clear cache between iterations
π Debugging Workflow DiagramΒΆ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PROBLEM DETECTED β
β (Poor performance, errors, slow speed) β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. REPRODUCE BUG β
β β’ Create minimal reproducible example β
β β’ Isolate the problem β
β β’ Document symptoms β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. GATHER DATA β
β β’ Check logs and error messages β
β β’ Profile performance β
β β’ Inspect intermediate outputs β
β β’ Visualize data and predictions β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. HYPOTHESIZE CAUSE β
β β’ Review checklist of common issues β
β β’ Form testable hypothesis β
β β’ Prioritize most likely causes β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. TEST HYPOTHESIS β
β β’ Make targeted change β
β β’ Re-run experiment β
β β’ Compare before/after β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
βββββββββ΄βββββββββ
β β
Fixed? YES NO
β β
βΌ βΌ
βββββββββββββββββ ββββββββββββββββ
β 5. VERIFY β β Try Next β
β β’ Test edge β β Hypothesis β
β cases β ββββββββ¬ββββββββ
β β’ Document β β
β solution β β
βββββββββββββββββ β
β
ββββββββΊ Back to Step 3
π Key Debugging TechniquesΒΆ
1. Sanity ChecksΒΆ
# Check data shapes
print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")
# Verify label distribution
print(y_train.value_counts())
# Test on single batch
model.fit(X_train[:32], y_train[:32])
2. Baseline ModelsΒΆ
# Always start with simple baseline
from sklearn.dummy import DummyClassifier
baseline = DummyClassifier(strategy='most_frequent')
baseline.fit(X_train, y_train)
print(f"Baseline accuracy: {baseline.score(X_test, y_test):.3f}")
3. LoggingΒΆ
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
logger.info(f"Training started with {len(X_train)} samples")
logger.warning(f"Found {null_count} missing values")
4. VisualizationΒΆ
# Plot learning curves
plt.plot(history['train_loss'], label='Train')
plt.plot(history['val_loss'], label='Validation')
plt.legend()
plt.show()
# Inspect predictions
pd.DataFrame({'Actual': y_test, 'Predicted': y_pred}).head(20)
β οΈ Common PitfallsΒΆ
π Success MetricsΒΆ
After completing this phase, you should be able to:
β
Systematically debug ML models using structured approach
β
Identify and fix data quality issues
β
Profile and optimize code for 2-10x speedup
β
Diagnose overfitting, underfitting, and convergence problems
β
Perform thorough error analysis
β
Use debugging tools (pdb, profilers, loggers)
β
Handle common production issues
π― AssessmentΒΆ
Pre-Quiz: Baseline knowledge check (10 questions)
Notebooks: 5 interactive debugging exercises
Assignment: Debug and optimize a broken ML pipeline
Challenges: 7 real-world debugging scenarios
Post-Quiz: Comprehensive assessment (18 questions)
Passing Criteria: 70% on post-quiz, complete assignment
π Additional ResourcesΒΆ
DocumentationΒΆ
ArticlesΒΆ
βDebugging Machine Learning Modelsβ - Papers with Code
βA Recipe for Training Neural Networksβ - Andrej Karpathy
βTroubleshooting Deep Neural Networksβ - Josh Tobin
ToolsΒΆ
TensorBoard: Visualization for deep learning
Weights & Biases: Experiment tracking
Neptune.ai: ML metadata store
π Next StepsΒΆ
After completing Phase 16:
Phase 17: Low-Code AI Tools (Gradio, Streamlit)
Phase 18: Production Deployment
Advanced: MLOps and Model Monitoring
π‘ Tips for SuccessΒΆ
Be Systematic: Follow the debugging workflow, donβt guess randomly
Start Simple: Test with minimal data first, then scale up
Document Everything: Keep notes on what you tried and results
Use Version Control: Commit working code before experimenting
Ask for Help: Share reproducible examples when stuck
Happy Debugging! Remember: Every bug is an opportunity to learn! πβπ¦