Run this notebook: Open in Colab Open in Kaggle

# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_digits, make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    confusion_matrix, classification_report, 
    accuracy_score, precision_recall_fscore_support
)
import warnings
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)

# Configure plotting
plt.style.use('default')
sns.set_palette('husl')

print("✅ Libraries imported successfully!")

Confusion Matrix Deep Dive: Beyond Aggregate Accuracy¶

The confusion matrix is the foundation of classification error analysis, revealing structured error patterns that a single accuracy number hides. For \(K\) classes, the \(K \times K\) matrix has correct predictions on the diagonal and errors off-diagonal. Row-normalized confusion matrices (dividing each row by its sum) show recall per class: what fraction of each true class was correctly identified. Column-normalized matrices show precision: what fraction of each predicted class was actually correct.

Why this matters: a model with 97% overall accuracy on digit recognition might systematically confuse 8s and 1s (both have vertical strokes), or 3s and 5s (similar top halves). These systematic confusions point to specific feature engineering opportunities – adding stroke-direction features or contour-based descriptors. In production, the confusion matrix also reveals asymmetric costs: confusing a malignant tumor (class 1) with benign (class 0) is far more dangerous than the reverse, and the matrix makes this visible at a glance.

# Load digit recognition dataset
digits = load_digits()
X, y = digits.data, digits.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Overall accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Overall Accuracy: {accuracy:.3f}")
print(f"\nDataset: {len(X)} samples, {X.shape[1]} features, {len(np.unique(y))} classes")

# Visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Raw counts
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0], 
            cbar_kws={'label': 'Count'})
axes[0].set_xlabel('Predicted Label', fontsize=12)
axes[0].set_ylabel('True Label', fontsize=12)
axes[0].set_title('Confusion Matrix (Counts)', fontsize=13, fontweight='bold')

# Normalized (percentages)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=axes[1],
            cbar_kws={'label': 'Percentage'})
axes[1].set_xlabel('Predicted Label', fontsize=12)
axes[1].set_ylabel('True Label', fontsize=12)
axes[1].set_title('Confusion Matrix (Normalized)', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n💡 Interpretation:")
print("  • Diagonal = Correct predictions")
print("  • Off-diagonal = Errors")
print("  • Look for systematic confusion patterns")

# Identify most confused class pairs
def find_most_confused_pairs(cm, top_n=5):
    """
    Find the most commonly confused class pairs.
    """
    n_classes = cm.shape[0]
    confusion_pairs = []
    
    for i in range(n_classes):
        for j in range(n_classes):
            if i != j:  # Exclude diagonal
                confusion_pairs.append({
                    'true_class': i,
                    'predicted_class': j,
                    'count': cm[i, j],
                    'percentage': 100 * cm[i, j] / cm[i].sum()
                })
    
    # Sort by count
    confusion_pairs = sorted(confusion_pairs, key=lambda x: x['count'], reverse=True)
    
    return confusion_pairs[:top_n]

confused_pairs = find_most_confused_pairs(cm, top_n=10)

print("\nTop 10 Most Confused Class Pairs:\n")
print(f"{'True':<6} {'Pred':<6} {'Count':<8} {'Percent'}")
print("=" * 40)
for pair in confused_pairs:
    print(f"{pair['true_class']:<6} {pair['predicted_class']:<6} "
          f"{pair['count']:<8} {pair['percentage']:.2f}%")

Per-Class Error Analysis: Finding the Weakest Links¶

Aggregate metrics like macro-F1 average over all classes equally, masking the fact that one or two classes may drag down overall performance. Per-class analysis with classification_report decomposes performance into precision, recall, and F1 for each class individually, revealing exactly where the model struggles. A class with high precision but low recall is being under-predicted (the model is conservative), while low precision with high recall indicates over-prediction (too many false alarms).

Prioritization strategy: sort classes by F1-score and focus improvement efforts on the bottom 20%. Common remedies include collecting more training examples for underperforming classes, applying class-weighted loss functions (\(w_c = \frac{n}{K \cdot n_c}\) for class \(c\)), or using oversampling techniques like SMOTE. The bar chart visualization below highlights underperformers in red, making it immediately clear which classes need attention during the next iteration of model development.

# Detailed per-class metrics
report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose()

# Filter only class metrics
class_metrics = report_df[:-3]  # Exclude accuracy, macro avg, weighted avg
class_metrics = class_metrics.sort_values('f1-score')

print("Per-Class Performance (sorted by F1-score):\n")
print(class_metrics.round(3))

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

metrics = ['precision', 'recall', 'f1-score']
colors = ['blue', 'green', 'orange']

for ax, metric, color in zip(axes, metrics, colors):
    classes = class_metrics.index.astype(str)
    values = class_metrics[metric]
    
    bars = ax.bar(classes, values, color=color, alpha=0.7, edgecolor='black')
    
    # Highlight worst performers
    worst_threshold = values.quantile(0.3)
    for i, (cls, val) in enumerate(zip(classes, values)):
        if val < worst_threshold:
            bars[i].set_color('red')
            bars[i].set_alpha(0.9)
    
    ax.set_xlabel('Class', fontsize=11)
    ax.set_ylabel(metric.capitalize(), fontsize=11)
    ax.set_title(f'{metric.capitalize()} by Class', fontsize=12, fontweight='bold')
    ax.set_ylim([0, 1.05])
    ax.axhline(y=values.mean(), color='red', linestyle='--', 
               label=f'Mean={values.mean():.3f}')
    ax.legend()
    ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Identify problematic classes
worst_f1 = class_metrics.nsmallest(3, 'f1-score')
print("\n⚠️ Classes Needing Attention (lowest F1-score):")
for idx, row in worst_f1.iterrows():
    print(f"   Class {idx}: F1={row['f1-score']:.3f}, "
          f"Precision={row['precision']:.3f}, Recall={row['recall']:.3f}")

Failure Case Analysis: Learning from Mistakes¶

Examining individual misclassified examples provides qualitative insight that no aggregate metric can capture. By visualizing the actual inputs the model gets wrong – and comparing them to the class the model thought they belonged to – you can identify whether errors stem from genuine ambiguity in the data (an 8 that truly looks like a 1), labeling errors in the dataset, or systematic model blind spots (the model never learned to recognize a particular visual pattern).

Structured failure analysis involves: (1) identifying the most common error pairs from the confusion matrix, (2) pulling specific examples for each pair, (3) examining prediction confidence alongside the visual input, and (4) categorizing root causes (ambiguous input, label noise, missing feature, distribution gap). This process often reveals that 60-80% of errors cluster into 2-3 failure modes, each with a targeted fix – for example, data augmentation for rotation-sensitive errors or additional training data for underrepresented subgroups.

# Find misclassified examples
errors = y_test != y_pred
error_indices = np.where(errors)[0]

print(f"Total errors: {len(error_indices)} / {len(y_test)} ({100*len(error_indices)/len(y_test):.2f}%)")

# Analyze error characteristics
error_analysis = pd.DataFrame({
    'true_label': y_test[error_indices],
    'predicted_label': y_pred[error_indices]
})

# Most common error types
error_types = error_analysis.groupby(['true_label', 'predicted_label']).size()
error_types = error_types.sort_values(ascending=False).head(10)

print("\nMost Common Error Types:\n")
print(error_types)

# Visualize failure cases
# Show examples of most common confusion (e.g., 8 vs 1)
most_confused = confused_pairs[0]
true_class = most_confused['true_class']
pred_class = most_confused['predicted_class']

# Find examples
confusion_mask = (y_test == true_class) & (y_pred == pred_class)
confusion_indices = np.where(confusion_mask)[0][:9]  # Get up to 9 examples

if len(confusion_indices) > 0:
    fig, axes = plt.subplots(3, 3, figsize=(10, 10))
    axes = axes.flatten()
    
    for i, idx in enumerate(confusion_indices):
        if i < 9:
            # Get prediction probabilities
            proba = model.predict_proba(X_test[idx].reshape(1, -1))[0]
            confidence = proba[pred_class]
            
            # Plot digit
            axes[i].imshow(X_test[idx].reshape(8, 8), cmap='gray')
            axes[i].set_title(f'True: {true_class}, Pred: {pred_class}\nConf: {confidence:.2f}',
                            fontsize=10)
            axes[i].axis('off')
    
    # Hide unused subplots
    for i in range(len(confusion_indices), 9):
        axes[i].axis('off')
    
    plt.suptitle(f'Failure Cases: {true_class} misclassified as {pred_class}',
                fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print(f"\n🔍 Analyzing {true_class} → {pred_class} confusion")
    print(f"   These digits might look similar or have ambiguous features")

Confidence Analysis: Does the Model Know What It Does Not Know?¶

A well-calibrated model’s predicted probability should match its actual accuracy: when it says “90% confident,” it should be correct 90% of the time. Calibration curves (reliability diagrams) plot predicted confidence against observed accuracy across binned confidence ranges. Points above the diagonal indicate underconfidence (model is better than it thinks), while points below indicate overconfidence (the dangerous case, where the model is wrong but certain).

High-confidence errors are the most dangerous failure mode in production. A model that is uncertain when wrong can be filtered by a confidence threshold – requests below 0.7 confidence get routed to a human reviewer. But a model that is 95% confident on wrong predictions offers no safety net. Random Forests tend to produce poorly calibrated probabilities because individual trees vote discretely; Platt scaling or isotonic regression (via scikit-learn’s CalibratedClassifierCV) post-processes the raw probabilities to improve calibration without changing the underlying model’s predictions.

# Analyze prediction confidence
y_pred_proba = model.predict_proba(X_test)
confidence = np.max(y_pred_proba, axis=1)

# Separate correct and incorrect predictions
correct_confidence = confidence[~errors]
incorrect_confidence = confidence[errors]

print("Confidence Statistics:\n")
print(f"Correct predictions:")
print(f"  Mean: {correct_confidence.mean():.3f}")
print(f"  Median: {np.median(correct_confidence):.3f}")
print(f"  Min: {correct_confidence.min():.3f}")

print(f"\nIncorrect predictions:")
print(f"  Mean: {incorrect_confidence.mean():.3f}")
print(f"  Median: {np.median(incorrect_confidence):.3f}")
print(f"  Max: {incorrect_confidence.max():.3f}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(correct_confidence, bins=20, alpha=0.6, label='Correct', 
             color='green', edgecolor='black')
axes[0].hist(incorrect_confidence, bins=20, alpha=0.6, label='Incorrect',
             color='red', edgecolor='black')
axes[0].set_xlabel('Confidence', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Confidence Distribution', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy vs Confidence
confidence_bins = np.linspace(0, 1, 11)
bin_centers = (confidence_bins[:-1] + confidence_bins[1:]) / 2
bin_accuracies = []

for i in range(len(confidence_bins) - 1):
    mask = (confidence >= confidence_bins[i]) & (confidence < confidence_bins[i+1])
    if mask.sum() > 0:
        bin_acc = (~errors[mask]).mean()
        bin_accuracies.append(bin_acc)
    else:
        bin_accuracies.append(0)

axes[1].plot(bin_centers, bin_accuracies, 'o-', linewidth=2, markersize=8, color='blue')
axes[1].plot([0, 1], [0, 1], 'r--', label='Perfect calibration', alpha=0.5)
axes[1].set_xlabel('Confidence', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Calibration Curve', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].set_xlim([0, 1])
axes[1].set_ylim([0, 1])

plt.tight_layout()
plt.show()

print("\n💡 Observations:")
if incorrect_confidence.mean() > 0.7:
    print("  ⚠️ Model is overconfident in its errors!")
else:
    print("  ✅ Low confidence on errors - model knows when it's uncertain")

Error Analysis Report: Structured Communication of Findings¶

A systematic error analysis report transforms debugging insights into actionable recommendations for the team. The report below follows a five-section structure used in production ML teams: (1) overall performance baseline, (2) per-class breakdown to identify weak spots, (3) most confused class pairs to guide data collection, (4) confidence analysis to assess deployment risk, and (5) prioritized recommendations with specific next steps.

Why structured reporting matters: ML debugging is iterative, and without documentation, teams repeat the same investigations. Each error analysis report becomes a historical record: “In v2.3, we discovered that class 8 confusion was caused by rotation-variant features. Adding rotation augmentation in v2.4 reduced this error by 40%.” The generate_error_analysis_report function below automates this process, producing a standardized report that can be run after every model retraining to track whether known failure modes have been addressed.

def generate_error_analysis_report(y_true, y_pred, y_pred_proba=None):
    """
    Generate comprehensive error analysis report.
    """
    print("=" * 80)
    print("ERROR ANALYSIS REPORT")
    print("=" * 80)
    
    # 1. Overall Performance
    print("\n1. OVERALL PERFORMANCE")
    print("-" * 40)
    accuracy = accuracy_score(y_true, y_pred)
    print(f"Accuracy: {accuracy:.3f} ({100*accuracy:.1f}%)")
    print(f"Total samples: {len(y_true)}")
    print(f"Correct: {(y_true == y_pred).sum()}")
    print(f"Errors: {(y_true != y_pred).sum()}")
    
    # 2. Per-Class Performance
    print("\n2. PER-CLASS PERFORMANCE")
    print("-" * 40)
    report = classification_report(y_true, y_pred)
    print(report)
    
    # 3. Most Confused Pairs
    print("\n3. MOST CONFUSED CLASS PAIRS")
    print("-" * 40)
    cm = confusion_matrix(y_true, y_pred)
    confused_pairs = find_most_confused_pairs(cm, top_n=5)
    for i, pair in enumerate(confused_pairs, 1):
        print(f"{i}. {pair['true_class']} → {pair['predicted_class']}: "
              f"{pair['count']} errors ({pair['percentage']:.1f}%)")
    
    # 4. Confidence Analysis
    if y_pred_proba is not None:
        print("\n4. CONFIDENCE ANALYSIS")
        print("-" * 40)
        confidence = np.max(y_pred_proba, axis=1)
        errors = y_true != y_pred
        
        print(f"Average confidence (correct): {confidence[~errors].mean():.3f}")
        print(f"Average confidence (incorrect): {confidence[errors].mean():.3f}")
        
        # High confidence errors
        high_conf_errors = (errors) & (confidence > 0.8)
        print(f"High-confidence errors (>0.8): {high_conf_errors.sum()}")
    
    # 5. Recommendations
    print("\n5. RECOMMENDATIONS")
    print("-" * 40)
    
    report_dict = classification_report(y_true, y_pred, output_dict=True)
    class_metrics = {k: v for k, v in report_dict.items() 
                    if k not in ['accuracy', 'macro avg', 'weighted avg']}
    
    # Find worst performing classes
    worst_classes = sorted(class_metrics.items(), 
                          key=lambda x: x[1]['f1-score'])[:3]
    
    print("Priority improvements:")
    for i, (cls, metrics) in enumerate(worst_classes, 1):
        print(f"{i}. Class {cls}: F1={metrics['f1-score']:.3f}")
        if metrics['precision'] < metrics['recall']:
            print(f"   → Improve precision (reduce false positives)")
        elif metrics['recall'] < metrics['precision']:
            print(f"   → Improve recall (reduce false negatives)")
        else:
            print(f"   → Collect more training data for this class")
    
    print("\n" + "=" * 80)
    print("END OF REPORT")
    print("=" * 80)

# Generate report
generate_error_analysis_report(y_test, y_pred, y_pred_proba)

🎯 Key Takeaways¶

Systematic Approach
- Start with overall metrics
- Drill down to per-class analysis
- Examine specific failure cases
- Prioritize improvements
Confusion Matrix Insights
- Reveals systematic errors
- Shows which classes are confused
- Guides data collection efforts
Per-Class Analysis
- Identify underperforming classes
- Separate precision vs recall issues
- Focus resources on problem areas
Confidence Calibration
- Well-calibrated models know when they’re uncertain
- High-confidence errors are dangerous
- Use confidence for production filtering
Actionable Recommendations
- Low precision → Reduce false positives
- Low recall → Reduce false negatives
- Class confusion → Collect more diverse data
- Overall low performance → Try different model

📝 Error Analysis Checklist¶

✅ Compute confusion matrix
✅ Calculate per-class metrics
✅ Identify most confused pairs
✅ Examine failure cases visually
✅ Analyze prediction confidence
✅ Categorize error types
✅ Prioritize improvements
✅ Document findings

🎉 Congratulations!¶

You’ve completed Phase 16: Debugging & Troubleshooting!

You now know how to:

Debug ML workflows systematically
Diagnose data quality issues
Profile and optimize code
Debug model-specific problems
Perform comprehensive error analysis

Next: Complete the assignment and challenges to solidify your debugging skills! 🚀