Lecture 13: Advice for Applying Machine Learning

📹 Watch Lecture

From Andrew Ng’s CS229 Lecture 13 (Autumn 2018)

“What I want to do today is share with you advice for applying machine learning… principles for helping you become efficient at how you apply all of these things to solve whatever application problem you might want to work on.” - Andrew Ng

The Challenge

“A lot of today’s material is actually not that mathematical. There’s also some of the hardest material in this class to understand.”

Why hard?

  • Easy to agree with principles intellectually

  • Hard to apply when “in the hot seat” making decisions

  • Example: “Should we collect more data for our class project now or not?”

Three Key Ideas

1. Diagnostics for Debugging Learning Algorithms

  • Your algorithm almost never works the first time

  • Need systematic debugging workflow

2. Error Analysis & Ablative Analysis

  • Error analysis: Understand what’s NOT working

  • Ablative analysis: Understand what IS working

3. Philosophy for Getting Started

  • How to begin a machine learning project

  • Systematic engineering discipline vs “black art”

The Reality of ML Development

“When you implement a learning algorithm for the first time, it almost never works, right? At least not the first time.”

Andrew’s Surprise:

“There was a weekend about a year ago where I implemented Softmax regression on my laptop, and it worked the first time. And even to this day, I still remember that feeling of surprise… I went in to try to find the bug, and there wasn’t a bug. But it’s so rare. I still remember it over a year later.”

From Black Art to Engineering

Old approach:

  • Go to someone with 30 years experience

  • They give magical advice that somehow works

  • “Black magic” based on intuition

New approach:

“Turn that black magic, that art into much more refined, so that you can much more systematically make these decisions yourself”

Focus: Building Applications that Work

Note:

  • Today’s material focused on building stuff that works

  • Some advice won’t apply to novel ML research

  • Not the best approach for writing research papers

  • Goal: Help you build applications successfully

The Debugging Workflow

Typical scenario:

  1. Have an idea for ML application

  2. Implement something

  3. It won’t work as well as you hoped

  4. Key question: What do you do next?

“When I work on machine learning algorithm, that’s actually most of my workflow. We usually have something implemented… it doesn’t work well… what do you do?”

Coming Up

  • Diagnostics: Systematic debugging

  • Error analysis: Finding what’s broken

  • Ablative analysis: Understanding what works

  • Decision frameworks: Data collection, feature engineering, model selection

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded!")

12.1 The Bad Debugging Strategy: Random Approach

Motivating Example: Spam Classifier

Scenario:

  • Building anti-spam classifier

  • Using 100 carefully chosen words as features

  • Implemented Bayesian logistic regression (with regularization)

  • Result: 20% test error (unacceptably high - 1 in 5 mistakes!)

What Many Teams Do (The Wrong Way)

“What many teams would do is say one of these ideas, kind of at random. It depends on what they happen to read the night before, right, about something.”

Random options people try:

  1. 📊 Get more training examples

  2. ⬇️ Try smaller set of features

  3. ⬆️ Try larger set of features

  4. ✉️ Add email header features

  5. 🔄 Run gradient descent for more iterations

  6. ⚙️ Switch from GD to Newton’s method

  7. 🎛️ Try different value for λ (regularization)

  8. 🔀 Switch to completely different algorithm (SVM, neural net)

“Someone will pick one of these ideas, kind of at random… or the most opinionated person will pick one of these things at random and do that.”

Problem: Could spend days/weeks on wrong approach!

The Better Way: Systematic Analysis

“If you actually sit down and brainstorm a list of things you could try, and then try to evaluate the different options, you’re already ahead of many teams.”

Steps:

  1. ✅ Brainstorm list of possible improvements

  2. ✅ Run diagnostics to identify the problem

  3. ✅ Pick approach that fixes the diagnosed problem

  4. ✅ Implement and measure impact

12.2 Bias vs Variance Diagnostic

“The most common diagnostic I end up using in pretty much every single machine learning project is a bias versus variance diagnostic.”

Quick Review

High Bias (Underfitting):

  • Model too simple

  • Doesn’t fit training data well

  • Example: Fitting straight line to curved data

High Variance (Overfitting):

  • Model too complex

  • Fits training data too well

  • Example: 10th degree polynomial wiggling through 5 points

“I’ve had former PhD students that learned about bias and variance when they’re doing their PhD and sometimes even a couple of years after they’ve graduated… they actually tell me that their understanding of bias and variance continue to deepen for many years.”

Learning Curves: The Diagnostic Tool

Setup:

  • X-axis: Number of training examples (m)

  • Y-axis: Error (training error and dev/test error)

  • Horizontal line: Desired performance level

Two curves to plot:

  1. Training Error (blue):

    • Starts near 0 (easy to fit small dataset)

    • Increases as m increases (harder to fit more data)

  2. Dev/Test Error (green):

    • Starts high

    • Decreases as m increases (more data → better generalization)

High Variance Diagnosis

Signals:

  1. ⚠️ Large gap between training and dev error

    • Training error much lower than dev error

    • Model memorizing training data, not generalizing

  2. 📉 Dev error still decreasing

    • Green curve hasn’t flattened

    • More data could help

“If you see a learning curve like this, this is a strong sign that you have a variance problem.”

Visual pattern:

  • Training error: Low, flat

  • Dev error: High, still declining

  • Gap: Large and persistent

High Bias Diagnosis

Signals:

  1. Not even doing well on training set

    • Training error above desired performance

    • Model can’t fit data even when it has seen it!

  2. 📏 Small gap between training and dev error

    • Both errors are similar

    • Both are unacceptably high

“Even on the training set you’re not achieving your desired level of performance. It’s like this algorithm has seen these examples and even for examples it’s seen, it’s not doing as well as you were hoping. So clearly the algorithm’s not fitting the data well enough.”

Key insight:

“No matter how much more data you get, no matter how far you extrapolate to the right, the error is never going to come down to your desired level of performance.”

Visual pattern:

  • Training error: High, near dev error

  • Dev error: High, close to training error

  • Gap: Small

  • Both: Never reach desired performance

Theoretical Foundation

Dev error decay rate:

“Learning theory suggests that in most cases, the green curve should decay as 1 over square root of m.”

Bayes Error:

  • Irreducible error due to noise in data

  • Best possible performance anyone could achieve

  • Example: Blurry medical images where diagnosis is ambiguous

“Learning algorithm’s error don’t always decay to zero… as M increases, it will decay at roughly a rate of 1/√m toward that baseline error, which is called Bayes error.”

# Load digits
digits = load_digits()
X = digits.data
y = digits.target

# Split: 60% train, 20% dev, 20% test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_dev, X_test, y_dev, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print(f"Train: {len(X_train)} samples")
print(f"Dev: {len(X_dev)} samples")
print(f"Test: {len(X_test)} samples")

# Train baseline model
clf = RandomForestClassifier(n_estimators=50, random_state=42)
clf.fit(X_train, y_train)

# Evaluate
train_acc = clf.score(X_train, y_train)
dev_acc = clf.score(X_dev, y_dev)
test_acc = clf.score(X_test, y_test)

print(f"\nBaseline Performance:")
print(f"Train accuracy: {train_acc:.3f}")
print(f"Dev accuracy: {dev_acc:.3f}")
print(f"Test accuracy: {test_acc:.3f}")

# Error analysis on dev set
y_pred_dev = clf.predict(X_dev)
errors = y_pred_dev != y_dev
error_indices = np.where(errors)[0]

print(f"\nTotal errors on dev set: {np.sum(errors)}")

# Visualize errors
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    if i < len(error_indices):
        idx = error_indices[i]
        ax.imshow(X_dev[idx].reshape(8, 8), cmap='gray')
        ax.set_title(f'True: {y_dev[idx]}\nPred: {y_pred_dev[idx]}', 
                    fontsize=9, color='red')
    ax.axis('off')

plt.suptitle('Error Analysis: Misclassified Digits', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

12.2 Confusion Matrix Analysis

What: Understanding which classes your model confuses

A confusion matrix is a \(K \times K\) table where entry \((i, j)\) counts how many examples of true class \(i\) were predicted as class \(j\). Diagonal entries are correct predictions; off-diagonal entries reveal systematic misclassification patterns. For digit recognition, common confusions include 8 vs 3 (similar shapes) and 4 vs 9 (similar top halves).

Why: Error analysis is more valuable than hyperparameter tuning

As Andrew Ng emphasizes, most teams jump straight to trying different algorithms or tuning hyperparameters when their model underperforms. A far more productive first step is examining what the model gets wrong. The confusion matrix reveals whether errors are concentrated on a few class pairs (suggesting targeted improvements like collecting more examples of confused classes) or spread uniformly (suggesting a fundamental model capacity issue). This analysis turns the “black art” of ML debugging into a systematic engineering discipline.

# Confusion matrix
cm = confusion_matrix(y_dev, y_pred_dev)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar_kws={'label': 'Count'})
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('True', fontsize=12)
plt.title('Confusion Matrix - Dev Set', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Find most confused pairs
off_diagonal = cm.copy()
np.fill_diagonal(off_diagonal, 0)
confused_pairs = []

for i in range(10):
    for j in range(10):
        if off_diagonal[i, j] > 0:
            confused_pairs.append((i, j, off_diagonal[i, j]))

confused_pairs.sort(key=lambda x: x[2], reverse=True)

print("\nMost Confused Digit Pairs:")
for true_label, pred_label, count in confused_pairs[:5]:
    print(f"  {true_label}{pred_label}: {count} errors")

12.3 Diagnostic Decisions

What: A systematic decision tree for ML debugging

Based on training error and the train-dev gap, we can automatically classify the model’s problem as high bias, high variance, or a good fit, and recommend targeted fixes. High training error means the model is too simple (add features, reduce regularization, use a more expressive architecture). A large train-dev gap means the model is overfitting (get more data, add regularization, simplify the model).

Why: Replacing intuition with a repeatable process

The debugging decision tree encodes the core lessons of bias-variance analysis into a programmatic workflow. Rather than relying on intuition or experience, even a beginner can follow this diagnostic to identify the bottleneck and choose the right intervention. In industry, this kind of systematic approach is what distinguishes teams that ship reliable ML systems from those that spend weeks trying random ideas.

# Decision tree for debugging
print("ML Debugging Decision Tree:\n")

train_error = 1 - train_acc
dev_error = 1 - dev_acc
gap = dev_error - train_error

print(f"Train error: {train_error:.3f}")
print(f"Dev error: {dev_error:.3f}")
print(f"Gap: {gap:.3f}\n")

# Diagnosis
if train_error > 0.1:  # High train error
    print("DIAGNOSIS: High Bias (Underfitting)")
    print("RECOMMENDATIONS:")
    print("  1. Try more complex model")
    print("  2. Add more features")
    print("  3. Train longer")
    print("  4. Reduce regularization")
elif gap > 0.05:  # Large train-dev gap
    print("DIAGNOSIS: High Variance (Overfitting)")
    print("RECOMMENDATIONS:")
    print("  1. Get more training data")
    print("  2. Regularization (L2, dropout)")
    print("  3. Reduce model complexity")
    print("  4. Data augmentation")
else:
    print("DIAGNOSIS: Good Fit")
    print("RECOMMENDATIONS:")
    print("  1. Error analysis for remaining errors")
    print("  2. Collect specific data for hard cases")
    print("  3. Ensemble methods")

12.4 Iterative Improvement

What: Demonstrating the build-diagnose-improve loop

We start with a baseline Random Forest classifier on the digits dataset and iteratively improve it: first by increasing the number of estimators (reducing variance), then by tuning max_depth (controlling complexity), and finally by adjusting min_samples_split (further regularization). At each step, we track both training and dev accuracy to monitor the bias-variance tradeoff.

Why: The standard ML development workflow

Real ML development is never a single shot – it is an iterative cycle of building a baseline, diagnosing its weaknesses via learning curves and error analysis, applying a targeted fix, and measuring the impact. Tracking train and dev accuracy at each iteration ensures you can distinguish genuine improvements from overfitting. This disciplined approach, which Andrew Ng calls “going from black art to engineering,” is what separates successful ML projects from failed ones.

# Simulate iterative improvement
iterations = ['Baseline', '+ More Trees', '+ Max Depth', '+ Min Samples']
train_accs = [train_acc]
dev_accs = [dev_acc]

# Iteration 1: More trees
clf1 = RandomForestClassifier(n_estimators=100, random_state=42)
clf1.fit(X_train, y_train)
train_accs.append(clf1.score(X_train, y_train))
dev_accs.append(clf1.score(X_dev, y_dev))

# Iteration 2: Tune max_depth
clf2 = RandomForestClassifier(n_estimators=100, max_depth=15, random_state=42)
clf2.fit(X_train, y_train)
train_accs.append(clf2.score(X_train, y_train))
dev_accs.append(clf2.score(X_dev, y_dev))

# Iteration 3: Tune min_samples_split
clf3 = RandomForestClassifier(n_estimators=100, max_depth=15, min_samples_split=5, random_state=42)
clf3.fit(X_train, y_train)
train_accs.append(clf3.score(X_train, y_train))
dev_accs.append(clf3.score(X_dev, y_dev))

# Plot progress
x = np.arange(len(iterations))
width = 0.35

fig, ax = plt.subplots(figsize=(12, 7))
ax.bar(x - width/2, train_accs, width, label='Train Accuracy', alpha=0.8)
ax.bar(x + width/2, dev_accs, width, label='Dev Accuracy', alpha=0.8)

ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('Accuracy', fontsize=12)
ax.set_title('Iterative Model Improvement', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(iterations, rotation=15, ha='right')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nProgress Summary:")
for i, iter_name in enumerate(iterations):
    print(f"{iter_name}: Train={train_accs[i]:.3f}, Dev={dev_accs[i]:.3f}, Gap={train_accs[i]-dev_accs[i]:.3f}")

12.5 Final Evaluation

What: Using the test set exactly once

After all hyperparameter tuning and model selection is done on the dev set, we evaluate the best model on the held-out test set a single time. The test set provides an unbiased estimate of how the model will perform on truly unseen data.

Why: The sacred rule of ML evaluation

If you evaluate on the test set multiple times and use the results to guide further development, the test set becomes a de facto dev set and your reported performance will be optimistically biased. This is why Andrew Ng calls the test set “sacred” – it must be touched only once, at the very end. In production ML systems, this discipline is enforced through proper data pipelines and evaluation protocols. Any reported metric that was obtained by peeking at the test set is unreliable.

# Use best model on test set (ONLY ONCE!)
best_model = clf3
test_acc_final = best_model.score(X_test, y_test)
y_pred_test = best_model.predict(X_test)

print("FINAL TEST SET EVALUATION:")
print(f"Test Accuracy: {test_acc_final:.3f}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred_test))

print("\n" + "="*60)
print("IMPORTANT: Test set used only once for final evaluation!")
print("Never tune hyperparameters on test set.")
print("="*60)

Key Takeaways

1. Data Splits

  • Train: Fit model parameters

  • Dev/Validation: Tune hyperparameters, select model

  • Test: Final evaluation ONLY (never touch during development)

  • Typical split: 60/20/20 or 80/10/10

2. Error Analysis

  • Manually examine dev set errors

  • Categorize error types

  • Prioritize by frequency

  • Find patterns (e.g., 8s confused with 3s)

3. Debugging Checklist

  1. High train error? → Bigger model, more features

  2. Large train-dev gap? → More data, regularization

  3. Large dev-test gap? → Better dev set

  4. High dev error? → Error analysis

4. Iterative Process

  1. Start with simple baseline

  2. Diagnose bottleneck (bias vs variance)

  3. Try focused improvement

  4. Evaluate on dev set

  5. Repeat

5. Common Mistakes

  • Tuning on test set

  • Optimizing wrong metric

  • Not doing error analysis

  • Premature optimization

  • Ignoring train-dev gap

6. Best Practices

  • Single number evaluation metric

  • Establish baseline quickly

  • Iterate rapidly

  • Document experiments

  • Test set is sacred

References

  1. Andrew Ng’s “Machine Learning Yearning”

  2. “Deep Learning” - Goodfellow et al., Chapter 11

  3. CS229 Lecture Notes

Next: Lecture 13: Recommender Systems