Lecture 5 & 6: Generative Learning Algorithms

📹 Watch Lecture 5 | Watch Lecture 6

From Andrew Ng’s CS229 Lectures 5-6

Discriminative vs Generative Learning

From Lecture 5:

“All of the learning algorithms we’ve been learning about so far are called discriminative learning algorithms. Today, I’d like to share with you how generative learning algorithms work.” - Andrew Ng

Discriminative Algorithms (e.g., Logistic Regression):

  • “Look at two classes and try to find the separation”

  • Search for a decision boundary that separates positive and negative examples

  • Directly model P(y|x) or decision boundary

  • Example: “Use gradient descent to search for a line that separates the positive-negative examples”

Generative Algorithms (e.g., GDA, Naive Bayes):

  • “Rather than looking at two classes and trying to find the separation, instead the algorithm looks at the classes one at a time”

  • First: “Look at all malignant tumors and try to build a model for what malignant tumors look like”

  • Second: “Look at all benign tumors and try to build a model for what benign tumors look like”

  • Then: Use Bayes’ rule to classify new examples

  • Model P(x|y) and P(y), then compute P(y|x)

Bayes’ Rule Framework

From the lecture:

\[P(y=1|x) = \frac{P(x|y=1) P(y=1)}{P(x)}\]

Where:

  • \(P(x|y)\) = “What do features look like given the class?” (generative model)

  • \(P(y)\) = Class prior (before seeing features)

  • \(P(x)\) = Evidence (can ignore for classification)

Key Insight: “If you learn P(x|y) and P(y), you can plug them into Bayes’ rule to calculate P(y=1|x)”

Two Main Algorithms Today

  1. Gaussian Discriminant Analysis (GDA)

    • For continuous-valued features

    • “Used for things like tumor classification”

    • Assumes features are Gaussian distributed

  2. Naive Bayes

    • For discrete features

    • “Used for building email spam classifiers”

    • “Twitter sentiment analysis or something”

Setup

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import multivariate_normal
from sklearn.datasets import make_classification, fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded successfully!")

4.1 The Multivariate Gaussian Distribution

From Lecture 5:

“How many of you are familiar with the multivariate Gaussian? Like half of you? The Gaussian is this familiar bell-shaped curve. A multivariate Gaussian is the generalization of this familiar bell-shaped curve to vector-valued random variables.” - Andrew Ng

Definition: If \(z \sim \mathcal{N}(\mu, \Sigma)\), then:

\[p(z) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(z-\mu)^T\Sigma^{-1}(z-\mu)\right)\]

Where:

  • \(z \in \mathbb{R}^n\) is the random vector

  • \(\mu \in \mathbb{R}^n\) is the mean vector: \(\mathbb{E}[z] = \mu\)

  • \(\Sigma \in \mathbb{R}^{n \times n}\) is the covariance matrix: \(\text{Cov}(z) = \mathbb{E}[(z-\mu)(z-\mu)^T] = \Sigma\)

From the lecture: “This is one of those formulas that almost no one memorizes when starting machine learning. Just look it up every time you need it!”

Understanding the Covariance Matrix

Diagonal elements: Control variance in each dimension

  • Large values → wide spread

  • Small values → narrow spread

Off-diagonal elements: Control correlation between dimensions

  • Zero → uncorrelated

  • Positive → positive correlation

  • Negative → negative correlation

From lecture visualization examples:

from scipy.stats import multivariate_normal

print("📊 Multivariate Gaussian Distribution Visualization")
print("="*70)
print("From CS229 Lecture 5: Understanding μ and Σ parameters\n")

# Create grid for visualization
x1 = np.linspace(-5, 5, 100)
x2 = np.linspace(-5, 5, 100)
X1, X2 = np.meshgrid(x1, x2)
pos = np.dstack((X1, X2))

# Different covariance matrices from lecture examples
covariances = [
    ("Standard Gaussian\nΣ = I", np.array([[1, 0], [0, 1]])),
    ("Smaller variance\nΣ = 0.6I", np.array([[0.6, 0], [0, 0.6]])),
    ("Larger variance\nΣ = 2I", np.array([[2, 0], [0, 2]])),
    ("Positive correlation\nΣ = [[1, 0.8], [0.8, 1]]", np.array([[1, 0.8], [0.8, 1]])),
    ("Negative correlation\nΣ = [[1, -0.8], [-0.8, 1]]", np.array([[1, -0.8], [-0.8, 1]])),
    ("Different variances\nΣ = [[2, 0], [0, 0.5]]", np.array([[2, 0], [0, 0.5]]))
]

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()

mu = np.array([0, 0])  # Mean at origin

for idx, (title, sigma) in enumerate(covariances):
    # Create multivariate Gaussian
    rv = multivariate_normal(mu, sigma)
    Z = rv.pdf(pos)
    
    # Plot contours
    axes[idx].contour(X1, X2, Z, levels=8, cmap='viridis', alpha=0.6)
    axes[idx].contourf(X1, X2, Z, levels=20, cmap='viridis', alpha=0.3)
    
    # Mark center
    axes[idx].plot(0, 0, 'r*', markersize=15, label='μ = [0, 0]')
    
    axes[idx].set_xlabel('z₁', fontsize=11)
    axes[idx].set_ylabel('z₂', fontsize=11)
    axes[idx].set_title(title, fontsize=12, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)
    axes[idx].set_xlim([-5, 5])
    axes[idx].set_ylim([-5, 5])
    axes[idx].set_aspect('equal')

plt.suptitle('Effect of Covariance Matrix Σ on Gaussian Distribution', 
             fontsize=14, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

print("\n💡 Key Observations from Andrew Ng's lecture:")
print("   1. Identity covariance → Circular/round contours")
print("   2. Larger diagonal → Wider spread")
print("   3. Smaller diagonal → Narrower, taller distribution")
print("   4. Positive off-diagonal → Ellipse tilted positive")
print("   5. Negative off-diagonal → Ellipse tilted negative")
print("   6. Different diagonal values → Ellipse stretched in one direction")

4.2 Gaussian Discriminant Analysis (GDA)

Model:

  • \(y \sim \text{Bernoulli}(\phi)\)

  • \(x|y=0 \sim \mathcal{N}(\mu_0, \Sigma)\)

  • \(x|y=1 \sim \mathcal{N}(\mu_1, \Sigma)\)

Parameters to learn: \(\phi, \mu_0, \mu_1, \Sigma\)

Maximum Likelihood Estimates: $\(\phi = \frac{1}{m}\sum_{i=1}^m \mathbb{1}\{y^{(i)} = 1\}\)\( \)\(\mu_0 = \frac{\sum_{i=1}^m \mathbb{1}\{y^{(i)} = 0\} x^{(i)}}{\sum_{i=1}^m \mathbb{1}\{y^{(i)} = 0\}}\)\( \)\(\mu_1 = \frac{\sum_{i=1}^m \mathbb{1}\{y^{(i)} = 1\} x^{(i)}}{\sum_{i=1}^m \mathbb{1}\{y^{(i)} = 1\}}\)\( \)\(\Sigma = \frac{1}{m}\sum_{i=1}^m (x^{(i)} - \mu_{y^{(i)}})(x^{(i)} - \mu_{y^{(i)}})^T\)$

class GDA:
    """
    Gaussian Discriminant Analysis classifier.
    """
    def __init__(self):
        self.phi = None
        self.mu0 = None
        self.mu1 = None
        self.sigma = None
    
    def fit(self, X, y):
        """
        Fit GDA model using maximum likelihood.
        """
        m, n = X.shape
        
        # Separate data by class
        X0 = X[y == 0]
        X1 = X[y == 1]
        
        # Estimate parameters
        self.phi = np.mean(y)  # P(y=1)
        self.mu0 = X0.mean(axis=0)
        self.mu1 = X1.mean(axis=0)
        
        # Shared covariance
        self.sigma = np.zeros((n, n))
        for i in range(m):
            mu = self.mu1 if y[i] == 1 else self.mu0
            diff = (X[i] - mu).reshape(-1, 1)
            self.sigma += diff @ diff.T
        self.sigma /= m
        
        return self
    
    def predict_proba(self, X):
        """
        Compute P(y=1|x) using Bayes' rule.
        """
        # P(x|y=0) and P(x|y=1)
        rv0 = multivariate_normal(self.mu0, self.sigma)
        rv1 = multivariate_normal(self.mu1, self.sigma)
        
        p_x_y0 = rv0.pdf(X)
        p_x_y1 = rv1.pdf(X)
        
        # P(y=0) and P(y=1)
        p_y0 = 1 - self.phi
        p_y1 = self.phi
        
        # Bayes' rule: P(y=1|x) = P(x|y=1)P(y=1) / P(x)
        numerator = p_x_y1 * p_y1
        denominator = p_x_y0 * p_y0 + p_x_y1 * p_y1
        
        return numerator / denominator
    
    def predict(self, X):
        """
        Predict class labels.
        """
        proba = self.predict_proba(X)
        return (proba >= 0.5).astype(int)

# Train GDA on our data
gda = GDA()
gda.fit(X, y)

# Make predictions
y_pred_gda = gda.predict(X)
accuracy_gda = accuracy_score(y, y_pred_gda)

print("GDA Model Parameters:")
print(f"φ (P(y=1)): {gda.phi:.4f}")
print(f"μ₀: {gda.mu0}")
print(f"μ₁: {gda.mu1}")
print(f"Σ shape: {gda.sigma.shape}")
print(f"\nTraining Accuracy: {accuracy_gda:.4f}")

4.3 GDA vs Logistic Regression

Key insight: If GDA assumptions are correct (data is Gaussian), GDA is more efficient (needs less data). However, logistic regression is more robust to violations of assumptions.

from sklearn.linear_model import LogisticRegression

# Compare GDA with Logistic Regression
# Vary training set size
train_sizes = [20, 50, 100, 200, 400]
gda_scores = []
lr_scores = []

for size in train_sizes:
    # Sample subset
    indices = np.random.choice(len(X), size=size, replace=False)
    X_train_sub = X[indices]
    y_train_sub = y[indices]
    
    # Create test set (rest of data)
    test_indices = np.setdiff1d(np.arange(len(X)), indices)
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    # Train GDA
    gda_temp = GDA()
    gda_temp.fit(X_train_sub, y_train_sub)
    gda_acc = accuracy_score(y_test, gda_temp.predict(X_test))
    gda_scores.append(gda_acc)
    
    # Train Logistic Regression
    lr = LogisticRegression()
    lr.fit(X_train_sub, y_train_sub)
    lr_acc = accuracy_score(y_test, lr.predict(X_test))
    lr_scores.append(lr_acc)

# Plot comparison
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, gda_scores, 'o-', linewidth=2, markersize=8, 
         label='GDA', color='blue')
plt.plot(train_sizes, lr_scores, 's-', linewidth=2, markersize=8, 
         label='Logistic Regression', color='red')
plt.xlabel('Training Set Size', fontsize=12)
plt.ylabel('Test Accuracy', fontsize=12)
plt.title('GDA vs Logistic Regression: Sample Efficiency', 
          fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nKey Observations:")
print("1. With Gaussian data, GDA often better with small samples")
print("2. Both converge to similar performance with lots of data")
print("3. Logistic regression more robust to non-Gaussian data")

4.4 Naive Bayes Classifier

Naive Bayes assumption: Features are conditionally independent given the class:

\[P(x_1, ..., x_n | y) = \prod_{i=1}^n P(x_i | y)\]

This is “naive” because features are rarely truly independent, but it works surprisingly well in practice!

Classification rule: $\(\arg\max_y P(y) \prod_{i=1}^n P(x_i | y)\)$

# Simple example: Spam detection with word features
# Create synthetic email dataset
np.random.seed(42)

# Vocabulary
spam_words = ['free', 'money', 'winner', 'click', 'prize']
ham_words = ['meeting', 'project', 'report', 'schedule', 'team']
common_words = ['the', 'is', 'and', 'to', 'of']

vocab = spam_words + ham_words + common_words
n_vocab = len(vocab)

# Generate emails
n_emails = 1000
n_spam = 400
n_ham = 600

def generate_email(is_spam, n_words=20):
    """Generate synthetic email as word counts."""
    counts = np.zeros(n_vocab)
    
    for _ in range(n_words):
        if is_spam:
            # Spam: more likely to use spam words
            word_probs = [0.3] * 5 + [0.05] * 5 + [0.1] * 5
        else:
            # Ham: more likely to use ham words  
            word_probs = [0.05] * 5 + [0.3] * 5 + [0.1] * 5
        
        word_probs = np.array(word_probs)
        word_probs /= word_probs.sum()
        
        word_idx = np.random.choice(n_vocab, p=word_probs)
        counts[word_idx] += 1
    
    return counts

# Generate dataset
X_emails = []
y_emails = []

for _ in range(n_spam):
    X_emails.append(generate_email(is_spam=True))
    y_emails.append(1)

for _ in range(n_ham):
    X_emails.append(generate_email(is_spam=False))
    y_emails.append(0)

X_emails = np.array(X_emails)
y_emails = np.array(y_emails)

# Split data
X_train_email, X_test_email, y_train_email, y_test_email = train_test_split(
    X_emails, y_emails, test_size=0.2, random_state=42, stratify=y_emails
)

print(f"Email Dataset:")
print(f"  Vocabulary size: {n_vocab} words")
print(f"  Training emails: {len(X_train_email)}")
print(f"  Test emails: {len(X_test_email)}")
print(f"  Spam ratio: {y_emails.mean():.2%}")

4.5 Multinomial Naive Bayes

For text/count data, we use Multinomial distribution:

\[P(x|y) = \frac{(\sum_i x_i)!}{\prod_i x_i!} \prod_i \phi_{i|y}^{x_i}\]

Maximum Likelihood Estimates (with Laplace smoothing): $\(\phi_{i|y=1} = \frac{\sum_{j:y^{(j)}=1} x_i^{(j)} + \alpha}{\sum_{j:y^{(j)}=1} \sum_{k} x_k^{(j)} + \alpha n}\)$

class MultinomialNaiveBayes:
    """
    Multinomial Naive Bayes for count data (e.g., text).
    """
    def __init__(self, alpha=1.0):
        self.alpha = alpha  # Laplace smoothing parameter
        self.class_log_prior = None
        self.feature_log_prob = None
        self.classes = None
    
    def fit(self, X, y):
        """
        Fit Multinomial Naive Bayes.
        """
        self.classes = np.unique(y)
        n_classes = len(self.classes)
        n_features = X.shape[1]
        
        # Class priors: P(y)
        self.class_log_prior = np.zeros(n_classes)
        for i, c in enumerate(self.classes):
            self.class_log_prior[i] = np.log(np.mean(y == c))
        
        # Feature probabilities: P(x_i|y) with Laplace smoothing
        self.feature_log_prob = np.zeros((n_classes, n_features))
        
        for i, c in enumerate(self.classes):
            X_c = X[y == c]
            # Count of each word in class c
            word_counts = X_c.sum(axis=0) + self.alpha
            # Total words in class c
            total_count = word_counts.sum()
            # Log probability
            self.feature_log_prob[i] = np.log(word_counts / total_count)
        
        return self
    
    def predict_log_proba(self, X):
        """
        Compute log P(y|x) for each class.
        """
        # log P(y|x) ∝ log P(y) + Σ x_i log P(x_i|y)
        log_proba = X @ self.feature_log_prob.T + self.class_log_prior
        return log_proba
    
    def predict(self, X):
        """
        Predict class with highest posterior probability.
        """
        log_proba = self.predict_log_proba(X)
        return self.classes[np.argmax(log_proba, axis=1)]

# Train Multinomial Naive Bayes
mnb = MultinomialNaiveBayes(alpha=1.0)
mnb.fit(X_train_email, y_train_email)

# Predictions
y_pred_train = mnb.predict(X_train_email)
y_pred_test = mnb.predict(X_test_email)

train_acc = accuracy_score(y_train_email, y_pred_train)
test_acc = accuracy_score(y_test_email, y_pred_test)

print("Multinomial Naive Bayes Results:")
print(f"Training Accuracy: {train_acc:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")

# Show most informative words
log_prob_diff = mnb.feature_log_prob[1] - mnb.feature_log_prob[0]
top_spam_indices = np.argsort(log_prob_diff)[-5:]
top_ham_indices = np.argsort(log_prob_diff)[:5]

print("\nMost indicative of SPAM:")
for idx in top_spam_indices:
    print(f"  {vocab[idx]}")

print("\nMost indicative of HAM:")
for idx in top_ham_indices:
    print(f"  {vocab[idx]}")

4.6 Laplace Smoothing

Problem: If a word never appears in training data for class \(y\), then \(P(word|y) = 0\), which makes \(P(y|x) = 0\) for any document containing that word!

Solution: Laplace (add-α) smoothing: $\(\phi_{i|y} = \frac{\text{count}(i, y) + \alpha}{\sum_j \text{count}(j, y) + \alpha n}\)$

Common choice: \(\alpha = 1\) (add-one smoothing)

# Compare different smoothing values
alphas = [0.01, 0.1, 1.0, 10.0]
train_accs = []
test_accs = []

for alpha in alphas:
    mnb_temp = MultinomialNaiveBayes(alpha=alpha)
    mnb_temp.fit(X_train_email, y_train_email)
    
    train_acc = accuracy_score(y_train_email, mnb_temp.predict(X_train_email))
    test_acc = accuracy_score(y_test_email, mnb_temp.predict(X_test_email))
    
    train_accs.append(train_acc)
    test_accs.append(test_acc)

# Plot
plt.figure(figsize=(10, 6))
plt.semilogx(alphas, train_accs, 'o-', linewidth=2, markersize=8, 
             label='Training Accuracy')
plt.semilogx(alphas, test_accs, 's-', linewidth=2, markersize=8, 
             label='Test Accuracy')
plt.xlabel('Smoothing Parameter (α)', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Effect of Laplace Smoothing', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nSmoothing Analysis:")
for i, alpha in enumerate(alphas):
    print(f"α = {alpha:6.2f}: Train = {train_accs[i]:.4f}, Test = {test_accs[i]:.4f}")

4.7 Real Dataset: 20 Newsgroups

Let’s apply Naive Bayes to real text classification!

# Load subset of 20 newsgroups (4 categories for speed)
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']

print("Loading 20 Newsgroups dataset...")
newsgroups_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    remove=('headers', 'footers', 'quotes'),
    random_state=42
)

newsgroups_test = fetch_20newsgroups(
    subset='test',
    categories=categories,
    remove=('headers', 'footers', 'quotes'),
    random_state=42
)

print(f"\nDataset loaded:")
print(f"  Categories: {categories}")
print(f"  Training documents: {len(newsgroups_train.data)}")
print(f"  Test documents: {len(newsgroups_test.data)}")

# Vectorize text
print("\nVectorizing text...")
vectorizer = CountVectorizer(max_features=5000, stop_words='english')
X_train_news = vectorizer.fit_transform(newsgroups_train.data)
X_test_news = vectorizer.transform(newsgroups_test.data)
y_train_news = newsgroups_train.target
y_test_news = newsgroups_test.target

print(f"  Vocabulary size: {len(vectorizer.vocabulary_)}")
print(f"  Feature matrix shape: {X_train_news.shape}")
# Train Multinomial Naive Bayes
print("Training Multinomial Naive Bayes...")
nb_sklearn = MultinomialNB(alpha=1.0)
nb_sklearn.fit(X_train_news, y_train_news)

# Predictions
y_pred_train_news = nb_sklearn.predict(X_train_news)
y_pred_test_news = nb_sklearn.predict(X_test_news)

train_acc_news = accuracy_score(y_train_news, y_pred_train_news)
test_acc_news = accuracy_score(y_test_news, y_pred_test_news)

print("\n" + "="*60)
print("20 NEWSGROUPS CLASSIFICATION RESULTS")
print("="*60)
print(f"Training Accuracy: {train_acc_news:.4f}")
print(f"Test Accuracy: {test_acc_news:.4f}")

# Classification report
print("\nClassification Report:")
print(classification_report(y_test_news, y_pred_test_news, 
                          target_names=categories))

# Confusion matrix
cm = confusion_matrix(y_test_news, y_pred_test_news)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=categories,
            yticklabels=categories)
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.title('Confusion Matrix - 20 Newsgroups Classification', 
          fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
# Show most informative words for each category
feature_names = vectorizer.get_feature_names_out()
n_top_words = 10

print("\nMost Informative Words per Category:")
print("="*60)

for i, category in enumerate(categories):
    # Get log probabilities for this class
    log_prob = nb_sklearn.feature_log_prob_[i]
    top_indices = np.argsort(log_prob)[-n_top_words:]
    top_words = [feature_names[idx] for idx in top_indices[::-1]]
    
    print(f"\n{category}:")
    print(f"  {', '.join(top_words)}")

4.8 Event Models: Multinomial vs Bernoulli

Multinomial: Counts matter

  • \(P(x|y)\) based on word counts

  • Good for longer documents

Bernoulli: Presence/absence only

  • \(P(x|y)\) based on word presence (binary)

  • Good for short documents

# Compare Multinomial vs Bernoulli Naive Bayes
from sklearn.naive_bayes import BernoulliNB

# Convert to binary (presence/absence)
X_train_binary = (X_train_news > 0).astype(int)
X_test_binary = (X_test_news > 0).astype(int)

# Train Bernoulli Naive Bayes
bnb = BernoulliNB(alpha=1.0)
bnb.fit(X_train_binary, y_train_news)

# Predictions
y_pred_bnb = bnb.predict(X_test_binary)
acc_bnb = accuracy_score(y_test_news, y_pred_bnb)

print("Model Comparison on 20 Newsgroups:")
print(f"  Multinomial NB: {test_acc_news:.4f}")
print(f"  Bernoulli NB:   {acc_bnb:.4f}")
print("\nFor longer documents, Multinomial often performs better!")

Key Takeaways

1. Discriminative vs Generative

  • Discriminative: Model \(P(y|x)\) directly (e.g., logistic regression)

  • Generative: Model \(P(x|y)\) and \(P(y)\), use Bayes’ rule (e.g., GDA, Naive Bayes)

  • Trade-off: Generative more sample-efficient but makes stronger assumptions

2. Gaussian Discriminant Analysis

  • Assumes features are Gaussian given class

  • Learns \(\mu_0, \mu_1, \Sigma\) from data

  • Equivalent to logistic regression when assumptions hold

  • More efficient than logistic regression with Gaussian data

3. Naive Bayes

  • Assumes conditional independence: \(P(x|y) = \prod_i P(x_i|y)\)

  • “Naive” but works well in practice!

  • Very fast to train and predict

  • Excellent for text classification

4. Laplace Smoothing

  • Prevents zero probabilities for unseen features

  • Add-one smoothing (\(\alpha=1\)) is common default

  • Can tune \(\alpha\) using cross-validation

5. Event Models

  • Multinomial: Use when counts matter (longer docs)

  • Bernoulli: Use for presence/absence (shorter docs, binary features)

  • Gaussian: Use for continuous features

6. When to Use Generative Models

  • Small training sets (more sample-efficient)

  • Need to generate samples

  • Have strong domain knowledge (can specify good model)

  • Fast inference required

7. Practical Considerations

  • Naive Bayes excellent baseline for text

  • Very fast: \(O(nd)\) training, \(O(d)\) prediction

  • Works well even when independence assumption violated

  • Feature selection via \(P(feature|class)\) ratios

Practice Exercises

  1. GDA Derivation: Derive the maximum likelihood estimates for GDA parameters. Show that maximizing log-likelihood gives the formulas presented.

  2. QDA: Implement Quadratic Discriminant Analysis (allow different \(\Sigma\) for each class). Compare with GDA on synthetic data.

  3. Non-Gaussian Data: Generate data from non-Gaussian distributions. Compare GDA vs Logistic Regression. When does each fail?

  4. Naive Bayes from Scratch: Implement Gaussian Naive Bayes from scratch. Test on iris dataset. Compare with sklearn.

  5. Feature Independence: Create dataset with correlated features. Train Naive Bayes. Does it still work? Why?

  6. Spam Filter: Build complete spam filter using Naive Bayes. Implement:

    • Text preprocessing

    • Feature extraction

    • Model training

    • Performance evaluation

  7. Smoothing Study: Systematically vary α from 0.001 to 100. Plot test accuracy. Find optimal value using cross-validation.

  8. Multinomial vs Bernoulli: Create datasets where each performs better. When should you use each?

  9. Semi-Supervised Learning: Use Naive Bayes with EM algorithm to leverage unlabeled data. Compare with supervised-only.

  10. Feature Ranking: Implement feature selection for Naive Bayes using mutual information or chi-square test.

References

  1. CS229 Lecture Notes: Andrew Ng’s notes on generative learning

  2. Pattern Recognition and Machine Learning: Bishop, Chapter 4

  3. The Elements of Statistical Learning: Hastie et al., Chapter 4.3

  4. “Naive Bayes at Forty”: Lewis (1998)

  5. “On Discriminative vs. Generative classifiers”: Ng & Jordan (2002)

Next: Lecture 5: Support Vector Machines