Lecture 5 & 6: Generative Learning Algorithms¶
📹 Watch Lecture 5 | Watch Lecture 6
From Andrew Ng’s CS229 Lectures 5-6¶
Discriminative vs Generative Learning¶
From Lecture 5:
“All of the learning algorithms we’ve been learning about so far are called discriminative learning algorithms. Today, I’d like to share with you how generative learning algorithms work.” - Andrew Ng
Discriminative Algorithms (e.g., Logistic Regression):
“Look at two classes and try to find the separation”
Search for a decision boundary that separates positive and negative examples
Directly model P(y|x) or decision boundary
Example: “Use gradient descent to search for a line that separates the positive-negative examples”
Generative Algorithms (e.g., GDA, Naive Bayes):
“Rather than looking at two classes and trying to find the separation, instead the algorithm looks at the classes one at a time”
First: “Look at all malignant tumors and try to build a model for what malignant tumors look like”
Second: “Look at all benign tumors and try to build a model for what benign tumors look like”
Then: Use Bayes’ rule to classify new examples
Model P(x|y) and P(y), then compute P(y|x)
Bayes’ Rule Framework¶
From the lecture:
Where:
\(P(x|y)\) = “What do features look like given the class?” (generative model)
\(P(y)\) = Class prior (before seeing features)
\(P(x)\) = Evidence (can ignore for classification)
Key Insight: “If you learn P(x|y) and P(y), you can plug them into Bayes’ rule to calculate P(y=1|x)”
Two Main Algorithms Today¶
Gaussian Discriminant Analysis (GDA)
For continuous-valued features
“Used for things like tumor classification”
Assumes features are Gaussian distributed
Naive Bayes
For discrete features
“Used for building email spam classifiers”
“Twitter sentiment analysis or something”
Setup¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import multivariate_normal
from sklearn.datasets import make_classification, fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import warnings
warnings.filterwarnings('ignore')
# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)
print("Libraries loaded successfully!")
4.1 The Multivariate Gaussian Distribution¶
From Lecture 5:¶
“How many of you are familiar with the multivariate Gaussian? Like half of you? The Gaussian is this familiar bell-shaped curve. A multivariate Gaussian is the generalization of this familiar bell-shaped curve to vector-valued random variables.” - Andrew Ng
Definition: If \(z \sim \mathcal{N}(\mu, \Sigma)\), then:
Where:
\(z \in \mathbb{R}^n\) is the random vector
\(\mu \in \mathbb{R}^n\) is the mean vector: \(\mathbb{E}[z] = \mu\)
\(\Sigma \in \mathbb{R}^{n \times n}\) is the covariance matrix: \(\text{Cov}(z) = \mathbb{E}[(z-\mu)(z-\mu)^T] = \Sigma\)
From the lecture: “This is one of those formulas that almost no one memorizes when starting machine learning. Just look it up every time you need it!”
Understanding the Covariance Matrix¶
Diagonal elements: Control variance in each dimension
Large values → wide spread
Small values → narrow spread
Off-diagonal elements: Control correlation between dimensions
Zero → uncorrelated
Positive → positive correlation
Negative → negative correlation
From lecture visualization examples:
from scipy.stats import multivariate_normal
print("📊 Multivariate Gaussian Distribution Visualization")
print("="*70)
print("From CS229 Lecture 5: Understanding μ and Σ parameters\n")
# Create grid for visualization
x1 = np.linspace(-5, 5, 100)
x2 = np.linspace(-5, 5, 100)
X1, X2 = np.meshgrid(x1, x2)
pos = np.dstack((X1, X2))
# Different covariance matrices from lecture examples
covariances = [
("Standard Gaussian\nΣ = I", np.array([[1, 0], [0, 1]])),
("Smaller variance\nΣ = 0.6I", np.array([[0.6, 0], [0, 0.6]])),
("Larger variance\nΣ = 2I", np.array([[2, 0], [0, 2]])),
("Positive correlation\nΣ = [[1, 0.8], [0.8, 1]]", np.array([[1, 0.8], [0.8, 1]])),
("Negative correlation\nΣ = [[1, -0.8], [-0.8, 1]]", np.array([[1, -0.8], [-0.8, 1]])),
("Different variances\nΣ = [[2, 0], [0, 0.5]]", np.array([[2, 0], [0, 0.5]]))
]
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()
mu = np.array([0, 0]) # Mean at origin
for idx, (title, sigma) in enumerate(covariances):
# Create multivariate Gaussian
rv = multivariate_normal(mu, sigma)
Z = rv.pdf(pos)
# Plot contours
axes[idx].contour(X1, X2, Z, levels=8, cmap='viridis', alpha=0.6)
axes[idx].contourf(X1, X2, Z, levels=20, cmap='viridis', alpha=0.3)
# Mark center
axes[idx].plot(0, 0, 'r*', markersize=15, label='μ = [0, 0]')
axes[idx].set_xlabel('z₁', fontsize=11)
axes[idx].set_ylabel('z₂', fontsize=11)
axes[idx].set_title(title, fontsize=12, fontweight='bold')
axes[idx].grid(True, alpha=0.3)
axes[idx].set_xlim([-5, 5])
axes[idx].set_ylim([-5, 5])
axes[idx].set_aspect('equal')
plt.suptitle('Effect of Covariance Matrix Σ on Gaussian Distribution',
fontsize=14, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()
print("\n💡 Key Observations from Andrew Ng's lecture:")
print(" 1. Identity covariance → Circular/round contours")
print(" 2. Larger diagonal → Wider spread")
print(" 3. Smaller diagonal → Narrower, taller distribution")
print(" 4. Positive off-diagonal → Ellipse tilted positive")
print(" 5. Negative off-diagonal → Ellipse tilted negative")
print(" 6. Different diagonal values → Ellipse stretched in one direction")
4.2 Gaussian Discriminant Analysis (GDA)¶
Model:
\(y \sim \text{Bernoulli}(\phi)\)
\(x|y=0 \sim \mathcal{N}(\mu_0, \Sigma)\)
\(x|y=1 \sim \mathcal{N}(\mu_1, \Sigma)\)
Parameters to learn: \(\phi, \mu_0, \mu_1, \Sigma\)
Maximum Likelihood Estimates: $\(\phi = \frac{1}{m}\sum_{i=1}^m \mathbb{1}\{y^{(i)} = 1\}\)\( \)\(\mu_0 = \frac{\sum_{i=1}^m \mathbb{1}\{y^{(i)} = 0\} x^{(i)}}{\sum_{i=1}^m \mathbb{1}\{y^{(i)} = 0\}}\)\( \)\(\mu_1 = \frac{\sum_{i=1}^m \mathbb{1}\{y^{(i)} = 1\} x^{(i)}}{\sum_{i=1}^m \mathbb{1}\{y^{(i)} = 1\}}\)\( \)\(\Sigma = \frac{1}{m}\sum_{i=1}^m (x^{(i)} - \mu_{y^{(i)}})(x^{(i)} - \mu_{y^{(i)}})^T\)$
class GDA:
"""
Gaussian Discriminant Analysis classifier.
"""
def __init__(self):
self.phi = None
self.mu0 = None
self.mu1 = None
self.sigma = None
def fit(self, X, y):
"""
Fit GDA model using maximum likelihood.
"""
m, n = X.shape
# Separate data by class
X0 = X[y == 0]
X1 = X[y == 1]
# Estimate parameters
self.phi = np.mean(y) # P(y=1)
self.mu0 = X0.mean(axis=0)
self.mu1 = X1.mean(axis=0)
# Shared covariance
self.sigma = np.zeros((n, n))
for i in range(m):
mu = self.mu1 if y[i] == 1 else self.mu0
diff = (X[i] - mu).reshape(-1, 1)
self.sigma += diff @ diff.T
self.sigma /= m
return self
def predict_proba(self, X):
"""
Compute P(y=1|x) using Bayes' rule.
"""
# P(x|y=0) and P(x|y=1)
rv0 = multivariate_normal(self.mu0, self.sigma)
rv1 = multivariate_normal(self.mu1, self.sigma)
p_x_y0 = rv0.pdf(X)
p_x_y1 = rv1.pdf(X)
# P(y=0) and P(y=1)
p_y0 = 1 - self.phi
p_y1 = self.phi
# Bayes' rule: P(y=1|x) = P(x|y=1)P(y=1) / P(x)
numerator = p_x_y1 * p_y1
denominator = p_x_y0 * p_y0 + p_x_y1 * p_y1
return numerator / denominator
def predict(self, X):
"""
Predict class labels.
"""
proba = self.predict_proba(X)
return (proba >= 0.5).astype(int)
# Train GDA on our data
gda = GDA()
gda.fit(X, y)
# Make predictions
y_pred_gda = gda.predict(X)
accuracy_gda = accuracy_score(y, y_pred_gda)
print("GDA Model Parameters:")
print(f"φ (P(y=1)): {gda.phi:.4f}")
print(f"μ₀: {gda.mu0}")
print(f"μ₁: {gda.mu1}")
print(f"Σ shape: {gda.sigma.shape}")
print(f"\nTraining Accuracy: {accuracy_gda:.4f}")
4.3 GDA vs Logistic Regression¶
Key insight: If GDA assumptions are correct (data is Gaussian), GDA is more efficient (needs less data). However, logistic regression is more robust to violations of assumptions.
from sklearn.linear_model import LogisticRegression
# Compare GDA with Logistic Regression
# Vary training set size
train_sizes = [20, 50, 100, 200, 400]
gda_scores = []
lr_scores = []
for size in train_sizes:
# Sample subset
indices = np.random.choice(len(X), size=size, replace=False)
X_train_sub = X[indices]
y_train_sub = y[indices]
# Create test set (rest of data)
test_indices = np.setdiff1d(np.arange(len(X)), indices)
X_test = X[test_indices]
y_test = y[test_indices]
# Train GDA
gda_temp = GDA()
gda_temp.fit(X_train_sub, y_train_sub)
gda_acc = accuracy_score(y_test, gda_temp.predict(X_test))
gda_scores.append(gda_acc)
# Train Logistic Regression
lr = LogisticRegression()
lr.fit(X_train_sub, y_train_sub)
lr_acc = accuracy_score(y_test, lr.predict(X_test))
lr_scores.append(lr_acc)
# Plot comparison
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, gda_scores, 'o-', linewidth=2, markersize=8,
label='GDA', color='blue')
plt.plot(train_sizes, lr_scores, 's-', linewidth=2, markersize=8,
label='Logistic Regression', color='red')
plt.xlabel('Training Set Size', fontsize=12)
plt.ylabel('Test Accuracy', fontsize=12)
plt.title('GDA vs Logistic Regression: Sample Efficiency',
fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nKey Observations:")
print("1. With Gaussian data, GDA often better with small samples")
print("2. Both converge to similar performance with lots of data")
print("3. Logistic regression more robust to non-Gaussian data")
4.4 Naive Bayes Classifier¶
Naive Bayes assumption: Features are conditionally independent given the class:
This is “naive” because features are rarely truly independent, but it works surprisingly well in practice!
Classification rule: $\(\arg\max_y P(y) \prod_{i=1}^n P(x_i | y)\)$
# Simple example: Spam detection with word features
# Create synthetic email dataset
np.random.seed(42)
# Vocabulary
spam_words = ['free', 'money', 'winner', 'click', 'prize']
ham_words = ['meeting', 'project', 'report', 'schedule', 'team']
common_words = ['the', 'is', 'and', 'to', 'of']
vocab = spam_words + ham_words + common_words
n_vocab = len(vocab)
# Generate emails
n_emails = 1000
n_spam = 400
n_ham = 600
def generate_email(is_spam, n_words=20):
"""Generate synthetic email as word counts."""
counts = np.zeros(n_vocab)
for _ in range(n_words):
if is_spam:
# Spam: more likely to use spam words
word_probs = [0.3] * 5 + [0.05] * 5 + [0.1] * 5
else:
# Ham: more likely to use ham words
word_probs = [0.05] * 5 + [0.3] * 5 + [0.1] * 5
word_probs = np.array(word_probs)
word_probs /= word_probs.sum()
word_idx = np.random.choice(n_vocab, p=word_probs)
counts[word_idx] += 1
return counts
# Generate dataset
X_emails = []
y_emails = []
for _ in range(n_spam):
X_emails.append(generate_email(is_spam=True))
y_emails.append(1)
for _ in range(n_ham):
X_emails.append(generate_email(is_spam=False))
y_emails.append(0)
X_emails = np.array(X_emails)
y_emails = np.array(y_emails)
# Split data
X_train_email, X_test_email, y_train_email, y_test_email = train_test_split(
X_emails, y_emails, test_size=0.2, random_state=42, stratify=y_emails
)
print(f"Email Dataset:")
print(f" Vocabulary size: {n_vocab} words")
print(f" Training emails: {len(X_train_email)}")
print(f" Test emails: {len(X_test_email)}")
print(f" Spam ratio: {y_emails.mean():.2%}")
4.5 Multinomial Naive Bayes¶
For text/count data, we use Multinomial distribution:
Maximum Likelihood Estimates (with Laplace smoothing): $\(\phi_{i|y=1} = \frac{\sum_{j:y^{(j)}=1} x_i^{(j)} + \alpha}{\sum_{j:y^{(j)}=1} \sum_{k} x_k^{(j)} + \alpha n}\)$
class MultinomialNaiveBayes:
"""
Multinomial Naive Bayes for count data (e.g., text).
"""
def __init__(self, alpha=1.0):
self.alpha = alpha # Laplace smoothing parameter
self.class_log_prior = None
self.feature_log_prob = None
self.classes = None
def fit(self, X, y):
"""
Fit Multinomial Naive Bayes.
"""
self.classes = np.unique(y)
n_classes = len(self.classes)
n_features = X.shape[1]
# Class priors: P(y)
self.class_log_prior = np.zeros(n_classes)
for i, c in enumerate(self.classes):
self.class_log_prior[i] = np.log(np.mean(y == c))
# Feature probabilities: P(x_i|y) with Laplace smoothing
self.feature_log_prob = np.zeros((n_classes, n_features))
for i, c in enumerate(self.classes):
X_c = X[y == c]
# Count of each word in class c
word_counts = X_c.sum(axis=0) + self.alpha
# Total words in class c
total_count = word_counts.sum()
# Log probability
self.feature_log_prob[i] = np.log(word_counts / total_count)
return self
def predict_log_proba(self, X):
"""
Compute log P(y|x) for each class.
"""
# log P(y|x) ∝ log P(y) + Σ x_i log P(x_i|y)
log_proba = X @ self.feature_log_prob.T + self.class_log_prior
return log_proba
def predict(self, X):
"""
Predict class with highest posterior probability.
"""
log_proba = self.predict_log_proba(X)
return self.classes[np.argmax(log_proba, axis=1)]
# Train Multinomial Naive Bayes
mnb = MultinomialNaiveBayes(alpha=1.0)
mnb.fit(X_train_email, y_train_email)
# Predictions
y_pred_train = mnb.predict(X_train_email)
y_pred_test = mnb.predict(X_test_email)
train_acc = accuracy_score(y_train_email, y_pred_train)
test_acc = accuracy_score(y_test_email, y_pred_test)
print("Multinomial Naive Bayes Results:")
print(f"Training Accuracy: {train_acc:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")
# Show most informative words
log_prob_diff = mnb.feature_log_prob[1] - mnb.feature_log_prob[0]
top_spam_indices = np.argsort(log_prob_diff)[-5:]
top_ham_indices = np.argsort(log_prob_diff)[:5]
print("\nMost indicative of SPAM:")
for idx in top_spam_indices:
print(f" {vocab[idx]}")
print("\nMost indicative of HAM:")
for idx in top_ham_indices:
print(f" {vocab[idx]}")
4.6 Laplace Smoothing¶
Problem: If a word never appears in training data for class \(y\), then \(P(word|y) = 0\), which makes \(P(y|x) = 0\) for any document containing that word!
Solution: Laplace (add-α) smoothing: $\(\phi_{i|y} = \frac{\text{count}(i, y) + \alpha}{\sum_j \text{count}(j, y) + \alpha n}\)$
Common choice: \(\alpha = 1\) (add-one smoothing)
# Compare different smoothing values
alphas = [0.01, 0.1, 1.0, 10.0]
train_accs = []
test_accs = []
for alpha in alphas:
mnb_temp = MultinomialNaiveBayes(alpha=alpha)
mnb_temp.fit(X_train_email, y_train_email)
train_acc = accuracy_score(y_train_email, mnb_temp.predict(X_train_email))
test_acc = accuracy_score(y_test_email, mnb_temp.predict(X_test_email))
train_accs.append(train_acc)
test_accs.append(test_acc)
# Plot
plt.figure(figsize=(10, 6))
plt.semilogx(alphas, train_accs, 'o-', linewidth=2, markersize=8,
label='Training Accuracy')
plt.semilogx(alphas, test_accs, 's-', linewidth=2, markersize=8,
label='Test Accuracy')
plt.xlabel('Smoothing Parameter (α)', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Effect of Laplace Smoothing', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nSmoothing Analysis:")
for i, alpha in enumerate(alphas):
print(f"α = {alpha:6.2f}: Train = {train_accs[i]:.4f}, Test = {test_accs[i]:.4f}")
4.7 Real Dataset: 20 Newsgroups¶
Let’s apply Naive Bayes to real text classification!
# Load subset of 20 newsgroups (4 categories for speed)
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
print("Loading 20 Newsgroups dataset...")
newsgroups_train = fetch_20newsgroups(
subset='train',
categories=categories,
remove=('headers', 'footers', 'quotes'),
random_state=42
)
newsgroups_test = fetch_20newsgroups(
subset='test',
categories=categories,
remove=('headers', 'footers', 'quotes'),
random_state=42
)
print(f"\nDataset loaded:")
print(f" Categories: {categories}")
print(f" Training documents: {len(newsgroups_train.data)}")
print(f" Test documents: {len(newsgroups_test.data)}")
# Vectorize text
print("\nVectorizing text...")
vectorizer = CountVectorizer(max_features=5000, stop_words='english')
X_train_news = vectorizer.fit_transform(newsgroups_train.data)
X_test_news = vectorizer.transform(newsgroups_test.data)
y_train_news = newsgroups_train.target
y_test_news = newsgroups_test.target
print(f" Vocabulary size: {len(vectorizer.vocabulary_)}")
print(f" Feature matrix shape: {X_train_news.shape}")
# Train Multinomial Naive Bayes
print("Training Multinomial Naive Bayes...")
nb_sklearn = MultinomialNB(alpha=1.0)
nb_sklearn.fit(X_train_news, y_train_news)
# Predictions
y_pred_train_news = nb_sklearn.predict(X_train_news)
y_pred_test_news = nb_sklearn.predict(X_test_news)
train_acc_news = accuracy_score(y_train_news, y_pred_train_news)
test_acc_news = accuracy_score(y_test_news, y_pred_test_news)
print("\n" + "="*60)
print("20 NEWSGROUPS CLASSIFICATION RESULTS")
print("="*60)
print(f"Training Accuracy: {train_acc_news:.4f}")
print(f"Test Accuracy: {test_acc_news:.4f}")
# Classification report
print("\nClassification Report:")
print(classification_report(y_test_news, y_pred_test_news,
target_names=categories))
# Confusion matrix
cm = confusion_matrix(y_test_news, y_pred_test_news)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=categories,
yticklabels=categories)
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.title('Confusion Matrix - 20 Newsgroups Classification',
fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
# Show most informative words for each category
feature_names = vectorizer.get_feature_names_out()
n_top_words = 10
print("\nMost Informative Words per Category:")
print("="*60)
for i, category in enumerate(categories):
# Get log probabilities for this class
log_prob = nb_sklearn.feature_log_prob_[i]
top_indices = np.argsort(log_prob)[-n_top_words:]
top_words = [feature_names[idx] for idx in top_indices[::-1]]
print(f"\n{category}:")
print(f" {', '.join(top_words)}")
4.8 Event Models: Multinomial vs Bernoulli¶
Multinomial: Counts matter
\(P(x|y)\) based on word counts
Good for longer documents
Bernoulli: Presence/absence only
\(P(x|y)\) based on word presence (binary)
Good for short documents
# Compare Multinomial vs Bernoulli Naive Bayes
from sklearn.naive_bayes import BernoulliNB
# Convert to binary (presence/absence)
X_train_binary = (X_train_news > 0).astype(int)
X_test_binary = (X_test_news > 0).astype(int)
# Train Bernoulli Naive Bayes
bnb = BernoulliNB(alpha=1.0)
bnb.fit(X_train_binary, y_train_news)
# Predictions
y_pred_bnb = bnb.predict(X_test_binary)
acc_bnb = accuracy_score(y_test_news, y_pred_bnb)
print("Model Comparison on 20 Newsgroups:")
print(f" Multinomial NB: {test_acc_news:.4f}")
print(f" Bernoulli NB: {acc_bnb:.4f}")
print("\nFor longer documents, Multinomial often performs better!")
Key Takeaways¶
1. Discriminative vs Generative¶
Discriminative: Model \(P(y|x)\) directly (e.g., logistic regression)
Generative: Model \(P(x|y)\) and \(P(y)\), use Bayes’ rule (e.g., GDA, Naive Bayes)
Trade-off: Generative more sample-efficient but makes stronger assumptions
2. Gaussian Discriminant Analysis¶
Assumes features are Gaussian given class
Learns \(\mu_0, \mu_1, \Sigma\) from data
Equivalent to logistic regression when assumptions hold
More efficient than logistic regression with Gaussian data
3. Naive Bayes¶
Assumes conditional independence: \(P(x|y) = \prod_i P(x_i|y)\)
“Naive” but works well in practice!
Very fast to train and predict
Excellent for text classification
4. Laplace Smoothing¶
Prevents zero probabilities for unseen features
Add-one smoothing (\(\alpha=1\)) is common default
Can tune \(\alpha\) using cross-validation
5. Event Models¶
Multinomial: Use when counts matter (longer docs)
Bernoulli: Use for presence/absence (shorter docs, binary features)
Gaussian: Use for continuous features
6. When to Use Generative Models¶
Small training sets (more sample-efficient)
Need to generate samples
Have strong domain knowledge (can specify good model)
Fast inference required
7. Practical Considerations¶
Naive Bayes excellent baseline for text
Very fast: \(O(nd)\) training, \(O(d)\) prediction
Works well even when independence assumption violated
Feature selection via \(P(feature|class)\) ratios
Practice Exercises¶
GDA Derivation: Derive the maximum likelihood estimates for GDA parameters. Show that maximizing log-likelihood gives the formulas presented.
QDA: Implement Quadratic Discriminant Analysis (allow different \(\Sigma\) for each class). Compare with GDA on synthetic data.
Non-Gaussian Data: Generate data from non-Gaussian distributions. Compare GDA vs Logistic Regression. When does each fail?
Naive Bayes from Scratch: Implement Gaussian Naive Bayes from scratch. Test on iris dataset. Compare with sklearn.
Feature Independence: Create dataset with correlated features. Train Naive Bayes. Does it still work? Why?
Spam Filter: Build complete spam filter using Naive Bayes. Implement:
Text preprocessing
Feature extraction
Model training
Performance evaluation
Smoothing Study: Systematically vary α from 0.001 to 100. Plot test accuracy. Find optimal value using cross-validation.
Multinomial vs Bernoulli: Create datasets where each performs better. When should you use each?
Semi-Supervised Learning: Use Naive Bayes with EM algorithm to leverage unlabeled data. Compare with supervised-only.
Feature Ranking: Implement feature selection for Naive Bayes using mutual information or chi-square test.
References¶
CS229 Lecture Notes: Andrew Ng’s notes on generative learning
Pattern Recognition and Machine Learning: Bishop, Chapter 4
The Elements of Statistical Learning: Hastie et al., Chapter 4.3
“Naive Bayes at Forty”: Lewis (1998)
“On Discriminative vs. Generative classifiers”: Ng & Jordan (2002)