Run this notebook: Open in Colab Open in Kaggle

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.datasets import make_classification, make_blobs, fetch_20newsgroups
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
np.random.seed(42)

1. Naive Bayes Classifier¶

Theory¶

The Naive Bayes classifier assumes features are conditionally independent given the class:

\[p(\mathbf{x}|y=c) = \prod_{j=1}^{D} p(x_j|y=c)\]

Classification rule: $$\hat{y} = \arg\max_c \left[ \log p(y=c) + \sum_{j=1}^{D} \log p(x_j|y=c) \right]$$

1.1 Gaussian Naive Bayes¶

class GaussianNaiveBayes:
    """Gaussian Naive Bayes from scratch"""
    
    def __init__(self):
        self.classes = None
        self.priors = None
        self.means = None
        self.vars = None
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.classes = np.unique(y)
        n_classes = len(self.classes)
        
        # Initialize parameters
        self.means = np.zeros((n_classes, n_features))
        self.vars = np.zeros((n_classes, n_features))
        self.priors = np.zeros(n_classes)
        
        # Calculate statistics for each class
        for idx, c in enumerate(self.classes):
            X_c = X[y == c]
            self.means[idx, :] = X_c.mean(axis=0)
            self.vars[idx, :] = X_c.var(axis=0)
            self.priors[idx] = X_c.shape[0] / n_samples
    
    def _gaussian_pdf(self, x, mean, var):
        """Gaussian probability density function"""
        eps = 1e-6  # For numerical stability
        coeff = 1.0 / np.sqrt(2.0 * np.pi * (var + eps))
        exponent = np.exp(-((x - mean) ** 2) / (2 * (var + eps)))
        return coeff * exponent
    
    def predict_log_proba(self, X):
        """Return log probabilities for each class"""
        n_samples = X.shape[0]
        n_classes = len(self.classes)
        log_probs = np.zeros((n_samples, n_classes))
        
        for idx, c in enumerate(self.classes):
            prior = np.log(self.priors[idx])
            # Sum of log likelihoods (independence assumption)
            likelihood = np.sum(np.log(self._gaussian_pdf(
                X, self.means[idx, :], self.vars[idx, :])), axis=1)
            log_probs[:, idx] = prior + likelihood
        
        return log_probs
    
    def predict(self, X):
        """Predict class labels"""
        log_probs = self.predict_log_proba(X)
        return self.classes[np.argmax(log_probs, axis=1)]

# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2,
                          n_redundant=0, n_clusters_per_class=1, 
                          class_sep=1.5, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train custom model
gnb_custom = GaussianNaiveBayes()
gnb_custom.fit(X_train, y_train)
y_pred_custom = gnb_custom.predict(X_test)

# Train sklearn model
gnb_sklearn = GaussianNB()
gnb_sklearn.fit(X_train, y_train)
y_pred_sklearn = gnb_sklearn.predict(X_test)

print("Gaussian Naive Bayes Performance:")
print(f"Custom implementation: {accuracy_score(y_test, y_pred_custom):.4f}")
print(f"Sklearn implementation: {accuracy_score(y_test, y_pred_sklearn):.4f}")

# Visualize decision boundary
def plot_decision_boundary(model, X, y, title):
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', edgecolors='black', s=50)
    plt.title(title)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plot_decision_boundary(gnb_custom, X_test, y_test, 'Custom Gaussian NB')
plt.subplot(1, 2, 2)
plot_decision_boundary(gnb_sklearn, X_test, y_test, 'Sklearn Gaussian NB')
plt.tight_layout()
plt.show()

1.2 Multinomial Naive Bayes for Text Classification¶

What: This code trains a Multinomial Naive Bayes classifier on the 20 Newsgroups dataset, a standard NLP benchmark with four topic categories. CountVectorizer converts raw text into word-count feature vectors, and MultinomialNB fits class-conditional word frequency distributions with Laplace smoothing ($\alpha = 1.0$).

Why: Multinomial Naive Bayes models each document as a bag of words drawn from a class-specific multinomial distribution: $p(\mathbf{x}|y=c) = \frac{n!}{\prod_j x_j!} \prod_j \theta_{cj}^{x_j}$, where $\theta_{cj}$ is the probability of word $j$ in class $c$. Despite the independence assumption ignoring word order and co-occurrence, this model remains a strong baseline for text classification because the high dimensionality of text features makes independence violations less harmful.

How: The Laplace smoothing parameter $\alpha$ adds pseudo-counts to prevent zero probabilities for unseen words: $\hat{\theta}_{cj} = \frac{x_{cj} + \alpha}{N_c + \alpha V}$ where $V$ is the vocabulary size. The top-word analysis uses feature_log_prob_ to reveal which words most strongly indicate each class, providing interpretable model explanations.

Connection: Multinomial Naive Bayes powers production spam filters, sentiment analyzers, and document routers. Its speed and interpretability make it ideal for rapid prototyping before investing in transformer-based models, and it serves as a calibration baseline in NLP pipelines.

# Text classification example with 20 newsgroups
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, 
                                     remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,
                                    remove=('headers', 'footers', 'quotes'))

# Vectorize text
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X_train_text = vectorizer.fit_transform(newsgroups_train.data)
X_test_text = vectorizer.transform(newsgroups_test.data)

# Train Multinomial Naive Bayes
mnb = MultinomialNB(alpha=1.0)  # alpha is Laplace smoothing parameter
mnb.fit(X_train_text, newsgroups_train.target)
y_pred_text = mnb.predict(X_test_text)

print("\nMultinomial Naive Bayes for Text Classification:")
print(f"Accuracy: {accuracy_score(newsgroups_test.target, y_pred_text):.4f}")
print("\nClassification Report:")
print(classification_report(newsgroups_test.target, y_pred_text, 
                          target_names=newsgroups_test.target_names))

# Show top features for each class
feature_names = vectorizer.get_feature_names_out()
n_top = 10

print("\nTop words per class:")
for i, category in enumerate(categories):
    top_indices = np.argsort(mnb.feature_log_prob_[i])[-n_top:]
    top_features = [feature_names[j] for j in top_indices]
    print(f"\n{category}:")
    print(", ".join(top_features))

2. Gaussian Discriminant Analysis (GDA)¶

Theory¶

GDA models each class as a multivariate Gaussian:

\[p(\mathbf{x}|y=c) = \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_c, \boldsymbol{\Sigma}_c)\]

Linear Discriminant Analysis (LDA): Shared covariance $\boldsymbol{\Sigma}_c = \boldsymbol{\Sigma}$

Quadratic Discriminant Analysis (QDA): Class-specific covariance $\boldsymbol{\Sigma}_c$

# Generate data with different covariance structures
n_samples = 300

# Class 0: mean=[0, 0], covariance=[[1, 0], [0, 1]]
mean0 = [0, 0]
cov0 = [[1.0, 0], [0, 1.0]]
X0 = np.random.multivariate_normal(mean0, cov0, n_samples)
y0 = np.zeros(n_samples)

# Class 1: mean=[3, 3], covariance=[[1.5, 0.5], [0.5, 1.5]]
mean1 = [3, 3]
cov1 = [[1.5, 0.5], [0.5, 1.5]]
X1 = np.random.multivariate_normal(mean1, cov1, n_samples)
y1 = np.ones(n_samples)

X_gda = np.vstack([X0, X1])
y_gda = np.hstack([y0, y1])

X_train_gda, X_test_gda, y_train_gda, y_test_gda = train_test_split(
    X_gda, y_gda, test_size=0.3, random_state=42)

# Train LDA and QDA
lda = LinearDiscriminantAnalysis()
qda = QuadraticDiscriminantAnalysis()

lda.fit(X_train_gda, y_train_gda)
qda.fit(X_train_gda, y_train_gda)

y_pred_lda = lda.predict(X_test_gda)
y_pred_qda = qda.predict(X_test_gda)

print("\nGaussian Discriminant Analysis:")
print(f"LDA Accuracy: {accuracy_score(y_test_gda, y_pred_lda):.4f}")
print(f"QDA Accuracy: {accuracy_score(y_test_gda, y_pred_qda):.4f}")

# Visualize decision boundaries
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Original data
axes[0].scatter(X0[:, 0], X0[:, 1], alpha=0.6, label='Class 0', s=30)
axes[0].scatter(X1[:, 0], X1[:, 1], alpha=0.6, label='Class 1', s=30)
axes[0].scatter(mean0[0], mean0[1], c='blue', marker='x', s=200, linewidths=3)
axes[0].scatter(mean1[0], mean1[1], c='orange', marker='x', s=200, linewidths=3)
axes[0].set_title('Original Data with Class Means')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# LDA
plot_decision_boundary(lda, X_test_gda, y_test_gda, 
                      f'LDA (Acc: {accuracy_score(y_test_gda, y_pred_lda):.3f})')
plt.sca(axes[1])

# QDA
plot_decision_boundary(qda, X_test_gda, y_test_gda,
                      f'QDA (Acc: {accuracy_score(y_test_gda, y_pred_qda):.3f})')
plt.sca(axes[2])

plt.tight_layout()
plt.show()

2.1 LDA: Linear Decision Boundary¶

For LDA with shared covariance, the decision boundary is linear:

\[\delta_c(\mathbf{x}) = \mathbf{x}^T \boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_c - \frac{1}{2}\boldsymbol{\mu}_c^T\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_c + \log \pi_c\]

class LinearDiscriminantAnalysisCustom:
    """LDA from scratch"""
    
    def fit(self, X, y):
        self.classes = np.unique(y)
        n_features = X.shape[1]
        
        # Calculate class priors and means
        self.priors = {}
        self.means = {}
        
        for c in self.classes:
            X_c = X[y == c]
            self.priors[c] = len(X_c) / len(X)
            self.means[c] = np.mean(X_c, axis=0)
        
        # Calculate shared covariance matrix
        self.covariance = np.zeros((n_features, n_features))
        for c in self.classes:
            X_c = X[y == c]
            X_c_centered = X_c - self.means[c]
            self.covariance += (X_c_centered.T @ X_c_centered)
        self.covariance /= (len(X) - len(self.classes))
        
        # Calculate discriminant function coefficients
        self.cov_inv = np.linalg.inv(self.covariance)
        
    def predict(self, X):
        predictions = []
        for x in X:
            scores = {}
            for c in self.classes:
                # Linear discriminant function
                score = (x @ self.cov_inv @ self.means[c] - 
                        0.5 * self.means[c] @ self.cov_inv @ self.means[c] + 
                        np.log(self.priors[c]))
                scores[c] = score
            predictions.append(max(scores, key=scores.get))
        return np.array(predictions)

# Test custom LDA
lda_custom = LinearDiscriminantAnalysisCustom()
lda_custom.fit(X_train_gda, y_train_gda)
y_pred_custom = lda_custom.predict(X_test_gda)

print(f"\nCustom LDA Accuracy: {accuracy_score(y_test_gda, y_pred_custom):.4f}")

# Visualize
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plot_decision_boundary(lda_custom, X_test_gda, y_test_gda, 'Custom LDA')
plt.subplot(1, 2, 2)
plot_decision_boundary(lda, X_test_gda, y_test_gda, 'Sklearn LDA')
plt.tight_layout()
plt.show()

3. Generative vs Discriminative Models¶

Comparison¶

Generative Models (Naive Bayes, LDA, QDA):

Model $p(\mathbf{x}|y)$ and $p(y)$
Can generate new samples
Can handle missing data
More assumptions, potentially less accurate

Discriminative Models (Logistic Regression, SVM):

Model $p(y|\mathbf{x})$ directly
Often more accurate for classification
Fewer assumptions about data distribution

from sklearn.linear_model import LogisticRegression

# Compare generative vs discriminative
# Generate data with varying sample sizes
sample_sizes = [50, 100, 200, 500, 1000, 2000]
gnb_scores = []
lda_scores = []
lr_scores = []

for n in sample_sizes:
    X, y = make_classification(n_samples=n, n_features=10, n_informative=8,
                              n_redundant=2, random_state=42)
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Generative models
    gnb = GaussianNB().fit(X_tr, y_tr)
    lda_model = LinearDiscriminantAnalysis().fit(X_tr, y_tr)
    
    # Discriminative model
    lr = LogisticRegression(max_iter=1000).fit(X_tr, y_tr)
    
    gnb_scores.append(accuracy_score(y_te, gnb.predict(X_te)))
    lda_scores.append(accuracy_score(y_te, lda_model.predict(X_te)))
    lr_scores.append(accuracy_score(y_te, lr.predict(X_te)))

# Plot learning curves
plt.figure(figsize=(12, 6))
plt.plot(sample_sizes, gnb_scores, 'o-', label='Naive Bayes (Generative)', linewidth=2)
plt.plot(sample_sizes, lda_scores, 's-', label='LDA (Generative)', linewidth=2)
plt.plot(sample_sizes, lr_scores, '^-', label='Logistic Regression (Discriminative)', linewidth=2)
plt.xlabel('Training Sample Size')
plt.ylabel('Test Accuracy')
plt.title('Generative vs Discriminative Models: Sample Efficiency')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nKey Observations:")
print("- Generative models often perform better with small sample sizes")
print("- Discriminative models typically achieve higher asymptotic accuracy")
print("- Trade-off between assumptions and flexibility")

4. Spam Detection with Naive Bayes¶

What: This code builds an end-to-end spam detection pipeline: generating synthetic email data from spam- and ham-specific word distributions, vectorizing with CountVectorizer, training MultinomialNB, and evaluating with a confusion matrix and log-odds analysis of the most discriminative words.

Why: Spam detection is a canonical binary classification problem where generative models shine. The log-odds ratio $\log \frac{p(w|spam)}{p(w|ham)}$ for each word $w$ directly quantifies how much evidence that word provides for or against the spam hypothesis. This interpretability is critical for understanding false positives (legitimate emails flagged as spam), which have high business cost.

How: The confusion matrix reveals the trade-off between precision and recall: false negatives (spam reaching the inbox) degrade user experience, while false positives (ham sent to spam) can cause missed communications. The log-odds visualization at the end ranks features by their discriminative power, implementing a form of naive feature importance that maps directly to the Naive Bayes decision rule: $\hat{y} = \arg\max_c \left[\log p(c) + \sum_j x_j \log \theta_{cj}\right]$.

Connection: Production email systems at scale (Gmail, Outlook) evolved from Naive Bayes foundations. The log-odds feature analysis demonstrated here is a precursor to more sophisticated explainability methods like LIME and SHAP used in modern ML systems.

# Simulate spam detection dataset
np.random.seed(42)

# Spam messages (high frequency of certain words)
spam_words = ['free', 'win', 'money', 'click', 'offer', 'prize', 'buy', 'cheap']
# Ham messages (normal words)
ham_words = ['meeting', 'project', 'report', 'schedule', 'team', 'work', 'deadline']

def generate_message(word_list, n_words=20, noise=0.1):
    """Generate a message from word list with some noise"""
    message = []
    for _ in range(n_words):
        if np.random.random() < noise:
            # Add noise from other list
            message.append(np.random.choice(spam_words + ham_words))
        else:
            message.append(np.random.choice(word_list))
    return ' '.join(message)

# Generate dataset
n_messages = 1000
messages = []
labels = []

for _ in range(n_messages // 2):
    messages.append(generate_message(spam_words))
    labels.append(1)  # Spam
    messages.append(generate_message(ham_words))
    labels.append(0)  # Ham

# Split data
X_train_spam, X_test_spam, y_train_spam, y_test_spam = train_test_split(
    messages, labels, test_size=0.3, random_state=42)

# Vectorize
vectorizer_spam = CountVectorizer()
X_train_spam_vec = vectorizer_spam.fit_transform(X_train_spam)
X_test_spam_vec = vectorizer_spam.transform(X_test_spam)

# Train Multinomial Naive Bayes
spam_classifier = MultinomialNB(alpha=1.0)
spam_classifier.fit(X_train_spam_vec, y_train_spam)
y_pred_spam = spam_classifier.predict(X_test_spam_vec)

print("Spam Detection with Naive Bayes:")
print(f"Accuracy: {accuracy_score(y_test_spam, y_pred_spam):.4f}")
print("\nConfusion Matrix:")
cm = confusion_matrix(y_test_spam, y_pred_spam)
print(cm)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
plt.title('Spam Detection Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

# Show most indicative words
feature_names = vectorizer_spam.get_feature_names_out()
log_prob_spam = spam_classifier.feature_log_prob_[1]
log_prob_ham = spam_classifier.feature_log_prob_[0]

# Log odds ratio
log_odds = log_prob_spam - log_prob_ham
top_spam_indices = np.argsort(log_odds)[-10:]
top_ham_indices = np.argsort(log_odds)[:10]

print("\nTop Spam Indicators:")
for idx in reversed(top_spam_indices):
    print(f"{feature_names[idx]:15s}: {log_odds[idx]:.3f}")

print("\nTop Ham Indicators:")
for idx in top_ham_indices:
    print(f"{feature_names[idx]:15s}: {log_odds[idx]:.3f}")

5. Effect of the Naive Assumption¶

What: This code systematically measures how feature correlation impacts Naive Bayes accuracy compared to QDA (which models the full covariance). By generating 2D Gaussian data with correlations ranging from 0 to 0.9, we observe whether violating the conditional independence assumption $p(x_1, x_2|y) = p(x_1|y)p(x_2|y)$ degrades Naive Bayes performance.

Why: The naive independence assumption is almost always wrong in practice – real features are correlated. Understanding when this violation matters is essential for deciding whether to use Naive Bayes or invest in more complex models. Theoretically, Naive Bayes can still achieve optimal classification even with correlated features, because correct class ranking (not calibrated probabilities) is what matters for the $\arg\max$ decision. The bias-variance trade-off also plays a role: the restricted model family reduces variance, which can offset the bias from ignoring correlations.

Connection: This experiment addresses a fundamental question in ML model selection: simpler models with wrong assumptions can outperform complex models with correct assumptions when data is limited. In practice, Naive Bayes remains competitive in high-dimensional text classification and medical diagnosis despite pervasive feature correlations, demonstrating that model robustness often matters more than correctness of assumptions.

# Compare Naive Bayes vs Full Bayes (QDA) with correlated features
correlations = [0.0, 0.3, 0.5, 0.7, 0.9]
nb_accuracies = []
qda_accuracies = []

for corr in correlations:
    # Generate correlated data
    cov = [[1.0, corr], [corr, 1.0]]
    X0 = np.random.multivariate_normal([0, 0], cov, 300)
    X1 = np.random.multivariate_normal([2, 2], cov, 300)
    X = np.vstack([X0, X1])
    y = np.hstack([np.zeros(300), np.ones(300)])
    
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Naive Bayes (assumes independence)
    nb = GaussianNB()
    nb.fit(X_tr, y_tr)
    nb_accuracies.append(accuracy_score(y_te, nb.predict(X_te)))
    
    # QDA (models full covariance)
    qda_model = QuadraticDiscriminantAnalysis()
    qda_model.fit(X_tr, y_tr)
    qda_accuracies.append(accuracy_score(y_te, qda_model.predict(X_te)))

# Plot
plt.figure(figsize=(10, 6))
plt.plot(correlations, nb_accuracies, 'o-', label='Naive Bayes', linewidth=2, markersize=8)
plt.plot(correlations, qda_accuracies, 's-', label='QDA (Full Covariance)', linewidth=2, markersize=8)
plt.xlabel('Feature Correlation')
plt.ylabel('Test Accuracy')
plt.title('Impact of Feature Correlation on Naive Bayes')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nObservation:")
print("Naive Bayes remains surprisingly robust even with correlated features,")
print("though QDA performs slightly better as correlation increases.")

Summary¶

In this notebook, we covered:

Naive Bayes Classifiers: Gaussian, Multinomial, and Bernoulli variants
Gaussian Discriminant Analysis: LDA and QDA for classification
Generative vs Discriminative: Trade-offs and sample efficiency
Practical Applications: Text classification and spam detection
Naive Independence Assumption: Its impact and robustness

Key Takeaways¶

Generative models learn the joint distribution $p(\mathbf{x}, y)$
Naive Bayes is simple, fast, and works well even with violated assumptions
LDA assumes shared covariance (linear boundary), QDA allows class-specific covariance (quadratic boundary)
Generative models often perform better with limited training data
Choice between generative and discriminative depends on data characteristics and available samples

Exercises¶

Implement Bernoulli Naive Bayes from scratch
Derive the decision boundary for 2-class LDA
Compare Naive Bayes with different smoothing parameters
Implement feature selection for Naive Bayes using mutual information
Explore the effect of class imbalance on generative classifiers