Chapter 4: ClassificationΒΆ

OverviewΒΆ

Predicting a qualitative (categorical) response instead of quantitative.

Examples:

  • Email: spam or not spam

  • Medical: disease or no disease

  • Customer: will default or not

  • Transaction: fraudulent or legitimate

Key Methods:

  1. Logistic Regression: Models P(Y=1|X) using logistic function

  2. Linear Discriminant Analysis (LDA): Assumes Gaussian distributions, linear boundary

  3. Quadratic Discriminant Analysis (QDA): Assumes Gaussian distributions, quadratic boundary

  4. Naive Bayes: Assumes feature independence

  5. K-Nearest Neighbors (KNN): Non-parametric, local averaging

Why Not Linear Regression?ΒΆ

For binary outcomes (0/1), linear regression:

  • Can predict values < 0 or > 1

  • No probabilistic interpretation

  • Assumes ordered categories for multi-class

Classification methods provide:

  • Probabilities: P(Y = k | X)

  • Proper handling of categorical outcomes

  • Better decision boundaries

4.1 Logistic RegressionΒΆ

The Logistic FunctionΒΆ

Instead of modeling Y directly, model the probability:

\[P(Y=1|X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}\]

Log-Odds (Logit)ΒΆ

\[\log\left(\frac{P(Y=1|X)}{1-P(Y=1|X)}\right) = \beta_0 + \beta_1 X\]

Key Properties:

  • Output always between 0 and 1

  • S-shaped curve

  • Linear in log-odds

InterpretationΒΆ

  • \(\beta_1 > 0\): Increasing X increases probability

  • \(\beta_1 < 0\): Increasing X decreases probability

  • \(e^{\beta_1}\): Odds ratio (multiplicative effect on odds)

Maximum Likelihood EstimationΒΆ

Coefficients estimated by maximizing likelihood function: $\(L(\beta_0, \beta_1) = \prod_{i:y_i=1} p(x_i) \prod_{i:y_i=0} (1-p(x_i))\)$

# Generate binary classification data
np.random.seed(42)
n = 500

# Feature: credit score (300-850)
credit_score = np.random.uniform(300, 850, n)

# True model: P(default) = logistic(-10 + 0.015*score)
# Higher credit score β†’ lower default probability
true_beta0 = -10
true_beta1 = 0.015
log_odds = true_beta0 + true_beta1 * credit_score
prob_default = 1 / (1 + np.exp(-log_odds))

# Generate binary outcomes
default = np.random.binomial(1, prob_default)

df_default = pd.DataFrame({
    'CreditScore': credit_score,
    'Default': default
})

print("πŸ“Š Credit Default Dataset")
print(f"\nTotal observations: {n}")
print(f"Default rate: {default.mean():.2%}")
print(f"\nClass distribution:")
print(df_default['Default'].value_counts())
print(f"\nFirst few rows:")
print(df_default.head(10))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Raw data
axes[0].scatter(credit_score[default==0], default[default==0], 
               alpha=0.3, label='No Default', s=30)
axes[0].scatter(credit_score[default==1], default[default==1], 
               alpha=0.3, label='Default', s=30)
axes[0].set_xlabel('Credit Score')
axes[0].set_ylabel('Default (0=No, 1=Yes)')
axes[0].set_title('Raw Classification Data')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# True probability curve
score_range = np.linspace(300, 850, 200)
true_prob = 1 / (1 + np.exp(-(true_beta0 + true_beta1 * score_range)))

axes[1].plot(score_range, true_prob, 'r-', linewidth=3, label='True P(Default)')
axes[1].scatter(credit_score, default, alpha=0.2, s=20)
axes[1].axhline(y=0.5, color='k', linestyle='--', alpha=0.5, label='Decision boundary')
axes[1].set_xlabel('Credit Score')
axes[1].set_ylabel('P(Default | Credit Score)')
axes[1].set_title('Logistic Function (True Model)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
# Fit logistic regression
X = df_default[['CreditScore']].values
y = df_default['Default'].values

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit model
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

beta0_hat = logistic_model.intercept_[0]
beta1_hat = logistic_model.coef_[0][0]

print("πŸ“ˆ Logistic Regression Results\n")
print(f"{'Parameter':<20} {'True Value':<15} {'Estimated':<15} {'Difference'}")
print("="*65)
print(f"{'Ξ²β‚€ (Intercept)':<20} {true_beta0:>12.4f}   {beta0_hat:>12.4f}   {abs(beta0_hat-true_beta0):>10.4f}")
print(f"{'β₁ (CreditScore)':<20} {true_beta1:>12.6f}   {beta1_hat:>12.6f}   {abs(beta1_hat-true_beta1):>10.6f}")

# Predictions
y_pred_prob = logistic_model.predict_proba(X_test)[:, 1]
y_pred = logistic_model.predict(X_test)

# Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"\nπŸ“Š Model Performance on Test Set:")
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}  (Of predicted defaults, how many were correct?)")
print(f"Recall:    {recall:.4f}  (Of actual defaults, how many did we catch?)")
print(f"F1-Score:  {f1:.4f}  (Harmonic mean of precision and recall)")

# Visualize fitted model
plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
score_plot = np.linspace(300, 850, 200).reshape(-1, 1)
prob_plot = logistic_model.predict_proba(score_plot)[:, 1]
true_prob_plot = 1 / (1 + np.exp(-(true_beta0 + true_beta1 * score_plot.flatten())))

plt.plot(score_plot, true_prob_plot, 'g--', linewidth=2, label='True P(Default)', alpha=0.7)
plt.plot(score_plot, prob_plot, 'r-', linewidth=2, label='Estimated P(Default)')
plt.scatter(X_test, y_test, alpha=0.3, s=30, label='Test data')
plt.axhline(y=0.5, color='k', linestyle='--', alpha=0.5)
plt.xlabel('Credit Score')
plt.ylabel('P(Default)')
plt.title('Fitted Logistic Regression')
plt.legend()
plt.grid(True, alpha=0.3)

# Decision boundary
plt.subplot(1, 2, 2)
decision_boundary = -beta0_hat / beta1_hat
true_boundary = -true_beta0 / true_beta1

plt.scatter(X_test[y_pred==0], y_test[y_pred==0], c='blue', alpha=0.5, 
           s=50, label='Predicted: No Default')
plt.scatter(X_test[y_pred==1], y_test[y_pred==1], c='red', alpha=0.5, 
           s=50, label='Predicted: Default')
plt.axvline(x=decision_boundary, color='r', linestyle='-', linewidth=2, 
           label=f'Decision boundary ({decision_boundary:.0f})')
plt.axvline(x=true_boundary, color='g', linestyle='--', linewidth=2, 
           label=f'True boundary ({true_boundary:.0f})')
plt.xlabel('Credit Score')
plt.ylabel('Default')
plt.title('Classification with Decision Boundary')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nπŸ’‘ Interpretation:")
print(f"   β€’ Decision boundary at score = {decision_boundary:.0f}")
print(f"   β€’ Below {decision_boundary:.0f}: Predict default")
print(f"   β€’ Above {decision_boundary:.0f}: Predict no default")
print(f"   β€’ Odds ratio = e^({beta1_hat:.6f}) = {np.exp(beta1_hat):.6f}")
print(f"   β€’ 10-point score increase multiplies odds by {np.exp(10*beta1_hat):.4f}")

4.2 Classification MetricsΒΆ

Confusion MatrixΒΆ

                 Predicted
              Negative  Positive
Actual Neg       TN        FP
       Pos       FN        TP

Key MetricsΒΆ

  • Accuracy = (TP + TN) / Total

  • Precision = TP / (TP + FP) β€” Of predicted positives, how many correct?

  • Recall (Sensitivity) = TP / (TP + FN) β€” Of actual positives, how many caught?

  • Specificity = TN / (TN + FP) β€” Of actual negatives, how many caught?

  • F1-Score = 2 Γ— (Precision Γ— Recall) / (Precision + Recall)

ROC CurveΒΆ

  • Plot: Recall (TPR) vs False Positive Rate (FPR)

  • FPR = FP / (FP + TN) = 1 - Specificity

  • AUC (Area Under Curve): Overall performance measure

    • AUC = 1.0: Perfect classifier

    • AUC = 0.5: Random guessing

    • AUC < 0.5: Worse than random

# Confusion matrix and detailed metrics
cm = confusion_matrix(y_test, y_pred)

print("πŸ“Š Confusion Matrix\n")
print("              Predicted")
print("              No Default  Default")
print(f"Actual No      {cm[0,0]:>6}      {cm[0,1]:>6}   (TN, FP)")
print(f"      Yes      {cm[1,0]:>6}      {cm[1,1]:>6}   (FN, TP)")

TN, FP, FN, TP = cm[0,0], cm[0,1], cm[1,0], cm[1,1]

accuracy = (TP + TN) / (TP + TN + FP + FN)
precision = TP / (TP + FP) if (TP + FP) > 0 else 0
recall = TP / (TP + FN) if (TP + FN) > 0 else 0
specificity = TN / (TN + FP) if (TN + FP) > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

print(f"\nπŸ“ˆ Detailed Metrics:")
print(f"\nAccuracy:    {accuracy:.4f}  = ({TP}+{TN}) / {TP+TN+FP+FN}")
print(f"Precision:   {precision:.4f}  = {TP} / ({TP}+{FP})  [Of predicted defaults, {precision:.1%} were correct]")
print(f"Recall:      {recall:.4f}  = {TP} / ({TP}+{FN})  [Of actual defaults, caught {recall:.1%}]")
print(f"Specificity: {specificity:.4f}  = {TN} / ({TN}+{FP})  [Of actual non-defaults, {specificity:.1%} correct]")
print(f"F1-Score:    {f1:.4f}")

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
roc_auc = roc_auc_score(y_test, y_pred_prob)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion matrix heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
           xticklabels=['No Default', 'Default'],
           yticklabels=['No Default', 'Default'])
axes[0].set_ylabel('Actual')
axes[0].set_xlabel('Predicted')
axes[0].set_title(f'Confusion Matrix\nAccuracy: {accuracy:.2%}')

# ROC curve
axes[1].plot(fpr, tpr, linewidth=2, label=f'Logistic Regression (AUC = {roc_auc:.3f})')
axes[1].plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier (AUC = 0.5)')
axes[1].set_xlabel('False Positive Rate (1 - Specificity)')
axes[1].set_ylabel('True Positive Rate (Recall)')
axes[1].set_title('ROC Curve')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nπŸ’‘ ROC-AUC Interpretation:")
print(f"   β€’ AUC = {roc_auc:.3f} ({roc_auc*100:.1f}%)")
if roc_auc > 0.9:
    print(f"   β€’ Excellent discrimination")
elif roc_auc > 0.8:
    print(f"   β€’ Good discrimination")
elif roc_auc > 0.7:
    print(f"   β€’ Fair discrimination")
else:
    print(f"   β€’ Poor discrimination")
print(f"   β€’ Model is much better than random guessing βœ…")

4.3 Linear Discriminant Analysis (LDA)ΒΆ

Bayes’ Theorem ApproachΒΆ

\[P(Y=k|X=x) = \frac{P(X=x|Y=k) \cdot P(Y=k)}{P(X=x)} = \frac{\pi_k \cdot f_k(x)}{\sum_{l=1}^K \pi_l \cdot f_l(x)}\]

where:

  • \(\pi_k\) = P(Y=k) β€” prior probability of class k

  • \(f_k(x)\) β€” density of X in class k

LDA AssumptionsΒΆ

  1. Normal distributions: \(f_k(x) \sim N(\mu_k, \sigma^2)\)

  2. Common variance: Same \(\sigma^2\) for all classes

Decision RuleΒΆ

Assign to class k that maximizes: $\(\delta_k(x) = x \cdot \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k)\)$

Linear decision boundary between classes

When to Use LDAΒΆ

  • Classes well-separated

  • Normal distribution reasonable

  • Small sample sizes (more stable than logistic)

  • Multi-class problems (>2 classes)

# Generate 2D data for LDA visualization
np.random.seed(42)
n_per_class = 200

# Class 0: mean=(2, 2)
# Class 1: mean=(5, 5)
# Common covariance
mean0 = np.array([2, 2])
mean1 = np.array([5, 5])
cov = np.array([[1, 0.5], [0.5, 1]])  # Common covariance

X0 = np.random.multivariate_normal(mean0, cov, n_per_class)
X1 = np.random.multivariate_normal(mean1, cov, n_per_class)

X_lda = np.vstack([X0, X1])
y_lda = np.hstack([np.zeros(n_per_class), np.ones(n_per_class)])

# Split data
X_train_lda, X_test_lda, y_train_lda, y_test_lda = train_test_split(
    X_lda, y_lda, test_size=0.3, random_state=42)

# Fit LDA
lda_model = LinearDiscriminantAnalysis()
lda_model.fit(X_train_lda, y_train_lda)

# Fit Logistic Regression for comparison
logistic_2d = LogisticRegression()
logistic_2d.fit(X_train_lda, y_train_lda)

# Predictions
y_pred_lda = lda_model.predict(X_test_lda)
y_pred_logistic = logistic_2d.predict(X_test_lda)

acc_lda = accuracy_score(y_test_lda, y_pred_lda)
acc_logistic = accuracy_score(y_test_lda, y_pred_logistic)

print("πŸ“Š LDA vs Logistic Regression\n")
print(f"LDA Test Accuracy:      {acc_lda:.4f}")
print(f"Logistic Test Accuracy: {acc_logistic:.4f}")

# Visualize decision boundaries
def plot_decision_boundary(model, X, y, ax, title):
    h = 0.1
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    ax.scatter(X[y==0, 0], X[y==0, 1], c='blue', s=30, alpha=0.6, label='Class 0')
    ax.scatter(X[y==1, 0], X[y==1, 1], c='red', s=30, alpha=0.6, label='Class 1')
    ax.set_xlabel('X₁')
    ax.set_ylabel('Xβ‚‚')
    ax.set_title(title)
    ax.legend()
    ax.grid(True, alpha=0.3)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

plot_decision_boundary(lda_model, X_test_lda, y_test_lda, axes[0], 
                      f'LDA Decision Boundary\nAccuracy: {acc_lda:.2%}')
plot_decision_boundary(logistic_2d, X_test_lda, y_test_lda, axes[1], 
                      f'Logistic Regression Boundary\nAccuracy: {acc_logistic:.2%}')

plt.tight_layout()
plt.show()

print(f"\nπŸ’‘ Observations:")
print(f"   β€’ Both produce linear boundaries")
print(f"   β€’ LDA assumes Gaussian distributions with equal variance")
print(f"   β€’ Logistic makes no distributional assumptions")
print(f"   β€’ Similar performance when LDA assumptions met βœ…")

4.4 Quadratic Discriminant Analysis (QDA)ΒΆ

Difference from LDAΒΆ

  • LDA: Assumes common covariance matrix β†’ Linear boundary

  • QDA: Allows different covariance per class β†’ Quadratic boundary

QDA Decision FunctionΒΆ

\[\delta_k(x) = -\frac{1}{2}(x-\mu_k)^T \Sigma_k^{-1} (x-\mu_k) - \frac{1}{2}\log|\Sigma_k| + \log(\pi_k)\]

Quadratic in x β†’ Can model more complex boundaries

Bias-Variance TradeoffΒΆ

  • LDA: Lower variance, higher bias (fewer parameters)

  • QDA: Higher variance, lower bias (more parameters)

When to Use QDAΒΆ

  • Decision boundary is clearly non-linear

  • Large training set (many parameters to estimate)

  • Classes have different variances

# Generate data with different covariances (QDA advantageous)
np.random.seed(42)
n_per_class = 200

mean0 = np.array([2, 2])
mean1 = np.array([5, 5])

# DIFFERENT covariances
cov0 = np.array([[1, 0], [0, 1]])  # Circular
cov1 = np.array([[3, 2], [2, 3]])  # Elliptical

X0_qda = np.random.multivariate_normal(mean0, cov0, n_per_class)
X1_qda = np.random.multivariate_normal(mean1, cov1, n_per_class)

X_qda = np.vstack([X0_qda, X1_qda])
y_qda = np.hstack([np.zeros(n_per_class), np.ones(n_per_class)])

# Split
X_train_qda, X_test_qda, y_train_qda, y_test_qda = train_test_split(
    X_qda, y_qda, test_size=0.3, random_state=42)

# Fit models
lda_qda = LinearDiscriminantAnalysis()
qda_model = QuadraticDiscriminantAnalysis()

lda_qda.fit(X_train_qda, y_train_qda)
qda_model.fit(X_train_qda, y_train_qda)

# Predictions
acc_lda_qda = accuracy_score(y_test_qda, lda_qda.predict(X_test_qda))
acc_qda = accuracy_score(y_test_qda, qda_model.predict(X_test_qda))

print("πŸ“Š LDA vs QDA on Non-Equal Covariance Data\n")
print(f"LDA Test Accuracy: {acc_lda_qda:.4f}  (assumes equal covariance)")
print(f"QDA Test Accuracy: {acc_qda:.4f}  (allows different covariances) βœ…")
print(f"\nImprovement: {(acc_qda - acc_lda_qda)*100:.1f} percentage points")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

plot_decision_boundary(lda_qda, X_test_qda, y_test_qda, axes[0], 
                      f'LDA (Linear Boundary)\nAccuracy: {acc_lda_qda:.2%}')
plot_decision_boundary(qda_model, X_test_qda, y_test_qda, axes[1], 
                      f'QDA (Quadratic Boundary)\nAccuracy: {acc_qda:.2%}')

plt.tight_layout()
plt.show()

print(f"\nπŸ’‘ Key Insights:")
print(f"   β€’ QDA captures curved boundary between classes")
print(f"   β€’ LDA forced to use straight line (suboptimal)")
print(f"   β€’ QDA more flexible but needs more data")
print(f"   β€’ Use QDA when classes have different spreads/shapes")

4.5 Naive BayesΒΆ

The AssumptionΒΆ

Features are conditionally independent given the class: $\(P(X_1, X_2, ..., X_p | Y=k) = \prod_{j=1}^p P(X_j | Y=k)\)$

Classification RuleΒΆ

\[P(Y=k|X) \propto P(Y=k) \prod_{j=1}^p P(X_j|Y=k)\]

ProsΒΆ

  • Very fast (simple calculations)

  • Works well with high-dimensional data

  • Performs surprisingly well even when assumption violated

  • Good for text classification

ConsΒΆ

  • Independence assumption often unrealistic

  • Can’t model feature interactions

  • Probability estimates can be poor

# Compare all methods on the QDA dataset
models = {
    'Logistic Regression': LogisticRegression(),
    'LDA': LinearDiscriminantAnalysis(),
    'QDA': QuadraticDiscriminantAnalysis(),
    'Naive Bayes': GaussianNB(),
    'KNN (k=5)': KNeighborsClassifier(n_neighbors=5)
}

results = {}

for name, model in models.items():
    model.fit(X_train_qda, y_train_qda)
    y_pred = model.predict(X_test_qda)
    acc = accuracy_score(y_test_qda, y_pred)
    results[name] = acc

print("πŸ“Š Comparison of All Classification Methods\n")
print(f"{'Method':<25} {'Test Accuracy'}")
print("="*45)
for name, acc in sorted(results.items(), key=lambda x: x[1], reverse=True):
    print(f"{name:<25} {acc:>12.4f} {'βœ…' if acc == max(results.values()) else ''}")

# Visualize all decision boundaries
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

for idx, (name, model) in enumerate(models.items()):
    plot_decision_boundary(model, X_test_qda, y_test_qda, axes[idx], 
                          f'{name}\nAccuracy: {results[name]:.2%}')

# Summary in last subplot
axes[5].axis('off')
summary = "πŸ’‘ Method Selection Guide:\n\n"
summary += "Logistic: General purpose\n"
summary += "LDA: Equal variance, linear\n"
summary += "QDA: Different variance, curved\n"
summary += "Naive Bayes: Fast, high-D\n"
summary += "KNN: Non-parametric, local\n\n"
summary += f"Best here: QDA ({max(results.values()):.2%})"
axes[5].text(0.1, 0.5, summary, fontsize=11, verticalalignment='center',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.tight_layout()
plt.show()

4.6 Multi-Class ClassificationΒΆ

Extending to K > 2 ClassesΒΆ

Logistic Regression: One-vs-Rest or Softmax (Multinomial)

  • Softmax: \(P(Y=k|X) = \frac{e^{\beta_k^T X}}{\sum_{l=1}^K e^{\beta_l^T X}}\)

LDA/QDA: Natural extension

  • Compute discriminant function for each class

  • Assign to class with highest score

KNN: Naturally handles multi-class

  • Majority vote among k neighbors

# Generate 3-class data
np.random.seed(42)
n_per_class = 150

# Three classes with different centers
mean_A = np.array([0, 0])
mean_B = np.array([4, 0])
mean_C = np.array([2, 3.5])

cov_shared = np.array([[0.8, 0], [0, 0.8]])

X_A = np.random.multivariate_normal(mean_A, cov_shared, n_per_class)
X_B = np.random.multivariate_normal(mean_B, cov_shared, n_per_class)
X_C = np.random.multivariate_normal(mean_C, cov_shared, n_per_class)

X_multi = np.vstack([X_A, X_B, X_C])
y_multi = np.hstack([np.zeros(n_per_class), 
                     np.ones(n_per_class), 
                     2*np.ones(n_per_class)])

# Split
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(
    X_multi, y_multi, test_size=0.3, random_state=42)

# Train LDA for multi-class
lda_multi = LinearDiscriminantAnalysis()
lda_multi.fit(X_train_multi, y_train_multi)

# Predictions
y_pred_multi = lda_multi.predict(X_test_multi)
acc_multi = accuracy_score(y_test_multi, y_pred_multi)

print("πŸ“Š Multi-Class Classification (3 Classes)\n")
print(f"Test Accuracy: {acc_multi:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test_multi, y_pred_multi, 
                          target_names=['Class A', 'Class B', 'Class C']))

# Confusion matrix
cm_multi = confusion_matrix(y_test_multi, y_pred_multi)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Decision boundaries
h = 0.1
x_min, x_max = X_multi[:, 0].min() - 1, X_multi[:, 0].max() + 1
y_min, y_max = X_multi[:, 1].min() - 1, X_multi[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = lda_multi.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

axes[0].contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
axes[0].scatter(X_test_multi[y_test_multi==0, 0], X_test_multi[y_test_multi==0, 1], 
               c='red', s=50, alpha=0.6, edgecolors='k', label='Class A')
axes[0].scatter(X_test_multi[y_test_multi==1, 0], X_test_multi[y_test_multi==1, 1], 
               c='blue', s=50, alpha=0.6, edgecolors='k', label='Class B')
axes[0].scatter(X_test_multi[y_test_multi==2, 0], X_test_multi[y_test_multi==2, 1], 
               c='green', s=50, alpha=0.6, edgecolors='k', label='Class C')
axes[0].set_xlabel('X₁')
axes[0].set_ylabel('Xβ‚‚')
axes[0].set_title(f'Multi-Class LDA\nAccuracy: {acc_multi:.2%}')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Confusion matrix
sns.heatmap(cm_multi, annot=True, fmt='d', cmap='Blues', ax=axes[1],
           xticklabels=['Class A', 'Class B', 'Class C'],
           yticklabels=['Class A', 'Class B', 'Class C'])
axes[1].set_ylabel('Actual')
axes[1].set_xlabel('Predicted')
axes[1].set_title('Confusion Matrix (3 Classes)')

plt.tight_layout()
plt.show()

print(f"\nπŸ’‘ Multi-Class Notes:")
print(f"   β€’ LDA naturally extends to K > 2 classes")
print(f"   β€’ Creates K-1 linear discriminants")
print(f"   β€’ Confusion matrix shows per-class performance")
print(f"   β€’ Can identify which classes are confused")

Key TakeawaysΒΆ

1. Method ComparisonΒΆ

Method

Boundary

Assumptions

Best For

Logistic

Linear

None

General purpose, interpretable

LDA

Linear

Normal, equal Ξ£

Well-separated, small n

QDA

Quadratic

Normal, different Ξ£

Curved boundary, larger n

Naive Bayes

Any

Independence

High-D, text, fast

KNN

Any

None

Non-parametric, irregular

2. Key MetricsΒΆ

  • Accuracy: Overall correctness

  • Precision: Minimize false positives

  • Recall: Minimize false negatives

  • F1-Score: Balance precision/recall

  • ROC-AUC: Overall discrimination ability

3. Decision FrameworkΒΆ

Linear boundary?
  Yes β†’ Logistic or LDA
  No β†’ QDA or KNN

Small sample?
  Yes β†’ LDA (more stable)
  No β†’ Logistic or QDA

Need probabilities?
  Yes β†’ Logistic, LDA, QDA
  No β†’ KNN acceptable

High dimensions?
  Yes β†’ Naive Bayes, Logistic with regularization

4. Practical TipsΒΆ

  • Always plot data first (2D/3D if possible)

  • Check class balance (imbalanced β†’ adjust metrics/threshold)

  • Use cross-validation for model selection

  • Consider cost of errors (medical: high recall; spam: high precision)

  • ROC curve for threshold selection

  • Ensemble methods often beat single classifier

5. Common PitfallsΒΆ

  • Using accuracy with imbalanced data

  • Ignoring class prior probabilities

  • Not standardizing features (for KNN, LDA)

  • Overfitting with QDA on small samples

  • Trusting Naive Bayes probability estimates

Next ChapterΒΆ

Chapter 5: Resampling Methods

  • Cross-Validation (validation set, LOOCV, k-fold)

  • Bootstrap

  • Model selection and assessment

Practice ExercisesΒΆ

Exercise 1: Logistic Regression InterpretationΒΆ

Given logistic model: \(\log(\text{odds}) = -5 + 0.02 \times \text{Age}\)

  1. What is P(Y=1) for Age=30?

  2. At what age is P(Y=1) = 0.5?

  3. Interpret the coefficient 0.02

Exercise 2: Metrics CalculationΒΆ

Given confusion matrix:

         Pred-  Pred+
Actual-   80     20
Actual+   10     90

Calculate: Accuracy, Precision, Recall, Specificity, F1

Exercise 3: Method SelectionΒΆ

For each scenario, choose the best method and explain:

  1. Email spam detection (10,000 word features)

  2. Medical diagnosis (n=50, 5 features, classes overlap)

  3. Customer churn (n=100,000, 20 features, non-linear)

Exercise 4: LDA vs QDAΒΆ

When would you prefer LDA over QDA? Consider:

  • Sample size

  • Number of features

  • Decision boundary shape

Exercise 5: ImplementationΒΆ

Create synthetic 3-class data and compare:

  1. Logistic Regression (one-vs-rest)

  2. LDA

  3. KNN (try k=1, 5, 10)

Which performs best? Why?