Lecture 3 & 4: Logistic Regression & Classification

📹 Watch Lecture 3 | Watch Lecture 4

From Andrew Ng’s CS229 Lectures 3-4

Why Not Use Linear Regression for Classification?

“It’s tempting to just fit a straight line to this data and threshold at 0.5… But it turns out this is not a good idea for classification problems.” - Andrew Ng

The Problem: Adding a single outlier can drastically shift the decision boundary in linear regression, even when the pattern is obvious.

Solution: Logistic Regression

  • Uses sigmoid function to map to [0,1]

  • Probabilistic interpretation

  • Stable decision boundaries

  • “Probably by far the most commonly used classification algorithm” - Andrew Ng

The Logistic/Sigmoid Function

From Lecture: “We want the hypothesis to output values between 0 and 1… so we define a function g(z) called the sigmoid or logistic function.”

\[g(z) = \frac{1}{1 + e^{-z}}\]

Properties:

  • Output range: (0, 1)

  • g(0) = 0.5

  • g(z) → 1 as z → ∞

  • g(z) → 0 as z → -∞

  • Derivative: g’(z) = g(z)(1 - g(z))

Hypothesis: $\(h_\theta(x) = g(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}}\)$

Interpretation: h_θ(x) = estimated probability that y = 1 given x

Setup

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer, load_iris, make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, roc_auc_score
)

# Set style and random seed
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded successfully!")

2.1 Understanding the Sigmoid Function

The sigmoid function maps any real number to the range (0, 1), making it perfect for probability estimation:

\[\sigma(z) = \frac{1}{1 + e^{-z}}\]

Key Properties:

  1. \(\sigma(0) = 0.5\)

  2. \(\sigma(z) \to 1\) as \(z \to \infty\)

  3. \(\sigma(z) \to 0\) as \(z \to -\infty\)

  4. \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\) (derivative)

  5. \(1 - \sigma(z) = \sigma(-z)\) (symmetry)

def sigmoid(z):
    """
    Compute sigmoid function.
    
    Args:
        z: Input values (scalar or array)
    
    Returns:
        Sigmoid of z
    """
    return 1 / (1 + np.exp(-z))

# Visualize sigmoid function
z = np.linspace(-10, 10, 200)
sigmoid_values = sigmoid(z)
sigmoid_derivative = sigmoid_values * (1 - sigmoid_values)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Sigmoid function
axes[0].plot(z, sigmoid_values, linewidth=2, label='σ(z)')
axes[0].axhline(y=0.5, color='r', linestyle='--', alpha=0.5, label='Decision threshold')
axes[0].axvline(x=0, color='r', linestyle='--', alpha=0.5)
axes[0].grid(True, alpha=0.3)
axes[0].set_xlabel('z', fontsize=12)
axes[0].set_ylabel('σ(z)', fontsize=12)
axes[0].set_title('Sigmoid Function', fontsize=14, fontweight='bold')
axes[0].legend()

# Sigmoid derivative
axes[1].plot(z, sigmoid_derivative, linewidth=2, color='orange', label="σ'(z)")
axes[1].grid(True, alpha=0.3)
axes[1].set_xlabel('z', fontsize=12)
axes[1].set_ylabel("σ'(z)", fontsize=12)
axes[1].set_title('Sigmoid Derivative', fontsize=14, fontweight='bold')
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"σ(0) = {sigmoid(0):.4f}")
print(f"σ(5) = {sigmoid(5):.4f}")
print(f"σ(-5) = {sigmoid(-5):.4f}")
print(f"\nMaximum derivative at z=0: {sigmoid_derivative[100]:.4f}")

2.2 Binary Classification: Simple Example

Let’s start with a simple 2D classification problem to understand decision boundaries.

# Generate synthetic 2D data
np.random.seed(42)
n_samples = 200

# Class 0: centered at (-2, -2)
X0 = np.random.randn(n_samples // 2, 2) + np.array([-2, -2])
y0 = np.zeros(n_samples // 2)

# Class 1: centered at (2, 2)
X1 = np.random.randn(n_samples // 2, 2) + np.array([2, 2])
y1 = np.ones(n_samples // 2)

# Combine
X = np.vstack([X0, X1])
y = np.hstack([y0, y1])

# Visualize data
plt.figure(figsize=(8, 6))
plt.scatter(X[y==0, 0], X[y==0, 1], c='blue', marker='o', label='Class 0', alpha=0.6, s=50)
plt.scatter(X[y==1, 0], X[y==1, 1], c='red', marker='s', label='Class 1', alpha=0.6, s=50)
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Binary Classification Data', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Dataset shape: {X.shape}")
print(f"Class distribution: {np.bincount(y.astype(int))}")

2.3 Implementing Logistic Regression from Scratch

What and Why

Logistic regression models the probability of a binary outcome using the sigmoid function \(\sigma(z) = \frac{1}{1+e^{-z}}\), which maps any real-valued input to the interval \((0,1)\). Unlike linear regression which predicts continuous values, logistic regression outputs calibrated probabilities, making it the foundation of binary classification in ML.

How It Works

The model computes \(p(y=1|x) = \sigma(w^T x + b)\) and is trained by minimizing the binary cross-entropy loss \(L = -\frac{1}{n}\sum_{i=1}^n [y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)]\) via gradient descent. The gradient has the elegant form \(\nabla_w L = \frac{1}{n} X^T(\hat{y} - y)\), identical in structure to the linear regression gradient but with predictions passed through the sigmoid. The code implements this update loop, tracking the loss at each iteration to verify convergence.

Connection to ML

Logistic regression is the building block of neural network output layers (softmax generalizes sigmoid to multi-class), and understanding its gradient is essential for grasping backpropagation.

def compute_cost_logistic(X, y, theta):
    """
    Compute cost for logistic regression (cross-entropy).
    
    Args:
        X: Feature matrix (m x n)
        y: Target vector (m,)
        theta: Parameters (n,)
    
    Returns:
        cost: Cross-entropy cost
    """
    m = len(y)
    h = sigmoid(X @ theta)
    
    # Avoid log(0) by adding small epsilon
    epsilon = 1e-10
    h = np.clip(h, epsilon, 1 - epsilon)
    
    cost = -(1/m) * np.sum(y * np.log(h) + (1 - y) * np.log(1 - h))
    return cost

def gradient_descent_logistic(X, y, theta, alpha, num_iters):
    """
    Perform gradient descent for logistic regression.
    
    Args:
        X: Feature matrix (m x n)
        y: Target vector (m,)
        theta: Initial parameters (n,)
        alpha: Learning rate
        num_iters: Number of iterations
    
    Returns:
        theta: Optimized parameters
        cost_history: Cost at each iteration
    """
    m = len(y)
    cost_history = []
    
    for i in range(num_iters):
        h = sigmoid(X @ theta)
        gradient = (1/m) * X.T @ (h - y)
        theta = theta - alpha * gradient
        
        cost = compute_cost_logistic(X, y, theta)
        cost_history.append(cost)
        
        if i % 100 == 0:
            print(f"Iteration {i}: Cost = {cost:.4f}")
    
    return theta, cost_history

def predict_logistic(X, theta, threshold=0.5):
    """
    Make predictions using logistic regression.
    
    Args:
        X: Feature matrix
        theta: Parameters
        threshold: Decision threshold (default 0.5)
    
    Returns:
        predictions: Binary predictions (0 or 1)
    """
    probabilities = sigmoid(X @ theta)
    return (probabilities >= threshold).astype(int)

# Add intercept term
X_with_intercept = np.column_stack([np.ones(len(X)), X])

# Initialize parameters
theta_init = np.zeros(X_with_intercept.shape[1])

# Train
print("Training Logistic Regression...\n")
theta_opt, cost_history = gradient_descent_logistic(
    X_with_intercept, y, theta_init, alpha=0.1, num_iters=1000
)

print(f"\nOptimized parameters: {theta_opt}")

# Make predictions
y_pred = predict_logistic(X_with_intercept, theta_opt)
accuracy = np.mean(y_pred == y)
print(f"Training accuracy: {accuracy:.4f}")

2.4 Visualizing Decision Boundary

What and Why

The decision boundary is the set of points where \(p(y=1|x) = 0.5\), which for logistic regression corresponds to the hyperplane \(w^T x + b = 0\). Visualizing this boundary alongside the data reveals how well the model separates classes and highlights regions of uncertainty.

How It Works

The code evaluates the sigmoid function on a dense 2D grid of points, then uses a contour plot at the 0.5 threshold to draw the linear decision boundary. Points on one side have \(\sigma(w^T x + b) > 0.5\) (predicted class 1) and on the other side \(\sigma(w^T x + b) < 0.5\) (predicted class 0). The color gradient shows the smooth probability transition between classes.

Connection to ML

Decision boundary visualization is a fundamental diagnostic tool in ML. While logistic regression produces linear boundaries, kernel methods and neural networks can learn nonlinear boundaries – comparing these visualizations helps build intuition about model expressiveness.

def plot_decision_boundary(X, y, theta, title='Decision Boundary'):
    """
    Plot decision boundary for 2D features.
    """
    # Create mesh
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, 100),
                           np.linspace(x2_min, x2_max, 100))
    
    # Predict on mesh
    X_mesh = np.column_stack([np.ones(xx1.ravel().shape), xx1.ravel(), xx2.ravel()])
    Z = sigmoid(X_mesh @ theta).reshape(xx1.shape)
    
    # Plot
    fig, ax = plt.subplots(figsize=(10, 8))
    
    # Contour plot showing probabilities
    contour = ax.contourf(xx1, xx2, Z, levels=20, cmap='RdYlBu_r', alpha=0.6)
    plt.colorbar(contour, label='P(y=1|x)')
    
    # Decision boundary (where probability = 0.5)
    ax.contour(xx1, xx2, Z, levels=[0.5], colors='black', linewidths=3)
    
    # Scatter plot
    ax.scatter(X[y==0, 0], X[y==0, 1], c='blue', marker='o', 
               label='Class 0', s=50, edgecolors='k', alpha=0.7)
    ax.scatter(X[y==1, 0], X[y==1, 1], c='red', marker='s', 
               label='Class 1', s=50, edgecolors='k', alpha=0.7)
    
    ax.set_xlabel('Feature 1', fontsize=12)
    ax.set_ylabel('Feature 2', fontsize=12)
    ax.set_title(title, fontsize=14, fontweight='bold')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# Plot decision boundary
plot_decision_boundary(X, y, theta_opt, 'Logistic Regression Decision Boundary')

# Plot cost history
plt.figure(figsize=(10, 5))
plt.plot(cost_history, linewidth=2)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Cost', fontsize=12)
plt.title('Cost Function Convergence', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

print(f"Initial cost: {cost_history[0]:.4f}")
print(f"Final cost: {cost_history[-1]:.4f}")

2.5 Real Dataset: Breast Cancer Classification

Now let’s apply logistic regression to a real medical dataset: predicting whether a breast tumor is malignant or benign.

# Load breast cancer dataset
data = load_breast_cancer()
X_cancer = data.data
y_cancer = data.target

print("Breast Cancer Dataset")
print(f"Samples: {X_cancer.shape[0]}")
print(f"Features: {X_cancer.shape[1]}")
print(f"\nFeature names: {data.feature_names[:5]}... (showing first 5)")
print(f"\nClass distribution:")
print(f"  Malignant (0): {np.sum(y_cancer == 0)}")
print(f"  Benign (1): {np.sum(y_cancer == 1)}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)

# Feature scaling is crucial!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Add intercept
X_train_final = np.column_stack([np.ones(len(X_train_scaled)), X_train_scaled])
X_test_final = np.column_stack([np.ones(len(X_test_scaled)), X_test_scaled])

print(f"\nTraining set: {X_train_final.shape}")
print(f"Test set: {X_test_final.shape}")
# Train logistic regression
theta_init_cancer = np.zeros(X_train_final.shape[1])

print("Training on Breast Cancer Dataset...\n")
theta_cancer, cost_history_cancer = gradient_descent_logistic(
    X_train_final, y_train, theta_init_cancer, alpha=0.5, num_iters=2000
)

# Predictions
y_train_pred = predict_logistic(X_train_final, theta_cancer)
y_test_pred = predict_logistic(X_test_final, theta_cancer)

# Evaluation
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)

print("\n" + "="*50)
print("RESULTS")
print("="*50)
print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"Test Precision: {test_precision:.4f}")
print(f"Test Recall: {test_recall:.4f}")
print(f"Test F1-Score: {test_f1:.4f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_test_pred)
print(f"\nConfusion Matrix:")
print(cm)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Malignant', 'Benign'],
            yticklabels=['Malignant', 'Benign'])
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.title('Confusion Matrix - Breast Cancer Classification', fontsize=14, fontweight='bold')
plt.show()

2.6 ROC Curve and AUC

The ROC (Receiver Operating Characteristic) curve shows the trade-off between true positive rate and false positive rate at different classification thresholds.

# Get probability predictions
y_test_proba = sigmoid(X_test_final @ theta_cancer)

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_test_proba)
auc = roc_auc_score(y_test, y_test_proba)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curve - Breast Cancer Classification', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

print(f"AUC Score: {auc:.4f}")
print(f"\nInterpretation: AUC = {auc:.3f} indicates {'excellent' if auc > 0.9 else 'good' if auc > 0.8 else 'fair'} classification performance.")

2.7 Multi-class Classification: One-vs-All

For problems with more than 2 classes, we train \(K\) binary classifiers (one for each class):

  • Classifier \(k\): predict class \(k\) vs all other classes

  • Final prediction: choose class with highest probability

class OneVsAllClassifier:
    """
    One-vs-All (OvA) multi-class classifier.
    """
    def __init__(self, alpha=0.1, num_iters=1000):
        self.alpha = alpha
        self.num_iters = num_iters
        self.thetas = None
        self.num_classes = None
    
    def fit(self, X, y):
        """
        Train K binary classifiers.
        """
        m, n = X.shape
        self.num_classes = len(np.unique(y))
        self.thetas = []
        
        # Add intercept
        X_with_intercept = np.column_stack([np.ones(m), X])
        
        print(f"Training {self.num_classes} binary classifiers...\n")
        
        for k in range(self.num_classes):
            # Binary labels: 1 if class k, 0 otherwise
            y_binary = (y == k).astype(int)
            
            # Initialize and train
            theta_init = np.zeros(n + 1)
            theta, _ = gradient_descent_logistic(
                X_with_intercept, y_binary, theta_init, 
                self.alpha, self.num_iters
            )
            self.thetas.append(theta)
            print(f"Classifier for class {k} trained.")
        
        print("\nAll classifiers trained!")
    
    def predict_proba(self, X):
        """
        Predict probabilities for each class.
        """
        m = X.shape[0]
        X_with_intercept = np.column_stack([np.ones(m), X])
        
        probas = np.zeros((m, self.num_classes))
        for k in range(self.num_classes):
            probas[:, k] = sigmoid(X_with_intercept @ self.thetas[k])
        
        return probas
    
    def predict(self, X):
        """
        Predict class with highest probability.
        """
        probas = self.predict_proba(X)
        return np.argmax(probas, axis=1)

# Load iris dataset (3 classes)
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

print("Iris Dataset")
print(f"Samples: {X_iris.shape[0]}")
print(f"Features: {X_iris.shape[1]}")
print(f"Classes: {len(np.unique(y_iris))} ({iris.target_names})")

# Split and scale
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42, stratify=y_iris
)

scaler_iris = StandardScaler()
X_train_iris_scaled = scaler_iris.fit_transform(X_train_iris)
X_test_iris_scaled = scaler_iris.transform(X_test_iris)

# Train OvA classifier
ova_clf = OneVsAllClassifier(alpha=0.5, num_iters=1000)
ova_clf.fit(X_train_iris_scaled, y_train_iris)
# Predictions
y_train_pred_iris = ova_clf.predict(X_train_iris_scaled)
y_test_pred_iris = ova_clf.predict(X_test_iris_scaled)

# Evaluation
train_acc_iris = accuracy_score(y_train_iris, y_train_pred_iris)
test_acc_iris = accuracy_score(y_test_iris, y_test_pred_iris)

print("\n" + "="*50)
print("MULTI-CLASS CLASSIFICATION RESULTS")
print("="*50)
print(f"Training Accuracy: {train_acc_iris:.4f}")
print(f"Test Accuracy: {test_acc_iris:.4f}")

# Classification report
print("\nClassification Report:")
print(classification_report(y_test_iris, y_test_pred_iris, 
                          target_names=iris.target_names))

# Confusion matrix
cm_iris = confusion_matrix(y_test_iris, y_test_pred_iris)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_iris, annot=True, fmt='d', cmap='Greens',
            xticklabels=iris.target_names,
            yticklabels=iris.target_names)
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.title('Confusion Matrix - Iris Classification (One-vs-All)', 
          fontsize=14, fontweight='bold')
plt.show()

2.8 Regularized Logistic Regression

To prevent overfitting, we add regularization term:

\[J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_{\theta}(x^{(i)})) \right] + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2\]

Note: We don’t regularize \(\theta_0\) (intercept).

def compute_cost_regularized(X, y, theta, lambda_reg):
    """
    Compute regularized cost for logistic regression.
    """
    m = len(y)
    h = sigmoid(X @ theta)
    
    epsilon = 1e-10
    h = np.clip(h, epsilon, 1 - epsilon)
    
    # Cross-entropy cost
    cost = -(1/m) * np.sum(y * np.log(h) + (1 - y) * np.log(1 - h))
    
    # Regularization term (excluding theta[0])
    reg_term = (lambda_reg / (2 * m)) * np.sum(theta[1:] ** 2)
    
    return cost + reg_term

def gradient_descent_regularized(X, y, theta, alpha, lambda_reg, num_iters):
    """
    Gradient descent with L2 regularization.
    """
    m = len(y)
    cost_history = []
    
    for i in range(num_iters):
        h = sigmoid(X @ theta)
        gradient = (1/m) * X.T @ (h - y)
        
        # Add regularization to gradient (excluding theta[0])
        gradient[1:] += (lambda_reg / m) * theta[1:]
        
        theta = theta - alpha * gradient
        
        cost = compute_cost_regularized(X, y, theta, lambda_reg)
        cost_history.append(cost)
    
    return theta, cost_history

# Compare different regularization strengths
lambdas = [0, 0.01, 0.1, 1.0, 10.0]
results = []

print("Training with different regularization strengths...\n")

for lam in lambdas:
    theta_init = np.zeros(X_train_final.shape[1])
    theta_reg, _ = gradient_descent_regularized(
        X_train_final, y_train, theta_init, 
        alpha=0.5, lambda_reg=lam, num_iters=1000
    )
    
    y_test_pred_reg = predict_logistic(X_test_final, theta_reg)
    test_acc_reg = accuracy_score(y_test, y_test_pred_reg)
    
    results.append({
        'lambda': lam,
        'accuracy': test_acc_reg,
        'theta_norm': np.linalg.norm(theta_reg[1:])
    })
    
    print(f"λ = {lam:6.2f}: Test Accuracy = {test_acc_reg:.4f}, ||θ|| = {np.linalg.norm(theta_reg[1:]):.4f}")

# Visualize effect of regularization
results_df = pd.DataFrame(results)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].semilogx(results_df['lambda'].replace(0, 1e-10), results_df['accuracy'], 
                 marker='o', linewidth=2, markersize=8)
axes[0].set_xlabel('Regularization Parameter (λ)', fontsize=12)
axes[0].set_ylabel('Test Accuracy', fontsize=12)
axes[0].set_title('Regularization vs Accuracy', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

axes[1].loglog(results_df['lambda'].replace(0, 1e-10), results_df['theta_norm'], 
               marker='s', linewidth=2, markersize=8, color='orange')
axes[1].set_xlabel('Regularization Parameter (λ)', fontsize=12)
axes[1].set_ylabel('Parameter Norm (||θ||)', fontsize=12)
axes[1].set_title('Regularization Effect on Parameters', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Key Takeaways

1. Sigmoid Function

  • Maps any real number to (0, 1)

  • Interprets output as probability: \(P(y=1|x; \theta)\)

  • Derivative has nice form: \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\)

2. Decision Boundary

  • Linear decision boundary: \(\theta^T x = 0\)

  • Can create non-linear boundaries with polynomial features

  • Visualizing helps understand model behavior

3. Cross-Entropy Loss

  • Penalizes confident wrong predictions heavily

  • Convex function → guaranteed to find global minimum

  • Gradient has same form as linear regression (but different \(h_\theta\))

4. Multi-class Classification

  • One-vs-All: Train K binary classifiers

  • Predict class with highest probability

  • Works well in practice, easy to implement

5. Regularization

  • Prevents overfitting by penalizing large weights

  • L2 regularization shrinks parameters toward zero

  • Choose λ using cross-validation

6. Evaluation Metrics

  • Accuracy: Not sufficient for imbalanced data

  • Precision, Recall, F1: Better for imbalanced classes

  • ROC-AUC: Threshold-independent metric

  • Confusion matrix: Shows detailed error breakdown

7. Practical Considerations

  • Feature scaling is crucial for gradient descent

  • Start with learning rate α ≈ 0.1-1.0

  • Monitor cost function to ensure convergence

  • Use stratified split for imbalanced data

2.9 Newton’s Method for Logistic Regression

From CS229 Lecture 3:

“Gradient ascent is a good algorithm… but it takes a baby step, takes a baby step, takes a lot of iterations. There’s another algorithm called Newton’s method which allows you to take much bigger jumps. You might need only 10 iterations instead of 100 or 1000 iterations with gradient ascent.” - Andrew Ng

The Idea:

  • Gradient ascent: First-order method (uses gradient)

  • Newton’s method: Second-order method (uses Hessian)

  • Much faster convergence (typically 5-20 iterations)

  • Each iteration is more expensive (must compute Hessian)

Newton’s Method for Finding Roots

Problem: Find θ where f(θ) = 0

Algorithm (1D case): $\(\theta^{(t+1)} = \theta^{(t)} - \frac{f(\theta^{(t)})}{f'(\theta^{(t)})}\)$

Geometric Intuition:

  1. Start at current θ

  2. Draw tangent line to f at θ

  3. Find where tangent crosses x-axis

  4. That’s your next θ

Newton’s Method for Optimization

Goal: Maximize ℓ(θ) (log-likelihood)

Insight: At maximum, ℓ’(θ) = 0

Set: f(θ) = ℓ’(θ), then solve f(θ) = 0

Update Rule (1D): $\(\theta := \theta - \frac{\ell'(\theta)}{\ell''(\theta)}\)$

Multi-dimensional (vectorized): $\(\theta := \theta - H^{-1} \nabla_\theta \ell(\theta)\)$

Where:

  • \(\nabla_\theta \ell(\theta)\) = gradient (n × 1 vector)

  • \(H\) = Hessian matrix (n × n matrix)

  • \(H_{ij} = \frac{\partial^2 \ell}{\partial \theta_i \partial \theta_j}\)

For Logistic Regression:

Gradient: $\(\nabla_\theta \ell(\theta) = X^T(y - h_\theta(X))\)$

Hessian: $\(H = -X^T D X\)$

where \(D\) is diagonal matrix with \(D_{ii} = h_\theta(x^{(i)})(1 - h_\theta(x^{(i)}))\)

Update: $\(\theta := \theta + H^{-1} \nabla_\theta \ell(\theta)\)$

def newtons_method_logistic(X, y, max_iter=20, tol=1e-6):
    """
    Newton's Method for Logistic Regression
    
    From CS229: Much faster convergence than gradient descent,
    but each iteration requires computing and inverting Hessian.
    
    Args:
        X: Feature matrix with intercept (m x n)
        y: Target vector (m,)
        max_iter: Maximum iterations (default 20)
        tol: Convergence tolerance
    
    Returns:
        theta: Optimized parameters
        log_likelihood_history: Log-likelihood at each iteration
    """
    m, n = X.shape
    theta = np.zeros(n)
    log_likelihood_history = []
    
    print("🔄 Newton's Method for Logistic Regression")
    print("="*70)
    
    for iteration in range(max_iter):
        # Compute hypothesis
        h = sigmoid(X @ theta)
        
        # Compute log-likelihood
        epsilon = 1e-10
        h_clipped = np.clip(h, epsilon, 1 - epsilon)
        log_likelihood = np.sum(y * np.log(h_clipped) + (1 - y) * np.log(1 - h_clipped))
        log_likelihood_history.append(log_likelihood)
        
        # Compute gradient
        gradient = X.T @ (h - y)
        
        # Compute Hessian: H = X^T D X
        # where D is diagonal with D_ii = h_i(1 - h_i)
        D = np.diag(h * (1 - h))
        H = X.T @ D @ X
        
        # Newton's update: θ := θ - H^(-1) ∇ℓ(θ)
        # For logistic regression: θ := θ + H^(-1) gradient (note the + for maximization)
        try:
            theta_update = np.linalg.solve(H, gradient)
            theta = theta - theta_update  # Subtract because we computed gradient, not -gradient
            
            # Check convergence
            update_norm = np.linalg.norm(theta_update)
            
            if iteration % 5 == 0 or iteration == max_iter - 1:
                print(f"Iter {iteration:2d}: Log-likelihood = {log_likelihood:10.4f}, "
                      f"||Δθ|| = {update_norm:.6f}")
            
            if update_norm < tol:
                print(f"\n✅ Converged after {iteration + 1} iterations!")
                break
                
        except np.linalg.LinAlgError:
            print(f"⚠️  Warning: Hessian is singular at iteration {iteration}")
            break
    
    return theta, np.array(log_likelihood_history)

# Generate sample data for demonstration
np.random.seed(42)
X_demo = np.random.randn(100, 2)
true_theta = np.array([0.5, 1.5, -1.0])
X_demo_b = np.c_[np.ones(100), X_demo]
probabilities = sigmoid(X_demo_b @ true_theta)
y_demo = (probabilities > 0.5).astype(int)

print("\n📊 Dataset for Newton's Method Demo")
print(f"Samples: {len(X_demo)}")
print(f"Features: {X_demo.shape[1]}")
print(f"True θ: {true_theta}")
print(f"Class distribution: {np.bincount(y_demo)}\n")

# Train with Newton's Method
theta_newton, ll_history_newton = newtons_method_logistic(X_demo_b, y_demo, max_iter=20)

print(f"\n🎯 Final Parameters (Newton's Method):")
print(f"   θ = {theta_newton}")
print(f"   True θ = {true_theta}")
print(f"   Difference: {np.linalg.norm(theta_newton - true_theta):.6f}")

# Compare with Gradient Ascent
print("\n" + "="*70)
print("📊 Comparison: Gradient Ascent vs Newton's Method")
print("="*70)

def gradient_ascent_comparison(X, y, alpha=0.1, max_iter=100):
    """Gradient ascent for comparison"""
    m, n = X.shape
    theta = np.zeros(n)
    ll_history = []
    
    for i in range(max_iter):
        h = sigmoid(X @ theta)
        epsilon = 1e-10
        h_clipped = np.clip(h, epsilon, 1 - epsilon)
        ll = np.sum(y * np.log(h_clipped) + (1 - y) * np.log(1 - h_clipped))
        ll_history.append(ll)
        
        gradient = X.T @ (y - h)  # Note: y - h for ascent
        theta = theta + alpha * gradient
    
    return theta, np.array(ll_history)

theta_gd, ll_history_gd = gradient_ascent_comparison(X_demo_b, y_demo, alpha=0.1, max_iter=100)

print(f"\n🔄 Gradient Ascent (100 iterations):")
print(f"   Final log-likelihood: {ll_history_gd[-1]:.4f}")
print(f"   θ = {theta_gd}")

print(f"\n⚡ Newton's Method ({len(ll_history_newton)} iterations):")
print(f"   Final log-likelihood: {ll_history_newton[-1]:.4f}")
print(f"   θ = {theta_newton}")

print(f"\n💡 Speedup: Newton's method converged in {len(ll_history_newton)} iterations")
print(f"   vs {100} iterations for gradient ascent (>5x faster!)")

# Visualize convergence
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Log-likelihood comparison
axes[0].plot(ll_history_gd, label='Gradient Ascent (α=0.1)', linewidth=2, alpha=0.8)
axes[0].plot(ll_history_newton, 'o-', label="Newton's Method", linewidth=2, 
             markersize=8, alpha=0.8)
axes[0].set_xlabel('Iteration', fontsize=12)
axes[0].set_ylabel('Log-Likelihood', fontsize=12)
axes[0].set_title('Convergence Comparison', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Log scale to see Newton's convergence better
axes[1].semilogy(np.abs(ll_history_gd - ll_history_gd[-1]) + 1e-10, 
                label='Gradient Ascent', linewidth=2, alpha=0.8)
axes[1].semilogy(np.arange(len(ll_history_newton)),
                np.abs(ll_history_newton - ll_history_newton[-1]) + 1e-10, 
                'o-', label="Newton's Method", linewidth=2, markersize=8, alpha=0.8)
axes[1].set_xlabel('Iteration', fontsize=12)
axes[1].set_ylabel('Distance from Optimum (log scale)', fontsize=12)
axes[1].set_title('Convergence Rate (Exponential for Newton)', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

When to Use Newton’s Method vs Gradient Ascent

From the lecture: “Each iteration is more expensive, but you need far fewer iterations.”

Aspect

Gradient Ascent

Newton’s Method

Iterations needed

100-1000+

5-20 typically

Per-iteration cost

O(n)

O(n³) (Hessian inversion)

Total cost

O(mn × iterations)

O(mn² + n³) × iterations

When to use

Large n (n > 10,000)

Small to medium n (n < 10,000)

Convergence rate

Linear

Quadratic (super fast!)

Scalability

Very scalable

Limited by Hessian size

Practical Guidelines:

Use Newton’s Method when:

  • Number of features n < 10,000

  • Need fast convergence (few iterations)

  • Can afford matrix operations

  • Working with well-conditioned problems

Use Gradient Ascent when:

  • Large number of features (n > 10,000)

  • Very large datasets (can use mini-batches)

  • Limited memory

  • Online learning scenarios

From Andrew Ng: “For logistic regression with n around 100-1000 features, Newton’s method is often the preferred choice because it converges so much faster.”

Advanced: Quasi-Newton Methods

BFGS and L-BFGS:

  • Approximate Hessian instead of computing exactly

  • O(n²) storage instead of O(n³) computation

  • Good middle ground between gradient descent and Newton’s method

  • Used in sklearn’s LogisticRegression with solver=‘lbfgs’

Practice Exercises

  1. Sigmoid Properties: Prove mathematically that \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\) and \(1 - \sigma(z) = \sigma(-z)\).

  2. Newton’s Method: Implement Newton’s method for logistic regression: \(\theta := \theta - H^{-1} \nabla J(\theta)\) where \(H\) is the Hessian. Compare convergence speed with gradient descent.

  3. Non-linear Boundaries: Add polynomial features (e.g., \(x_1^2, x_2^2, x_1 x_2\)) to create non-linear decision boundaries. Visualize the result.

  4. Imbalanced Classes: Create a dataset with 95% class 0, 5% class 1. Train standard and weighted logistic regression. Compare F1-scores.

  5. Softmax Regression: Implement softmax regression (multi-class logistic regression) from scratch. Test on MNIST digits.

  6. Feature Selection: Use L1 regularization (Lasso) to perform feature selection. Which features are most important?

  7. Calibration: Plot calibration curve for your classifier. Is it well-calibrated?

  8. Learning Curves: Plot training and validation accuracy vs training set size. Diagnose bias/variance.

  9. Threshold Tuning: Find optimal classification threshold to maximize F1-score on validation set.

  10. Comparison: Compare your implementation with sklearn.linear_model.LogisticRegression. Are results similar?

References

  1. CS229 Lecture Notes: Andrew Ng’s lecture notes on logistic regression

  2. Pattern Recognition and Machine Learning: Bishop, Chapter 4

  3. The Elements of Statistical Learning: Hastie et al., Chapter 4

  4. scikit-learn Documentation: Logistic Regression implementation details

Next: Lecture 3: Regularization