Lecture 6-7: Support Vector Machines

📹 Watch Lecture 6 | Watch Lecture 7

From Andrew Ng’s CS229 Lectures 6-7

“Support vector machine is one of my favorite algorithms because it’s very turnkey. If you have a classification problem, you just run it and it more or less works.” - Andrew Ng

The Optimal Margin Classifier

Key Idea: Among all decision boundaries that separate the data, choose the one with the largest margin (distance to nearest point).

Why Maximum Margin?

  • Better generalization

  • More robust to noise

  • Unique solution

The Representer Theorem

From Lecture 7:

“We’ll work in potentially very high-dimensional, like 100,000 dimensional, or a million dimensional, or 100 billion dimensional, or even infinite-dimensional feature spaces.” - Andrew Ng

Key Insight: The optimal w can be represented as: $\(w = \sum_{i=1}^m \alpha_i y^{(i)} x^{(i)}\)$

Implications:

  • w is a linear combination of training examples

  • Even in infinite dimensions, only need to store m coefficients (α’s)

  • This enables the kernel trick!

The Kernel Trick

From the lecture: “Kernels are the mechanism for working in these incredibly high dimensional feature spaces”

Basic idea:

  1. Want to work in high-dimensional feature space φ(x)

  2. But computing φ(x) and storing w is expensive

  3. Solution: Never compute φ(x) explicitly!

  4. Instead, use kernel function K(x, z) = ⟨φ(x), φ(z)⟩

Example: Polynomial kernel of degree d

  • Feature space: All monomials up to degree d

  • Dimension: Exponential in d!

  • But K(x, z) = (xᵀz + 1)^d computable in O(n) time

Common Kernels

Linear: K(x, z) = xᵀz

  • Equivalent to no feature mapping

  • Good for high-dimensional data with linear patterns

Polynomial: K(x, z) = (xᵀz + c)^d

  • Degree d polynomial features

  • c is constant term

RBF (Gaussian): K(x, z) = exp(-γ||x - z||²)

  • Infinite dimensional feature space!

  • Most popular kernel

  • γ controls width of Gaussian

Setup

We import NumPy, Matplotlib, and scikit-learn’s SVM implementation (SVC). Support vector machines involve solving a quadratic programming problem to find the maximum-margin hyperplane, and scikit-learn provides efficient solvers via libsvm. We also load visualization utilities for plotting decision boundaries and kernel transformations. Understanding SVMs requires seeing how different kernels and hyperparameters (\(C\), \(\gamma\)) reshape the decision boundary, so rich plotting support is essential for this notebook.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_circles, make_moons, load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from scipy.optimize import minimize
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded successfully!")

5.1 Understanding Margins

The margin is the distance from the decision boundary to the nearest data point. SVMs find the boundary that maximizes this margin.

Intuition: Larger margin → more confidence → better generalization

# Generate linearly separable data
np.random.seed(42)
n_samples = 100

# Class 0: centered at (-2, -2)
X0 = np.random.randn(n_samples//2, 2) * 0.5 + np.array([-1.5, -1.5])
y0 = np.zeros(n_samples//2)

# Class 1: centered at (2, 2)
X1 = np.random.randn(n_samples//2, 2) * 0.5 + np.array([1.5, 1.5])
y1 = np.ones(n_samples//2)

X_linearsep = np.vstack([X0, X1])
y_linearsep = np.hstack([y0, y1])
y_linearsep_signed = 2 * y_linearsep - 1  # Convert to {-1, +1}

# Train SVM
svm = SVC(kernel='linear', C=1000)  # Large C for hard margin
svm.fit(X_linearsep, y_linearsep)

# Get support vectors
sv = svm.support_vectors_

# Plot
fig, ax = plt.subplots(figsize=(12, 8))

# Create mesh
x_min, x_max = X_linearsep[:, 0].min() - 1, X_linearsep[:, 0].max() + 1
y_min, y_max = X_linearsep[:, 1].min() - 1, X_linearsep[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                     np.linspace(y_min, y_max, 200))

# Decision function
Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot decision boundary and margins
ax.contour(xx, yy, Z, levels=[-1, 0, 1], linestyles=['--', '-', '--'],
          colors=['blue', 'black', 'red'], linewidths=[2, 3, 2])

# Scatter plot
ax.scatter(X0[:, 0], X0[:, 1], c='blue', marker='o', s=100, 
          alpha=0.6, edgecolors='k', label='Class 0')
ax.scatter(X1[:, 0], X1[:, 1], c='red', marker='s', s=100, 
          alpha=0.6, edgecolors='k', label='Class 1')

# Highlight support vectors
ax.scatter(sv[:, 0], sv[:, 1], s=300, linewidth=2, 
          facecolors='none', edgecolors='green', 
          label='Support Vectors')

ax.set_xlabel('Feature 1', fontsize=12)
ax.set_ylabel('Feature 2', fontsize=12)
ax.set_title('SVM: Maximum Margin Classifier\n(Black = Decision Boundary, Dashed = Margin)', 
            fontsize=14, fontweight='bold')
ax.legend(fontsize=11, loc='upper left')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Number of support vectors: {len(sv)}")
print(f"Total training points: {len(X_linearsep)}")
print(f"Support vector ratio: {len(sv)/len(X_linearsep):.2%}")

5.2 Linear SVM Implementation

Let’s implement a simple linear SVM using the dual formulation and quadratic programming.

from scipy.optimize import minimize

class LinearSVM:
    """
    Linear SVM using dual formulation.
    """
    def __init__(self, C=1.0):
        self.C = C
        self.alpha = None
        self.w = None
        self.b = None
        self.support_vectors = None
    
    def fit(self, X, y):
        """
        Train SVM using dual formulation.
        y should be in {-1, +1}
        """
        m, n = X.shape
        
        # Gram matrix: K[i,j] = <x^(i), x^(j)>
        K = X @ X.T
        
        # Dual objective: -[sum(alpha_i) - 0.5 * sum_ij(alpha_i * alpha_j * y_i * y_j * K_ij)]
        def objective(alpha):
            return 0.5 * np.sum((alpha * y) @ K * (alpha * y)) - np.sum(alpha)
        
        # Gradient
        def gradient(alpha):
            return (alpha * y) @ K * y - 1
        
        # Constraints: sum(alpha_i * y_i) = 0
        constraints = {'type': 'eq', 'fun': lambda alpha: np.dot(alpha, y)}
        
        # Bounds: 0 <= alpha_i <= C
        bounds = [(0, self.C) for _ in range(m)]
        
        # Initial guess
        alpha0 = np.zeros(m)
        
        # Solve
        result = minimize(objective, alpha0, method='SLSQP', 
                        jac=gradient, bounds=bounds, constraints=constraints)
        
        self.alpha = result.x
        
        # Support vectors (alpha > threshold)
        sv_idx = self.alpha > 1e-5
        self.support_vectors = X[sv_idx]
        
        # Compute w
        self.w = np.sum((self.alpha * y)[:, np.newaxis] * X, axis=0)
        
        # Compute b using support vectors
        sv_alpha = self.alpha[sv_idx]
        sv_X = X[sv_idx]
        sv_y = y[sv_idx]
        self.b = np.mean(sv_y - sv_X @ self.w)
        
        return self
    
    def decision_function(self, X):
        """
        Compute decision function: w^T x + b
        """
        return X @ self.w + self.b
    
    def predict(self, X):
        """
        Predict class labels.
        """
        return np.sign(self.decision_function(X))

# Train our implementation
my_svm = LinearSVM(C=1.0)
my_svm.fit(X_linearsep, y_linearsep_signed)

# Predictions
y_pred_mine = my_svm.predict(X_linearsep)
accuracy_mine = np.mean(y_pred_mine == y_linearsep_signed)

print("Custom Linear SVM:")
print(f"  Accuracy: {accuracy_mine:.4f}")
print(f"  Number of support vectors: {len(my_svm.support_vectors)}")
print(f"  w: {my_svm.w}")
print(f"  b: {my_svm.b:.4f}")

5.3 Kernel Functions

For non-linearly separable data, we use the kernel trick: $\(K(x, z) = \langle \phi(x), \phi(z) \rangle\)$

Common Kernels:

  1. Linear: \(K(x, z) = x^T z\)

  2. Polynomial: \(K(x, z) = (x^T z + c)^d\)

  3. RBF (Gaussian): \(K(x, z) = \exp(-\gamma ||x - z||^2)\)

  4. Sigmoid: \(K(x, z) = \tanh(\kappa x^T z + \Theta)\)

# Generate non-linearly separable data (XOR-like)
np.random.seed(42)
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, factor=0.3, random_state=42)
y_circles_signed = 2 * y_circles - 1

# Test different kernels
kernels = ['linear', 'poly', 'rbf']
kernel_names = ['Linear', 'Polynomial (degree=3)', 'RBF (Gaussian)']

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, (kernel, name) in enumerate(zip(kernels, kernel_names)):
    # Train SVM
    if kernel == 'poly':
        svm_kernel = SVC(kernel=kernel, degree=3, C=1.0)
    else:
        svm_kernel = SVC(kernel=kernel, C=1.0, gamma='auto')
    
    svm_kernel.fit(X_circles, y_circles)
    
    # Create mesh
    x_min, x_max = X_circles[:, 0].min() - 0.5, X_circles[:, 0].max() + 0.5
    y_min, y_max = X_circles[:, 1].min() - 0.5, X_circles[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    
    # Decision function
    Z = svm_kernel.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    axes[idx].contourf(xx, yy, Z, levels=20, cmap='RdBu', alpha=0.6)
    axes[idx].contour(xx, yy, Z, levels=[0], colors='black', linewidths=3)
    
    # Scatter
    axes[idx].scatter(X_circles[y_circles==0, 0], X_circles[y_circles==0, 1],
                     c='blue', marker='o', s=50, edgecolors='k', alpha=0.7)
    axes[idx].scatter(X_circles[y_circles==1, 0], X_circles[y_circles==1, 1],
                     c='red', marker='s', s=50, edgecolors='k', alpha=0.7)
    
    # Support vectors
    sv = svm_kernel.support_vectors_
    axes[idx].scatter(sv[:, 0], sv[:, 1], s=200, linewidth=2,
                     facecolors='none', edgecolors='green')
    
    # Accuracy
    acc = accuracy_score(y_circles, svm_kernel.predict(X_circles))
    
    axes[idx].set_xlabel('Feature 1', fontsize=11)
    axes[idx].set_ylabel('Feature 2', fontsize=11)
    axes[idx].set_title(f'{name}\nAccuracy: {acc:.3f}, SVs: {len(sv)}',
                       fontsize=12, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nObservations:")
print("- Linear kernel fails (data not linearly separable)")
print("- Polynomial and RBF kernels succeed")
print("- RBF most commonly used in practice")

5.4 RBF Kernel Deep Dive

The RBF (Radial Basis Function) kernel is: $\(K(x, z) = \exp\left(-\gamma ||x - z||^2\right)\)$

where \(\gamma = \frac{1}{2\sigma^2}\) controls the “reach” of each training example:

  • Small γ (large σ): Smooth boundary, far-reaching influence

  • Large γ (small σ): Complex boundary, local influence

# Generate moons dataset
X_moons, y_moons = make_moons(n_samples=200, noise=0.15, random_state=42)

# Try different gamma values
gammas = [0.1, 1.0, 10.0]

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, gamma in enumerate(gammas):
    # Train SVM with RBF kernel
    svm_rbf = SVC(kernel='rbf', gamma=gamma, C=1.0)
    svm_rbf.fit(X_moons, y_moons)
    
    # Create mesh
    x_min, x_max = X_moons[:, 0].min() - 0.5, X_moons[:, 0].max() + 0.5
    y_min, y_max = X_moons[:, 1].min() - 0.5, X_moons[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    
    Z = svm_rbf.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    axes[idx].contourf(xx, yy, Z, levels=20, cmap='RdBu', alpha=0.6)
    axes[idx].contour(xx, yy, Z, levels=[0], colors='black', linewidths=3)
    
    axes[idx].scatter(X_moons[y_moons==0, 0], X_moons[y_moons==0, 1],
                     c='blue', marker='o', s=50, edgecolors='k', alpha=0.7)
    axes[idx].scatter(X_moons[y_moons==1, 0], X_moons[y_moons==1, 1],
                     c='red', marker='s', s=50, edgecolors='k', alpha=0.7)
    
    sv = svm_rbf.support_vectors_
    axes[idx].scatter(sv[:, 0], sv[:, 1], s=200, linewidth=2,
                     facecolors='none', edgecolors='green')
    
    acc = accuracy_score(y_moons, svm_rbf.predict(X_moons))
    
    axes[idx].set_xlabel('Feature 1', fontsize=11)
    axes[idx].set_ylabel('Feature 2', fontsize=11)
    axes[idx].set_title(f'γ = {gamma}\nAcc: {acc:.3f}, SVs: {len(sv)}',
                       fontsize=12, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nGamma (γ) Effect:")
print("- γ = 0.1: Smooth, possibly underfitting")
print("- γ = 1.0: Good balance")
print("- γ = 10.0: Complex, possibly overfitting")

5.5 Soft Margin SVM (C Parameter)

For non-separable data, we allow some mistakes with slack variables \(\xi_i\):

\[\min_{w,b,\xi} \frac{1}{2}||w||^2 + C\sum_{i=1}^m \xi_i\]
\[\text{s.t. } y^{(i)}(w^T x^{(i)} + b) \geq 1 - \xi_i, \quad \xi_i \geq 0\]

C parameter controls trade-off:

  • Large C: Hard margin (fewer errors, risk overfitting)

  • Small C: Soft margin (more errors, better generalization)

# Generate slightly overlapping data
np.random.seed(42)
X_overlap, y_overlap = make_classification(n_samples=200, n_features=2, 
                                           n_redundant=0, n_informative=2,
                                           n_clusters_per_class=1, 
                                           class_sep=0.8, random_state=42)

# Try different C values
C_values = [0.01, 1.0, 100.0]

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, C in enumerate(C_values):
    svm_c = SVC(kernel='rbf', C=C, gamma='auto')
    svm_c.fit(X_overlap, y_overlap)
    
    # Create mesh
    x_min, x_max = X_overlap[:, 0].min() - 1, X_overlap[:, 0].max() + 1
    y_min, y_max = X_overlap[:, 1].min() - 1, X_overlap[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    
    Z = svm_c.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    axes[idx].contourf(xx, yy, Z, levels=20, cmap='RdBu', alpha=0.6)
    axes[idx].contour(xx, yy, Z, levels=[0], colors='black', linewidths=3)
    
    axes[idx].scatter(X_overlap[y_overlap==0, 0], X_overlap[y_overlap==0, 1],
                     c='blue', marker='o', s=50, edgecolors='k', alpha=0.7)
    axes[idx].scatter(X_overlap[y_overlap==1, 0], X_overlap[y_overlap==1, 1],
                     c='red', marker='s', s=50, edgecolors='k', alpha=0.7)
    
    sv = svm_c.support_vectors_
    axes[idx].scatter(sv[:, 0], sv[:, 1], s=200, linewidth=2,
                     facecolors='none', edgecolors='green')
    
    acc = accuracy_score(y_overlap, svm_c.predict(X_overlap))
    
    axes[idx].set_xlabel('Feature 1', fontsize=11)
    axes[idx].set_ylabel('Feature 2', fontsize=11)
    axes[idx].set_title(f'C = {C}\nAcc: {acc:.3f}, SVs: {len(sv)}',
                       fontsize=12, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nC Parameter Effect:")
print("- C = 0.01: Very soft margin, many SVs, simpler boundary")
print("- C = 1.0: Balanced")
print("- C = 100: Hard margin, fewer SVs, complex boundary")

5.7 Multi-class SVMs

SVMs are binary classifiers. For multi-class:

  1. One-vs-One (OvO): Train K(K-1)/2 classifiers

  2. One-vs-All (OvA): Train K classifiers

scikit-learn uses OvO by default for SVC.

from sklearn.datasets import load_iris

# Load iris (3 classes)
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Split and scale
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42, stratify=y_iris
)

scaler_iris = StandardScaler()
X_train_iris_scaled = scaler_iris.fit_transform(X_train_iris)
X_test_iris_scaled = scaler_iris.transform(X_test_iris)

# Train multi-class SVM
svm_multi = SVC(kernel='rbf', C=1.0, gamma='auto', decision_function_shape='ovo')
svm_multi.fit(X_train_iris_scaled, y_train_iris)

# Predictions
y_pred_iris = svm_multi.predict(X_test_iris_scaled)
acc_iris = accuracy_score(y_test_iris, y_pred_iris)

print("Multi-class SVM (Iris Dataset):")
print(f"  Test Accuracy: {acc_iris:.4f}")
print(f"  Number of classes: {len(iris.target_names)}")
print(f"  Number of binary classifiers (OvO): {len(iris.target_names) * (len(iris.target_names)-1) // 2}")

# Classification report
print("\nClassification Report:")
print(classification_report(y_test_iris, y_pred_iris, target_names=iris.target_names))

# Confusion matrix
cm = confusion_matrix(y_test_iris, y_pred_iris)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
           xticklabels=iris.target_names,
           yticklabels=iris.target_names)
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.title('Multi-class SVM: Iris Classification', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Key Takeaways

1. Margin Maximization

  • SVMs find the maximum-margin hyperplane

  • Larger margin → better generalization

  • Only support vectors affect the decision boundary

2. Kernel Trick

  • Map data to high-dimensional space without computing \(\phi(x)\) explicitly

  • RBF kernel most commonly used: \(K(x,z) = \exp(-\gamma||x-z||^2)\)

  • Enables non-linear decision boundaries

3. Hyperparameters

  • C: Regularization (small = soft margin, large = hard margin)

  • γ: RBF kernel width (small = smooth, large = complex)

  • Use cross-validation to tune both

4. Advantages

  • Effective in high dimensions

  • Memory efficient (only stores support vectors)

  • Versatile (different kernels for different data)

  • Strong theoretical guarantees

5. Disadvantages

  • Slow for large datasets (O(n²) to O(n³))

  • Sensitive to feature scaling

  • Hard to interpret (black box)

  • Choosing right kernel and parameters critical

6. Practical Tips

  • Always scale features (StandardScaler)

  • Start with RBF kernel

  • Try C ∈ [0.1, 1, 10, 100] and γ ∈ [0.001, 0.01, 0.1, 1]

  • For large datasets, use LinearSVC (faster)

  • For text, try linear kernel first

7. When to Use SVMs

  • Medium-sized datasets (100s to 10,000s of samples)

  • High-dimensional data (text, genomics)

  • Clear margin of separation

  • Need probabilistic interpretation → use sklearn’s probability=True

Practice Exercises

  1. Margin Calculation: Given w, b, and data points, calculate geometric margin. Verify support vectors.

  2. Kernel Implementation: Implement polynomial and RBF kernels from scratch. Test on 2D data.

  3. Soft vs Hard Margin: Generate data with outliers. Compare hard margin (large C) vs soft margin (small C). Which is more robust?

  4. Custom Kernel: Design and implement a custom kernel for a specific problem (e.g., string matching).

  5. Scaling Impact: Train SVM with and without feature scaling. Compare performance. Why does scaling matter?

  6. Multi-class Comparison: Implement both One-vs-One and One-vs-All strategies. Compare accuracy and training time.

  7. SVM vs Logistic Regression: Compare on multiple datasets. When does each perform better?

  8. Probability Calibration: Use predict_proba(). Are the probabilities well-calibrated? Plot calibration curve.

  9. Large Dataset: Try SVM on MNIST (60k samples). Use LinearSVC or SGDClassifier for speed. Compare with full SVM.

  10. Imbalanced Classes: Create dataset with 95/5 class split. Use class_weight='balanced'. Compare with standard SVM.

References

  1. CS229 Lecture Notes: Andrew Ng’s notes on SVMs

  2. “A Tutorial on Support Vector Machines”: Burges (1998)

  3. “Statistical Learning Theory”: Vapnik (1998)

  4. The Elements of Statistical Learning: Hastie et al., Chapter 12

  5. “LIBSVM: A Library for Support Vector Machines”: Chang & Lin

  6. “Sequential Minimal Optimization” (SMO): Platt (1998)

Next: Lecture 6: Neural Networks - Basics