Lecture 6-7: Support Vector Machines¶
📹 Watch Lecture 6 | Watch Lecture 7
From Andrew Ng’s CS229 Lectures 6-7¶
“Support vector machine is one of my favorite algorithms because it’s very turnkey. If you have a classification problem, you just run it and it more or less works.” - Andrew Ng
The Optimal Margin Classifier¶
Key Idea: Among all decision boundaries that separate the data, choose the one with the largest margin (distance to nearest point).
Why Maximum Margin?
Better generalization
More robust to noise
Unique solution
The Representer Theorem¶
From Lecture 7:
“We’ll work in potentially very high-dimensional, like 100,000 dimensional, or a million dimensional, or 100 billion dimensional, or even infinite-dimensional feature spaces.” - Andrew Ng
Key Insight: The optimal w can be represented as: $\(w = \sum_{i=1}^m \alpha_i y^{(i)} x^{(i)}\)$
Implications:
w is a linear combination of training examples
Even in infinite dimensions, only need to store m coefficients (α’s)
This enables the kernel trick!
The Kernel Trick¶
From the lecture: “Kernels are the mechanism for working in these incredibly high dimensional feature spaces”
Basic idea:
Want to work in high-dimensional feature space φ(x)
But computing φ(x) and storing w is expensive
Solution: Never compute φ(x) explicitly!
Instead, use kernel function K(x, z) = ⟨φ(x), φ(z)⟩
Example: Polynomial kernel of degree d
Feature space: All monomials up to degree d
Dimension: Exponential in d!
But K(x, z) = (xᵀz + 1)^d computable in O(n) time
Common Kernels¶
Linear: K(x, z) = xᵀz
Equivalent to no feature mapping
Good for high-dimensional data with linear patterns
Polynomial: K(x, z) = (xᵀz + c)^d
Degree d polynomial features
c is constant term
RBF (Gaussian): K(x, z) = exp(-γ||x - z||²)
Infinite dimensional feature space!
Most popular kernel
γ controls width of Gaussian
Setup¶
We import NumPy, Matplotlib, and scikit-learn’s SVM implementation (SVC). Support vector machines involve solving a quadratic programming problem to find the maximum-margin hyperplane, and scikit-learn provides efficient solvers via libsvm. We also load visualization utilities for plotting decision boundaries and kernel transformations. Understanding SVMs requires seeing how different kernels and hyperparameters (\(C\), \(\gamma\)) reshape the decision boundary, so rich plotting support is essential for this notebook.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_circles, make_moons, load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from scipy.optimize import minimize
import warnings
warnings.filterwarnings('ignore')
# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)
print("Libraries loaded successfully!")
5.1 Understanding Margins¶
The margin is the distance from the decision boundary to the nearest data point. SVMs find the boundary that maximizes this margin.
Intuition: Larger margin → more confidence → better generalization
# Generate linearly separable data
np.random.seed(42)
n_samples = 100
# Class 0: centered at (-2, -2)
X0 = np.random.randn(n_samples//2, 2) * 0.5 + np.array([-1.5, -1.5])
y0 = np.zeros(n_samples//2)
# Class 1: centered at (2, 2)
X1 = np.random.randn(n_samples//2, 2) * 0.5 + np.array([1.5, 1.5])
y1 = np.ones(n_samples//2)
X_linearsep = np.vstack([X0, X1])
y_linearsep = np.hstack([y0, y1])
y_linearsep_signed = 2 * y_linearsep - 1 # Convert to {-1, +1}
# Train SVM
svm = SVC(kernel='linear', C=1000) # Large C for hard margin
svm.fit(X_linearsep, y_linearsep)
# Get support vectors
sv = svm.support_vectors_
# Plot
fig, ax = plt.subplots(figsize=(12, 8))
# Create mesh
x_min, x_max = X_linearsep[:, 0].min() - 1, X_linearsep[:, 0].max() + 1
y_min, y_max = X_linearsep[:, 1].min() - 1, X_linearsep[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))
# Decision function
Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot decision boundary and margins
ax.contour(xx, yy, Z, levels=[-1, 0, 1], linestyles=['--', '-', '--'],
colors=['blue', 'black', 'red'], linewidths=[2, 3, 2])
# Scatter plot
ax.scatter(X0[:, 0], X0[:, 1], c='blue', marker='o', s=100,
alpha=0.6, edgecolors='k', label='Class 0')
ax.scatter(X1[:, 0], X1[:, 1], c='red', marker='s', s=100,
alpha=0.6, edgecolors='k', label='Class 1')
# Highlight support vectors
ax.scatter(sv[:, 0], sv[:, 1], s=300, linewidth=2,
facecolors='none', edgecolors='green',
label='Support Vectors')
ax.set_xlabel('Feature 1', fontsize=12)
ax.set_ylabel('Feature 2', fontsize=12)
ax.set_title('SVM: Maximum Margin Classifier\n(Black = Decision Boundary, Dashed = Margin)',
fontsize=14, fontweight='bold')
ax.legend(fontsize=11, loc='upper left')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Number of support vectors: {len(sv)}")
print(f"Total training points: {len(X_linearsep)}")
print(f"Support vector ratio: {len(sv)/len(X_linearsep):.2%}")
5.2 Linear SVM Implementation¶
Let’s implement a simple linear SVM using the dual formulation and quadratic programming.
from scipy.optimize import minimize
class LinearSVM:
"""
Linear SVM using dual formulation.
"""
def __init__(self, C=1.0):
self.C = C
self.alpha = None
self.w = None
self.b = None
self.support_vectors = None
def fit(self, X, y):
"""
Train SVM using dual formulation.
y should be in {-1, +1}
"""
m, n = X.shape
# Gram matrix: K[i,j] = <x^(i), x^(j)>
K = X @ X.T
# Dual objective: -[sum(alpha_i) - 0.5 * sum_ij(alpha_i * alpha_j * y_i * y_j * K_ij)]
def objective(alpha):
return 0.5 * np.sum((alpha * y) @ K * (alpha * y)) - np.sum(alpha)
# Gradient
def gradient(alpha):
return (alpha * y) @ K * y - 1
# Constraints: sum(alpha_i * y_i) = 0
constraints = {'type': 'eq', 'fun': lambda alpha: np.dot(alpha, y)}
# Bounds: 0 <= alpha_i <= C
bounds = [(0, self.C) for _ in range(m)]
# Initial guess
alpha0 = np.zeros(m)
# Solve
result = minimize(objective, alpha0, method='SLSQP',
jac=gradient, bounds=bounds, constraints=constraints)
self.alpha = result.x
# Support vectors (alpha > threshold)
sv_idx = self.alpha > 1e-5
self.support_vectors = X[sv_idx]
# Compute w
self.w = np.sum((self.alpha * y)[:, np.newaxis] * X, axis=0)
# Compute b using support vectors
sv_alpha = self.alpha[sv_idx]
sv_X = X[sv_idx]
sv_y = y[sv_idx]
self.b = np.mean(sv_y - sv_X @ self.w)
return self
def decision_function(self, X):
"""
Compute decision function: w^T x + b
"""
return X @ self.w + self.b
def predict(self, X):
"""
Predict class labels.
"""
return np.sign(self.decision_function(X))
# Train our implementation
my_svm = LinearSVM(C=1.0)
my_svm.fit(X_linearsep, y_linearsep_signed)
# Predictions
y_pred_mine = my_svm.predict(X_linearsep)
accuracy_mine = np.mean(y_pred_mine == y_linearsep_signed)
print("Custom Linear SVM:")
print(f" Accuracy: {accuracy_mine:.4f}")
print(f" Number of support vectors: {len(my_svm.support_vectors)}")
print(f" w: {my_svm.w}")
print(f" b: {my_svm.b:.4f}")
5.3 Kernel Functions¶
For non-linearly separable data, we use the kernel trick: $\(K(x, z) = \langle \phi(x), \phi(z) \rangle\)$
Common Kernels:
Linear: \(K(x, z) = x^T z\)
Polynomial: \(K(x, z) = (x^T z + c)^d\)
RBF (Gaussian): \(K(x, z) = \exp(-\gamma ||x - z||^2)\)
Sigmoid: \(K(x, z) = \tanh(\kappa x^T z + \Theta)\)
# Generate non-linearly separable data (XOR-like)
np.random.seed(42)
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, factor=0.3, random_state=42)
y_circles_signed = 2 * y_circles - 1
# Test different kernels
kernels = ['linear', 'poly', 'rbf']
kernel_names = ['Linear', 'Polynomial (degree=3)', 'RBF (Gaussian)']
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for idx, (kernel, name) in enumerate(zip(kernels, kernel_names)):
# Train SVM
if kernel == 'poly':
svm_kernel = SVC(kernel=kernel, degree=3, C=1.0)
else:
svm_kernel = SVC(kernel=kernel, C=1.0, gamma='auto')
svm_kernel.fit(X_circles, y_circles)
# Create mesh
x_min, x_max = X_circles[:, 0].min() - 0.5, X_circles[:, 0].max() + 0.5
y_min, y_max = X_circles[:, 1].min() - 0.5, X_circles[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))
# Decision function
Z = svm_kernel.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot
axes[idx].contourf(xx, yy, Z, levels=20, cmap='RdBu', alpha=0.6)
axes[idx].contour(xx, yy, Z, levels=[0], colors='black', linewidths=3)
# Scatter
axes[idx].scatter(X_circles[y_circles==0, 0], X_circles[y_circles==0, 1],
c='blue', marker='o', s=50, edgecolors='k', alpha=0.7)
axes[idx].scatter(X_circles[y_circles==1, 0], X_circles[y_circles==1, 1],
c='red', marker='s', s=50, edgecolors='k', alpha=0.7)
# Support vectors
sv = svm_kernel.support_vectors_
axes[idx].scatter(sv[:, 0], sv[:, 1], s=200, linewidth=2,
facecolors='none', edgecolors='green')
# Accuracy
acc = accuracy_score(y_circles, svm_kernel.predict(X_circles))
axes[idx].set_xlabel('Feature 1', fontsize=11)
axes[idx].set_ylabel('Feature 2', fontsize=11)
axes[idx].set_title(f'{name}\nAccuracy: {acc:.3f}, SVs: {len(sv)}',
fontsize=12, fontweight='bold')
axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nObservations:")
print("- Linear kernel fails (data not linearly separable)")
print("- Polynomial and RBF kernels succeed")
print("- RBF most commonly used in practice")
5.4 RBF Kernel Deep Dive¶
The RBF (Radial Basis Function) kernel is: $\(K(x, z) = \exp\left(-\gamma ||x - z||^2\right)\)$
where \(\gamma = \frac{1}{2\sigma^2}\) controls the “reach” of each training example:
Small γ (large σ): Smooth boundary, far-reaching influence
Large γ (small σ): Complex boundary, local influence
# Generate moons dataset
X_moons, y_moons = make_moons(n_samples=200, noise=0.15, random_state=42)
# Try different gamma values
gammas = [0.1, 1.0, 10.0]
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for idx, gamma in enumerate(gammas):
# Train SVM with RBF kernel
svm_rbf = SVC(kernel='rbf', gamma=gamma, C=1.0)
svm_rbf.fit(X_moons, y_moons)
# Create mesh
x_min, x_max = X_moons[:, 0].min() - 0.5, X_moons[:, 0].max() + 0.5
y_min, y_max = X_moons[:, 1].min() - 0.5, X_moons[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))
Z = svm_rbf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot
axes[idx].contourf(xx, yy, Z, levels=20, cmap='RdBu', alpha=0.6)
axes[idx].contour(xx, yy, Z, levels=[0], colors='black', linewidths=3)
axes[idx].scatter(X_moons[y_moons==0, 0], X_moons[y_moons==0, 1],
c='blue', marker='o', s=50, edgecolors='k', alpha=0.7)
axes[idx].scatter(X_moons[y_moons==1, 0], X_moons[y_moons==1, 1],
c='red', marker='s', s=50, edgecolors='k', alpha=0.7)
sv = svm_rbf.support_vectors_
axes[idx].scatter(sv[:, 0], sv[:, 1], s=200, linewidth=2,
facecolors='none', edgecolors='green')
acc = accuracy_score(y_moons, svm_rbf.predict(X_moons))
axes[idx].set_xlabel('Feature 1', fontsize=11)
axes[idx].set_ylabel('Feature 2', fontsize=11)
axes[idx].set_title(f'γ = {gamma}\nAcc: {acc:.3f}, SVs: {len(sv)}',
fontsize=12, fontweight='bold')
axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nGamma (γ) Effect:")
print("- γ = 0.1: Smooth, possibly underfitting")
print("- γ = 1.0: Good balance")
print("- γ = 10.0: Complex, possibly overfitting")
5.5 Soft Margin SVM (C Parameter)¶
For non-separable data, we allow some mistakes with slack variables \(\xi_i\):
C parameter controls trade-off:
Large C: Hard margin (fewer errors, risk overfitting)
Small C: Soft margin (more errors, better generalization)
# Generate slightly overlapping data
np.random.seed(42)
X_overlap, y_overlap = make_classification(n_samples=200, n_features=2,
n_redundant=0, n_informative=2,
n_clusters_per_class=1,
class_sep=0.8, random_state=42)
# Try different C values
C_values = [0.01, 1.0, 100.0]
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for idx, C in enumerate(C_values):
svm_c = SVC(kernel='rbf', C=C, gamma='auto')
svm_c.fit(X_overlap, y_overlap)
# Create mesh
x_min, x_max = X_overlap[:, 0].min() - 1, X_overlap[:, 0].max() + 1
y_min, y_max = X_overlap[:, 1].min() - 1, X_overlap[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))
Z = svm_c.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
axes[idx].contourf(xx, yy, Z, levels=20, cmap='RdBu', alpha=0.6)
axes[idx].contour(xx, yy, Z, levels=[0], colors='black', linewidths=3)
axes[idx].scatter(X_overlap[y_overlap==0, 0], X_overlap[y_overlap==0, 1],
c='blue', marker='o', s=50, edgecolors='k', alpha=0.7)
axes[idx].scatter(X_overlap[y_overlap==1, 0], X_overlap[y_overlap==1, 1],
c='red', marker='s', s=50, edgecolors='k', alpha=0.7)
sv = svm_c.support_vectors_
axes[idx].scatter(sv[:, 0], sv[:, 1], s=200, linewidth=2,
facecolors='none', edgecolors='green')
acc = accuracy_score(y_overlap, svm_c.predict(X_overlap))
axes[idx].set_xlabel('Feature 1', fontsize=11)
axes[idx].set_ylabel('Feature 2', fontsize=11)
axes[idx].set_title(f'C = {C}\nAcc: {acc:.3f}, SVs: {len(sv)}',
fontsize=12, fontweight='bold')
axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nC Parameter Effect:")
print("- C = 0.01: Very soft margin, many SVs, simpler boundary")
print("- C = 1.0: Balanced")
print("- C = 100: Hard margin, fewer SVs, complex boundary")
5.6 Hyperparameter Tuning with Grid Search¶
What: Systematically finding optimal SVM hyperparameters¶
SVMs have two critical hyperparameters: the regularization strength \(C\) (controlling the bias-variance tradeoff via slack variables) and the kernel parameter \(\gamma\) (controlling the complexity of the decision boundary for RBF kernels). Grid search evaluates every combination of \(C\) and \(\gamma\) from predefined ranges using cross-validation, then selects the combination with the best validation performance.
Why: SVM performance is highly sensitive to hyperparameters¶
A poor choice of \(\gamma\) in the RBF kernel can make the classifier either memorize individual training points (large \(\gamma\), overfitting) or treat all points as identical (small \(\gamma\), underfitting). Cross-validated grid search is the standard approach to navigating this 2D hyperparameter space, and the resulting heatmap of validation scores provides a clear picture of which regions produce robust models. In practice, a logarithmic grid (e.g., \(C \in \{0.1, 1, 10, 100\}\), \(\gamma \in \{0.001, 0.01, 0.1, 1\}\)) is the standard starting point.
# Load breast cancer dataset
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target
# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Breast Cancer Dataset:")
print(f" Training: {X_train_scaled.shape}")
print(f" Test: {X_test_scaled.shape}")
# Grid search
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1]
}
print("\nPerforming Grid Search...")
grid_search = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=5,
scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
# Test best model
best_svm = grid_search.best_estimator_
y_pred = best_svm.predict(X_test_scaled)
test_acc = accuracy_score(y_test, y_pred)
print(f"\nTest accuracy: {test_acc:.4f}")
print(f"Number of support vectors: {len(best_svm.support_vectors_)}")
# Visualize grid search results
results = pd.DataFrame(grid_search.cv_results_)
scores = results.pivot_table(values='mean_test_score',
index='param_gamma',
columns='param_C')
plt.figure(figsize=(10, 8))
sns.heatmap(scores, annot=True, fmt='.3f', cmap='YlGnBu', cbar_kws={'label': 'CV Accuracy'})
plt.xlabel('C', fontsize=12)
plt.ylabel('γ (gamma)', fontsize=12)
plt.title('SVM Hyperparameter Grid Search\n(Breast Cancer Classification)',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
5.7 Multi-class SVMs¶
SVMs are binary classifiers. For multi-class:
One-vs-One (OvO): Train K(K-1)/2 classifiers
One-vs-All (OvA): Train K classifiers
scikit-learn uses OvO by default for SVC.
from sklearn.datasets import load_iris
# Load iris (3 classes)
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
# Split and scale
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
X_iris, y_iris, test_size=0.2, random_state=42, stratify=y_iris
)
scaler_iris = StandardScaler()
X_train_iris_scaled = scaler_iris.fit_transform(X_train_iris)
X_test_iris_scaled = scaler_iris.transform(X_test_iris)
# Train multi-class SVM
svm_multi = SVC(kernel='rbf', C=1.0, gamma='auto', decision_function_shape='ovo')
svm_multi.fit(X_train_iris_scaled, y_train_iris)
# Predictions
y_pred_iris = svm_multi.predict(X_test_iris_scaled)
acc_iris = accuracy_score(y_test_iris, y_pred_iris)
print("Multi-class SVM (Iris Dataset):")
print(f" Test Accuracy: {acc_iris:.4f}")
print(f" Number of classes: {len(iris.target_names)}")
print(f" Number of binary classifiers (OvO): {len(iris.target_names) * (len(iris.target_names)-1) // 2}")
# Classification report
print("\nClassification Report:")
print(classification_report(y_test_iris, y_pred_iris, target_names=iris.target_names))
# Confusion matrix
cm = confusion_matrix(y_test_iris, y_pred_iris)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=iris.target_names,
yticklabels=iris.target_names)
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.title('Multi-class SVM: Iris Classification', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
Key Takeaways¶
1. Margin Maximization¶
SVMs find the maximum-margin hyperplane
Larger margin → better generalization
Only support vectors affect the decision boundary
2. Kernel Trick¶
Map data to high-dimensional space without computing \(\phi(x)\) explicitly
RBF kernel most commonly used: \(K(x,z) = \exp(-\gamma||x-z||^2)\)
Enables non-linear decision boundaries
3. Hyperparameters¶
C: Regularization (small = soft margin, large = hard margin)
γ: RBF kernel width (small = smooth, large = complex)
Use cross-validation to tune both
4. Advantages¶
Effective in high dimensions
Memory efficient (only stores support vectors)
Versatile (different kernels for different data)
Strong theoretical guarantees
5. Disadvantages¶
Slow for large datasets (O(n²) to O(n³))
Sensitive to feature scaling
Hard to interpret (black box)
Choosing right kernel and parameters critical
6. Practical Tips¶
Always scale features (StandardScaler)
Start with RBF kernel
Try C ∈ [0.1, 1, 10, 100] and γ ∈ [0.001, 0.01, 0.1, 1]
For large datasets, use LinearSVC (faster)
For text, try linear kernel first
7. When to Use SVMs¶
Medium-sized datasets (100s to 10,000s of samples)
High-dimensional data (text, genomics)
Clear margin of separation
Need probabilistic interpretation → use sklearn’s
probability=True
Practice Exercises¶
Margin Calculation: Given w, b, and data points, calculate geometric margin. Verify support vectors.
Kernel Implementation: Implement polynomial and RBF kernels from scratch. Test on 2D data.
Soft vs Hard Margin: Generate data with outliers. Compare hard margin (large C) vs soft margin (small C). Which is more robust?
Custom Kernel: Design and implement a custom kernel for a specific problem (e.g., string matching).
Scaling Impact: Train SVM with and without feature scaling. Compare performance. Why does scaling matter?
Multi-class Comparison: Implement both One-vs-One and One-vs-All strategies. Compare accuracy and training time.
SVM vs Logistic Regression: Compare on multiple datasets. When does each perform better?
Probability Calibration: Use
predict_proba(). Are the probabilities well-calibrated? Plot calibration curve.Large Dataset: Try SVM on MNIST (60k samples). Use LinearSVC or SGDClassifier for speed. Compare with full SVM.
Imbalanced Classes: Create dataset with 95/5 class split. Use
class_weight='balanced'. Compare with standard SVM.
References¶
CS229 Lecture Notes: Andrew Ng’s notes on SVMs
“A Tutorial on Support Vector Machines”: Burges (1998)
“Statistical Learning Theory”: Vapnik (1998)
The Elements of Statistical Learning: Hastie et al., Chapter 12
“LIBSVM: A Library for Support Vector Machines”: Chang & Lin
“Sequential Minimal Optimization” (SMO): Platt (1998)