Run this notebook: Open in Colab Open in Kaggle

Chapter 9: Support Vector Machines¶

Overview¶

Support Vector Machines (SVMs) are powerful classifiers that find the optimal separating hyperplane between classes.

Evolution of SVMs¶

1. Maximal Margin Classifier¶

Linearly separable data only
Finds hyperplane with maximum margin
Margin = distance from hyperplane to nearest point
Problem: Too restrictive (requires perfect separation)

2. Support Vector Classifier (Soft Margin)¶

Allows some misclassifications
Introduces slack variables ξᵢ
Tuning parameter C controls bias-variance tradeoff
Works when data not perfectly separable

3. Support Vector Machine (Kernel Trick)¶

Handles non-linear boundaries
Uses kernel functions to implicitly map to higher dimensions
Common kernels: Linear, Polynomial, RBF (Gaussian), Sigmoid
Most flexible and powerful

Key Concepts¶

Hyperplane¶

In p dimensions: $$\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p = 0$$

Classification rule: $$f(X) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p$$

If $f(X) > 0$: Class +1
If $f(X) < 0$: Class -1

Margin¶

Distance from hyperplane to nearest training point: $$M = \min_i \frac{|f(x_i)|}{||\beta||}$$

Goal: Maximize M

Support Vectors¶

Training points that lie on the margin
Only these points affect the hyperplane!
All other points can move without changing solution
Typically only small fraction of training points

Advantages¶

✅ Effective in high dimensions
✅ Memory efficient (only uses support vectors)
✅ Versatile (different kernels for different data)
✅ Works well with clear margin
✅ Robust to outliers (with proper C tuning)

Disadvantages¶

❌ Computationally expensive (O(n²) to O(n³))
❌ Difficult for large datasets (n > 10,000)
❌ Sensitive to kernel choice
❌ No probabilistic interpretation (by default)
❌ Hyperparameter tuning critical

9.1 Maximal Margin Classifier¶

Optimization Problem¶

Maximize: $M$
Subject to: $$||\beta|| = 1$$y_i (\beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}) \geq M \quad \forall i$$

where:

$M$ = margin width
$y_i \in \{-1, +1\}$ = class labels
$||\beta|| = 1$ = normalization constraint

Geometric Interpretation¶

Hyperplane: $\beta_0 + \beta^T x = 0$
Margin boundaries: $\beta_0 + \beta^T x = \pm M$
Support vectors: Points on margin boundaries
Decision: Sign of $\beta_0 + \beta^T x$

Limitation¶

Requires perfect linear separability - rare in practice!

# Maximal Margin Classifier Demo (Linearly Separable Data)

# Generate linearly separable data
np.random.seed(42)
X_sep = np.random.randn(40, 2)
y_sep = np.ones(40)
X_sep[:20] -= 2
y_sep[:20] = -1

# Fit SVC with large C (hard margin approximation)
svm_hard = SVC(kernel='linear', C=1e10)
svm_hard.fit(X_sep, y_sep)

# Plotting function for decision boundary
def plot_svm_boundary(X, y, model, title, ax):
    # Create mesh
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    # Predict on mesh
    Z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot decision boundary and margins
    ax.contourf(xx, yy, Z, levels=[-1e10, 0, 1e10], colors=['#FFAAAA', '#AAAAFF'], alpha=0.3)
    ax.contour(xx, yy, Z, levels=[-1, 0, 1], colors=['k', 'k', 'k'], 
              linestyles=['--', '-', '--'], linewidths=[1, 2, 1])
    
    # Plot points
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', s=100, edgecolors='k')
    
    # Highlight support vectors
    ax.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1],
              s=300, linewidth=2, facecolors='none', edgecolors='green', label='Support Vectors')
    
    ax.set_xlabel('X₁')
    ax.set_ylabel('X₂')
    ax.set_title(title)
    ax.legend()
    ax.grid(True, alpha=0.3)

fig, ax = plt.subplots(1, 1, figsize=(10, 6))
plot_svm_boundary(X_sep, y_sep, svm_hard, 
                 f'Maximal Margin Classifier\n{len(model.support_vectors_)} Support Vectors', ax)
plt.show()

print(f"\n📊 Maximal Margin Classifier:")
print(f"   • Number of support vectors: {len(svm_hard.support_vectors_)}")
print(f"   • Training accuracy: {svm_hard.score(X_sep, y_sep):.3f}")
print(f"   • Margin width: {2 / np.linalg.norm(svm_hard.coef_):.3f}")
print(f"\n💡 Solid line = decision boundary (hyperplane)")
print(f"   Dashed lines = margin boundaries")
print(f"   Green circles = support vectors (on margin)")

9.2 Support Vector Classifier (Soft Margin)¶

The Problem¶

Real data is rarely perfectly separable!

The Solution: Soft Margin¶

Allow some points to:

Be on the wrong side of margin
Be misclassified

Optimization with Slack Variables¶

Maximize: $M$
Subject to: $$y_i (\beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}) \geq M(1 - \epsilon_i)$$\epsilon_i \geq 0, \quad \sum_{i=1}^n \epsilon_i \leq C$$

where:

$\epsilon_i$ = slack variable for observation i
$\epsilon_i = 0$: Correct side of margin
$0 < \epsilon_i < 1$: Wrong side of margin, but correct class
$\epsilon_i > 1$: Misclassified
$C$ = budget for total violations

Tuning Parameter C¶

Large C:

Few violations allowed
Narrow margin
Low bias, high variance
Risk of overfitting

Small C:

Many violations allowed
Wide margin
High bias, low variance
More robust, may underfit

Typical values: 0.01, 0.1, 1, 10, 100

Equivalent Formulation¶

Minimize: $$\frac{1}{2}||\beta||^2 + C \sum_{i=1}^n \epsilon_i$$

This is a regularization problem:

First term: Model complexity
Second term: Training error
C controls tradeoff (like λ in ridge/lasso)

# Support Vector Classifier: Effect of C

# Generate overlapping data
np.random.seed(42)
X_overlap = np.random.randn(100, 2)
y_overlap = np.ones(100)
X_overlap[:50] -= 1
y_overlap[:50] = -1
X_overlap[40:50] += 1.5  # Create overlap

# Try different C values
C_values = [0.01, 0.1, 1, 100]
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, C in enumerate(C_values):
    svm = SVC(kernel='linear', C=C)
    svm.fit(X_overlap, y_overlap)
    
    plot_svm_boundary(X_overlap, y_overlap, svm,
                     f'C = {C}\n{len(svm.support_vectors_)} Support Vectors\n'
                     f'Accuracy: {svm.score(X_overlap, y_overlap):.3f}',
                     axes[idx])

plt.tight_layout()
plt.show()

print("\n💡 Effect of C:")
print("   • C = 0.01:  Very wide margin, many support vectors, smooth boundary")
print("   • C = 0.1:   Wide margin, more support vectors")
print("   • C = 1:     Moderate margin (often good default)")
print("   • C = 100:   Narrow margin, fewer support vectors, tries to fit all points")
print("\n   Smaller C → wider margin → more bias, less variance")
print("   Larger C → narrower margin → less bias, more variance")

9.3 Support Vector Machines with Kernels¶

The Non-Linear Problem¶

Many datasets are not linearly separable in original space

The Kernel Trick¶

Idea: Map data to higher-dimensional space where it becomes linearly separable

\[\phi: \mathbb{R}^p \rightarrow \mathbb{R}^q \quad (q >> p)\]

But computing $\phi(x)$ explicitly can be expensive or impossible!

Solution: Use kernel function $$K(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle$$

Compute inner product in high-dimensional space without explicitly going there!

Common Kernels¶

1. Linear Kernel¶

\[K(x_i, x_j) = x_i^T x_j\]

Standard inner product
Same as Support Vector Classifier

2. Polynomial Kernel¶

\[K(x_i, x_j) = (1 + x_i^T x_j)^d\]

$d$ = degree (usually 2, 3, or 4)
Creates polynomial decision boundaries
Higher $d$ → more flexible, but can overfit

3. Radial Basis Function (RBF) / Gaussian Kernel¶

\[K(x_i, x_j) = \exp(-\gamma ||x_i - x_j||^2)\]

$\gamma$ = kernel coefficient (default: 1/n_features)
Most popular kernel
Infinite-dimensional feature space!
Creates local, non-linear boundaries

Effect of γ:

Large γ: Narrow influence, complex boundary, overfitting risk
Small γ: Wide influence, smooth boundary, underfitting risk

4. Sigmoid Kernel¶

\[K(x_i, x_j) = \tanh(\gamma x_i^T x_j + r)\]

Similar to neural networks
Less commonly used

Kernel Selection Guidelines¶

Start with RBF - usually best default
Try Linear if:
- Many features (p large)
- Linear separability suspected
- Need interpretability
Try Polynomial for:
- Interaction effects important
- Known polynomial relationships
Use cross-validation to choose!

# Kernel Comparison on Non-Linear Data

# Generate non-linear datasets
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, factor=0.5, random_state=42)
X_moons, y_moons = make_moons(n_samples=200, noise=0.15, random_state=42)

# Convert labels to -1, 1
y_circles = 2 * y_circles - 1
y_moons = 2 * y_moons - 1

# Kernels to test
kernels = ['linear', 'poly', 'rbf']
kernel_params = {
    'linear': {},
    'poly': {'degree': 3},
    'rbf': {'gamma': 'scale'}
}

# Plot for circles dataset
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

for idx, kernel in enumerate(kernels):
    # Circles
    svm_circles = SVC(kernel=kernel, C=1, **kernel_params[kernel])
    svm_circles.fit(X_circles, y_circles)
    plot_svm_boundary(X_circles, y_circles, svm_circles,
                     f'Circles: {kernel.upper()} kernel\n'
                     f'Acc: {svm_circles.score(X_circles, y_circles):.3f}',
                     axes[0, idx])
    
    # Moons
    svm_moons = SVC(kernel=kernel, C=1, **kernel_params[kernel])
    svm_moons.fit(X_moons, y_moons)
    plot_svm_boundary(X_moons, y_moons, svm_moons,
                     f'Moons: {kernel.upper()} kernel\n'
                     f'Acc: {svm_moons.score(X_moons, y_moons):.3f}',
                     axes[1, idx])

plt.tight_layout()
plt.show()

print("\n💡 Kernel Observations:")
print("   • LINEAR: Fails on non-linear data (straight line boundary)")
print("   • POLY: Works on moons, struggles with circles")
print("   • RBF: Handles both datasets well (most flexible)")
print("\n   RBF is often the best default choice!")

# Effect of gamma parameter in RBF kernel

gamma_values = [0.01, 0.1, 1, 10]
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, gamma in enumerate(gamma_values):
    svm = SVC(kernel='rbf', C=1, gamma=gamma)
    svm.fit(X_moons, y_moons)
    
    plot_svm_boundary(X_moons, y_moons, svm,
                     f'RBF Kernel: γ = {gamma}\n'
                     f'Accuracy: {svm.score(X_moons, y_moons):.3f}\n'
                     f'{len(svm.support_vectors_)} Support Vectors',
                     axes[idx])

plt.tight_layout()
plt.show()

print("\n💡 Effect of γ in RBF kernel:")
print("   • γ = 0.01:  Very smooth, wide influence (underfitting)")
print("   • γ = 0.1:   Balanced, good generalization")
print("   • γ = 1:     More complex boundary")
print("   • γ = 10:    Very wiggly, tight around points (overfitting)")
print("\n   Larger γ → more complex boundary → higher variance")
print("   Smaller γ → smoother boundary → higher bias")

9.4 Multi-class SVMs¶

SVMs are inherently binary classifiers. For K > 2 classes:

One-vs-One (OVO)¶

Train $\binom{K}{2}$ classifiers, one for each pair of classes
For K=3: 3 classifiers (1 vs 2, 1 vs 3, 2 vs 3)
For K=10: 45 classifiers
Prediction: Vote among all classifiers
Pros: Each classifier only sees relevant data
Cons: Many classifiers to train
sklearn default for SVC

One-vs-Rest (OVR) / One-vs-All¶

Train K classifiers, one for each class vs all others
For K=10: 10 classifiers
Prediction: Choose class with highest confidence
Pros: Fewer classifiers
Cons: Imbalanced training data

sklearn’s SVC automatically handles multi-class using OVO.

# Multi-class SVM Example

# Generate 3-class data
from sklearn.datasets import make_blobs

X_multi, y_multi = make_blobs(n_samples=300, centers=3, n_features=2,
                              cluster_std=1.0, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_multi, y_multi, test_size=0.3, random_state=42)

# Train SVM (automatically handles multi-class)
svm_multi = SVC(kernel='rbf', C=1, gamma='scale')
svm_multi.fit(X_train, y_train)

# Plotting
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Decision boundary
h = 0.02
x_min, x_max = X_multi[:, 0].min() - 1, X_multi[:, 0].max() + 1
y_min, y_max = X_multi[:, 1].min() - 1, X_multi[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = svm_multi.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

ax1.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
ax1.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='viridis',
           s=100, edgecolors='k', marker='o', label='Train')
ax1.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap='viridis',
           s=100, edgecolors='k', marker='s', label='Test')
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax1.set_title(f'Multi-class SVM (3 classes)\n'
             f'Train Acc: {svm_multi.score(X_train, y_train):.3f}, '
             f'Test Acc: {svm_multi.score(X_test, y_test):.3f}')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Confusion matrix
from sklearn.metrics import confusion_matrix
y_pred = svm_multi.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax2)
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Actual')
ax2.set_title('Confusion Matrix')

plt.tight_layout()
plt.show()

print(f"\n📊 Multi-class SVM:")
print(f"   • Number of classes: {len(np.unique(y_multi))}")
print(f"   • Number of binary classifiers trained: {len(svm_multi.n_support_)}")
print(f"   • Total support vectors: {len(svm_multi.support_vectors_)}")
print(f"   • Support vectors per class: {svm_multi.n_support_}")
print(f"\n   sklearn's SVC uses One-vs-One (OVO) strategy")
print(f"   For 3 classes: trains 3 binary classifiers")

9.5 Real Dataset Example: Breast Cancer¶

Hyperparameter Tuning¶

Key parameters to tune:

C: Regularization (0.1, 1, 10, 100)
gamma: RBF kernel width (0.001, 0.01, 0.1, 1)
kernel: Linear, RBF, Poly

Use GridSearchCV for systematic search.

# Breast Cancer Classification with SVM

# Load data
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

# IMPORTANT: Scale features for SVM!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("🔍 Hyperparameter Tuning with GridSearchCV...\n")

# Grid search
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf', 'linear']
}

grid = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.4f}")

# Best model
best_svm = grid.best_estimator_
y_pred = best_svm.predict(X_test_scaled)

print(f"\nTest accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix')
axes[0].set_xticklabels(data.target_names)
axes[0].set_yticklabels(data.target_names)

# Grid search heatmap (for RBF kernel)
rbf_results = grid.cv_results_
rbf_mask = [p['kernel'] == 'rbf' for p in rbf_results['params']]
C_vals = [0.1, 1, 10, 100]
gamma_vals = [0.001, 0.01, 0.1, 1]

scores = np.zeros((len(gamma_vals), len(C_vals)))
for i, gamma in enumerate(gamma_vals):
    for j, C in enumerate(C_vals):
        idx = [k for k, p in enumerate(rbf_results['params']) 
               if p.get('kernel') == 'rbf' and p['C'] == C and p['gamma'] == gamma]
        if idx:
            scores[i, j] = rbf_results['mean_test_score'][idx[0]]

sns.heatmap(scores, annot=True, fmt='.3f', cmap='RdYlGn', 
           xticklabels=C_vals, yticklabels=gamma_vals, ax=axes[1])
axes[1].set_xlabel('C')
axes[1].set_ylabel('γ (gamma)')
axes[1].set_title('Grid Search Results (RBF Kernel)\nCV Accuracy')

plt.tight_layout()
plt.show()

print("\n💡 Key Insights:")
print("   • Feature scaling is CRITICAL for SVM!")
print("   • Grid search found optimal C and gamma")
print("   • RBF kernel often works best")
print("   • SVM achieves excellent performance on this dataset")

9.6 Support Vector Regression (SVR)¶

SVMs can also be used for regression!

Key Difference from Classification¶

Classification: Find hyperplane with maximum margin
Regression: Find function with maximum ε-insensitive tube

ε-Insensitive Loss¶

Ignore errors smaller than ε: $$L_\epsilon(y, f(x)) = \begin{cases} 0 & \text{if } |y - f(x)| \leq \epsilon \\ |y - f(x)| - \epsilon & \text{otherwise} \end{cases}$$

Support Vectors in SVR¶

Points outside the ε-tube (prediction errors > ε)

Parameters¶

C: Regularization (same as SVC)
epsilon: Width of tube (default: 0.1)
kernel, gamma: Same as SVC

# Support Vector Regression Example

# Generate regression data
np.random.seed(42)
n = 100
X_reg = np.linspace(0, 10, n).reshape(-1, 1)
y_reg = np.sin(X_reg).ravel() + np.random.randn(n) * 0.3

# Compare different epsilon values
epsilons = [0.05, 0.1, 0.3, 0.5]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

X_plot = np.linspace(0, 10, 300).reshape(-1, 1)

for idx, eps in enumerate(epsilons):
    svr = SVR(kernel='rbf', C=100, epsilon=eps, gamma='scale')
    svr.fit(X_reg, y_reg)
    y_pred = svr.predict(X_plot)
    
    # Plot data and prediction
    axes[idx].scatter(X_reg, y_reg, alpha=0.5, s=50, label='Data')
    axes[idx].plot(X_plot, y_pred, 'r-', linewidth=2, label='SVR')
    
    # Plot epsilon tube
    axes[idx].fill_between(X_plot.ravel(), 
                          y_pred - eps, y_pred + eps,
                          alpha=0.2, color='red', label=f'ε-tube (ε={eps})')
    
    # Highlight support vectors
    axes[idx].scatter(X_reg[svr.support_], y_reg[svr.support_],
                     s=200, facecolors='none', edgecolors='green', 
                     linewidth=2, label=f'{len(svr.support_)} SVs')
    
    axes[idx].set_xlabel('X')
    axes[idx].set_ylabel('y')
    axes[idx].set_title(f'SVR: ε = {eps}\n'
                       f'R² = {svr.score(X_reg, y_reg):.3f}')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Effect of ε in SVR:")
print("   • Smaller ε: Narrower tube, more support vectors, closer fit")
print("   • Larger ε: Wider tube, fewer support vectors, smoother fit")
print("   • Points inside tube: not support vectors (zero penalty)")
print("   • Points outside tube: support vectors (contribute to model)")

Key Takeaways¶

When to Use SVMs¶

Good For: ✅ High-dimensional data (text, genomics)
✅ Clear margin of separation exists
✅ More features than samples (p > n)
✅ Non-linear boundaries (with kernels)
✅ Binary classification

Not Ideal For: ❌ Very large datasets (n > 10,000)
❌ Noisy data with overlapping classes
❌ Need probability estimates
❌ Need interpretability
❌ Real-time prediction critical

Best Practices¶

1. Always Scale Features

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

SVMs are sensitive to feature scales!

2. Start with RBF Kernel

svm = SVC(kernel='rbf', C=1, gamma='scale')

Best default for most problems

3. Tune Hyperparameters

param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf', 'linear']
}
GridSearchCV(SVC(), param_grid, cv=5)

4. Use Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

Prevents data leakage!

Hyperparameter Guidelines¶

Parameter C (Regularization)¶

C → ∞: Hard margin, low bias, high variance
C → 0: Soft margin, high bias, low variance
Typical: Try 0.1, 1, 10, 100
Default: 1 (often reasonable)

Parameter γ (RBF kernel)¶

γ → ∞: High complexity, overfitting
γ → 0: Low complexity, underfitting
Typical: Try 0.001, 0.01, 0.1, 1
Default: ‘scale’ = 1/(n_features × X.var())

Kernel Selection¶

Linear: Fast, interpretable, high-dimensional
RBF: Flexible, non-linear, default choice
Poly: Specific polynomial relationships
Sigmoid: Rarely used

Comparison with Other Methods¶

Aspect	SVM	Logistic Regression	Random Forest
Speed	Slow (O(n²))	Fast	Medium
High-dim	Excellent	Good	Good
Non-linear	Yes (kernels)	No (needs features)	Yes
Interpretability	Low	High	Medium
Tuning	Critical	Easy	Less critical
Probabilities	Not native	Native	Native
Large data	Poor	Good	Good

Common Pitfalls¶

❌ Forgetting to scale features → Poor performance
❌ Using default parameters → Suboptimal results
❌ Wrong kernel choice → Missing patterns
❌ Too large C or gamma → Overfitting
❌ Not using cross-validation → Unreliable estimates
❌ Applying to huge datasets → Extremely slow

Practical Workflow¶

Preprocess: Scale features (StandardScaler)
Baseline: Try linear kernel first
Non-linear: Try RBF with default parameters
Tune: Grid search over C and gamma
Validate: Use cross-validation
Evaluate: Test set performance
Compare: Try other methods (RF, XGBoost)

Advanced Tips¶

For Large Datasets:

Use LinearSVC (faster than SVC(kernel='linear'))
Consider subsampling
Try kernel approximation (Nystroem, RBFSampler)

For Imbalanced Data:

Use class_weight='balanced'
Or manually set class_weight={0: w0, 1: w1}

For Probability Estimates:

Use probability=True in SVC
Slower (uses cross-validation internally)
Then can use predict_proba()

sklearn Implementation Notes¶

SVC vs LinearSVC:

SVC(kernel='linear'): Uses libsvm (slower, more features)
LinearSVC: Uses liblinear (faster, fewer features)
For linear kernels on large data: use LinearSVC

SVR vs LinearSVR:

Same distinction as classification

Next Chapter¶

Chapter 10: Deep Learning

Neural Networks
Activation Functions
Backpropagation
Convolutional Neural Networks
Recurrent Neural Networks

Practice Exercises¶

Exercise 1: Margin Analysis¶

Generate linearly separable data
Fit SVC with different C values (0.01, 0.1, 1, 10, 100)
For each, calculate and plot:
- Margin width
- Number of support vectors
- Training and test accuracy
Visualize the relationship between C and these metrics

Exercise 2: Kernel Comparison¶

Using make_classification with varying parameters:

Create datasets with different separability
Test all kernels (linear, poly degree 2-4, RBF)
Record training time and accuracy for each
Create recommendation matrix: dataset type → best kernel

Exercise 3: Hyperparameter Sensitivity¶

Generate non-linear data (circles or moons)
Create heatmap: C (x-axis) vs gamma (y-axis) vs accuracy (color)
Use at least 10 values for each parameter
Identify optimal region and overfitting region
Plot decision boundaries for corners and center

Exercise 4: Feature Scaling Impact¶

Using breast cancer dataset:

Train SVM on raw features (no scaling)
Train SVM on standardized features
Train SVM on normalized features (min-max)
Compare accuracy, training time, support vectors
Explain why scaling matters

Exercise 5: Multi-class Strategy¶

Using iris or digits dataset:

Implement One-vs-One manually
Implement One-vs-Rest manually
Compare with sklearn’s default
Analyze which class pairs are hardest to separate
Visualize decision regions (use PCA if needed)

Exercise 6: SVR Parameter Exploration¶

Generate non-linear regression data
Test epsilon values: 0.01, 0.05, 0.1, 0.3, 0.5
Test C values: 0.1, 1, 10, 100
For each combination, record:
- Number of support vectors
- R² score
- Prediction smoothness
Identify sweet spot for bias-variance tradeoff