Chapter 9: Support Vector MachinesΒΆ

OverviewΒΆ

Support Vector Machines (SVMs) are powerful classifiers that find the optimal separating hyperplane between classes.

Evolution of SVMsΒΆ

1. Maximal Margin ClassifierΒΆ

  • Linearly separable data only

  • Finds hyperplane with maximum margin

  • Margin = distance from hyperplane to nearest point

  • Problem: Too restrictive (requires perfect separation)

2. Support Vector Classifier (Soft Margin)ΒΆ

  • Allows some misclassifications

  • Introduces slack variables ΞΎα΅’

  • Tuning parameter C controls bias-variance tradeoff

  • Works when data not perfectly separable

3. Support Vector Machine (Kernel Trick)ΒΆ

  • Handles non-linear boundaries

  • Uses kernel functions to implicitly map to higher dimensions

  • Common kernels: Linear, Polynomial, RBF (Gaussian), Sigmoid

  • Most flexible and powerful

Key ConceptsΒΆ

HyperplaneΒΆ

In p dimensions: $\(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p = 0\)$

Classification rule: $\(f(X) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\)$

  • If \(f(X) > 0\): Class +1

  • If \(f(X) < 0\): Class -1

MarginΒΆ

Distance from hyperplane to nearest training point: $\(M = \min_i \frac{|f(x_i)|}{||\beta||}\)$

Goal: Maximize M

Support VectorsΒΆ

  • Training points that lie on the margin

  • Only these points affect the hyperplane!

  • All other points can move without changing solution

  • Typically only small fraction of training points

AdvantagesΒΆ

βœ… Effective in high dimensions
βœ… Memory efficient (only uses support vectors)
βœ… Versatile (different kernels for different data)
βœ… Works well with clear margin
βœ… Robust to outliers (with proper C tuning)

DisadvantagesΒΆ

❌ Computationally expensive (O(n²) to O(n³))
❌ Difficult for large datasets (n > 10,000)
❌ Sensitive to kernel choice
❌ No probabilistic interpretation (by default)
❌ Hyperparameter tuning critical

9.1 Maximal Margin ClassifierΒΆ

Optimization ProblemΒΆ

Maximize: \(M\)
Subject to: $\(||\beta|| = 1\)\( \)\(y_i (\beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}) \geq M \quad \forall i\)$

where:

  • \(M\) = margin width

  • \(y_i \in \{-1, +1\}\) = class labels

  • \(||\beta|| = 1\) = normalization constraint

Geometric InterpretationΒΆ

  • Hyperplane: \(\beta_0 + \beta^T x = 0\)

  • Margin boundaries: \(\beta_0 + \beta^T x = \pm M\)

  • Support vectors: Points on margin boundaries

  • Decision: Sign of \(\beta_0 + \beta^T x\)

LimitationΒΆ

Requires perfect linear separability - rare in practice!

# Maximal Margin Classifier Demo (Linearly Separable Data)

# Generate linearly separable data
np.random.seed(42)
X_sep = np.random.randn(40, 2)
y_sep = np.ones(40)
X_sep[:20] -= 2
y_sep[:20] = -1

# Fit SVC with large C (hard margin approximation)
svm_hard = SVC(kernel='linear', C=1e10)
svm_hard.fit(X_sep, y_sep)

# Plotting function for decision boundary
def plot_svm_boundary(X, y, model, title, ax):
    # Create mesh
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    # Predict on mesh
    Z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot decision boundary and margins
    ax.contourf(xx, yy, Z, levels=[-1e10, 0, 1e10], colors=['#FFAAAA', '#AAAAFF'], alpha=0.3)
    ax.contour(xx, yy, Z, levels=[-1, 0, 1], colors=['k', 'k', 'k'], 
              linestyles=['--', '-', '--'], linewidths=[1, 2, 1])
    
    # Plot points
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', s=100, edgecolors='k')
    
    # Highlight support vectors
    ax.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1],
              s=300, linewidth=2, facecolors='none', edgecolors='green', label='Support Vectors')
    
    ax.set_xlabel('X₁')
    ax.set_ylabel('Xβ‚‚')
    ax.set_title(title)
    ax.legend()
    ax.grid(True, alpha=0.3)

fig, ax = plt.subplots(1, 1, figsize=(10, 6))
plot_svm_boundary(X_sep, y_sep, svm_hard, 
                 f'Maximal Margin Classifier\n{len(model.support_vectors_)} Support Vectors', ax)
plt.show()

print(f"\nπŸ“Š Maximal Margin Classifier:")
print(f"   β€’ Number of support vectors: {len(svm_hard.support_vectors_)}")
print(f"   β€’ Training accuracy: {svm_hard.score(X_sep, y_sep):.3f}")
print(f"   β€’ Margin width: {2 / np.linalg.norm(svm_hard.coef_):.3f}")
print(f"\nπŸ’‘ Solid line = decision boundary (hyperplane)")
print(f"   Dashed lines = margin boundaries")
print(f"   Green circles = support vectors (on margin)")

9.2 Support Vector Classifier (Soft Margin)ΒΆ

The ProblemΒΆ

Real data is rarely perfectly separable!

The Solution: Soft MarginΒΆ

Allow some points to:

  1. Be on the wrong side of margin

  2. Be misclassified

Optimization with Slack VariablesΒΆ

Maximize: \(M\)
Subject to: $\(y_i (\beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}) \geq M(1 - \epsilon_i)\)\( \)\(\epsilon_i \geq 0, \quad \sum_{i=1}^n \epsilon_i \leq C\)$

where:

  • \(\epsilon_i\) = slack variable for observation i

  • \(\epsilon_i = 0\): Correct side of margin

  • \(0 < \epsilon_i < 1\): Wrong side of margin, but correct class

  • \(\epsilon_i > 1\): Misclassified

  • \(C\) = budget for total violations

Tuning Parameter CΒΆ

Large C:

  • Few violations allowed

  • Narrow margin

  • Low bias, high variance

  • Risk of overfitting

Small C:

  • Many violations allowed

  • Wide margin

  • High bias, low variance

  • More robust, may underfit

Typical values: 0.01, 0.1, 1, 10, 100

Equivalent FormulationΒΆ

Minimize: $\(\frac{1}{2}||\beta||^2 + C \sum_{i=1}^n \epsilon_i\)$

This is a regularization problem:

  • First term: Model complexity

  • Second term: Training error

  • C controls tradeoff (like Ξ» in ridge/lasso)

# Support Vector Classifier: Effect of C

# Generate overlapping data
np.random.seed(42)
X_overlap = np.random.randn(100, 2)
y_overlap = np.ones(100)
X_overlap[:50] -= 1
y_overlap[:50] = -1
X_overlap[40:50] += 1.5  # Create overlap

# Try different C values
C_values = [0.01, 0.1, 1, 100]
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, C in enumerate(C_values):
    svm = SVC(kernel='linear', C=C)
    svm.fit(X_overlap, y_overlap)
    
    plot_svm_boundary(X_overlap, y_overlap, svm,
                     f'C = {C}\n{len(svm.support_vectors_)} Support Vectors\n'
                     f'Accuracy: {svm.score(X_overlap, y_overlap):.3f}',
                     axes[idx])

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Effect of C:")
print("   β€’ C = 0.01:  Very wide margin, many support vectors, smooth boundary")
print("   β€’ C = 0.1:   Wide margin, more support vectors")
print("   β€’ C = 1:     Moderate margin (often good default)")
print("   β€’ C = 100:   Narrow margin, fewer support vectors, tries to fit all points")
print("\n   Smaller C β†’ wider margin β†’ more bias, less variance")
print("   Larger C β†’ narrower margin β†’ less bias, more variance")

9.3 Support Vector Machines with KernelsΒΆ

The Non-Linear ProblemΒΆ

Many datasets are not linearly separable in original space

The Kernel TrickΒΆ

Idea: Map data to higher-dimensional space where it becomes linearly separable

\[\phi: \mathbb{R}^p \rightarrow \mathbb{R}^q \quad (q >> p)\]

But computing \(\phi(x)\) explicitly can be expensive or impossible!

Solution: Use kernel function $\(K(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle\)$

Compute inner product in high-dimensional space without explicitly going there!

Common KernelsΒΆ

1. Linear KernelΒΆ

\[K(x_i, x_j) = x_i^T x_j\]
  • Standard inner product

  • Same as Support Vector Classifier

2. Polynomial KernelΒΆ

\[K(x_i, x_j) = (1 + x_i^T x_j)^d\]
  • \(d\) = degree (usually 2, 3, or 4)

  • Creates polynomial decision boundaries

  • Higher \(d\) β†’ more flexible, but can overfit

3. Radial Basis Function (RBF) / Gaussian KernelΒΆ

\[K(x_i, x_j) = \exp(-\gamma ||x_i - x_j||^2)\]
  • \(\gamma\) = kernel coefficient (default: 1/n_features)

  • Most popular kernel

  • Infinite-dimensional feature space!

  • Creates local, non-linear boundaries

Effect of Ξ³:

  • Large Ξ³: Narrow influence, complex boundary, overfitting risk

  • Small Ξ³: Wide influence, smooth boundary, underfitting risk

4. Sigmoid KernelΒΆ

\[K(x_i, x_j) = \tanh(\gamma x_i^T x_j + r)\]
  • Similar to neural networks

  • Less commonly used

Kernel Selection GuidelinesΒΆ

  1. Start with RBF - usually best default

  2. Try Linear if:

    • Many features (p large)

    • Linear separability suspected

    • Need interpretability

  3. Try Polynomial for:

    • Interaction effects important

    • Known polynomial relationships

  4. Use cross-validation to choose!

# Kernel Comparison on Non-Linear Data

# Generate non-linear datasets
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, factor=0.5, random_state=42)
X_moons, y_moons = make_moons(n_samples=200, noise=0.15, random_state=42)

# Convert labels to -1, 1
y_circles = 2 * y_circles - 1
y_moons = 2 * y_moons - 1

# Kernels to test
kernels = ['linear', 'poly', 'rbf']
kernel_params = {
    'linear': {},
    'poly': {'degree': 3},
    'rbf': {'gamma': 'scale'}
}

# Plot for circles dataset
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

for idx, kernel in enumerate(kernels):
    # Circles
    svm_circles = SVC(kernel=kernel, C=1, **kernel_params[kernel])
    svm_circles.fit(X_circles, y_circles)
    plot_svm_boundary(X_circles, y_circles, svm_circles,
                     f'Circles: {kernel.upper()} kernel\n'
                     f'Acc: {svm_circles.score(X_circles, y_circles):.3f}',
                     axes[0, idx])
    
    # Moons
    svm_moons = SVC(kernel=kernel, C=1, **kernel_params[kernel])
    svm_moons.fit(X_moons, y_moons)
    plot_svm_boundary(X_moons, y_moons, svm_moons,
                     f'Moons: {kernel.upper()} kernel\n'
                     f'Acc: {svm_moons.score(X_moons, y_moons):.3f}',
                     axes[1, idx])

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Kernel Observations:")
print("   β€’ LINEAR: Fails on non-linear data (straight line boundary)")
print("   β€’ POLY: Works on moons, struggles with circles")
print("   β€’ RBF: Handles both datasets well (most flexible)")
print("\n   RBF is often the best default choice!")
# Effect of gamma parameter in RBF kernel

gamma_values = [0.01, 0.1, 1, 10]
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, gamma in enumerate(gamma_values):
    svm = SVC(kernel='rbf', C=1, gamma=gamma)
    svm.fit(X_moons, y_moons)
    
    plot_svm_boundary(X_moons, y_moons, svm,
                     f'RBF Kernel: Ξ³ = {gamma}\n'
                     f'Accuracy: {svm.score(X_moons, y_moons):.3f}\n'
                     f'{len(svm.support_vectors_)} Support Vectors',
                     axes[idx])

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Effect of Ξ³ in RBF kernel:")
print("   β€’ Ξ³ = 0.01:  Very smooth, wide influence (underfitting)")
print("   β€’ Ξ³ = 0.1:   Balanced, good generalization")
print("   β€’ Ξ³ = 1:     More complex boundary")
print("   β€’ Ξ³ = 10:    Very wiggly, tight around points (overfitting)")
print("\n   Larger Ξ³ β†’ more complex boundary β†’ higher variance")
print("   Smaller Ξ³ β†’ smoother boundary β†’ higher bias")

9.4 Multi-class SVMsΒΆ

SVMs are inherently binary classifiers. For K > 2 classes:

One-vs-One (OVO)ΒΆ

  • Train \(\binom{K}{2}\) classifiers, one for each pair of classes

  • For K=3: 3 classifiers (1 vs 2, 1 vs 3, 2 vs 3)

  • For K=10: 45 classifiers

  • Prediction: Vote among all classifiers

  • Pros: Each classifier only sees relevant data

  • Cons: Many classifiers to train

  • sklearn default for SVC

One-vs-Rest (OVR) / One-vs-AllΒΆ

  • Train K classifiers, one for each class vs all others

  • For K=10: 10 classifiers

  • Prediction: Choose class with highest confidence

  • Pros: Fewer classifiers

  • Cons: Imbalanced training data

sklearn’s SVC automatically handles multi-class using OVO.

# Multi-class SVM Example

# Generate 3-class data
from sklearn.datasets import make_blobs

X_multi, y_multi = make_blobs(n_samples=300, centers=3, n_features=2,
                              cluster_std=1.0, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_multi, y_multi, test_size=0.3, random_state=42)

# Train SVM (automatically handles multi-class)
svm_multi = SVC(kernel='rbf', C=1, gamma='scale')
svm_multi.fit(X_train, y_train)

# Plotting
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Decision boundary
h = 0.02
x_min, x_max = X_multi[:, 0].min() - 1, X_multi[:, 0].max() + 1
y_min, y_max = X_multi[:, 1].min() - 1, X_multi[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = svm_multi.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

ax1.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
ax1.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='viridis',
           s=100, edgecolors='k', marker='o', label='Train')
ax1.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap='viridis',
           s=100, edgecolors='k', marker='s', label='Test')
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax1.set_title(f'Multi-class SVM (3 classes)\n'
             f'Train Acc: {svm_multi.score(X_train, y_train):.3f}, '
             f'Test Acc: {svm_multi.score(X_test, y_test):.3f}')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Confusion matrix
from sklearn.metrics import confusion_matrix
y_pred = svm_multi.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax2)
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Actual')
ax2.set_title('Confusion Matrix')

plt.tight_layout()
plt.show()

print(f"\nπŸ“Š Multi-class SVM:")
print(f"   β€’ Number of classes: {len(np.unique(y_multi))}")
print(f"   β€’ Number of binary classifiers trained: {len(svm_multi.n_support_)}")
print(f"   β€’ Total support vectors: {len(svm_multi.support_vectors_)}")
print(f"   β€’ Support vectors per class: {svm_multi.n_support_}")
print(f"\n   sklearn's SVC uses One-vs-One (OVO) strategy")
print(f"   For 3 classes: trains 3 binary classifiers")

9.5 Real Dataset Example: Breast CancerΒΆ

Hyperparameter TuningΒΆ

Key parameters to tune:

  • C: Regularization (0.1, 1, 10, 100)

  • gamma: RBF kernel width (0.001, 0.01, 0.1, 1)

  • kernel: Linear, RBF, Poly

Use GridSearchCV for systematic search.

# Breast Cancer Classification with SVM

# Load data
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

# IMPORTANT: Scale features for SVM!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("πŸ” Hyperparameter Tuning with GridSearchCV...\n")

# Grid search
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf', 'linear']
}

grid = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.4f}")

# Best model
best_svm = grid.best_estimator_
y_pred = best_svm.predict(X_test_scaled)

print(f"\nTest accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix')
axes[0].set_xticklabels(data.target_names)
axes[0].set_yticklabels(data.target_names)

# Grid search heatmap (for RBF kernel)
rbf_results = grid.cv_results_
rbf_mask = [p['kernel'] == 'rbf' for p in rbf_results['params']]
C_vals = [0.1, 1, 10, 100]
gamma_vals = [0.001, 0.01, 0.1, 1]

scores = np.zeros((len(gamma_vals), len(C_vals)))
for i, gamma in enumerate(gamma_vals):
    for j, C in enumerate(C_vals):
        idx = [k for k, p in enumerate(rbf_results['params']) 
               if p.get('kernel') == 'rbf' and p['C'] == C and p['gamma'] == gamma]
        if idx:
            scores[i, j] = rbf_results['mean_test_score'][idx[0]]

sns.heatmap(scores, annot=True, fmt='.3f', cmap='RdYlGn', 
           xticklabels=C_vals, yticklabels=gamma_vals, ax=axes[1])
axes[1].set_xlabel('C')
axes[1].set_ylabel('Ξ³ (gamma)')
axes[1].set_title('Grid Search Results (RBF Kernel)\nCV Accuracy')

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Key Insights:")
print("   β€’ Feature scaling is CRITICAL for SVM!")
print("   β€’ Grid search found optimal C and gamma")
print("   β€’ RBF kernel often works best")
print("   β€’ SVM achieves excellent performance on this dataset")

9.6 Support Vector Regression (SVR)ΒΆ

SVMs can also be used for regression!

Key Difference from ClassificationΒΆ

Classification: Find hyperplane with maximum margin
Regression: Find function with maximum Ξ΅-insensitive tube

Ξ΅-Insensitive LossΒΆ

Ignore errors smaller than Ξ΅: $\(L_\epsilon(y, f(x)) = \begin{cases} 0 & \text{if } |y - f(x)| \leq \epsilon \\ |y - f(x)| - \epsilon & \text{otherwise} \end{cases}\)$

Support Vectors in SVRΒΆ

Points outside the Ξ΅-tube (prediction errors > Ξ΅)

ParametersΒΆ

  • C: Regularization (same as SVC)

  • epsilon: Width of tube (default: 0.1)

  • kernel, gamma: Same as SVC

# Support Vector Regression Example

# Generate regression data
np.random.seed(42)
n = 100
X_reg = np.linspace(0, 10, n).reshape(-1, 1)
y_reg = np.sin(X_reg).ravel() + np.random.randn(n) * 0.3

# Compare different epsilon values
epsilons = [0.05, 0.1, 0.3, 0.5]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

X_plot = np.linspace(0, 10, 300).reshape(-1, 1)

for idx, eps in enumerate(epsilons):
    svr = SVR(kernel='rbf', C=100, epsilon=eps, gamma='scale')
    svr.fit(X_reg, y_reg)
    y_pred = svr.predict(X_plot)
    
    # Plot data and prediction
    axes[idx].scatter(X_reg, y_reg, alpha=0.5, s=50, label='Data')
    axes[idx].plot(X_plot, y_pred, 'r-', linewidth=2, label='SVR')
    
    # Plot epsilon tube
    axes[idx].fill_between(X_plot.ravel(), 
                          y_pred - eps, y_pred + eps,
                          alpha=0.2, color='red', label=f'Ξ΅-tube (Ξ΅={eps})')
    
    # Highlight support vectors
    axes[idx].scatter(X_reg[svr.support_], y_reg[svr.support_],
                     s=200, facecolors='none', edgecolors='green', 
                     linewidth=2, label=f'{len(svr.support_)} SVs')
    
    axes[idx].set_xlabel('X')
    axes[idx].set_ylabel('y')
    axes[idx].set_title(f'SVR: Ξ΅ = {eps}\n'
                       f'RΒ² = {svr.score(X_reg, y_reg):.3f}')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Effect of Ξ΅ in SVR:")
print("   β€’ Smaller Ξ΅: Narrower tube, more support vectors, closer fit")
print("   β€’ Larger Ξ΅: Wider tube, fewer support vectors, smoother fit")
print("   β€’ Points inside tube: not support vectors (zero penalty)")
print("   β€’ Points outside tube: support vectors (contribute to model)")

Key TakeawaysΒΆ

When to Use SVMsΒΆ

Good For: βœ… High-dimensional data (text, genomics)
βœ… Clear margin of separation exists
βœ… More features than samples (p > n)
βœ… Non-linear boundaries (with kernels)
βœ… Binary classification

Not Ideal For: ❌ Very large datasets (n > 10,000)
❌ Noisy data with overlapping classes
❌ Need probability estimates
❌ Need interpretability
❌ Real-time prediction critical

Best PracticesΒΆ

1. Always Scale Features

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

SVMs are sensitive to feature scales!

2. Start with RBF Kernel

svm = SVC(kernel='rbf', C=1, gamma='scale')

Best default for most problems

3. Tune Hyperparameters

param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf', 'linear']
}
GridSearchCV(SVC(), param_grid, cv=5)

4. Use Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

Prevents data leakage!

Hyperparameter GuidelinesΒΆ

Parameter C (Regularization)ΒΆ

  • C β†’ ∞: Hard margin, low bias, high variance

  • C β†’ 0: Soft margin, high bias, low variance

  • Typical: Try 0.1, 1, 10, 100

  • Default: 1 (often reasonable)

Parameter Ξ³ (RBF kernel)ΒΆ

  • Ξ³ β†’ ∞: High complexity, overfitting

  • Ξ³ β†’ 0: Low complexity, underfitting

  • Typical: Try 0.001, 0.01, 0.1, 1

  • Default: β€˜scale’ = 1/(n_features Γ— X.var())

Kernel SelectionΒΆ

  • Linear: Fast, interpretable, high-dimensional

  • RBF: Flexible, non-linear, default choice

  • Poly: Specific polynomial relationships

  • Sigmoid: Rarely used

Comparison with Other MethodsΒΆ

Aspect

SVM

Logistic Regression

Random Forest

Speed

Slow (O(nΒ²))

Fast

Medium

High-dim

Excellent

Good

Good

Non-linear

Yes (kernels)

No (needs features)

Yes

Interpretability

Low

High

Medium

Tuning

Critical

Easy

Less critical

Probabilities

Not native

Native

Native

Large data

Poor

Good

Good

Common PitfallsΒΆ

❌ Forgetting to scale features β†’ Poor performance
❌ Using default parameters β†’ Suboptimal results
❌ Wrong kernel choice β†’ Missing patterns
❌ Too large C or gamma β†’ Overfitting
❌ Not using cross-validation β†’ Unreliable estimates
❌ Applying to huge datasets β†’ Extremely slow

Practical WorkflowΒΆ

  1. Preprocess: Scale features (StandardScaler)

  2. Baseline: Try linear kernel first

  3. Non-linear: Try RBF with default parameters

  4. Tune: Grid search over C and gamma

  5. Validate: Use cross-validation

  6. Evaluate: Test set performance

  7. Compare: Try other methods (RF, XGBoost)

Advanced TipsΒΆ

For Large Datasets:

  • Use LinearSVC (faster than SVC(kernel='linear'))

  • Consider subsampling

  • Try kernel approximation (Nystroem, RBFSampler)

For Imbalanced Data:

  • Use class_weight='balanced'

  • Or manually set class_weight={0: w0, 1: w1}

For Probability Estimates:

  • Use probability=True in SVC

  • Slower (uses cross-validation internally)

  • Then can use predict_proba()

sklearn Implementation NotesΒΆ

SVC vs LinearSVC:

  • SVC(kernel='linear'): Uses libsvm (slower, more features)

  • LinearSVC: Uses liblinear (faster, fewer features)

  • For linear kernels on large data: use LinearSVC

SVR vs LinearSVR:

  • Same distinction as classification

Next ChapterΒΆ

Chapter 10: Deep Learning

  • Neural Networks

  • Activation Functions

  • Backpropagation

  • Convolutional Neural Networks

  • Recurrent Neural Networks

Practice ExercisesΒΆ

Exercise 1: Margin AnalysisΒΆ

  1. Generate linearly separable data

  2. Fit SVC with different C values (0.01, 0.1, 1, 10, 100)

  3. For each, calculate and plot:

    • Margin width

    • Number of support vectors

    • Training and test accuracy

  4. Visualize the relationship between C and these metrics

Exercise 2: Kernel ComparisonΒΆ

Using make_classification with varying parameters:

  1. Create datasets with different separability

  2. Test all kernels (linear, poly degree 2-4, RBF)

  3. Record training time and accuracy for each

  4. Create recommendation matrix: dataset type β†’ best kernel

Exercise 3: Hyperparameter SensitivityΒΆ

  1. Generate non-linear data (circles or moons)

  2. Create heatmap: C (x-axis) vs gamma (y-axis) vs accuracy (color)

  3. Use at least 10 values for each parameter

  4. Identify optimal region and overfitting region

  5. Plot decision boundaries for corners and center

Exercise 4: Feature Scaling ImpactΒΆ

Using breast cancer dataset:

  1. Train SVM on raw features (no scaling)

  2. Train SVM on standardized features

  3. Train SVM on normalized features (min-max)

  4. Compare accuracy, training time, support vectors

  5. Explain why scaling matters

Exercise 5: Multi-class StrategyΒΆ

Using iris or digits dataset:

  1. Implement One-vs-One manually

  2. Implement One-vs-Rest manually

  3. Compare with sklearn’s default

  4. Analyze which class pairs are hardest to separate

  5. Visualize decision regions (use PCA if needed)

Exercise 6: SVR Parameter ExplorationΒΆ

  1. Generate non-linear regression data

  2. Test epsilon values: 0.01, 0.05, 0.1, 0.3, 0.5

  3. Test C values: 0.1, 1, 10, 100

  4. For each combination, record:

    • Number of support vectors

    • RΒ² score

    • Prediction smoothness

  5. Identify sweet spot for bias-variance tradeoff