Chapter 9: Support Vector MachinesΒΆ
OverviewΒΆ
Support Vector Machines (SVMs) are powerful classifiers that find the optimal separating hyperplane between classes.
Evolution of SVMsΒΆ
1. Maximal Margin ClassifierΒΆ
Linearly separable data only
Finds hyperplane with maximum margin
Margin = distance from hyperplane to nearest point
Problem: Too restrictive (requires perfect separation)
2. Support Vector Classifier (Soft Margin)ΒΆ
Allows some misclassifications
Introduces slack variables ΞΎα΅’
Tuning parameter C controls bias-variance tradeoff
Works when data not perfectly separable
3. Support Vector Machine (Kernel Trick)ΒΆ
Handles non-linear boundaries
Uses kernel functions to implicitly map to higher dimensions
Common kernels: Linear, Polynomial, RBF (Gaussian), Sigmoid
Most flexible and powerful
Key ConceptsΒΆ
HyperplaneΒΆ
In p dimensions: $\(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p = 0\)$
Classification rule: $\(f(X) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\)$
If \(f(X) > 0\): Class +1
If \(f(X) < 0\): Class -1
MarginΒΆ
Distance from hyperplane to nearest training point: $\(M = \min_i \frac{|f(x_i)|}{||\beta||}\)$
Goal: Maximize M
Support VectorsΒΆ
Training points that lie on the margin
Only these points affect the hyperplane!
All other points can move without changing solution
Typically only small fraction of training points
AdvantagesΒΆ
β
Effective in high dimensions
β
Memory efficient (only uses support vectors)
β
Versatile (different kernels for different data)
β
Works well with clear margin
β
Robust to outliers (with proper C tuning)
DisadvantagesΒΆ
β Computationally expensive (O(nΒ²) to O(nΒ³))
β Difficult for large datasets (n > 10,000)
β Sensitive to kernel choice
β No probabilistic interpretation (by default)
β Hyperparameter tuning critical
9.1 Maximal Margin ClassifierΒΆ
Optimization ProblemΒΆ
Maximize: \(M\)
Subject to:
$\(||\beta|| = 1\)\(
\)\(y_i (\beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}) \geq M \quad \forall i\)$
where:
\(M\) = margin width
\(y_i \in \{-1, +1\}\) = class labels
\(||\beta|| = 1\) = normalization constraint
Geometric InterpretationΒΆ
Hyperplane: \(\beta_0 + \beta^T x = 0\)
Margin boundaries: \(\beta_0 + \beta^T x = \pm M\)
Support vectors: Points on margin boundaries
Decision: Sign of \(\beta_0 + \beta^T x\)
LimitationΒΆ
Requires perfect linear separability - rare in practice!
# Maximal Margin Classifier Demo (Linearly Separable Data)
# Generate linearly separable data
np.random.seed(42)
X_sep = np.random.randn(40, 2)
y_sep = np.ones(40)
X_sep[:20] -= 2
y_sep[:20] = -1
# Fit SVC with large C (hard margin approximation)
svm_hard = SVC(kernel='linear', C=1e10)
svm_hard.fit(X_sep, y_sep)
# Plotting function for decision boundary
def plot_svm_boundary(X, y, model, title, ax):
# Create mesh
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Predict on mesh
Z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot decision boundary and margins
ax.contourf(xx, yy, Z, levels=[-1e10, 0, 1e10], colors=['#FFAAAA', '#AAAAFF'], alpha=0.3)
ax.contour(xx, yy, Z, levels=[-1, 0, 1], colors=['k', 'k', 'k'],
linestyles=['--', '-', '--'], linewidths=[1, 2, 1])
# Plot points
ax.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', s=100, edgecolors='k')
# Highlight support vectors
ax.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1],
s=300, linewidth=2, facecolors='none', edgecolors='green', label='Support Vectors')
ax.set_xlabel('Xβ')
ax.set_ylabel('Xβ')
ax.set_title(title)
ax.legend()
ax.grid(True, alpha=0.3)
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
plot_svm_boundary(X_sep, y_sep, svm_hard,
f'Maximal Margin Classifier\n{len(model.support_vectors_)} Support Vectors', ax)
plt.show()
print(f"\nπ Maximal Margin Classifier:")
print(f" β’ Number of support vectors: {len(svm_hard.support_vectors_)}")
print(f" β’ Training accuracy: {svm_hard.score(X_sep, y_sep):.3f}")
print(f" β’ Margin width: {2 / np.linalg.norm(svm_hard.coef_):.3f}")
print(f"\nπ‘ Solid line = decision boundary (hyperplane)")
print(f" Dashed lines = margin boundaries")
print(f" Green circles = support vectors (on margin)")
9.2 Support Vector Classifier (Soft Margin)ΒΆ
The ProblemΒΆ
Real data is rarely perfectly separable!
The Solution: Soft MarginΒΆ
Allow some points to:
Be on the wrong side of margin
Be misclassified
Optimization with Slack VariablesΒΆ
Maximize: \(M\)
Subject to:
$\(y_i (\beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}) \geq M(1 - \epsilon_i)\)\(
\)\(\epsilon_i \geq 0, \quad \sum_{i=1}^n \epsilon_i \leq C\)$
where:
\(\epsilon_i\) = slack variable for observation i
\(\epsilon_i = 0\): Correct side of margin
\(0 < \epsilon_i < 1\): Wrong side of margin, but correct class
\(\epsilon_i > 1\): Misclassified
\(C\) = budget for total violations
Tuning Parameter CΒΆ
Large C:
Few violations allowed
Narrow margin
Low bias, high variance
Risk of overfitting
Small C:
Many violations allowed
Wide margin
High bias, low variance
More robust, may underfit
Typical values: 0.01, 0.1, 1, 10, 100
Equivalent FormulationΒΆ
Minimize: $\(\frac{1}{2}||\beta||^2 + C \sum_{i=1}^n \epsilon_i\)$
This is a regularization problem:
First term: Model complexity
Second term: Training error
C controls tradeoff (like Ξ» in ridge/lasso)
# Support Vector Classifier: Effect of C
# Generate overlapping data
np.random.seed(42)
X_overlap = np.random.randn(100, 2)
y_overlap = np.ones(100)
X_overlap[:50] -= 1
y_overlap[:50] = -1
X_overlap[40:50] += 1.5 # Create overlap
# Try different C values
C_values = [0.01, 0.1, 1, 100]
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()
for idx, C in enumerate(C_values):
svm = SVC(kernel='linear', C=C)
svm.fit(X_overlap, y_overlap)
plot_svm_boundary(X_overlap, y_overlap, svm,
f'C = {C}\n{len(svm.support_vectors_)} Support Vectors\n'
f'Accuracy: {svm.score(X_overlap, y_overlap):.3f}',
axes[idx])
plt.tight_layout()
plt.show()
print("\nπ‘ Effect of C:")
print(" β’ C = 0.01: Very wide margin, many support vectors, smooth boundary")
print(" β’ C = 0.1: Wide margin, more support vectors")
print(" β’ C = 1: Moderate margin (often good default)")
print(" β’ C = 100: Narrow margin, fewer support vectors, tries to fit all points")
print("\n Smaller C β wider margin β more bias, less variance")
print(" Larger C β narrower margin β less bias, more variance")
9.3 Support Vector Machines with KernelsΒΆ
The Non-Linear ProblemΒΆ
Many datasets are not linearly separable in original space
The Kernel TrickΒΆ
Idea: Map data to higher-dimensional space where it becomes linearly separable
But computing \(\phi(x)\) explicitly can be expensive or impossible!
Solution: Use kernel function $\(K(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle\)$
Compute inner product in high-dimensional space without explicitly going there!
Common KernelsΒΆ
1. Linear KernelΒΆ
Standard inner product
Same as Support Vector Classifier
2. Polynomial KernelΒΆ
\(d\) = degree (usually 2, 3, or 4)
Creates polynomial decision boundaries
Higher \(d\) β more flexible, but can overfit
3. Radial Basis Function (RBF) / Gaussian KernelΒΆ
\(\gamma\) = kernel coefficient (default: 1/n_features)
Most popular kernel
Infinite-dimensional feature space!
Creates local, non-linear boundaries
Effect of Ξ³:
Large Ξ³: Narrow influence, complex boundary, overfitting risk
Small Ξ³: Wide influence, smooth boundary, underfitting risk
4. Sigmoid KernelΒΆ
Similar to neural networks
Less commonly used
Kernel Selection GuidelinesΒΆ
Start with RBF - usually best default
Try Linear if:
Many features (p large)
Linear separability suspected
Need interpretability
Try Polynomial for:
Interaction effects important
Known polynomial relationships
Use cross-validation to choose!
# Kernel Comparison on Non-Linear Data
# Generate non-linear datasets
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, factor=0.5, random_state=42)
X_moons, y_moons = make_moons(n_samples=200, noise=0.15, random_state=42)
# Convert labels to -1, 1
y_circles = 2 * y_circles - 1
y_moons = 2 * y_moons - 1
# Kernels to test
kernels = ['linear', 'poly', 'rbf']
kernel_params = {
'linear': {},
'poly': {'degree': 3},
'rbf': {'gamma': 'scale'}
}
# Plot for circles dataset
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
for idx, kernel in enumerate(kernels):
# Circles
svm_circles = SVC(kernel=kernel, C=1, **kernel_params[kernel])
svm_circles.fit(X_circles, y_circles)
plot_svm_boundary(X_circles, y_circles, svm_circles,
f'Circles: {kernel.upper()} kernel\n'
f'Acc: {svm_circles.score(X_circles, y_circles):.3f}',
axes[0, idx])
# Moons
svm_moons = SVC(kernel=kernel, C=1, **kernel_params[kernel])
svm_moons.fit(X_moons, y_moons)
plot_svm_boundary(X_moons, y_moons, svm_moons,
f'Moons: {kernel.upper()} kernel\n'
f'Acc: {svm_moons.score(X_moons, y_moons):.3f}',
axes[1, idx])
plt.tight_layout()
plt.show()
print("\nπ‘ Kernel Observations:")
print(" β’ LINEAR: Fails on non-linear data (straight line boundary)")
print(" β’ POLY: Works on moons, struggles with circles")
print(" β’ RBF: Handles both datasets well (most flexible)")
print("\n RBF is often the best default choice!")
# Effect of gamma parameter in RBF kernel
gamma_values = [0.01, 0.1, 1, 10]
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()
for idx, gamma in enumerate(gamma_values):
svm = SVC(kernel='rbf', C=1, gamma=gamma)
svm.fit(X_moons, y_moons)
plot_svm_boundary(X_moons, y_moons, svm,
f'RBF Kernel: Ξ³ = {gamma}\n'
f'Accuracy: {svm.score(X_moons, y_moons):.3f}\n'
f'{len(svm.support_vectors_)} Support Vectors',
axes[idx])
plt.tight_layout()
plt.show()
print("\nπ‘ Effect of Ξ³ in RBF kernel:")
print(" β’ Ξ³ = 0.01: Very smooth, wide influence (underfitting)")
print(" β’ Ξ³ = 0.1: Balanced, good generalization")
print(" β’ Ξ³ = 1: More complex boundary")
print(" β’ Ξ³ = 10: Very wiggly, tight around points (overfitting)")
print("\n Larger Ξ³ β more complex boundary β higher variance")
print(" Smaller Ξ³ β smoother boundary β higher bias")
9.4 Multi-class SVMsΒΆ
SVMs are inherently binary classifiers. For K > 2 classes:
One-vs-One (OVO)ΒΆ
Train \(\binom{K}{2}\) classifiers, one for each pair of classes
For K=3: 3 classifiers (1 vs 2, 1 vs 3, 2 vs 3)
For K=10: 45 classifiers
Prediction: Vote among all classifiers
Pros: Each classifier only sees relevant data
Cons: Many classifiers to train
sklearn default for SVC
One-vs-Rest (OVR) / One-vs-AllΒΆ
Train K classifiers, one for each class vs all others
For K=10: 10 classifiers
Prediction: Choose class with highest confidence
Pros: Fewer classifiers
Cons: Imbalanced training data
sklearnβs SVC automatically handles multi-class using OVO.
# Multi-class SVM Example
# Generate 3-class data
from sklearn.datasets import make_blobs
X_multi, y_multi = make_blobs(n_samples=300, centers=3, n_features=2,
cluster_std=1.0, random_state=42)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X_multi, y_multi, test_size=0.3, random_state=42)
# Train SVM (automatically handles multi-class)
svm_multi = SVC(kernel='rbf', C=1, gamma='scale')
svm_multi.fit(X_train, y_train)
# Plotting
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Decision boundary
h = 0.02
x_min, x_max = X_multi[:, 0].min() - 1, X_multi[:, 0].max() + 1
y_min, y_max = X_multi[:, 1].min() - 1, X_multi[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = svm_multi.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
ax1.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
ax1.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='viridis',
s=100, edgecolors='k', marker='o', label='Train')
ax1.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap='viridis',
s=100, edgecolors='k', marker='s', label='Test')
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax1.set_title(f'Multi-class SVM (3 classes)\n'
f'Train Acc: {svm_multi.score(X_train, y_train):.3f}, '
f'Test Acc: {svm_multi.score(X_test, y_test):.3f}')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Confusion matrix
from sklearn.metrics import confusion_matrix
y_pred = svm_multi.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax2)
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Actual')
ax2.set_title('Confusion Matrix')
plt.tight_layout()
plt.show()
print(f"\nπ Multi-class SVM:")
print(f" β’ Number of classes: {len(np.unique(y_multi))}")
print(f" β’ Number of binary classifiers trained: {len(svm_multi.n_support_)}")
print(f" β’ Total support vectors: {len(svm_multi.support_vectors_)}")
print(f" β’ Support vectors per class: {svm_multi.n_support_}")
print(f"\n sklearn's SVC uses One-vs-One (OVO) strategy")
print(f" For 3 classes: trains 3 binary classifiers")
9.5 Real Dataset Example: Breast CancerΒΆ
Hyperparameter TuningΒΆ
Key parameters to tune:
C: Regularization (0.1, 1, 10, 100)
gamma: RBF kernel width (0.001, 0.01, 0.1, 1)
kernel: Linear, RBF, Poly
Use GridSearchCV for systematic search.
# Breast Cancer Classification with SVM
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42)
# IMPORTANT: Scale features for SVM!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("π Hyperparameter Tuning with GridSearchCV...\n")
# Grid search
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1],
'kernel': ['rbf', 'linear']
}
grid = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train_scaled, y_train)
print(f"Best parameters: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.4f}")
# Best model
best_svm = grid.best_estimator_
y_pred = best_svm.predict(X_test_scaled)
print(f"\nTest accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))
# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix')
axes[0].set_xticklabels(data.target_names)
axes[0].set_yticklabels(data.target_names)
# Grid search heatmap (for RBF kernel)
rbf_results = grid.cv_results_
rbf_mask = [p['kernel'] == 'rbf' for p in rbf_results['params']]
C_vals = [0.1, 1, 10, 100]
gamma_vals = [0.001, 0.01, 0.1, 1]
scores = np.zeros((len(gamma_vals), len(C_vals)))
for i, gamma in enumerate(gamma_vals):
for j, C in enumerate(C_vals):
idx = [k for k, p in enumerate(rbf_results['params'])
if p.get('kernel') == 'rbf' and p['C'] == C and p['gamma'] == gamma]
if idx:
scores[i, j] = rbf_results['mean_test_score'][idx[0]]
sns.heatmap(scores, annot=True, fmt='.3f', cmap='RdYlGn',
xticklabels=C_vals, yticklabels=gamma_vals, ax=axes[1])
axes[1].set_xlabel('C')
axes[1].set_ylabel('Ξ³ (gamma)')
axes[1].set_title('Grid Search Results (RBF Kernel)\nCV Accuracy')
plt.tight_layout()
plt.show()
print("\nπ‘ Key Insights:")
print(" β’ Feature scaling is CRITICAL for SVM!")
print(" β’ Grid search found optimal C and gamma")
print(" β’ RBF kernel often works best")
print(" β’ SVM achieves excellent performance on this dataset")
9.6 Support Vector Regression (SVR)ΒΆ
SVMs can also be used for regression!
Key Difference from ClassificationΒΆ
Classification: Find hyperplane with maximum margin
Regression: Find function with maximum Ξ΅-insensitive tube
Ξ΅-Insensitive LossΒΆ
Ignore errors smaller than Ξ΅: $\(L_\epsilon(y, f(x)) = \begin{cases} 0 & \text{if } |y - f(x)| \leq \epsilon \\ |y - f(x)| - \epsilon & \text{otherwise} \end{cases}\)$
Support Vectors in SVRΒΆ
Points outside the Ξ΅-tube (prediction errors > Ξ΅)
ParametersΒΆ
C: Regularization (same as SVC)
epsilon: Width of tube (default: 0.1)
kernel, gamma: Same as SVC
# Support Vector Regression Example
# Generate regression data
np.random.seed(42)
n = 100
X_reg = np.linspace(0, 10, n).reshape(-1, 1)
y_reg = np.sin(X_reg).ravel() + np.random.randn(n) * 0.3
# Compare different epsilon values
epsilons = [0.05, 0.1, 0.3, 0.5]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()
X_plot = np.linspace(0, 10, 300).reshape(-1, 1)
for idx, eps in enumerate(epsilons):
svr = SVR(kernel='rbf', C=100, epsilon=eps, gamma='scale')
svr.fit(X_reg, y_reg)
y_pred = svr.predict(X_plot)
# Plot data and prediction
axes[idx].scatter(X_reg, y_reg, alpha=0.5, s=50, label='Data')
axes[idx].plot(X_plot, y_pred, 'r-', linewidth=2, label='SVR')
# Plot epsilon tube
axes[idx].fill_between(X_plot.ravel(),
y_pred - eps, y_pred + eps,
alpha=0.2, color='red', label=f'Ξ΅-tube (Ξ΅={eps})')
# Highlight support vectors
axes[idx].scatter(X_reg[svr.support_], y_reg[svr.support_],
s=200, facecolors='none', edgecolors='green',
linewidth=2, label=f'{len(svr.support_)} SVs')
axes[idx].set_xlabel('X')
axes[idx].set_ylabel('y')
axes[idx].set_title(f'SVR: Ξ΅ = {eps}\n'
f'RΒ² = {svr.score(X_reg, y_reg):.3f}')
axes[idx].legend()
axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nπ‘ Effect of Ξ΅ in SVR:")
print(" β’ Smaller Ξ΅: Narrower tube, more support vectors, closer fit")
print(" β’ Larger Ξ΅: Wider tube, fewer support vectors, smoother fit")
print(" β’ Points inside tube: not support vectors (zero penalty)")
print(" β’ Points outside tube: support vectors (contribute to model)")
Key TakeawaysΒΆ
When to Use SVMsΒΆ
Good For:
β
High-dimensional data (text, genomics)
β
Clear margin of separation exists
β
More features than samples (p > n)
β
Non-linear boundaries (with kernels)
β
Binary classification
Not Ideal For:
β Very large datasets (n > 10,000)
β Noisy data with overlapping classes
β Need probability estimates
β Need interpretability
β Real-time prediction critical
Best PracticesΒΆ
1. Always Scale Features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
SVMs are sensitive to feature scales!
2. Start with RBF Kernel
svm = SVC(kernel='rbf', C=1, gamma='scale')
Best default for most problems
3. Tune Hyperparameters
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1],
'kernel': ['rbf', 'linear']
}
GridSearchCV(SVC(), param_grid, cv=5)
4. Use Pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('svm', SVC())
])
Prevents data leakage!
Hyperparameter GuidelinesΒΆ
Parameter C (Regularization)ΒΆ
C β β: Hard margin, low bias, high variance
C β 0: Soft margin, high bias, low variance
Typical: Try 0.1, 1, 10, 100
Default: 1 (often reasonable)
Parameter Ξ³ (RBF kernel)ΒΆ
Ξ³ β β: High complexity, overfitting
Ξ³ β 0: Low complexity, underfitting
Typical: Try 0.001, 0.01, 0.1, 1
Default: βscaleβ = 1/(n_features Γ X.var())
Kernel SelectionΒΆ
Linear: Fast, interpretable, high-dimensional
RBF: Flexible, non-linear, default choice
Poly: Specific polynomial relationships
Sigmoid: Rarely used
Comparison with Other MethodsΒΆ
Aspect |
SVM |
Logistic Regression |
Random Forest |
|---|---|---|---|
Speed |
Slow (O(nΒ²)) |
Fast |
Medium |
High-dim |
Excellent |
Good |
Good |
Non-linear |
Yes (kernels) |
No (needs features) |
Yes |
Interpretability |
Low |
High |
Medium |
Tuning |
Critical |
Easy |
Less critical |
Probabilities |
Not native |
Native |
Native |
Large data |
Poor |
Good |
Good |
Common PitfallsΒΆ
β Forgetting to scale features β Poor performance
β Using default parameters β Suboptimal results
β Wrong kernel choice β Missing patterns
β Too large C or gamma β Overfitting
β Not using cross-validation β Unreliable estimates
β Applying to huge datasets β Extremely slow
Practical WorkflowΒΆ
Preprocess: Scale features (StandardScaler)
Baseline: Try linear kernel first
Non-linear: Try RBF with default parameters
Tune: Grid search over C and gamma
Validate: Use cross-validation
Evaluate: Test set performance
Compare: Try other methods (RF, XGBoost)
Advanced TipsΒΆ
For Large Datasets:
Use
LinearSVC(faster thanSVC(kernel='linear'))Consider subsampling
Try kernel approximation (Nystroem, RBFSampler)
For Imbalanced Data:
Use
class_weight='balanced'Or manually set
class_weight={0: w0, 1: w1}
For Probability Estimates:
Use
probability=Truein SVCSlower (uses cross-validation internally)
Then can use
predict_proba()
sklearn Implementation NotesΒΆ
SVC vs LinearSVC:
SVC(kernel='linear'): Uses libsvm (slower, more features)LinearSVC: Uses liblinear (faster, fewer features)For linear kernels on large data: use LinearSVC
SVR vs LinearSVR:
Same distinction as classification
Next ChapterΒΆ
Chapter 10: Deep Learning
Neural Networks
Activation Functions
Backpropagation
Convolutional Neural Networks
Recurrent Neural Networks
Practice ExercisesΒΆ
Exercise 1: Margin AnalysisΒΆ
Generate linearly separable data
Fit SVC with different C values (0.01, 0.1, 1, 10, 100)
For each, calculate and plot:
Margin width
Number of support vectors
Training and test accuracy
Visualize the relationship between C and these metrics
Exercise 2: Kernel ComparisonΒΆ
Using make_classification with varying parameters:
Create datasets with different separability
Test all kernels (linear, poly degree 2-4, RBF)
Record training time and accuracy for each
Create recommendation matrix: dataset type β best kernel
Exercise 3: Hyperparameter SensitivityΒΆ
Generate non-linear data (circles or moons)
Create heatmap: C (x-axis) vs gamma (y-axis) vs accuracy (color)
Use at least 10 values for each parameter
Identify optimal region and overfitting region
Plot decision boundaries for corners and center
Exercise 4: Feature Scaling ImpactΒΆ
Using breast cancer dataset:
Train SVM on raw features (no scaling)
Train SVM on standardized features
Train SVM on normalized features (min-max)
Compare accuracy, training time, support vectors
Explain why scaling matters
Exercise 5: Multi-class StrategyΒΆ
Using iris or digits dataset:
Implement One-vs-One manually
Implement One-vs-Rest manually
Compare with sklearnβs default
Analyze which class pairs are hardest to separate
Visualize decision regions (use PCA if needed)
Exercise 6: SVR Parameter ExplorationΒΆ
Generate non-linear regression data
Test epsilon values: 0.01, 0.05, 0.1, 0.3, 0.5
Test C values: 0.1, 1, 10, 100
For each combination, record:
Number of support vectors
RΒ² score
Prediction smoothness
Identify sweet spot for bias-variance tradeoff