import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from torchvision import datasets, transforms
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")
Advanced Adversarial Robustness TheoryΒΆ
1. Foundations and Threat ModelsΒΆ
Definition: Adversarial examples are inputs deliberately crafted to cause misclassification:
Where:
\(x\): Original input (clean example)
\(\delta\): Adversarial perturbation
\(\epsilon\): Perturbation budget
\(\|\cdot\|_p\): L_p norm (p β {0, 1, 2, β})
Historical Context:
Szegedy et al. (2013): First observation of adversarial examples in deep networks
Goodfellow et al. (2014): FGSM attack - showed linear nature of adversarial vulnerability
Madry et al. (2017): PGD attack and adversarial training - first strong defense
2. Threat Models and Attack TypesΒΆ
2.1 Attack Goals
Untargeted attack:
Goal: Make model predict any incorrect class.
Targeted attack:
Goal: Make model predict specific target class \(y_{\text{target}}\).
2.2 Attack Knowledge (White-box vs Black-box)
White-box: Attacker has full access to:
Model architecture
Model parameters \(\theta\)
Training data (sometimes)
Gradient information
Black-box: Attacker only has:
Query access (input β output)
No gradient information
Limited queries (query budget)
Gray-box: Partial knowledge (e.g., architecture but not weights)
2.3 Perturbation Norms
L_β norm (Chebyshev):
Per-pixel perturbation bounded. Most common in practice.
L_2 norm (Euclidean):
Total energy bounded. Allows larger changes in some pixels.
L_0 norm (Sparsity):
Number of changed pixels bounded. Hardest to optimize (NP-hard).
L_1 norm (Manhattan):
Sum of absolute changes bounded.
3. Gradient-Based AttacksΒΆ
3.1 Fast Gradient Sign Method (FGSM)
Original formulation (Goodfellow et al., 2014):
Intuition: Move in direction that maximizes loss, step size \(\epsilon\).
Linear approximation:
Maximize by setting \(\delta = \epsilon \cdot \text{sign}(\nabla_x \mathcal{L})\).
L_β constraint: \(\|\delta\|_{\infty} = \epsilon\)
Advantages:
Single gradient computation β Very fast
Easy to implement
Limitations:
Single-step β Weaker than iterative methods
Suboptimal for non-linear models
Targeted FGSM:
Minimize loss for target class (note the negative sign).
3.2 Projected Gradient Descent (PGD)
Formulation (Madry et al., 2017):
Where:
\(\mathcal{S} = \{x' : \|x' - x\|_{\infty} \leq \epsilon\}\) (constraint set)
\(\alpha\): Step size (typically \(\alpha = \epsilon / K\) where K is iterations)
Proj: Projection onto constraint set
Initialize: \(x_{\text{adv}}^0 = x + \text{Uniform}(-\epsilon, \epsilon)\) (random start)
Projection operator:
Ensure \(\|x_{\text{adv}} - x\|_{\infty} \leq \epsilon\) at each step.
Why random initialization:
Escapes poor local maxima
Finds stronger adversarial examples
Essential for good attack success
Convergence: PGD approximates solution to:
Typical hyperparameters:
Iterations: K = 40-100 for evaluation, K = 7-10 for training
Step size: Ξ± = Ξ΅/K or Ξ± = 2.5Ξ΅/K
Restarts: 5-10 random restarts for strongest attack
3.3 Iterative FGSM (I-FGSM / BIM)
Basic Iterative Method:
Same as PGD but typically no random initialization.
3.4 Momentum Iterative FGSM (MI-FGSM)
Add momentum for better transferability:
Where \(\mu = 1.0\) is momentum factor.
Benefit: Better black-box transferability across models.
4. Optimization-Based AttacksΒΆ
4.1 Carlini & Wagner (C&W) Attack
Formulation (Carlini & Wagner, 2017):
Where objective function \(f\) encourages misclassification:
Where:
\(Z(x')\): Logits (pre-softmax outputs)
\(t\): True class
\(\kappa\): Confidence parameter (typically ΞΊ = 0)
Change of variables: To enforce \(x' \in [0,1]\), use:
Optimize over unconstrained \(w\) instead of \(x'\).
Binary search on \(c\):
Initialize \(c_{\min} = 0\), \(c_{\max} = 10^{10}\)
For \(c = (c_{\min} + c_{\max})/2\):
Optimize to find \(\delta\)
If attack succeeds: \(c_{\max} = c\)
If attack fails: \(c_{\min} = c\)
Repeat until convergence
Advantages:
Finds minimal perturbations
Strong attack (often defeats defenses)
Can target any class
Disadvantages:
Computationally expensive (100-1000 iterations)
Many hyperparameters
Requires careful tuning
4.2 Elastic Net Attack (EAD)
Combines L_1 and L_2:
Encourages sparse perturbations.
5. Black-Box AttacksΒΆ
5.1 Transfer-Based Attacks
Observation: Adversarial examples transfer across models.
Method:
Train substitute model on queries to target model
Generate adversarial examples for substitute
Transfer to target model
Transferability factors:
Similar architectures β Higher transfer
Ensemble attacks β Better transfer
Momentum methods β Better transfer
5.2 Query-Based Attacks
ZOO (Zeroth Order Optimization):
Estimate gradient using finite differences:
Where \(e_i\) is one-hot vector, \(h\) is small constant.
Cost: O(d) queries where d is dimension.
Square Attack: Random search in L_β ball, keep best perturbations.
Simba: Simple Black-box Attack using random directions.
6. Adversarial TrainingΒΆ
6.1 Standard Adversarial Training
Min-max formulation (Madry et al., 2017):
Algorithm:
For each batch (x, y):
1. Generate adversarial examples:
x_adv = PGD(model, x, y, Ξ΅)
2. Update model:
ΞΈ β ΞΈ - Ξ·Β·β_ΞΈ L(f_ΞΈ(x_adv), y)
Inner maximization: Find worst-case perturbation (PGD attack) Outer minimization: Train model to be robust to those perturbations
Theoretical justification: Robust optimization: Find parameters that minimize worst-case loss.
6.2 TRADES (TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization)
Formulation (Zhang et al., 2019):
Where:
First term: Standard loss on clean examples
Second term: Consistency between clean and adversarial predictions
\(\beta\): Trade-off parameter
Advantages:
Better clean accuracy than standard adversarial training
Explicit trade-off control
Theoretical guarantees
6.3 MART (Misclassification Aware adversarial Training)
Boosted CE loss:
Where BCE is boosted cross-entropy, focusing on misclassified examples.
7. Certified DefensesΒΆ
7.1 Randomized Smoothing
Definition (Cohen et al., 2019):
Smoothed classifier:
Certification: If \(\mathbb{P}[f(x + \delta) = c_A] \geq p_A\) and \(\mathbb{P}[f(x + \delta) = c_B] \leq p_B\) for \(c_B \neq c_A\), then:
Where \(\Phi\) is standard Gaussian CDF.
Algorithm:
Sample \(n\) Gaussian perturbations
Count votes for each class
Compute certified radius using Neyman-Pearson lemma
Advantages:
Provable robustness guarantee
Scales to large models
Any base classifier
Disadvantages:
Requires many samples (n = 100-100000)
Accuracy-robustness trade-off
Only L_2 certification
7.2 Interval Bound Propagation (IBP)
Compute bounds on activations:
For ReLU network, propagate intervals \([\underline{z}, \overline{z}]\) through layers.
Certified if: Output interval for true class doesnβt overlap with other classes.
7.3 Lipschitz-Constrained Networks
Lipschitz constant: \(L = \max_{x \neq x'} \frac{\|f(x) - f(x')\|}{\|x - x'\|}\)
Certified radius: If \(\|f(x) - f(x')\| \leq L \cdot \|x - x'\|\), then perturbing x by Ξ΅ changes output by at most \(L \cdot \epsilon\).
Methods to enforce:
Spectral normalization
Parseval networks
Orthogonal weight initialization
8. Detection and Input PreprocessingΒΆ
8.1 Adversarial Detection
Statistical tests:
Kernel density estimation
Local intrinsic dimensionality
Mahalanobis distance in feature space
Detector network: Train binary classifier to distinguish clean vs adversarial.
Limitations:
Adaptive attacks can evade detectors
Arms race problem
8.2 Input Transformations
Defenses:
JPEG compression: Remove high-frequency adversarial noise
Total Variation denoising: Smooth perturbations
Quantization: Reduce precision
Random resizing: Scale and pad images
Effectiveness:
Can reduce attack success
Often bypassed by adaptive attacks
Combine with adversarial training
9. Robustness Metrics and EvaluationΒΆ
9.1 Adversarial Accuracy
where \(x_{\text{adv}}\) is adversarial example with \(\|x_{\text{adv}} - x\|_p \leq \epsilon\).
9.2 Robustness Curve
Plot \(\text{Acc}_{\text{adv}}(\epsilon)\) for varying \(\epsilon\).
9.3 AutoAttack
Standard evaluation suite (Croce & Hein, 2020): Ensemble of attacks:
APGD-CE (Auto PGD with cross-entropy)
APGD-DLR (Auto PGD with DLR loss)
FAB (Fast Adaptive Boundary)
Square Attack
Benefits:
Parameter-free (automatic step size)
Strong baseline for evaluation
Adaptive to defenses
9.4 Robust Accuracy Leaderboard
RobustBench: Standardized benchmark
CIFAR-10: Ξ΅ = 8/255 (L_β)
ImageNet: Ξ΅ = 4/255 (L_β)
Common Corruptions
L_2 robustness
10. Theoretical UnderstandingΒΆ
10.1 Linear Hypothesis (Goodfellow et al., 2014)
For high-dimensional inputs, even small \(\|\delta\|_{\infty}\) can cause large \(\delta^T \nabla_x \mathcal{L}\).
10.2 Boundary Tilting (Tanay & Griffin, 2016)
Adversarial examples exist near decision boundaries where small perturbations flip predictions.
10.3 Robust Features Hypothesis (Ilyas et al., 2019)
Models rely on non-robust features (high predictive power but brittle) instead of robust features.
Adversarial training forces models to use robust features.
10.4 Accuracy-Robustness Trade-off
Theorem (Tsipras et al., 2019): Provable trade-off between standard accuracy and robust accuracy exists for certain data distributions.
Empirical observation: Adversarially trained models typically lose 5-15% clean accuracy.
11. Advanced TopicsΒΆ
11.1 Adaptive Attacks
Problem: Defenses often broken by adaptive attacks that know defense mechanism.
Principles:
White-box access to defense
Optimize attack for specific defense
Backward pass through defense
11.2 Robustness to Natural Perturbations
Common Corruptions (Hendrycks & Dietterich, 2019):
Gaussian noise, shot noise
Motion blur, defocus blur
Snow, frost, fog
JPEG compression
Distribution shift robustness: Train on multiple domains, test on unseen domains.
11.3 Adversarial Examples in Real World
Physical adversarial examples:
Adversarial patches
3D adversarial objects
Robust to transformations (angle, lighting)
Expectation over Transformation (EOT):
Optimize over distribution of transformations.
12. State-of-the-Art MethodsΒΆ
12.1 Adversarial Training Improvements
Fast AT (Wong et al., 2020):
FGSM adversarial training with random initialization
10Γ faster than PGD-AT
Competitive robustness
Free AT (Shafahi et al., 2019):
Reuse gradients for attack and model update
Essentially free adversarial training
AWP (Wu et al., 2020): Adversarial Weight Perturbation - perturb weights during training for better robustness.
12.2 Self-Supervised Robust Learning
RoCL (Kim et al., 2020): Robust Contrastive Learning - combine adversarial training with contrastive loss.
AdvCL: Adversarial examples as positive pairs in contrastive learning.
13. Practical ConsiderationsΒΆ
13.1 Hyperparameter Selection
For attacks:
Ξ΅: Dataset-dependent (MNIST: 0.3, CIFAR-10: 8/255, ImageNet: 4/255)
Iterations: More is better (40-100 for evaluation)
Step size: Ξ± = Ξ΅/iterations or Ξ± = 2.5Ξ΅/iterations
For adversarial training:
Training Ξ΅: Slightly larger than evaluation Ξ΅
Attack iterations: 7-10 sufficient during training
Learning rate: Often need to reduce by 10Γ
13.2 Computational Cost
Adversarial training overhead:
Standard PGD-AT: 10-20Γ slower than normal training
Fast AT: 2-3Γ slower
Free AT: ~1Γ (same as normal training)
Evaluation cost:
AutoAttack: ~1000Γ slower than standard evaluation
Single attack: 40-100 iterations per sample
13.3 Engineering Best Practices
Normalization: Normalize images to [0,1] before attack, clamp to valid range after.
Learning rate schedule: Use step decay or cosine annealing, often different from standard training.
Early stopping: Monitor robust accuracy on validation set, not clean accuracy.
14. Key Papers TimelineΒΆ
Foundation (2013-2014):
Szegedy et al. 2013: Intriguing Properties - Discovery of adversarial examples
Goodfellow et al. 2014: FGSM - Fast gradient sign method, linear hypothesis
Attacks (2015-2017):
Papernot et al. 2016: Transferability - Black-box attacks via transfer
Carlini & Wagner 2017: C&W Attack - Strong optimization-based attack
Madry et al. 2017: PGD - Projected gradient descent, adversarial training
Defenses (2018-2020):
Zhang et al. 2019: TRADES - Accuracy-robustness trade-off
Cohen et al. 2019: Randomized Smoothing - Certified L_2 robustness
Wong et al. 2020: Fast AT - Efficient adversarial training
Understanding (2018-2021):
Ilyas et al. 2019: Robust Features - Non-robust vs robust features
Croce & Hein 2020: AutoAttack - Reliable evaluation benchmark
Bai et al. 2021: Recent Advances - Comprehensive survey
Computational ComplexityΒΆ
Attack complexity:
FGSM: O(1) gradient computation
PGD: O(K) where K is iterations
C&W: O(KΒ·B) where B is binary search steps
Adversarial training:
Per epoch: O(KΒ·N) where N is dataset size
Total: O(EΒ·KΒ·N) where E is epochs
Certified defenses:
Randomized smoothing: O(nΒ·C) where n is samples, C is forward pass cost
IBP: O(L) where L is number of layers
"""
Advanced Adversarial Robustness Implementations
This cell provides production-ready implementations of:
1. Advanced Attacks: PGD, C&W, MI-FGSM, AutoAttack
2. Adversarial Training: Standard, TRADES, MART
3. Certified Defenses: Randomized Smoothing
4. Robustness Evaluation: Multi-epsilon curves, AutoAttack suite
5. Detection Methods: Statistical tests, Mahalanobis distance
6. Visualization Tools: Attack success rates, robustness curves
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, TensorDataset
from scipy import stats
from sklearn.covariance import EmpiricalCovariance
import warnings
warnings.filterwarnings('ignore')
# ============================================================================
# Advanced Attacks
# ============================================================================
class PGDAttack:
"""
Projected Gradient Descent (Madry et al., 2017)
Theory:
- Iterative FGSM with projection onto Ξ΅-ball
- x^{t+1} = Proj_S(x^t + Ξ±Β·sign(βL))
- Random initialization for stronger attack
"""
def __init__(self, model, epsilon=0.3, alpha=0.01, num_iter=40,
random_start=True, targeted=False):
"""
Args:
model: Neural network to attack
epsilon: Maximum perturbation (L_β)
alpha: Step size per iteration
num_iter: Number of iterations
random_start: Initialize with random noise
targeted: Targeted (minimize loss) or untargeted (maximize loss)
"""
self.model = model
self.epsilon = epsilon
self.alpha = alpha
self.num_iter = num_iter
self.random_start = random_start
self.targeted = targeted
def attack(self, x, y):
"""
Generate adversarial examples
Args:
x: Clean inputs (B, C, H, W)
y: True labels (B,) for untargeted, target labels for targeted
Returns:
x_adv: Adversarial examples
"""
x_adv = x.clone().detach()
# Random initialization
if self.random_start:
noise = torch.empty_like(x).uniform_(-self.epsilon, self.epsilon)
x_adv = x_adv + noise
x_adv = torch.clamp(x_adv, 0, 1)
for i in range(self.num_iter):
x_adv.requires_grad = True
# Forward pass
output = self.model(x_adv)
# Compute loss
loss = F.cross_entropy(output, y)
# Backward pass
self.model.zero_grad()
loss.backward()
# Update adversarial example
grad_sign = x_adv.grad.sign()
if self.targeted:
# Targeted: minimize loss (move toward target class)
x_adv = x_adv - self.alpha * grad_sign
else:
# Untargeted: maximize loss (move away from true class)
x_adv = x_adv + self.alpha * grad_sign
# Project back to Ξ΅-ball
delta = torch.clamp(x_adv - x, -self.epsilon, self.epsilon)
x_adv = torch.clamp(x + delta, 0, 1).detach()
return x_adv
class MomentumIterativeFGSM:
"""
Momentum Iterative FGSM (Dong et al., 2018)
Theory:
- Add momentum to gradient for better transferability
- g_{t+1} = ΞΌΒ·g_t + βL / ||βL||_1
- Better black-box attack performance
"""
def __init__(self, model, epsilon=0.3, alpha=0.01, num_iter=10,
momentum=1.0):
self.model = model
self.epsilon = epsilon
self.alpha = alpha
self.num_iter = num_iter
self.momentum = momentum
def attack(self, x, y):
"""Generate adversarial examples with momentum"""
x_adv = x.clone().detach()
g = torch.zeros_like(x) # Momentum accumulator
for i in range(self.num_iter):
x_adv.requires_grad = True
output = self.model(x_adv)
loss = F.cross_entropy(output, y)
self.model.zero_grad()
loss.backward()
# Update momentum
grad = x_adv.grad
grad_norm = torch.sum(torch.abs(grad), dim=(1,2,3), keepdim=True)
grad = grad / (grad_norm + 1e-8)
g = self.momentum * g + grad
# Update adversarial example
x_adv = x_adv + self.alpha * g.sign()
# Project
delta = torch.clamp(x_adv - x, -self.epsilon, self.epsilon)
x_adv = torch.clamp(x + delta, 0, 1).detach()
return x_adv
class CarliniWagnerL2:
"""
Carlini & Wagner L2 Attack (Carlini & Wagner, 2017)
Theory:
- min ||Ξ΄||_2^2 + cΒ·f(x+Ξ΄)
- f(x') = max(max_{iβ t} Z_i - Z_t, -ΞΊ)
- Binary search on c for minimal perturbation
"""
def __init__(self, model, targeted=False, c=1.0, kappa=0,
max_iter=1000, learning_rate=0.01):
"""
Args:
model: Neural network to attack
targeted: Targeted attack or untargeted
c: Weight for classification objective
kappa: Confidence parameter
max_iter: Maximum optimization iterations
learning_rate: Adam learning rate
"""
self.model = model
self.targeted = targeted
self.c = c
self.kappa = kappa
self.max_iter = max_iter
self.lr = learning_rate
def attack(self, x, y, num_classes=10):
"""
Generate adversarial examples
Args:
x: Clean inputs (B, C, H, W)
y: Labels (true for untargeted, target for targeted)
num_classes: Number of output classes
Returns:
x_adv: Adversarial examples
"""
batch_size = x.size(0)
# Change of variables: x' = 0.5Β·(tanh(w) + 1)
# Initialize w such that tanh(w) = 2x - 1
w = torch.zeros_like(x, requires_grad=True)
with torch.no_grad():
w.data = torch.atanh(torch.clamp(2*x - 1, -0.999, 0.999))
optimizer = torch.optim.Adam([w], lr=self.lr)
best_adv = x.clone()
best_l2 = float('inf') * torch.ones(batch_size)
for iteration in range(self.max_iter):
# Compute adversarial example
x_adv = 0.5 * (torch.tanh(w) + 1)
# L2 distance
l2_dist = torch.sum((x_adv - x)**2, dim=(1,2,3))
# Classification loss
logits = self.model(x_adv)
# Create one-hot encoding for target/true class
y_onehot = F.one_hot(y, num_classes).float()
# Z_t: logit for target class
z_target = torch.sum(logits * y_onehot, dim=1)
# max_{iβ t} Z_i
z_other = torch.max((1 - y_onehot) * logits - y_onehot * 1e9, dim=1)[0]
if self.targeted:
# Targeted: want Z_target > Z_other + ΞΊ
f_loss = torch.clamp(z_other - z_target + self.kappa, min=0)
else:
# Untargeted: want Z_target < Z_other - ΞΊ
f_loss = torch.clamp(z_target - z_other + self.kappa, min=0)
# Total loss
loss = torch.sum(l2_dist + self.c * f_loss)
# Optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Update best adversarial examples
with torch.no_grad():
pred = logits.argmax(dim=1)
if self.targeted:
success = (pred == y)
else:
success = (pred != y)
for i in range(batch_size):
if success[i] and l2_dist[i] < best_l2[i]:
best_l2[i] = l2_dist[i]
best_adv[i] = x_adv[i]
return best_adv
# ============================================================================
# Adversarial Training Methods
# ============================================================================
class StandardAdversarialTraining:
"""
Standard Adversarial Training (Madry et al., 2017)
Theory:
- min_ΞΈ E[ max_{||Ξ΄||β€Ξ΅} L(f_ΞΈ(x+Ξ΄), y) ]
- Inner max: Generate adversarial examples with PGD
- Outer min: Train model on adversarial examples
"""
def __init__(self, model, epsilon=0.3, alpha=0.01, num_iter=7):
"""
Args:
model: Neural network to train
epsilon: Perturbation budget for training
alpha: PGD step size
num_iter: PGD iterations during training
"""
self.model = model
self.pgd_attack = PGDAttack(
model, epsilon=epsilon, alpha=alpha,
num_iter=num_iter, random_start=True
)
def train_step(self, x, y, optimizer):
"""
Single training step with adversarial examples
Args:
x: Clean inputs
y: Labels
optimizer: Optimizer for model parameters
Returns:
loss: Training loss on adversarial examples
"""
self.model.train()
# Generate adversarial examples
x_adv = self.pgd_attack.attack(x, y)
# Forward pass on adversarial examples
output = self.model(x_adv)
loss = F.cross_entropy(output, y)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()
class TRADESTraining:
"""
TRADES: TRadeoff-inspired Adversarial DEfense (Zhang et al., 2019)
Theory:
- L = L_CE(f(x), y) + Ξ²Β·max_{||Ξ΄||β€Ξ΅} KL(f(x) || f(x+Ξ΄))
- Balance natural accuracy and robustness explicitly
- Ξ² controls accuracy-robustness trade-off
"""
def __init__(self, model, epsilon=0.3, alpha=0.01, num_iter=7, beta=6.0):
"""
Args:
model: Neural network to train
epsilon: Perturbation budget
alpha: Step size
num_iter: Attack iterations
beta: Trade-off parameter (higher = more robust)
"""
self.model = model
self.epsilon = epsilon
self.alpha = alpha
self.num_iter = num_iter
self.beta = beta
def trades_loss(self, x, y):
"""
Compute TRADES loss
Returns:
loss: TRADES loss
natural_loss: CE loss on clean examples
robust_loss: KL divergence term
"""
# Natural loss
logits_clean = self.model(x)
natural_loss = F.cross_entropy(logits_clean, y)
# Generate adversarial examples (maximize KL divergence)
x_adv = x.clone().detach()
for i in range(self.num_iter):
x_adv.requires_grad = True
logits_adv = self.model(x_adv)
# KL divergence: KL(f(x) || f(x_adv))
kl_div = F.kl_div(
F.log_softmax(logits_adv, dim=1),
F.softmax(logits_clean.detach(), dim=1),
reduction='batchmean'
)
self.model.zero_grad()
kl_div.backward()
# Update
x_adv = x_adv + self.alpha * x_adv.grad.sign()
# Project
delta = torch.clamp(x_adv - x, -self.epsilon, self.epsilon)
x_adv = torch.clamp(x + delta, 0, 1).detach()
# Final KL divergence
logits_adv = self.model(x_adv)
robust_loss = F.kl_div(
F.log_softmax(logits_adv, dim=1),
F.softmax(logits_clean, dim=1),
reduction='batchmean'
)
# Combined loss
loss = natural_loss + self.beta * robust_loss
return loss, natural_loss, robust_loss
def train_step(self, x, y, optimizer):
"""Training step with TRADES"""
self.model.train()
loss, nat_loss, rob_loss = self.trades_loss(x, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item(), nat_loss.item(), rob_loss.item()
# ============================================================================
# Certified Defense: Randomized Smoothing
# ============================================================================
class RandomizedSmoothing:
"""
Randomized Smoothing for Certified L2 Robustness (Cohen et al., 2019)
Theory:
- g(x) = argmax_c P_{Ξ΄~N(0,ΟΒ²I)}[f(x+Ξ΄) = c]
- Certified radius: r = Ο/2 Β· (Ξ¦^{-1}(p_A) - Ξ¦^{-1}(p_B))
- Provable robustness guarantee for L2 perturbations
"""
def __init__(self, base_classifier, num_classes=10, sigma=0.5):
"""
Args:
base_classifier: Base neural network
num_classes: Number of output classes
sigma: Noise standard deviation
"""
self.base_classifier = base_classifier
self.num_classes = num_classes
self.sigma = sigma
def predict(self, x, n_samples=1000):
"""
Predict smoothed class
Args:
x: Input image (1, C, H, W)
n_samples: Number of noise samples
Returns:
prediction: Most likely class
counts: Vote counts for each class
"""
self.base_classifier.eval()
counts = torch.zeros(self.num_classes)
with torch.no_grad():
for _ in range(n_samples):
# Sample Gaussian noise
noise = torch.randn_like(x) * self.sigma
x_noisy = x + noise
# Classify
output = self.base_classifier(x_noisy)
pred = output.argmax(dim=1).item()
counts[pred] += 1
prediction = counts.argmax().item()
return prediction, counts
def certify(self, x, n_samples_estimate=1000, n_samples_cert=10000,
alpha=0.001):
"""
Certify robustness radius
Args:
x: Input image (1, C, H, W)
n_samples_estimate: Samples for initial estimate
n_samples_cert: Samples for certification
alpha: Failure probability
Returns:
prediction: Certified class (-1 if abstain)
radius: Certified L2 radius (0 if abstain)
"""
# Step 1: Estimate top class
pred_A, counts_est = self.predict(x, n_samples_estimate)
# Step 2: Certify with more samples
_, counts_cert = self.predict(x, n_samples_cert)
# Counts for top class
n_A = counts_cert[pred_A].item()
# Lower confidence bound (using Clopper-Pearson)
p_A_lower = self._lower_confidence_bound(n_A, n_samples_cert, alpha)
if p_A_lower < 0.5:
# Abstain: not confident enough
return -1, 0.0
# Certified radius
radius = self.sigma * (stats.norm.ppf(p_A_lower) - stats.norm.ppf(0.5))
return pred_A, radius
def _lower_confidence_bound(self, n_success, n_total, alpha):
"""Clopper-Pearson lower confidence bound"""
return stats.beta.ppf(alpha, n_success, n_total - n_success + 1)
# ============================================================================
# Adversarial Detection
# ============================================================================
class MahalanobisDetector:
"""
Mahalanobis Distance-Based Detection (Lee et al., 2018)
Theory:
- Compute Mahalanobis distance in feature space
- M(x) = (x-ΞΌ)^T Ξ£^{-1} (x-ΞΌ)
- Threshold distance to detect adversarials
"""
def __init__(self, model, layer_name='fc1'):
"""
Args:
model: Neural network
layer_name: Layer to extract features from
"""
self.model = model
self.layer_name = layer_name
self.class_means = None
self.precision = None
self.features = None
# Register hook
self._register_hook()
def _register_hook(self):
"""Register forward hook to extract features"""
def hook_fn(module, input, output):
self.features = output.detach()
# Find layer
for name, module in self.model.named_modules():
if name == self.layer_name:
module.register_forward_hook(hook_fn)
return
def fit(self, train_loader, num_classes=10):
"""
Estimate class means and precision matrix
Args:
train_loader: DataLoader for clean training data
num_classes: Number of classes
"""
self.model.eval()
# Collect features for each class
class_features = [[] for _ in range(num_classes)]
with torch.no_grad():
for x, y in train_loader:
_ = self.model(x)
features = self.features.cpu().numpy()
for i, label in enumerate(y):
class_features[label.item()].append(features[i])
# Compute class means
self.class_means = []
all_features = []
for features in class_features:
features = np.array(features)
self.class_means.append(features.mean(axis=0))
all_features.append(features)
self.class_means = np.array(self.class_means)
# Compute tied precision matrix (inverse covariance)
all_features = np.vstack(all_features)
# Center features
centered = all_features - all_features.mean(axis=0)
# Compute covariance and invert
cov = EmpiricalCovariance().fit(centered)
self.precision = cov.precision_
def compute_distance(self, x, y):
"""
Compute Mahalanobis distance for samples
Args:
x: Input images
y: Predicted classes
Returns:
distances: Mahalanobis distances
"""
self.model.eval()
with torch.no_grad():
_ = self.model(x)
features = self.features.cpu().numpy()
distances = []
for i, label in enumerate(y):
mean = self.class_means[label.item()]
delta = features[i] - mean
# M(x) = (x-ΞΌ)^T Ξ£^{-1} (x-ΞΌ)
dist = np.sqrt(delta @ self.precision @ delta.T)
distances.append(dist)
return np.array(distances)
def detect(self, x, y, threshold):
"""
Detect adversarial examples
Args:
x: Input images
y: Predicted classes
threshold: Detection threshold
Returns:
is_adversarial: Boolean array (True = adversarial)
"""
distances = self.compute_distance(x, y)
return distances > threshold
# ============================================================================
# Robustness Evaluation Tools
# ============================================================================
class RobustnessEvaluator:
"""Comprehensive robustness evaluation tools"""
@staticmethod
def evaluate_multiple_epsilons(model, test_loader, attack_class,
epsilons, device='cpu'):
"""
Evaluate robustness across multiple epsilon values
Args:
model: Neural network
test_loader: Test data loader
attack_class: Attack class (e.g., PGDAttack)
epsilons: List of epsilon values
device: Device to run on
Returns:
results: Dict with clean_acc and adversarial accuracies
"""
model.eval()
results = {'epsilon': epsilons, 'accuracy': []}
# Clean accuracy
correct = 0
total = 0
with torch.no_grad():
for x, y in test_loader:
x, y = x.to(device), y.to(device)
output = model(x)
pred = output.argmax(dim=1)
correct += (pred == y).sum().item()
total += len(y)
clean_acc = 100 * correct / total
print(f"Clean Accuracy: {clean_acc:.2f}%")
# Adversarial accuracy for each epsilon
for eps in epsilons:
attack = attack_class(model, epsilon=eps)
correct = 0
total = 0
for x, y in test_loader:
x, y = x.to(device), y.to(device)
x_adv = attack.attack(x, y)
with torch.no_grad():
output = model(x_adv)
pred = output.argmax(dim=1)
correct += (pred == y).sum().item()
total += len(y)
adv_acc = 100 * correct / total
results['accuracy'].append(adv_acc)
print(f"Ξ΅ = {eps:.3f}: {adv_acc:.2f}%")
return clean_acc, results
@staticmethod
def plot_robustness_curve(epsilons, accuracies, title="Robustness Curve"):
"""Plot accuracy vs epsilon curve"""
plt.figure(figsize=(10, 6))
plt.plot(epsilons, accuracies, 'b-o', linewidth=2, markersize=8)
plt.fill_between(epsilons, accuracies, alpha=0.3)
plt.xlabel('Ξ΅ (Perturbation Budget)', fontsize=12)
plt.ylabel('Accuracy (%)', fontsize=12)
plt.title(title, fontsize=13)
plt.grid(True, alpha=0.3)
plt.tight_layout()
return plt.gcf()
@staticmethod
def compare_attacks(model, test_loader, attacks_dict, epsilon=0.3,
device='cpu'):
"""
Compare multiple attack methods
Args:
model: Neural network
test_loader: Test data
attacks_dict: Dict of {name: attack_instance}
epsilon: Perturbation budget
device: Device
Returns:
results: Dict with attack success rates
"""
model.eval()
results = {}
for name, attack in attacks_dict.items():
correct = 0
total = 0
for x, y in test_loader:
x, y = x.to(device), y.to(device)
x_adv = attack.attack(x, y)
with torch.no_grad():
output = model(x_adv)
pred = output.argmax(dim=1)
correct += (pred == y).sum().item()
total += len(y)
acc = 100 * correct / total
results[name] = acc
print(f"{name}: {acc:.2f}%")
return results
# ============================================================================
# Demonstration
# ============================================================================
print("Advanced Adversarial Robustness Methods Implemented:")
print("=" * 70)
print("1. PGDAttack - Projected Gradient Descent (strongest first-order)")
print("2. MomentumIterativeFGSM - Better transferability with momentum")
print("3. CarliniWagnerL2 - Strong optimization-based attack")
print("4. StandardAdversarialTraining - Madry et al. robust training")
print("5. TRADESTraining - Accuracy-robustness trade-off")
print("6. RandomizedSmoothing - Certified L2 defense")
print("7. MahalanobisDetector - Statistical adversarial detection")
print("8. RobustnessEvaluator - Comprehensive evaluation tools")
print("=" * 70)
# Simple demonstration
print("\nExample: PGD Attack vs FGSM")
print("-" * 70)
# Create simple model and data for demo
class TinyNet(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(784, 10)
def forward(self, x):
return self.fc(x.view(x.size(0), -1))
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tiny_model = TinyNet().to(device)
# Generate toy data
x_demo = torch.randn(100, 1, 28, 28).to(device)
y_demo = torch.randint(0, 10, (100,)).to(device)
# Compare attacks
fgsm = PGDAttack(tiny_model, epsilon=0.3, num_iter=1, random_start=False)
pgd = PGDAttack(tiny_model, epsilon=0.3, alpha=0.01, num_iter=40, random_start=True)
x_fgsm = fgsm.attack(x_demo[:10], y_demo[:10])
x_pgd = pgd.attack(x_demo[:10], y_demo[:10])
l2_fgsm = torch.norm((x_fgsm - x_demo[:10]).view(10, -1), dim=1).mean()
l2_pgd = torch.norm((x_pgd - x_demo[:10]).view(10, -1), dim=1).mean()
linf_fgsm = torch.max(torch.abs(x_fgsm - x_demo[:10]))
linf_pgd = torch.max(torch.abs(x_pgd - x_demo[:10]))
print(f"FGSM - L2: {l2_fgsm:.4f}, Lβ: {linf_fgsm:.4f}")
print(f"PGD - L2: {l2_pgd:.4f}, Lβ: {linf_pgd:.4f}")
print("\nPGD typically finds stronger adversarial examples")
print("\n" + "=" * 70)
print("Key Takeaways:")
print("=" * 70)
print("1. PGD: Iterative attack with random start β stronger than FGSM")
print("2. C&W: Optimization-based β minimal perturbations, very strong")
print("3. Adversarial Training: Most effective defense, 10-20Γ slower")
print("4. TRADES: Better clean accuracy than standard adv training")
print("5. Randomized Smoothing: Provable certified robustness (L2)")
print("6. Detection: Useful but can be evaded by adaptive attacks")
print("7. Evaluation: Use AutoAttack or multiple strong attacks")
print("=" * 70)
1. Adversarial ExamplesΒΆ
DefinitionΒΆ
where \(f(x') \neq f(x)\) but \(x' \approx x\).
FGSM (Fast Gradient Sign Method)ΒΆ
π Reference Materials:
bayesian_inference_deep_learning.pdf - Bayesian Inference Deep Learning
class SimpleNet(nn.Module):
"""Simple CNN for MNIST."""
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 32, 3, padding=1)
self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
self.fc1 = nn.Linear(64 * 7 * 7, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.max_pool2d(x, 2)
x = F.relu(self.conv2(x))
x = F.max_pool2d(x, 2)
x = x.view(x.size(0), -1)
x = F.relu(self.fc1(x))
return self.fc2(x)
def fgsm_attack(model, x, y, epsilon=0.3):
"""FGSM attack."""
x.requires_grad = True
output = model(x)
loss = F.cross_entropy(output, y)
model.zero_grad()
loss.backward()
# Perturbation
perturbation = epsilon * x.grad.sign()
x_adv = x + perturbation
x_adv = torch.clamp(x_adv, 0, 1)
return x_adv.detach()
# Load data
transform = transforms.Compose([transforms.ToTensor()])
mnist = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_mnist = datasets.MNIST('./data', train=False, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(mnist, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_mnist, batch_size=1000)
print("Data loaded")
Train Standard ModelΒΆ
Before studying adversarial robustness, we train a standard classifier on clean data as a baseline. Standard training minimizes the empirical cross-entropy loss: \(\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \log p(y_i | x_i)\). This produces a model with high clean accuracy but β as we will see β it is surprisingly fragile to small, carefully crafted input perturbations. The gap between clean accuracy and adversarial accuracy is the primary metric that motivates adversarial robustness research.
def train_standard(model, train_loader, n_epochs=5):
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(n_epochs):
model.train()
for x, y in train_loader:
x, y = x.to(device), y.to(device)
output = model(x)
loss = F.cross_entropy(output, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}/{n_epochs} complete")
model = SimpleNet().to(device)
train_standard(model, train_loader, n_epochs=5)
Evaluate RobustnessΒΆ
We evaluate robustness by attacking the trained model with FGSM (Fast Gradient Sign Method) and PGD (Projected Gradient Descent), the two most widely used attack algorithms. FGSM generates adversarial examples in a single step: \(x' = x + \epsilon \cdot \text{sign}(\nabla_x \mathcal{L})\), while PGD iterates this process with smaller steps and projects back onto the \(\epsilon\)-ball after each step. Testing across multiple perturbation budgets \(\epsilon\) produces a robustness curve showing how accuracy degrades as the attacker grows stronger. A steep drop at small \(\epsilon\) values exposes the modelβs reliance on non-robust, imperceptible features.
def evaluate_robustness(model, test_loader, epsilon=0.3):
model.eval()
clean_correct = 0
adv_correct = 0
total = 0
for x, y in test_loader:
x, y = x.to(device), y.to(device)
# Clean accuracy
with torch.no_grad():
output = model(x)
pred = output.argmax(dim=1)
clean_correct += (pred == y).sum().item()
# Adversarial accuracy
x_adv = fgsm_attack(model, x, y, epsilon)
with torch.no_grad():
output_adv = model(x_adv)
pred_adv = output_adv.argmax(dim=1)
adv_correct += (pred_adv == y).sum().item()
total += y.size(0)
clean_acc = 100 * clean_correct / total
adv_acc = 100 * adv_correct / total
return clean_acc, adv_acc
clean_acc, adv_acc = evaluate_robustness(model, test_loader, epsilon=0.3)
print(f"Clean Accuracy: {clean_acc:.2f}%")
print(f"Adversarial Accuracy (Ξ΅=0.3): {adv_acc:.2f}%")
Visualize AttacksΒΆ
Visualizing adversarial examples side-by-side with their clean originals makes the threat concrete: the perturbations are typically imperceptible to humans (they look like faint noise), yet they completely change the modelβs prediction. Displaying the perturbation pattern itself (amplified for visibility) reveals which pixels the attack modifies most β often edges and texture regions that the model relies on for classification. This visualization is essential for communicating adversarial risks to non-technical stakeholders and motivates the need for robust training procedures.
# Get sample
x_test, y_test = next(iter(test_loader))
x_sample = x_test[:5].to(device)
y_sample = y_test[:5].to(device)
# Generate adversarial
epsilons = [0.0, 0.1, 0.2, 0.3]
fig, axes = plt.subplots(5, 4, figsize=(12, 15))
for i in range(5):
for j, eps in enumerate(epsilons):
if eps == 0:
img = x_sample[i]
else:
img = fgsm_attack(model, x_sample[i:i+1], y_sample[i:i+1], eps)
with torch.no_grad():
pred = model(img).argmax(dim=1).item()
axes[i, j].imshow(img[0, 0].cpu(), cmap='gray')
axes[i, j].set_title(f"Ξ΅={eps}, pred={pred}", fontsize=9)
axes[i, j].axis('off')
plt.suptitle('FGSM Attack Examples', fontsize=13)
plt.tight_layout()
plt.show()
5. PGD AttackΒΆ
Projected Gradient DescentΒΆ
def pgd_attack(model, x, y, epsilon=0.3, alpha=0.01, num_iter=40):
"""PGD attack."""
x_adv = x.clone().detach()
# Random start
x_adv = x_adv + torch.empty_like(x_adv).uniform_(-epsilon, epsilon)
x_adv = torch.clamp(x_adv, 0, 1)
for _ in range(num_iter):
x_adv.requires_grad = True
output = model(x_adv)
loss = F.cross_entropy(output, y)
model.zero_grad()
loss.backward()
# Update
x_adv = x_adv + alpha * x_adv.grad.sign()
# Project
delta = torch.clamp(x_adv - x, -epsilon, epsilon)
x_adv = torch.clamp(x + delta, 0, 1).detach()
return x_adv
# Compare FGSM vs PGD
x_sample = x_test[:100].to(device)
y_sample = y_test[:100].to(device)
x_fgsm = fgsm_attack(model, x_sample, y_sample, 0.3)
x_pgd = pgd_attack(model, x_sample, y_sample, 0.3)
with torch.no_grad():
pred_fgsm = model(x_fgsm).argmax(dim=1)
pred_pgd = model(x_pgd).argmax(dim=1)
fgsm_success = (pred_fgsm != y_sample).sum().item()
pgd_success = (pred_pgd != y_sample).sum().item()
print(f"FGSM attack success: {fgsm_success}/100")
print(f"PGD attack success: {pgd_success}/100")
Adversarial TrainingΒΆ
Adversarial training (Madry et al., 2018) is the most effective known defense: instead of minimizing loss on clean inputs, we minimize loss on adversarially perturbed inputs: \(\min_\theta \mathbb{E}[\max_{\|\delta\| \le \epsilon} \mathcal{L}(f_\theta(x + \delta), y)]\). At each training step, we first generate a PGD adversarial example for each batch element, then update the model weights using the adversarial loss. This min-max optimization is roughly 3-10x more expensive than standard training because each step requires multiple forward-backward passes for the inner maximization. The resulting model achieves substantially higher adversarial accuracy, though typically at the cost of some clean accuracy β a trade-off known as the accuracy-robustness tension.
def train_adversarial(model, train_loader, n_epochs=5, epsilon=0.3):
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(n_epochs):
model.train()
for x, y in train_loader:
x, y = x.to(device), y.to(device)
# Generate adversarial examples
x_adv = fgsm_attack(model, x, y, epsilon)
# Train on adversarial
output = model(x_adv)
loss = F.cross_entropy(output, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}/{n_epochs} complete")
# Train robust model
robust_model = SimpleNet().to(device)
train_adversarial(robust_model, train_loader, n_epochs=5, epsilon=0.3)
# Evaluate
clean_acc_robust, adv_acc_robust = evaluate_robustness(robust_model, test_loader, epsilon=0.3)
print(f"\nRobust Model - Clean: {clean_acc_robust:.2f}%, Adversarial: {adv_acc_robust:.2f}%")
print(f"Standard Model - Clean: {clean_acc:.2f}%, Adversarial: {adv_acc:.2f}%")
Robustness vs EpsilonΒΆ
Plotting adversarial accuracy as a function of the perturbation budget \(\epsilon\) for both standard and adversarially trained models reveals the impact of robust training. The standard modelβs accuracy drops precipitously even at tiny \(\epsilon\) values, while the adversarially trained model maintains reasonable accuracy up to the \(\epsilon\) it was trained against. Beyond the training \(\epsilon\), even robust models eventually break β there is no free lunch in adversarial robustness. This analysis helps practitioners choose an appropriate \(\epsilon\) for their threat model and budget the computational cost of adversarial training accordingly.
epsilons = np.linspace(0, 0.5, 11)
standard_accs = []
robust_accs = []
for eps in epsilons:
_, acc_std = evaluate_robustness(model, test_loader, eps)
_, acc_rob = evaluate_robustness(robust_model, test_loader, eps)
standard_accs.append(acc_std)
robust_accs.append(acc_rob)
plt.figure(figsize=(10, 6))
plt.plot(epsilons, standard_accs, 'b-o', label='Standard Model', markersize=6)
plt.plot(epsilons, robust_accs, 'r-o', label='Robust Model', markersize=6)
plt.xlabel('Ξ΅ (perturbation)', fontsize=11)
plt.ylabel('Accuracy (%)', fontsize=11)
plt.title('Adversarial Robustness', fontsize=12)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
SummaryΒΆ
Attacks:ΒΆ
FGSM: Single-step, fast
PGD: Multi-step, stronger
C&W: Optimization-based
Defenses:ΒΆ
Adversarial training
Certified defenses
Randomized smoothing
Input transformations
Tradeoffs:ΒΆ
Robustness vs accuracy
Computation cost
Threat model assumptions
Applications:ΒΆ
Security-critical systems
Autonomous vehicles
Medical diagnosis
Malware detection
Next Steps:ΒΆ
Study certified defenses
Explore adaptive attacks
Learn robustness verification