Run this notebook: Open in Colab Open in Kaggle

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from torchvision import datasets, transforms

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

Advanced Adversarial Robustness Theory¶

1. Foundations and Threat Models¶

Definition: Adversarial examples are inputs deliberately crafted to cause misclassification:

\[x_{\text{adv}} = x + \delta \quad \text{s.t.} \quad f(x_{\text{adv}}) \neq f(x), \quad \|\delta\|_p \leq \epsilon\]

Where:

\(x\): Original input (clean example)
\(\delta\): Adversarial perturbation
\(\epsilon\): Perturbation budget
\(\|\cdot\|_p\): L_p norm (p ∈ {0, 1, 2, ∞})

Historical Context:

Szegedy et al. (2013): First observation of adversarial examples in deep networks
Goodfellow et al. (2014): FGSM attack - showed linear nature of adversarial vulnerability
Madry et al. (2017): PGD attack and adversarial training - first strong defense

2. Threat Models and Attack Types¶

2.1 Attack Goals

Untargeted attack:

\[\max_{\|\delta\|_p \leq \epsilon} \mathcal{L}(f_{\theta}(x + \delta), y)\]

Goal: Make model predict any incorrect class.

Targeted attack:

\[\max_{\|\delta\|_p \leq \epsilon} \mathcal{L}(f_{\theta}(x + \delta), y_{\text{target}})\]

Goal: Make model predict specific target class \(y_{\text{target}}\).

2.2 Attack Knowledge (White-box vs Black-box)

White-box: Attacker has full access to:

Model architecture
Model parameters \(\theta\)
Training data (sometimes)
Gradient information

Black-box: Attacker only has:

Query access (input → output)
No gradient information
Limited queries (query budget)

Gray-box: Partial knowledge (e.g., architecture but not weights)

2.3 Perturbation Norms

L_∞ norm (Chebyshev):

\[\|\delta\|_{\infty} = \max_i |\delta_i| \leq \epsilon\]

Per-pixel perturbation bounded. Most common in practice.

L_2 norm (Euclidean):

\[\|\delta\|_2 = \sqrt{\sum_i \delta_i^2} \leq \epsilon\]

Total energy bounded. Allows larger changes in some pixels.

L_0 norm (Sparsity):

\[\|\delta\|_0 = |\{i : \delta_i \neq 0\}| \leq k\]

Number of changed pixels bounded. Hardest to optimize (NP-hard).

L_1 norm (Manhattan):

\[\|\delta\|_1 = \sum_i |\delta_i| \leq \epsilon\]

Sum of absolute changes bounded.

3. Gradient-Based Attacks¶

3.1 Fast Gradient Sign Method (FGSM)

Original formulation (Goodfellow et al., 2014):

\[x_{\text{adv}} = x + \epsilon \cdot \text{sign}(\nabla_x \mathcal{L}(f_{\theta}(x), y))\]

Intuition: Move in direction that maximizes loss, step size \(\epsilon\).

Linear approximation:

\[\mathcal{L}(f_{\theta}(x + \delta), y) \approx \mathcal{L}(f_{\theta}(x), y) + \delta^T \nabla_x \mathcal{L}\]

Maximize by setting \(\delta = \epsilon \cdot \text{sign}(\nabla_x \mathcal{L})\).

L_∞ constraint: \(\|\delta\|_{\infty} = \epsilon\)

Advantages:

Single gradient computation → Very fast
Easy to implement

Limitations:

Single-step → Weaker than iterative methods
Suboptimal for non-linear models

Targeted FGSM:

\[x_{\text{adv}} = x - \epsilon \cdot \text{sign}(\nabla_x \mathcal{L}(f_{\theta}(x), y_{\text{target}}))\]

Minimize loss for target class (note the negative sign).

3.2 Projected Gradient Descent (PGD)

Formulation (Madry et al., 2017):

\[x_{\text{adv}}^{t+1} = \text{Proj}_{\mathcal{S}}(x_{\text{adv}}^t + \alpha \cdot \text{sign}(\nabla_x \mathcal{L}(f_{\theta}(x_{\text{adv}}^t), y)))\]

Where:

\(\mathcal{S} = \{x' : \|x' - x\|_{\infty} \leq \epsilon\}\) (constraint set)
\(\alpha\): Step size (typically \(\alpha = \epsilon / K\) where K is iterations)
Proj: Projection onto constraint set
Initialize: \(x_{\text{adv}}^0 = x + \text{Uniform}(-\epsilon, \epsilon)\) (random start)

Projection operator:

\[\text{Proj}_{\mathcal{S}}(z) = \text{clip}(z, x - \epsilon, x + \epsilon)\]

Ensure \(\|x_{\text{adv}} - x\|_{\infty} \leq \epsilon\) at each step.

Why random initialization:

Escapes poor local maxima
Finds stronger adversarial examples
Essential for good attack success

Convergence: PGD approximates solution to:

\[\max_{\|x' - x\|_{\infty} \leq \epsilon} \mathcal{L}(f_{\theta}(x'), y)\]

Typical hyperparameters:

Iterations: K = 40-100 for evaluation, K = 7-10 for training
Step size: α = ε/K or α = 2.5ε/K
Restarts: 5-10 random restarts for strongest attack

3.3 Iterative FGSM (I-FGSM / BIM)

Basic Iterative Method:

\[x_{\text{adv}}^{t+1} = \text{clip}(x_{\text{adv}}^t + \alpha \cdot \text{sign}(\nabla_x \mathcal{L}), x - \epsilon, x + \epsilon)\]

Same as PGD but typically no random initialization.

3.4 Momentum Iterative FGSM (MI-FGSM)

Add momentum for better transferability:

\[g_{t+1} = \mu \cdot g_t + \frac{\nabla_x \mathcal{L}(x_{\text{adv}}^t, y)}{\|\nabla_x \mathcal{L}(x_{\text{adv}}^t, y)\|_1}\]

\[x_{\text{adv}}^{t+1} = \text{clip}(x_{\text{adv}}^t + \alpha \cdot \text{sign}(g_{t+1}), x - \epsilon, x + \epsilon)\]

Where \(\mu = 1.0\) is momentum factor.

Benefit: Better black-box transferability across models.

4. Optimization-Based Attacks¶

4.1 Carlini & Wagner (C&W) Attack

Formulation (Carlini & Wagner, 2017):

\[\min_{\delta} \|\delta\|_2^2 + c \cdot f(x + \delta)\]

Where objective function \(f\) encourages misclassification:

\[f(x') = \max(\max_{i \neq t} Z(x')_i - Z(x')_t, -\kappa)\]

Where:

\(Z(x')\): Logits (pre-softmax outputs)
\(t\): True class
\(\kappa\): Confidence parameter (typically κ = 0)

Change of variables: To enforce \(x' \in [0,1]\), use:

\[x' = \frac{1}{2}(\tanh(w) + 1)\]

Optimize over unconstrained \(w\) instead of \(x'\).

Binary search on \(c\):

Initialize \(c_{\min} = 0\), \(c_{\max} = 10^{10}\)
For \(c = (c_{\min} + c_{\max})/2\):
- Optimize to find \(\delta\)
- If attack succeeds: \(c_{\max} = c\)
- If attack fails: \(c_{\min} = c\)
Repeat until convergence

Advantages:

Finds minimal perturbations
Strong attack (often defeats defenses)
Can target any class

Disadvantages:

Computationally expensive (100-1000 iterations)
Many hyperparameters
Requires careful tuning

4.2 Elastic Net Attack (EAD)

Combines L_1 and L_2:

\[\min_{\delta} \lambda \|\delta\|_1 + \|\delta\|_2^2 + c \cdot f(x + \delta)\]

Encourages sparse perturbations.

5. Black-Box Attacks¶

5.1 Transfer-Based Attacks

Observation: Adversarial examples transfer across models.

Method:

Train substitute model on queries to target model
Generate adversarial examples for substitute
Transfer to target model

Transferability factors:

Similar architectures → Higher transfer
Ensemble attacks → Better transfer
Momentum methods → Better transfer

5.2 Query-Based Attacks

ZOO (Zeroth Order Optimization):

Estimate gradient using finite differences:

\[\frac{\partial \mathcal{L}}{\partial x_i} \approx \frac{\mathcal{L}(x + h \cdot e_i) - \mathcal{L}(x - h \cdot e_i)}{2h}\]

Where \(e_i\) is one-hot vector, \(h\) is small constant.

Cost: O(d) queries where d is dimension.

Square Attack: Random search in L_∞ ball, keep best perturbations.

Simba: Simple Black-box Attack using random directions.

6. Adversarial Training¶

6.1 Standard Adversarial Training

Min-max formulation (Madry et al., 2017):

\[\min_{\theta} \mathbb{E}_{(x,y) \sim \mathcal{D}} \left[ \max_{\|\delta\|_{\infty} \leq \epsilon} \mathcal{L}(f_{\theta}(x + \delta), y) \right]\]

Algorithm:

For each batch (x, y):
  1. Generate adversarial examples:
     x_adv = PGD(model, x, y, ε)
  2. Update model:
     θ ← θ - η·∇_θ L(f_θ(x_adv), y)

Inner maximization: Find worst-case perturbation (PGD attack) Outer minimization: Train model to be robust to those perturbations

Theoretical justification: Robust optimization: Find parameters that minimize worst-case loss.

6.2 TRADES (TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization)

Formulation (Zhang et al., 2019):

\[\min_{\theta} \mathbb{E} \left[ \mathcal{L}(f_{\theta}(x), y) + \beta \cdot \max_{\|\delta\|_p \leq \epsilon} \text{KL}(f_{\theta}(x) \| f_{\theta}(x + \delta)) \right]\]

Where:

First term: Standard loss on clean examples
Second term: Consistency between clean and adversarial predictions
\(\beta\): Trade-off parameter

Advantages:

Better clean accuracy than standard adversarial training
Explicit trade-off control
Theoretical guarantees

6.3 MART (Misclassification Aware adversarial Training)

Boosted CE loss:

\[\mathcal{L}_{\text{MART}} = \text{BCE}(f_{\theta}(x_{\text{adv}}), y) + \beta \cdot \text{KL}(f_{\theta}(x) \| f_{\theta}(x_{\text{adv}}))\]

Where BCE is boosted cross-entropy, focusing on misclassified examples.

7. Certified Defenses¶

7.1 Randomized Smoothing

Definition (Cohen et al., 2019):

Smoothed classifier:

\[g(x) = \arg\max_c \mathbb{P}_{\delta \sim \mathcal{N}(0, \sigma^2 I)}[f(x + \delta) = c]\]

Certification: If \(\mathbb{P}[f(x + \delta) = c_A] \geq p_A\) and \(\mathbb{P}[f(x + \delta) = c_B] \leq p_B\) for \(c_B \neq c_A\), then:

\[g(x + \delta') = c_A \quad \forall \|\delta'\|_2 \leq \frac{\sigma}{2}(\Phi^{-1}(p_A) - \Phi^{-1}(p_B))\]

Where \(\Phi\) is standard Gaussian CDF.

Algorithm:

Sample \(n\) Gaussian perturbations
Count votes for each class
Compute certified radius using Neyman-Pearson lemma

Advantages:

Provable robustness guarantee
Scales to large models
Any base classifier

Disadvantages:

Requires many samples (n = 100-100000)
Accuracy-robustness trade-off
Only L_2 certification

7.2 Interval Bound Propagation (IBP)

Compute bounds on activations:

For ReLU network, propagate intervals \([\underline{z}, \overline{z}]\) through layers.

Certified if: Output interval for true class doesn’t overlap with other classes.

7.3 Lipschitz-Constrained Networks

Lipschitz constant: \(L = \max_{x \neq x'} \frac{\|f(x) - f(x')\|}{\|x - x'\|}\)

Certified radius: If \(\|f(x) - f(x')\| \leq L \cdot \|x - x'\|\), then perturbing x by ε changes output by at most \(L \cdot \epsilon\).

Methods to enforce:

Spectral normalization
Parseval networks
Orthogonal weight initialization

8. Detection and Input Preprocessing¶

8.1 Adversarial Detection

Statistical tests:

Kernel density estimation
Local intrinsic dimensionality
Mahalanobis distance in feature space

Detector network: Train binary classifier to distinguish clean vs adversarial.

Limitations:

Adaptive attacks can evade detectors
Arms race problem

8.2 Input Transformations

Defenses:

JPEG compression: Remove high-frequency adversarial noise
Total Variation denoising: Smooth perturbations
Quantization: Reduce precision
Random resizing: Scale and pad images

Effectiveness:

Can reduce attack success
Often bypassed by adaptive attacks
Combine with adversarial training

9. Robustness Metrics and Evaluation¶

9.1 Adversarial Accuracy

\[\text{Acc}_{\text{adv}}(\epsilon) = \mathbb{P}_{(x,y) \sim \mathcal{D}}[f(x_{\text{adv}}) = y]\]

where \(x_{\text{adv}}\) is adversarial example with \(\|x_{\text{adv}} - x\|_p \leq \epsilon\).

9.2 Robustness Curve

Plot \(\text{Acc}_{\text{adv}}(\epsilon)\) for varying \(\epsilon\).

9.3 AutoAttack

Standard evaluation suite (Croce & Hein, 2020): Ensemble of attacks:

APGD-CE (Auto PGD with cross-entropy)
APGD-DLR (Auto PGD with DLR loss)
FAB (Fast Adaptive Boundary)
Square Attack

Benefits:

Parameter-free (automatic step size)
Strong baseline for evaluation
Adaptive to defenses

9.4 Robust Accuracy Leaderboard

RobustBench: Standardized benchmark

CIFAR-10: ε = 8/255 (L_∞)
ImageNet: ε = 4/255 (L_∞)
Common Corruptions
L_2 robustness

10. Theoretical Understanding¶

10.1 Linear Hypothesis (Goodfellow et al., 2014)

\[\mathcal{L}(f(x + \delta), y) \approx \mathcal{L}(f(x), y) + \delta^T \nabla_x \mathcal{L}\]

For high-dimensional inputs, even small \(\|\delta\|_{\infty}\) can cause large \(\delta^T \nabla_x \mathcal{L}\).

10.2 Boundary Tilting (Tanay & Griffin, 2016)

Adversarial examples exist near decision boundaries where small perturbations flip predictions.

10.3 Robust Features Hypothesis (Ilyas et al., 2019)

Models rely on non-robust features (high predictive power but brittle) instead of robust features.

Adversarial training forces models to use robust features.

10.4 Accuracy-Robustness Trade-off

Theorem (Tsipras et al., 2019): Provable trade-off between standard accuracy and robust accuracy exists for certain data distributions.

Empirical observation: Adversarially trained models typically lose 5-15% clean accuracy.

11. Advanced Topics¶

11.1 Adaptive Attacks

Problem: Defenses often broken by adaptive attacks that know defense mechanism.

Principles:

White-box access to defense
Optimize attack for specific defense
Backward pass through defense

11.2 Robustness to Natural Perturbations

Common Corruptions (Hendrycks & Dietterich, 2019):

Gaussian noise, shot noise
Motion blur, defocus blur
Snow, frost, fog
JPEG compression

Distribution shift robustness: Train on multiple domains, test on unseen domains.

11.3 Adversarial Examples in Real World

Physical adversarial examples:

Adversarial patches
3D adversarial objects
Robust to transformations (angle, lighting)

Expectation over Transformation (EOT):

\[\mathbb{E}_{t \sim T}[\mathcal{L}(f(t(x + \delta)), y)]\]

Optimize over distribution of transformations.

12. State-of-the-Art Methods¶

12.1 Adversarial Training Improvements

Fast AT (Wong et al., 2020):

FGSM adversarial training with random initialization
10× faster than PGD-AT
Competitive robustness

Free AT (Shafahi et al., 2019):

Reuse gradients for attack and model update
Essentially free adversarial training

AWP (Wu et al., 2020): Adversarial Weight Perturbation - perturb weights during training for better robustness.

12.2 Self-Supervised Robust Learning

RoCL (Kim et al., 2020): Robust Contrastive Learning - combine adversarial training with contrastive loss.

AdvCL: Adversarial examples as positive pairs in contrastive learning.

13. Practical Considerations¶

13.1 Hyperparameter Selection

For attacks:

ε: Dataset-dependent (MNIST: 0.3, CIFAR-10: 8/255, ImageNet: 4/255)
Iterations: More is better (40-100 for evaluation)
Step size: α = ε/iterations or α = 2.5ε/iterations

For adversarial training:

Training ε: Slightly larger than evaluation ε
Attack iterations: 7-10 sufficient during training
Learning rate: Often need to reduce by 10×

13.2 Computational Cost

Adversarial training overhead:

Standard PGD-AT: 10-20× slower than normal training
Fast AT: 2-3× slower
Free AT: ~1× (same as normal training)

Evaluation cost:

AutoAttack: ~1000× slower than standard evaluation
Single attack: 40-100 iterations per sample

13.3 Engineering Best Practices

Normalization: Normalize images to [0,1] before attack, clamp to valid range after.

Learning rate schedule: Use step decay or cosine annealing, often different from standard training.

Early stopping: Monitor robust accuracy on validation set, not clean accuracy.

14. Key Papers Timeline¶

Foundation (2013-2014):

Szegedy et al. 2013: Intriguing Properties - Discovery of adversarial examples
Goodfellow et al. 2014: FGSM - Fast gradient sign method, linear hypothesis

Attacks (2015-2017):

Papernot et al. 2016: Transferability - Black-box attacks via transfer
Carlini & Wagner 2017: C&W Attack - Strong optimization-based attack
Madry et al. 2017: PGD - Projected gradient descent, adversarial training

Defenses (2018-2020):

Zhang et al. 2019: TRADES - Accuracy-robustness trade-off
Cohen et al. 2019: Randomized Smoothing - Certified L_2 robustness
Wong et al. 2020: Fast AT - Efficient adversarial training

Understanding (2018-2021):

Ilyas et al. 2019: Robust Features - Non-robust vs robust features
Croce & Hein 2020: AutoAttack - Reliable evaluation benchmark
Bai et al. 2021: Recent Advances - Comprehensive survey

Computational Complexity¶

Attack complexity:

FGSM: O(1) gradient computation
PGD: O(K) where K is iterations
C&W: O(K·B) where B is binary search steps

Adversarial training:

Per epoch: O(K·N) where N is dataset size
Total: O(E·K·N) where E is epochs

Certified defenses:

Randomized smoothing: O(n·C) where n is samples, C is forward pass cost
IBP: O(L) where L is number of layers

"""
Advanced Adversarial Robustness Implementations

This cell provides production-ready implementations of:
1. Advanced Attacks: PGD, C&W, MI-FGSM, AutoAttack
2. Adversarial Training: Standard, TRADES, MART
3. Certified Defenses: Randomized Smoothing
4. Robustness Evaluation: Multi-epsilon curves, AutoAttack suite
5. Detection Methods: Statistical tests, Mahalanobis distance
6. Visualization Tools: Attack success rates, robustness curves
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, TensorDataset
from scipy import stats
from sklearn.covariance import EmpiricalCovariance
import warnings
warnings.filterwarnings('ignore')

# ============================================================================
# Advanced Attacks
# ============================================================================

class PGDAttack:
    """
    Projected Gradient Descent (Madry et al., 2017)
    
    Theory:
    - Iterative FGSM with projection onto ε-ball
    - x^{t+1} = Proj_S(x^t + α·sign(∇L))
    - Random initialization for stronger attack
    """
    
    def __init__(self, model, epsilon=0.3, alpha=0.01, num_iter=40, 
                 random_start=True, targeted=False):
        """
        Args:
            model: Neural network to attack
            epsilon: Maximum perturbation (L_∞)
            alpha: Step size per iteration
            num_iter: Number of iterations
            random_start: Initialize with random noise
            targeted: Targeted (minimize loss) or untargeted (maximize loss)
        """
        self.model = model
        self.epsilon = epsilon
        self.alpha = alpha
        self.num_iter = num_iter
        self.random_start = random_start
        self.targeted = targeted
    
    def attack(self, x, y):
        """
        Generate adversarial examples
        
        Args:
            x: Clean inputs (B, C, H, W)
            y: True labels (B,) for untargeted, target labels for targeted
            
        Returns:
            x_adv: Adversarial examples
        """
        x_adv = x.clone().detach()
        
        # Random initialization
        if self.random_start:
            noise = torch.empty_like(x).uniform_(-self.epsilon, self.epsilon)
            x_adv = x_adv + noise
            x_adv = torch.clamp(x_adv, 0, 1)
        
        for i in range(self.num_iter):
            x_adv.requires_grad = True
            
            # Forward pass
            output = self.model(x_adv)
            
            # Compute loss
            loss = F.cross_entropy(output, y)
            
            # Backward pass
            self.model.zero_grad()
            loss.backward()
            
            # Update adversarial example
            grad_sign = x_adv.grad.sign()
            
            if self.targeted:
                # Targeted: minimize loss (move toward target class)
                x_adv = x_adv - self.alpha * grad_sign
            else:
                # Untargeted: maximize loss (move away from true class)
                x_adv = x_adv + self.alpha * grad_sign
            
            # Project back to ε-ball
            delta = torch.clamp(x_adv - x, -self.epsilon, self.epsilon)
            x_adv = torch.clamp(x + delta, 0, 1).detach()
        
        return x_adv


class MomentumIterativeFGSM:
    """
    Momentum Iterative FGSM (Dong et al., 2018)
    
    Theory:
    - Add momentum to gradient for better transferability
    - g_{t+1} = μ·g_t + ∇L / ||∇L||_1
    - Better black-box attack performance
    """
    
    def __init__(self, model, epsilon=0.3, alpha=0.01, num_iter=10, 
                 momentum=1.0):
        self.model = model
        self.epsilon = epsilon
        self.alpha = alpha
        self.num_iter = num_iter
        self.momentum = momentum
    
    def attack(self, x, y):
        """Generate adversarial examples with momentum"""
        x_adv = x.clone().detach()
        g = torch.zeros_like(x)  # Momentum accumulator
        
        for i in range(self.num_iter):
            x_adv.requires_grad = True
            
            output = self.model(x_adv)
            loss = F.cross_entropy(output, y)
            
            self.model.zero_grad()
            loss.backward()
            
            # Update momentum
            grad = x_adv.grad
            grad_norm = torch.sum(torch.abs(grad), dim=(1,2,3), keepdim=True)
            grad = grad / (grad_norm + 1e-8)
            
            g = self.momentum * g + grad
            
            # Update adversarial example
            x_adv = x_adv + self.alpha * g.sign()
            
            # Project
            delta = torch.clamp(x_adv - x, -self.epsilon, self.epsilon)
            x_adv = torch.clamp(x + delta, 0, 1).detach()
        
        return x_adv


class CarliniWagnerL2:
    """
    Carlini & Wagner L2 Attack (Carlini & Wagner, 2017)
    
    Theory:
    - min ||δ||_2^2 + c·f(x+δ)
    - f(x') = max(max_{i≠t} Z_i - Z_t, -κ)
    - Binary search on c for minimal perturbation
    """
    
    def __init__(self, model, targeted=False, c=1.0, kappa=0, 
                 max_iter=1000, learning_rate=0.01):
        """
        Args:
            model: Neural network to attack
            targeted: Targeted attack or untargeted
            c: Weight for classification objective
            kappa: Confidence parameter
            max_iter: Maximum optimization iterations
            learning_rate: Adam learning rate
        """
        self.model = model
        self.targeted = targeted
        self.c = c
        self.kappa = kappa
        self.max_iter = max_iter
        self.lr = learning_rate
    
    def attack(self, x, y, num_classes=10):
        """
        Generate adversarial examples
        
        Args:
            x: Clean inputs (B, C, H, W)
            y: Labels (true for untargeted, target for targeted)
            num_classes: Number of output classes
            
        Returns:
            x_adv: Adversarial examples
        """
        batch_size = x.size(0)
        
        # Change of variables: x' = 0.5·(tanh(w) + 1)
        # Initialize w such that tanh(w) = 2x - 1
        w = torch.zeros_like(x, requires_grad=True)
        with torch.no_grad():
            w.data = torch.atanh(torch.clamp(2*x - 1, -0.999, 0.999))
        
        optimizer = torch.optim.Adam([w], lr=self.lr)
        
        best_adv = x.clone()
        best_l2 = float('inf') * torch.ones(batch_size)
        
        for iteration in range(self.max_iter):
            # Compute adversarial example
            x_adv = 0.5 * (torch.tanh(w) + 1)
            
            # L2 distance
            l2_dist = torch.sum((x_adv - x)**2, dim=(1,2,3))
            
            # Classification loss
            logits = self.model(x_adv)
            
            # Create one-hot encoding for target/true class
            y_onehot = F.one_hot(y, num_classes).float()
            
            # Z_t: logit for target class
            z_target = torch.sum(logits * y_onehot, dim=1)
            
            # max_{i≠t} Z_i
            z_other = torch.max((1 - y_onehot) * logits - y_onehot * 1e9, dim=1)[0]
            
            if self.targeted:
                # Targeted: want Z_target > Z_other + κ
                f_loss = torch.clamp(z_other - z_target + self.kappa, min=0)
            else:
                # Untargeted: want Z_target < Z_other - κ
                f_loss = torch.clamp(z_target - z_other + self.kappa, min=0)
            
            # Total loss
            loss = torch.sum(l2_dist + self.c * f_loss)
            
            # Optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            # Update best adversarial examples
            with torch.no_grad():
                pred = logits.argmax(dim=1)
                
                if self.targeted:
                    success = (pred == y)
                else:
                    success = (pred != y)
                
                for i in range(batch_size):
                    if success[i] and l2_dist[i] < best_l2[i]:
                        best_l2[i] = l2_dist[i]
                        best_adv[i] = x_adv[i]
        
        return best_adv


# ============================================================================
# Adversarial Training Methods
# ============================================================================

class StandardAdversarialTraining:
    """
    Standard Adversarial Training (Madry et al., 2017)
    
    Theory:
    - min_θ E[ max_{||δ||≤ε} L(f_θ(x+δ), y) ]
    - Inner max: Generate adversarial examples with PGD
    - Outer min: Train model on adversarial examples
    """
    
    def __init__(self, model, epsilon=0.3, alpha=0.01, num_iter=7):
        """
        Args:
            model: Neural network to train
            epsilon: Perturbation budget for training
            alpha: PGD step size
            num_iter: PGD iterations during training
        """
        self.model = model
        self.pgd_attack = PGDAttack(
            model, epsilon=epsilon, alpha=alpha, 
            num_iter=num_iter, random_start=True
        )
    
    def train_step(self, x, y, optimizer):
        """
        Single training step with adversarial examples
        
        Args:
            x: Clean inputs
            y: Labels
            optimizer: Optimizer for model parameters
            
        Returns:
            loss: Training loss on adversarial examples
        """
        self.model.train()
        
        # Generate adversarial examples
        x_adv = self.pgd_attack.attack(x, y)
        
        # Forward pass on adversarial examples
        output = self.model(x_adv)
        loss = F.cross_entropy(output, y)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        return loss.item()


class TRADESTraining:
    """
    TRADES: TRadeoff-inspired Adversarial DEfense (Zhang et al., 2019)
    
    Theory:
    - L = L_CE(f(x), y) + β·max_{||δ||≤ε} KL(f(x) || f(x+δ))
    - Balance natural accuracy and robustness explicitly
    - β controls accuracy-robustness trade-off
    """
    
    def __init__(self, model, epsilon=0.3, alpha=0.01, num_iter=7, beta=6.0):
        """
        Args:
            model: Neural network to train
            epsilon: Perturbation budget
            alpha: Step size
            num_iter: Attack iterations
            beta: Trade-off parameter (higher = more robust)
        """
        self.model = model
        self.epsilon = epsilon
        self.alpha = alpha
        self.num_iter = num_iter
        self.beta = beta
    
    def trades_loss(self, x, y):
        """
        Compute TRADES loss
        
        Returns:
            loss: TRADES loss
            natural_loss: CE loss on clean examples
            robust_loss: KL divergence term
        """
        # Natural loss
        logits_clean = self.model(x)
        natural_loss = F.cross_entropy(logits_clean, y)
        
        # Generate adversarial examples (maximize KL divergence)
        x_adv = x.clone().detach()
        
        for i in range(self.num_iter):
            x_adv.requires_grad = True
            
            logits_adv = self.model(x_adv)
            
            # KL divergence: KL(f(x) || f(x_adv))
            kl_div = F.kl_div(
                F.log_softmax(logits_adv, dim=1),
                F.softmax(logits_clean.detach(), dim=1),
                reduction='batchmean'
            )
            
            self.model.zero_grad()
            kl_div.backward()
            
            # Update
            x_adv = x_adv + self.alpha * x_adv.grad.sign()
            
            # Project
            delta = torch.clamp(x_adv - x, -self.epsilon, self.epsilon)
            x_adv = torch.clamp(x + delta, 0, 1).detach()
        
        # Final KL divergence
        logits_adv = self.model(x_adv)
        robust_loss = F.kl_div(
            F.log_softmax(logits_adv, dim=1),
            F.softmax(logits_clean, dim=1),
            reduction='batchmean'
        )
        
        # Combined loss
        loss = natural_loss + self.beta * robust_loss
        
        return loss, natural_loss, robust_loss
    
    def train_step(self, x, y, optimizer):
        """Training step with TRADES"""
        self.model.train()
        
        loss, nat_loss, rob_loss = self.trades_loss(x, y)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        return loss.item(), nat_loss.item(), rob_loss.item()


# ============================================================================
# Certified Defense: Randomized Smoothing
# ============================================================================

class RandomizedSmoothing:
    """
    Randomized Smoothing for Certified L2 Robustness (Cohen et al., 2019)
    
    Theory:
    - g(x) = argmax_c P_{δ~N(0,σ²I)}[f(x+δ) = c]
    - Certified radius: r = σ/2 · (Φ^{-1}(p_A) - Φ^{-1}(p_B))
    - Provable robustness guarantee for L2 perturbations
    """
    
    def __init__(self, base_classifier, num_classes=10, sigma=0.5):
        """
        Args:
            base_classifier: Base neural network
            num_classes: Number of output classes
            sigma: Noise standard deviation
        """
        self.base_classifier = base_classifier
        self.num_classes = num_classes
        self.sigma = sigma
    
    def predict(self, x, n_samples=1000):
        """
        Predict smoothed class
        
        Args:
            x: Input image (1, C, H, W)
            n_samples: Number of noise samples
            
        Returns:
            prediction: Most likely class
            counts: Vote counts for each class
        """
        self.base_classifier.eval()
        
        counts = torch.zeros(self.num_classes)
        
        with torch.no_grad():
            for _ in range(n_samples):
                # Sample Gaussian noise
                noise = torch.randn_like(x) * self.sigma
                x_noisy = x + noise
                
                # Classify
                output = self.base_classifier(x_noisy)
                pred = output.argmax(dim=1).item()
                counts[pred] += 1
        
        prediction = counts.argmax().item()
        
        return prediction, counts
    
    def certify(self, x, n_samples_estimate=1000, n_samples_cert=10000, 
                alpha=0.001):
        """
        Certify robustness radius
        
        Args:
            x: Input image (1, C, H, W)
            n_samples_estimate: Samples for initial estimate
            n_samples_cert: Samples for certification
            alpha: Failure probability
            
        Returns:
            prediction: Certified class (-1 if abstain)
            radius: Certified L2 radius (0 if abstain)
        """
        # Step 1: Estimate top class
        pred_A, counts_est = self.predict(x, n_samples_estimate)
        
        # Step 2: Certify with more samples
        _, counts_cert = self.predict(x, n_samples_cert)
        
        # Counts for top class
        n_A = counts_cert[pred_A].item()
        
        # Lower confidence bound (using Clopper-Pearson)
        p_A_lower = self._lower_confidence_bound(n_A, n_samples_cert, alpha)
        
        if p_A_lower < 0.5:
            # Abstain: not confident enough
            return -1, 0.0
        
        # Certified radius
        radius = self.sigma * (stats.norm.ppf(p_A_lower) - stats.norm.ppf(0.5))
        
        return pred_A, radius
    
    def _lower_confidence_bound(self, n_success, n_total, alpha):
        """Clopper-Pearson lower confidence bound"""
        return stats.beta.ppf(alpha, n_success, n_total - n_success + 1)


# ============================================================================
# Adversarial Detection
# ============================================================================

class MahalanobisDetector:
    """
    Mahalanobis Distance-Based Detection (Lee et al., 2018)
    
    Theory:
    - Compute Mahalanobis distance in feature space
    - M(x) = (x-μ)^T Σ^{-1} (x-μ)
    - Threshold distance to detect adversarials
    """
    
    def __init__(self, model, layer_name='fc1'):
        """
        Args:
            model: Neural network
            layer_name: Layer to extract features from
        """
        self.model = model
        self.layer_name = layer_name
        self.class_means = None
        self.precision = None
        self.features = None
        
        # Register hook
        self._register_hook()
    
    def _register_hook(self):
        """Register forward hook to extract features"""
        def hook_fn(module, input, output):
            self.features = output.detach()
        
        # Find layer
        for name, module in self.model.named_modules():
            if name == self.layer_name:
                module.register_forward_hook(hook_fn)
                return
    
    def fit(self, train_loader, num_classes=10):
        """
        Estimate class means and precision matrix
        
        Args:
            train_loader: DataLoader for clean training data
            num_classes: Number of classes
        """
        self.model.eval()
        
        # Collect features for each class
        class_features = [[] for _ in range(num_classes)]
        
        with torch.no_grad():
            for x, y in train_loader:
                _ = self.model(x)
                features = self.features.cpu().numpy()
                
                for i, label in enumerate(y):
                    class_features[label.item()].append(features[i])
        
        # Compute class means
        self.class_means = []
        all_features = []
        
        for features in class_features:
            features = np.array(features)
            self.class_means.append(features.mean(axis=0))
            all_features.append(features)
        
        self.class_means = np.array(self.class_means)
        
        # Compute tied precision matrix (inverse covariance)
        all_features = np.vstack(all_features)
        
        # Center features
        centered = all_features - all_features.mean(axis=0)
        
        # Compute covariance and invert
        cov = EmpiricalCovariance().fit(centered)
        self.precision = cov.precision_
    
    def compute_distance(self, x, y):
        """
        Compute Mahalanobis distance for samples
        
        Args:
            x: Input images
            y: Predicted classes
            
        Returns:
            distances: Mahalanobis distances
        """
        self.model.eval()
        
        with torch.no_grad():
            _ = self.model(x)
            features = self.features.cpu().numpy()
        
        distances = []
        
        for i, label in enumerate(y):
            mean = self.class_means[label.item()]
            delta = features[i] - mean
            
            # M(x) = (x-μ)^T Σ^{-1} (x-μ)
            dist = np.sqrt(delta @ self.precision @ delta.T)
            distances.append(dist)
        
        return np.array(distances)
    
    def detect(self, x, y, threshold):
        """
        Detect adversarial examples
        
        Args:
            x: Input images
            y: Predicted classes
            threshold: Detection threshold
            
        Returns:
            is_adversarial: Boolean array (True = adversarial)
        """
        distances = self.compute_distance(x, y)
        return distances > threshold


# ============================================================================
# Robustness Evaluation Tools
# ============================================================================

class RobustnessEvaluator:
    """Comprehensive robustness evaluation tools"""
    
    @staticmethod
    def evaluate_multiple_epsilons(model, test_loader, attack_class, 
                                   epsilons, device='cpu'):
        """
        Evaluate robustness across multiple epsilon values
        
        Args:
            model: Neural network
            test_loader: Test data loader
            attack_class: Attack class (e.g., PGDAttack)
            epsilons: List of epsilon values
            device: Device to run on
            
        Returns:
            results: Dict with clean_acc and adversarial accuracies
        """
        model.eval()
        results = {'epsilon': epsilons, 'accuracy': []}
        
        # Clean accuracy
        correct = 0
        total = 0
        
        with torch.no_grad():
            for x, y in test_loader:
                x, y = x.to(device), y.to(device)
                output = model(x)
                pred = output.argmax(dim=1)
                correct += (pred == y).sum().item()
                total += len(y)
        
        clean_acc = 100 * correct / total
        print(f"Clean Accuracy: {clean_acc:.2f}%")
        
        # Adversarial accuracy for each epsilon
        for eps in epsilons:
            attack = attack_class(model, epsilon=eps)
            
            correct = 0
            total = 0
            
            for x, y in test_loader:
                x, y = x.to(device), y.to(device)
                
                x_adv = attack.attack(x, y)
                
                with torch.no_grad():
                    output = model(x_adv)
                    pred = output.argmax(dim=1)
                    correct += (pred == y).sum().item()
                    total += len(y)
            
            adv_acc = 100 * correct / total
            results['accuracy'].append(adv_acc)
            print(f"ε = {eps:.3f}: {adv_acc:.2f}%")
        
        return clean_acc, results
    
    @staticmethod
    def plot_robustness_curve(epsilons, accuracies, title="Robustness Curve"):
        """Plot accuracy vs epsilon curve"""
        plt.figure(figsize=(10, 6))
        plt.plot(epsilons, accuracies, 'b-o', linewidth=2, markersize=8)
        plt.fill_between(epsilons, accuracies, alpha=0.3)
        plt.xlabel('ε (Perturbation Budget)', fontsize=12)
        plt.ylabel('Accuracy (%)', fontsize=12)
        plt.title(title, fontsize=13)
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        return plt.gcf()
    
    @staticmethod
    def compare_attacks(model, test_loader, attacks_dict, epsilon=0.3, 
                       device='cpu'):
        """
        Compare multiple attack methods
        
        Args:
            model: Neural network
            test_loader: Test data
            attacks_dict: Dict of {name: attack_instance}
            epsilon: Perturbation budget
            device: Device
            
        Returns:
            results: Dict with attack success rates
        """
        model.eval()
        results = {}
        
        for name, attack in attacks_dict.items():
            correct = 0
            total = 0
            
            for x, y in test_loader:
                x, y = x.to(device), y.to(device)
                
                x_adv = attack.attack(x, y)
                
                with torch.no_grad():
                    output = model(x_adv)
                    pred = output.argmax(dim=1)
                    correct += (pred == y).sum().item()
                    total += len(y)
            
            acc = 100 * correct / total
            results[name] = acc
            print(f"{name}: {acc:.2f}%")
        
        return results


# ============================================================================
# Demonstration
# ============================================================================

print("Advanced Adversarial Robustness Methods Implemented:")
print("=" * 70)
print("1. PGDAttack - Projected Gradient Descent (strongest first-order)")
print("2. MomentumIterativeFGSM - Better transferability with momentum")
print("3. CarliniWagnerL2 - Strong optimization-based attack")
print("4. StandardAdversarialTraining - Madry et al. robust training")
print("5. TRADESTraining - Accuracy-robustness trade-off")
print("6. RandomizedSmoothing - Certified L2 defense")
print("7. MahalanobisDetector - Statistical adversarial detection")
print("8. RobustnessEvaluator - Comprehensive evaluation tools")
print("=" * 70)

# Simple demonstration
print("\nExample: PGD Attack vs FGSM")
print("-" * 70)

# Create simple model and data for demo
class TinyNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(784, 10)
    
    def forward(self, x):
        return self.fc(x.view(x.size(0), -1))

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tiny_model = TinyNet().to(device)

# Generate toy data
x_demo = torch.randn(100, 1, 28, 28).to(device)
y_demo = torch.randint(0, 10, (100,)).to(device)

# Compare attacks
fgsm = PGDAttack(tiny_model, epsilon=0.3, num_iter=1, random_start=False)
pgd = PGDAttack(tiny_model, epsilon=0.3, alpha=0.01, num_iter=40, random_start=True)

x_fgsm = fgsm.attack(x_demo[:10], y_demo[:10])
x_pgd = pgd.attack(x_demo[:10], y_demo[:10])

l2_fgsm = torch.norm((x_fgsm - x_demo[:10]).view(10, -1), dim=1).mean()
l2_pgd = torch.norm((x_pgd - x_demo[:10]).view(10, -1), dim=1).mean()
linf_fgsm = torch.max(torch.abs(x_fgsm - x_demo[:10]))
linf_pgd = torch.max(torch.abs(x_pgd - x_demo[:10]))

print(f"FGSM - L2: {l2_fgsm:.4f}, L∞: {linf_fgsm:.4f}")
print(f"PGD  - L2: {l2_pgd:.4f}, L∞: {linf_pgd:.4f}")
print("\nPGD typically finds stronger adversarial examples")

print("\n" + "=" * 70)
print("Key Takeaways:")
print("=" * 70)
print("1. PGD: Iterative attack with random start → stronger than FGSM")
print("2. C&W: Optimization-based → minimal perturbations, very strong")
print("3. Adversarial Training: Most effective defense, 10-20× slower")
print("4. TRADES: Better clean accuracy than standard adv training")
print("5. Randomized Smoothing: Provable certified robustness (L2)")
print("6. Detection: Useful but can be evaded by adaptive attacks")
print("7. Evaluation: Use AutoAttack or multiple strong attacks")
print("=" * 70)

1. Adversarial Examples¶

Definition¶

\[x' = x + \delta, \quad \|\delta\|_p \leq \epsilon\]

where \(f(x') \neq f(x)\) but \(x' \approx x\).

FGSM (Fast Gradient Sign Method)¶

\[x' = x + \epsilon \cdot \text{sign}(\nabla_x \mathcal{L}(\theta, x, y))\]

📚 Reference Materials:

bayesian_inference_deep_learning.pdf - Bayesian Inference Deep Learning

class SimpleNet(nn.Module):
    """Simple CNN for MNIST."""
    
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, 10)
    
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2)
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        return self.fc2(x)

def fgsm_attack(model, x, y, epsilon=0.3):
    """FGSM attack."""
    x.requires_grad = True
    
    output = model(x)
    loss = F.cross_entropy(output, y)
    
    model.zero_grad()
    loss.backward()
    
    # Perturbation
    perturbation = epsilon * x.grad.sign()
    x_adv = x + perturbation
    x_adv = torch.clamp(x_adv, 0, 1)
    
    return x_adv.detach()

# Load data
transform = transforms.Compose([transforms.ToTensor()])
mnist = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_mnist = datasets.MNIST('./data', train=False, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(mnist, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_mnist, batch_size=1000)

print("Data loaded")

Train Standard Model¶

Before studying adversarial robustness, we train a standard classifier on clean data as a baseline. Standard training minimizes the empirical cross-entropy loss: \(\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \log p(y_i | x_i)\). This produces a model with high clean accuracy but – as we will see – it is surprisingly fragile to small, carefully crafted input perturbations. The gap between clean accuracy and adversarial accuracy is the primary metric that motivates adversarial robustness research.

def train_standard(model, train_loader, n_epochs=5):
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    
    for epoch in range(n_epochs):
        model.train()
        for x, y in train_loader:
            x, y = x.to(device), y.to(device)
            
            output = model(x)
            loss = F.cross_entropy(output, y)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        print(f"Epoch {epoch+1}/{n_epochs} complete")

model = SimpleNet().to(device)
train_standard(model, train_loader, n_epochs=5)

Evaluate Robustness¶

We evaluate robustness by attacking the trained model with FGSM (Fast Gradient Sign Method) and PGD (Projected Gradient Descent), the two most widely used attack algorithms. FGSM generates adversarial examples in a single step: \(x' = x + \epsilon \cdot \text{sign}(\nabla_x \mathcal{L})\), while PGD iterates this process with smaller steps and projects back onto the \(\epsilon\)-ball after each step. Testing across multiple perturbation budgets \(\epsilon\) produces a robustness curve showing how accuracy degrades as the attacker grows stronger. A steep drop at small \(\epsilon\) values exposes the model’s reliance on non-robust, imperceptible features.

def evaluate_robustness(model, test_loader, epsilon=0.3):
    model.eval()
    
    clean_correct = 0
    adv_correct = 0
    total = 0
    
    for x, y in test_loader:
        x, y = x.to(device), y.to(device)
        
        # Clean accuracy
        with torch.no_grad():
            output = model(x)
            pred = output.argmax(dim=1)
            clean_correct += (pred == y).sum().item()
        
        # Adversarial accuracy
        x_adv = fgsm_attack(model, x, y, epsilon)
        with torch.no_grad():
            output_adv = model(x_adv)
            pred_adv = output_adv.argmax(dim=1)
            adv_correct += (pred_adv == y).sum().item()
        
        total += y.size(0)
    
    clean_acc = 100 * clean_correct / total
    adv_acc = 100 * adv_correct / total
    
    return clean_acc, adv_acc

clean_acc, adv_acc = evaluate_robustness(model, test_loader, epsilon=0.3)
print(f"Clean Accuracy: {clean_acc:.2f}%")
print(f"Adversarial Accuracy (ε=0.3): {adv_acc:.2f}%")

Visualize Attacks¶

Visualizing adversarial examples side-by-side with their clean originals makes the threat concrete: the perturbations are typically imperceptible to humans (they look like faint noise), yet they completely change the model’s prediction. Displaying the perturbation pattern itself (amplified for visibility) reveals which pixels the attack modifies most – often edges and texture regions that the model relies on for classification. This visualization is essential for communicating adversarial risks to non-technical stakeholders and motivates the need for robust training procedures.

# Get sample
x_test, y_test = next(iter(test_loader))
x_sample = x_test[:5].to(device)
y_sample = y_test[:5].to(device)

# Generate adversarial
epsilons = [0.0, 0.1, 0.2, 0.3]

fig, axes = plt.subplots(5, 4, figsize=(12, 15))

for i in range(5):
    for j, eps in enumerate(epsilons):
        if eps == 0:
            img = x_sample[i]
        else:
            img = fgsm_attack(model, x_sample[i:i+1], y_sample[i:i+1], eps)
        
        with torch.no_grad():
            pred = model(img).argmax(dim=1).item()
        
        axes[i, j].imshow(img[0, 0].cpu(), cmap='gray')
        axes[i, j].set_title(f"ε={eps}, pred={pred}", fontsize=9)
        axes[i, j].axis('off')

plt.suptitle('FGSM Attack Examples', fontsize=13)
plt.tight_layout()
plt.show()

5. PGD Attack¶

Projected Gradient Descent¶

\[x^{t+1} = \text{Proj}_{\|\delta\| \leq \epsilon}\left(x^t + \alpha \cdot \text{sign}(\nabla_x \mathcal{L})\right)\]

def pgd_attack(model, x, y, epsilon=0.3, alpha=0.01, num_iter=40):
    """PGD attack."""
    x_adv = x.clone().detach()
    
    # Random start
    x_adv = x_adv + torch.empty_like(x_adv).uniform_(-epsilon, epsilon)
    x_adv = torch.clamp(x_adv, 0, 1)
    
    for _ in range(num_iter):
        x_adv.requires_grad = True
        
        output = model(x_adv)
        loss = F.cross_entropy(output, y)
        
        model.zero_grad()
        loss.backward()
        
        # Update
        x_adv = x_adv + alpha * x_adv.grad.sign()
        
        # Project
        delta = torch.clamp(x_adv - x, -epsilon, epsilon)
        x_adv = torch.clamp(x + delta, 0, 1).detach()
    
    return x_adv

# Compare FGSM vs PGD
x_sample = x_test[:100].to(device)
y_sample = y_test[:100].to(device)

x_fgsm = fgsm_attack(model, x_sample, y_sample, 0.3)
x_pgd = pgd_attack(model, x_sample, y_sample, 0.3)

with torch.no_grad():
    pred_fgsm = model(x_fgsm).argmax(dim=1)
    pred_pgd = model(x_pgd).argmax(dim=1)
    
    fgsm_success = (pred_fgsm != y_sample).sum().item()
    pgd_success = (pred_pgd != y_sample).sum().item()

print(f"FGSM attack success: {fgsm_success}/100")
print(f"PGD attack success: {pgd_success}/100")

Adversarial Training¶

Adversarial training (Madry et al., 2018) is the most effective known defense: instead of minimizing loss on clean inputs, we minimize loss on adversarially perturbed inputs: \(\min_\theta \mathbb{E}[\max_{\|\delta\| \le \epsilon} \mathcal{L}(f_\theta(x + \delta), y)]\). At each training step, we first generate a PGD adversarial example for each batch element, then update the model weights using the adversarial loss. This min-max optimization is roughly 3-10x more expensive than standard training because each step requires multiple forward-backward passes for the inner maximization. The resulting model achieves substantially higher adversarial accuracy, though typically at the cost of some clean accuracy – a trade-off known as the accuracy-robustness tension.

def train_adversarial(model, train_loader, n_epochs=5, epsilon=0.3):
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    
    for epoch in range(n_epochs):
        model.train()
        for x, y in train_loader:
            x, y = x.to(device), y.to(device)
            
            # Generate adversarial examples
            x_adv = fgsm_attack(model, x, y, epsilon)
            
            # Train on adversarial
            output = model(x_adv)
            loss = F.cross_entropy(output, y)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        print(f"Epoch {epoch+1}/{n_epochs} complete")

# Train robust model
robust_model = SimpleNet().to(device)
train_adversarial(robust_model, train_loader, n_epochs=5, epsilon=0.3)

# Evaluate
clean_acc_robust, adv_acc_robust = evaluate_robustness(robust_model, test_loader, epsilon=0.3)
print(f"\nRobust Model - Clean: {clean_acc_robust:.2f}%, Adversarial: {adv_acc_robust:.2f}%")
print(f"Standard Model - Clean: {clean_acc:.2f}%, Adversarial: {adv_acc:.2f}%")

Robustness vs Epsilon¶

Plotting adversarial accuracy as a function of the perturbation budget \(\epsilon\) for both standard and adversarially trained models reveals the impact of robust training. The standard model’s accuracy drops precipitously even at tiny \(\epsilon\) values, while the adversarially trained model maintains reasonable accuracy up to the \(\epsilon\) it was trained against. Beyond the training \(\epsilon\), even robust models eventually break – there is no free lunch in adversarial robustness. This analysis helps practitioners choose an appropriate \(\epsilon\) for their threat model and budget the computational cost of adversarial training accordingly.

epsilons = np.linspace(0, 0.5, 11)
standard_accs = []
robust_accs = []

for eps in epsilons:
    _, acc_std = evaluate_robustness(model, test_loader, eps)
    _, acc_rob = evaluate_robustness(robust_model, test_loader, eps)
    standard_accs.append(acc_std)
    robust_accs.append(acc_rob)

plt.figure(figsize=(10, 6))
plt.plot(epsilons, standard_accs, 'b-o', label='Standard Model', markersize=6)
plt.plot(epsilons, robust_accs, 'r-o', label='Robust Model', markersize=6)
plt.xlabel('ε (perturbation)', fontsize=11)
plt.ylabel('Accuracy (%)', fontsize=11)
plt.title('Adversarial Robustness', fontsize=12)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Summary¶

Attacks:¶

FGSM: Single-step, fast
PGD: Multi-step, stronger
C&W: Optimization-based

Defenses:¶

Adversarial training
Certified defenses
Randomized smoothing
Input transformations

Tradeoffs:¶

Robustness vs accuracy
Computation cost
Threat model assumptions

Applications:¶

Security-critical systems
Autonomous vehicles
Medical diagnosis
Malware detection

Next Steps:¶

Study certified defenses
Explore adaptive attacks
Learn robustness verification