Run this notebook: Open in Colab Open in Kaggle

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from torchvision import datasets, transforms

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

1. Bayesian Neural Networks¶

Posterior:¶

\[p(w | \mathcal{D}) = \frac{p(\mathcal{D} | w) p(w)}{p(\mathcal{D})}\]

Predictive:¶

\[p(y^* | x^*, \mathcal{D}) = \int p(y^* | x^*, w) p(w | \mathcal{D}) dw\]

Variational Inference:¶

Approximate \(p(w|\mathcal{D})\) with \(q(w|\theta)\):

\[\mathcal{L} = \mathbb{E}_{q(w)}[\log p(\mathcal{D}|w)] - \text{KL}(q(w) \| p(w))\]

📚 Reference Materials:

bayesian.pdf - Bayesian
bayesian_inference_deep_learning.pdf - Bayesian Inference Deep Learning

Bayesian Neural Networks: Deep Theory and Variational Inference¶

1. Bayesian Inference for Neural Networks¶

The core idea of Bayesian Neural Networks (BNNs) is to treat network weights as random variables with distributions rather than point estimates.

Posterior Distribution:

Given dataset \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N\), the posterior over weights is:

\[p(w | \mathcal{D}) = \frac{p(\mathcal{D} | w) p(w)}{p(\mathcal{D})} = \frac{\prod_{i=1}^N p(y_i | x_i, w) \cdot p(w)}{\int \prod_{i=1}^N p(y_i | x_i, w') \cdot p(w') dw'}\]

The Challenge: The denominator (evidence) requires integrating over all possible weight configurations—intractable for neural networks with millions of parameters.

Predictive Distribution:

For a new input \(x^*\), we want:

\[p(y^* | x^*, \mathcal{D}) = \int p(y^* | x^*, w) p(w | \mathcal{D}) dw\]

This marginalizes over the posterior, naturally providing uncertainty quantification.

2. Variational Inference for BNNs¶

Since exact inference is intractable, we use variational inference to approximate \(p(w | \mathcal{D})\) with a simpler distribution \(q(w | \theta)\) parameterized by \(\theta\) (e.g., Gaussian with mean \(\mu\) and variance \(\sigma^2\)).

Evidence Lower Bound (ELBO):

We maximize the ELBO instead of the marginal likelihood:

\[\mathcal{L}(\theta) = \mathbb{E}_{q(w|\theta)}[\log p(\mathcal{D} | w)] - \text{KL}(q(w|\theta) \| p(w))\]

Derivation:

Starting from the log evidence:

\[\log p(\mathcal{D}) = \log \int p(\mathcal{D}, w) dw = \log \int \frac{p(\mathcal{D}, w)}{q(w|\theta)} q(w|\theta) dw\]

By Jensen’s inequality (log is concave):

\[\log p(\mathcal{D}) \geq \mathbb{E}_{q(w|\theta)}[\log \frac{p(\mathcal{D}, w)}{q(w|\theta)}] = \mathbb{E}_q[\log p(\mathcal{D}|w)] + \mathbb{E}_q[\log \frac{p(w)}{q(w|\theta)}]\]

\[= \mathbb{E}_q[\log p(\mathcal{D}|w)] - \text{KL}(q(w|\theta) \| p(w)) = \mathcal{L}(\theta)\]

Components:

Likelihood Term: \(\mathbb{E}_{q(w|\theta)}[\log p(\mathcal{D} | w)]\) - Data fit (reconstruction)
KL Regularizer: \(\text{KL}(q(w|\theta) \| p(w))\) - Complexity penalty (keep weights close to prior)

3. Bayes by Backprop Algorithm¶

Reparameterization Trick:

To compute gradients w.r.t. \(\theta\), we use:

\[w = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]

where \(\sigma = \log(1 + \exp(\rho))\) (softplus to ensure positivity).

Loss Function (per minibatch):

\[\mathcal{L}_{\text{BB}} = \frac{1}{M} \sum_{i=1}^M \log q(w^{(i)}|\theta) - \log p(w^{(i)}) - \log p(\mathcal{D} | w^{(i)})\]

where \(w^{(i)} \sim q(w|\theta)\) are sampled weights.

Algorithm:

For each minibatch:
Sample ε ~ N(0, I)
Compute w = μ + σ ⊙ ε
Forward pass: ŷ = f(x; w)
Compute NLL: -log p(D|w) = MSE or cross-entropy loss
Compute KL: KL(q(w|θ) || p(w))
Total loss: L = NLL + λ·KL
Backpropagate and update μ, ρ

KL Divergence (Gaussian case):

For \(q(w_j|\theta) = \mathcal{N}(\mu_j, \sigma_j^2)\) and \(p(w_j) = \mathcal{N}(0, \sigma_p^2)\):

\[\text{KL}(q \| p) = \frac{1}{2} \sum_j \left[ \frac{\sigma_j^2 + \mu_j^2}{\sigma_p^2} - \log \frac{\sigma_j^2}{\sigma_p^2} - 1 \right]\]

4. Uncertainty Decomposition¶

BNNs provide two types of uncertainty:

A. Aleatoric Uncertainty (Data Uncertainty):

Inherent noise in the data
Cannot be reduced with more data or model capacity
Example: Sensor noise, label ambiguity
Modeled by output noise: \(y = f(x; w) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma_{\text{noise}}^2)\)

B. Epistemic Uncertainty (Model Uncertainty):

Uncertainty about model parameters \(w\)
Can be reduced with more training data
Example: Uncertainty far from training data
Captured by weight distribution \(p(w | \mathcal{D})\)

Total Predictive Variance:

\[\mathbb{V}[y^*] = \underbrace{\mathbb{E}_w[\sigma_{\text{noise}}^2(x^*; w)]}_{\text{Aleatoric}} + \underbrace{\mathbb{V}_w[\mu(x^*; w)]}_{\text{Epistemic}}\]

Law of Total Variance:

\[\mathbb{V}[y^*] = \mathbb{E}_w[\mathbb{V}[y^* | w]] + \mathbb{V}_w[\mathbb{E}[y^* | w]]\]

5. MC Dropout as Approximate Bayesian Inference¶

Gal & Ghahramani (2016) showed that dropout training approximates variational inference.

Training: Apply dropout with rate \(p\) during training:

\[\hat{y} = \text{Softmax}(W_2 \cdot \text{ReLU}(W_1 x \odot m))\]

where \(m \sim \text{Bernoulli}(1-p)\) is the dropout mask.

Inference: Keep dropout active during test time and average predictions:

\[\mathbb{E}[y^*] \approx \frac{1}{T} \sum_{t=1}^T f(x^*; w^{(t)})\]

Uncertainty Estimation:

\[\mathbb{V}[y^*] \approx \frac{1}{T} \sum_{t=1}^T (f(x^*; w^{(t)}) - \bar{y})^2\]

Connection to Variational Inference:

Dropout approximates \(q(w|\theta)\) with a Bernoulli distribution over weight masking. The variational distribution is:

\[q(w|\theta) = \prod_l \text{Bernoulli}(m_l; 1-p)\]

The ELBO corresponds to minimizing:

\[\mathcal{L} = -\frac{1}{N} \sum_i \log p(y_i | x_i, w) + \lambda \|w\|_2^2\]

where \(\lambda = \frac{p}{2N\tau}\) (weight decay relates to dropout rate).

6. Deep Ensembles¶

An alternative to full Bayesian inference: train \(M\) neural networks with different random initializations.

Predictive Mean:

\[\mu_{\text{ens}}(x) = \frac{1}{M} \sum_{m=1}^M f_m(x)\]

Predictive Variance:

\[\sigma_{\text{ens}}^2(x) = \frac{1}{M} \sum_{m=1}^M [f_m(x) - \mu_{\text{ens}}(x)]^2\]

Advantages:

Simple to implement (no architecture changes)
Embarrassingly parallel training
Often outperforms BNNs in practice

Disadvantages:

Computationally expensive (\(M \times\) cost)
Not truly Bayesian (no prior, no marginalization)

7. Comparison: BNN vs MC Dropout vs Ensembles¶

Method	Training Cost	Inference Cost	Uncertainty Quality	Calibration
BNN (VI)	High (KL term)	Medium (T samples)	High (principled)	Good
MC Dropout	Low (standard)	Medium (T forward)	Medium	Fair
Ensembles	Very High (M models)	High (M forward)	High (empirical)	Excellent

When to use:

BNN: Need principled uncertainty, small models, interpretability
MC Dropout: Quick uncertainty estimates, existing models
Ensembles: Best performance, resources available, calibration critical

8. Advanced Topics¶

A. Last-Layer Bayesian Approximation:

Only make the final layer Bayesian (computationally cheaper)
Feature extractor remains deterministic
Works well when representation learning is more important

B. Structured Variational Distributions:

Diagonal Gaussian: \(q(w) = \prod_j \mathcal{N}(w_j | \mu_j, \sigma_j^2)\) (independent weights)
Matrix Gaussian: \(q(W) = \mathcal{N}(W | M, \Sigma)\) (correlated weights)
Normalizing Flows: \(q(w) = q_0(z) \left| \det \frac{\partial f}{\partial z} \right|^{-1}\) (flexible)

C. Temperature Scaling for Calibration:

After training, scale logits by temperature \(T\):

\[p(y=k|x) = \frac{\exp(z_k / T)}{\sum_j \exp(z_j / T)}\]

Optimize \(T\) on validation set to minimize NLL. Improves calibration without retraining.

class BayesianLinear(nn.Module):
    """Bayesian linear layer with Gaussian weights."""
    
    def __init__(self, in_features, out_features, prior_std=1.0):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.prior_std = prior_std
        
        # Weight parameters
        self.weight_mu = nn.Parameter(torch.randn(out_features, in_features) * 0.1)
        self.weight_rho = nn.Parameter(torch.randn(out_features, in_features) * 0.1)
        
        # Bias parameters
        self.bias_mu = nn.Parameter(torch.randn(out_features) * 0.1)
        self.bias_rho = nn.Parameter(torch.randn(out_features) * 0.1)
    
    def forward(self, x):
        # Sample weights
        weight_std = torch.log1p(torch.exp(self.weight_rho))
        weight = self.weight_mu + weight_std * torch.randn_like(self.weight_mu)
        
        # Sample bias
        bias_std = torch.log1p(torch.exp(self.bias_rho))
        bias = self.bias_mu + bias_std * torch.randn_like(self.bias_mu)
        
        return F.linear(x, weight, bias)
    
    def kl_divergence(self):
        """KL divergence to prior."""
        weight_std = torch.log1p(torch.exp(self.weight_rho))
        bias_std = torch.log1p(torch.exp(self.bias_rho))
        
        # KL for weights
        kl_weight = 0.5 * torch.sum(
            (self.weight_mu ** 2 + weight_std ** 2) / (self.prior_std ** 2)
            - torch.log(weight_std ** 2 / (self.prior_std ** 2))
            - 1
        )
        
        # KL for bias
        kl_bias = 0.5 * torch.sum(
            (self.bias_mu ** 2 + bias_std ** 2) / (self.prior_std ** 2)
            - torch.log(bias_std ** 2 / (self.prior_std ** 2))
            - 1
        )
        
        return kl_weight + kl_bias

print("BayesianLinear defined")

# Advanced BNN Implementation: Comparison of Methods

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

# ============================================================
# 1. MC Dropout Network
# ============================================================

class MCDropoutNN(nn.Module):
    """Network with MC Dropout for uncertainty estimation."""
    
    def __init__(self, input_dim, hidden_dim, output_dim, dropout_rate=0.1):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(p=dropout_rate)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.dropout(x)  # Apply dropout
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        return x
    
    def predict_with_uncertainty(self, x, n_samples=100):
        """MC Dropout inference: keep dropout active."""
        self.train()  # Enable dropout during inference
        predictions = []
        
        with torch.no_grad():
            for _ in range(n_samples):
                pred = self.forward(x)
                predictions.append(pred.cpu().numpy())
        
        predictions = np.array(predictions)
        mean = predictions.mean(axis=0)
        std = predictions.std(axis=0)
        
        return mean, std

# ============================================================
# 2. Deep Ensemble
# ============================================================

class EnsembleNN:
    """Ensemble of neural networks for uncertainty."""
    
    def __init__(self, input_dim, hidden_dim, output_dim, n_models=5):
        self.models = []
        for _ in range(n_models):
            model = nn.Sequential(
                nn.Linear(input_dim, hidden_dim),
                nn.ReLU(),
                nn.Linear(hidden_dim, hidden_dim),
                nn.ReLU(),
                nn.Linear(hidden_dim, output_dim)
            )
            self.models.append(model)
        
        self.n_models = n_models
    
    def to(self, device):
        for model in self.models:
            model.to(device)
        return self
    
    def train_model(self, idx, X, y, optimizer, n_epochs=1000):
        """Train individual model in ensemble."""
        model = self.models[idx]
        model.train()
        
        for epoch in range(n_epochs):
            y_pred = model(X)
            loss = F.mse_loss(y_pred, y)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
    
    def predict_with_uncertainty(self, x):
        """Ensemble prediction."""
        predictions = []
        
        for model in self.models:
            model.eval()
            with torch.no_grad():
                pred = model(x)
                predictions.append(pred.cpu().numpy())
        
        predictions = np.array(predictions)
        mean = predictions.mean(axis=0)
        std = predictions.std(axis=0)
        
        return mean, std

# ============================================================
# 3. Uncertainty Decomposition Utilities
# ============================================================

def decompose_uncertainty(model, x_test, y_test, n_samples=100, noise_std=0.1):
    """
    Decompose total uncertainty into aleatoric and epistemic.
    
    For regression: y = f(x; w) + ε, ε ~ N(0, σ²)
    
    Total variance: Var[y*] = E_w[σ²] + Var_w[μ(x*; w)]
                              ^^^^^^^^   ^^^^^^^^^^^^^^^^
                              Aleatoric   Epistemic
    """
    model.eval()
    predictions = []
    
    with torch.no_grad():
        for _ in range(n_samples):
            pred = model(x_test)
            predictions.append(pred.cpu().numpy())
    
    predictions = np.array(predictions).squeeze()
    
    # Epistemic uncertainty: variance of predictions across weight samples
    epistemic = predictions.var(axis=0)
    
    # Aleatoric uncertainty: inherent noise (known or estimated)
    aleatoric = noise_std ** 2 * np.ones_like(epistemic)
    
    # Total uncertainty
    total = aleatoric + epistemic
    
    return {
        'total': total,
        'aleatoric': aleatoric,
        'epistemic': epistemic,
        'predictions': predictions
    }

# ============================================================
# 4. Calibration Metrics
# ============================================================

def compute_calibration_curve(confidences, accuracies, n_bins=10):
    """
    Expected Calibration Error (ECE).
    
    ECE = Σ (|acc(B_m) - conf(B_m)|) · |B_m| / N
    
    Perfect calibration: confidence = accuracy
    """
    bins = np.linspace(0, 1, n_bins + 1)
    bin_indices = np.digitize(confidences, bins) - 1
    
    ece = 0.0
    bin_accs = []
    bin_confs = []
    bin_counts = []
    
    for i in range(n_bins):
        mask = bin_indices == i
        if mask.sum() > 0:
            bin_acc = accuracies[mask].mean()
            bin_conf = confidences[mask].mean()
            bin_count = mask.sum()
            
            ece += np.abs(bin_acc - bin_conf) * bin_count / len(confidences)
            
            bin_accs.append(bin_acc)
            bin_confs.append(bin_conf)
            bin_counts.append(bin_count)
        else:
            bin_accs.append(0)
            bin_confs.append(0)
            bin_counts.append(0)
    
    return ece, bin_accs, bin_confs, bin_counts

def temperature_scaling(logits, labels, T_range=(0.1, 5.0), n_trials=50):
    """
    Find optimal temperature T to minimize NLL.
    
    Calibrated probabilities: p(y|x) = softmax(z/T)
    """
    best_T = 1.0
    best_nll = float('inf')
    
    temperatures = np.linspace(T_range[0], T_range[1], n_trials)
    
    for T in temperatures:
        scaled_logits = logits / T
        log_probs = F.log_softmax(torch.FloatTensor(scaled_logits), dim=1)
        nll = F.nll_loss(log_probs, torch.LongTensor(labels))
        
        if nll < best_nll:
            best_nll = nll
            best_T = T
    
    return best_T

print("Advanced BNN implementations loaded: MC Dropout, Ensembles, Uncertainty Decomposition, Calibration")

BNN for Regression¶

A Bayesian Neural Network places probability distributions over the network weights instead of learning single point estimates. Each weight \(w\) is represented by a distribution \(q(w) = \mathcal{N}(\mu_w, \sigma_w^2)\) where both the mean and variance are learnable parameters. During a forward pass, weights are sampled from their distributions, making each forward pass stochastic. Training optimizes the ELBO (Evidence Lower Bound): \(\mathcal{L} = \mathbb{E}_{q(w)}[\log p(y|x, w)] - \text{KL}(q(w) \| p(w))\), balancing data fit with a prior regularizer. The result is a model that provides principled uncertainty estimates for every prediction, distinguishing between confident and uncertain regions of the input space.

class BayesianNN(nn.Module):
    """Bayesian neural network."""
    
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.fc1 = BayesianLinear(input_dim, hidden_dim)
        self.fc2 = BayesianLinear(hidden_dim, output_dim)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x
    
    def kl_divergence(self):
        return self.fc1.kl_divergence() + self.fc2.kl_divergence()

print("BayesianNN defined")

Generate Data and Train¶

We generate synthetic regression data with clear regions of data density (where the model should be confident) and gaps (where the model should be uncertain). Training a BNN on this data using variational inference (the reparameterization trick for sampling differentiable weight samples) converges similarly to standard training but produces a distribution over functions rather than a single function. The KL divergence term acts as a regularizer, preventing the posterior from collapsing to a point estimate and ensuring meaningful uncertainty quantification.

# Data
def f(x):
    return np.sin(3*x)

np.random.seed(42)
X_train = np.random.uniform(-1, 1, 30).reshape(-1, 1)
y_train = f(X_train) + 0.1 * np.random.randn(30, 1)

X_train_t = torch.FloatTensor(X_train).to(device)
y_train_t = torch.FloatTensor(y_train).to(device)

# Model
model = BayesianNN(1, 64, 1).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

# Train
n_epochs = 2000
losses = []

for epoch in range(n_epochs):
    # Forward (sample weights)
    y_pred = model(X_train_t)
    
    # ELBO loss
    nll = F.mse_loss(y_pred, y_train_t)
    kl = model.kl_divergence() / len(X_train)
    loss = nll + 0.01 * kl
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    losses.append(loss.item())
    
    if (epoch + 1) % 500 == 0:
        print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}, NLL: {nll.item():.4f}, KL: {kl.item():.4f}")

Predictive Uncertainty¶

Predictive uncertainty in a BNN is obtained by running multiple forward passes (each with different weight samples) and computing statistics of the outputs. The mean of the forward passes gives the expected prediction, while the variance decomposes into aleatoric uncertainty (inherent data noise, captured by a noise output head) and epistemic uncertainty (model uncertainty due to limited data, captured by the spread across weight samples). This decomposition is uniquely valuable: epistemic uncertainty decreases with more training data, while aleatoric uncertainty does not – knowing which type dominates guides data collection and model improvement strategies.

# Test data
X_test = np.linspace(-1.5, 1.5, 200).reshape(-1, 1)
X_test_t = torch.FloatTensor(X_test).to(device)

# MC samples
model.eval()
n_samples = 100
predictions = []

with torch.no_grad():
    for _ in range(n_samples):
        y_pred = model(X_test_t)
        predictions.append(y_pred.cpu().numpy())

predictions = np.array(predictions).squeeze()

# Statistics
mean_pred = predictions.mean(axis=0)
std_pred = predictions.std(axis=0)

print(f"Predictions: {predictions.shape}, Mean: {mean_pred.shape}")

Visualize Results¶

Plotting multiple sampled functions from the BNN posterior, along with the mean prediction and uncertainty bands, provides a visual analogue to Gaussian Process regression. Near training data, the sampled functions agree closely (low epistemic uncertainty); far from training data, they diverge (high epistemic uncertainty). Comparing the BNN’s uncertainty estimates to a standard neural network (which produces only point predictions) and to a Gaussian Process (which provides exact uncertainty) highlights the trade-offs: BNNs scale better than GPs to large datasets and complex architectures while providing uncertainty estimates that standard networks lack.

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Predictions with uncertainty
axes[0].plot(X_test, f(X_test), 'k--', label='True', linewidth=2)
axes[0].plot(X_test, mean_pred, 'b-', label='BNN mean', linewidth=2)
axes[0].fill_between(X_test.ravel(), mean_pred - 2*std_pred, mean_pred + 2*std_pred, 
                      alpha=0.3, label='±2σ')
axes[0].scatter(X_train, y_train, c='r', s=50, zorder=10, label='Data')
axes[0].set_xlabel('x', fontsize=12)
axes[0].set_ylabel('y', fontsize=12)
axes[0].set_title('Bayesian NN Predictions', fontsize=13)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Training loss
axes[1].plot(losses)
axes[1].set_xlabel('Iteration', fontsize=12)
axes[1].set_ylabel('Loss', fontsize=12)
axes[1].set_title('Training Loss', fontsize=13)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Uncertainty Visualization and Method Comparison

# Generate comparison data
X_comp = np.linspace(-1.5, 1.5, 200).reshape(-1, 1)
X_comp_t = torch.FloatTensor(X_comp).to(device)

# ============================================================
# Train MC Dropout model
# ============================================================
mc_model = MCDropoutNN(1, 64, 1, dropout_rate=0.1).to(device)
mc_optimizer = torch.optim.Adam(mc_model.parameters(), lr=1e-2)

print("Training MC Dropout model...")
for epoch in range(1000):
    mc_model.train()
    y_pred = mc_model(X_train_t)
    loss = F.mse_loss(y_pred, y_train_t)
    
    mc_optimizer.zero_grad()
    loss.backward()
    mc_optimizer.step()

mc_mean, mc_std = mc_model.predict_with_uncertainty(X_comp_t, n_samples=100)

# ============================================================
# Train Ensemble
# ============================================================
ensemble = EnsembleNN(1, 64, 1, n_models=5).to(device)

print("Training ensemble (5 models)...")
for idx in range(ensemble.n_models):
    optimizer = torch.optim.Adam(ensemble.models[idx].parameters(), lr=1e-2)
    ensemble.train_model(idx, X_train_t, y_train_t, optimizer, n_epochs=1000)

ens_mean, ens_std = ensemble.predict_with_uncertainty(X_comp_t)

# ============================================================
# BNN predictions (already trained)
# ============================================================
bnn_predictions = []
model.eval()
with torch.no_grad():
    for _ in range(100):
        pred = model(X_comp_t)
        bnn_predictions.append(pred.cpu().numpy())

bnn_predictions = np.array(bnn_predictions).squeeze()
bnn_mean = bnn_predictions.mean(axis=0)
bnn_std = bnn_predictions.std(axis=0)

# ============================================================
# Visualization: Compare all methods
# ============================================================
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

methods = [
    ('BNN (Variational)', bnn_mean, bnn_std, 'blue'),
    ('MC Dropout', mc_mean.squeeze(), mc_std.squeeze(), 'green'),
    ('Ensemble (5 models)', ens_mean.squeeze(), ens_std.squeeze(), 'red')
]

for idx, (name, mean, std, color) in enumerate(methods):
    ax = axes[idx // 2, idx % 2]
    
    # True function
    ax.plot(X_comp, f(X_comp), 'k--', label='True function', linewidth=2, alpha=0.7)
    
    # Predictions
    ax.plot(X_comp, mean, color=color, label=f'{name} mean', linewidth=2)
    ax.fill_between(X_comp.ravel(), mean - 2*std, mean + 2*std, 
                     alpha=0.3, color=color, label='±2σ')
    
    # Training data
    ax.scatter(X_train, y_train, c='black', s=80, zorder=10, 
               edgecolors='white', linewidths=2, label='Training data')
    
    ax.set_xlabel('x', fontsize=13)
    ax.set_ylabel('y', fontsize=13)
    ax.set_title(f'{name} - Uncertainty Quantification', fontsize=14, fontweight='bold')
    ax.legend(fontsize=11, loc='upper left')
    ax.grid(True, alpha=0.3)
    ax.set_xlim(-1.5, 1.5)
    ax.set_ylim(-1.5, 1.5)

# ============================================================
# Uncertainty comparison plot
# ============================================================
ax = axes[1, 1]
ax.plot(X_comp, bnn_std, 'b-', label='BNN', linewidth=2)
ax.plot(X_comp, mc_std.squeeze(), 'g-', label='MC Dropout', linewidth=2)
ax.plot(X_comp, ens_std.squeeze(), 'r-', label='Ensemble', linewidth=2)

# Highlight extrapolation regions
ax.axvspan(-1.5, -1.0, alpha=0.2, color='gray', label='Extrapolation')
ax.axvspan(1.0, 1.5, alpha=0.2, color='gray')

ax.set_xlabel('x', fontsize=13)
ax.set_ylabel('Predictive Std Dev (σ)', fontsize=13)
ax.set_title('Uncertainty Comparison Across Methods', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('bnn_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

# ============================================================
# Uncertainty Decomposition for BNN
# ============================================================
noise_std = 0.1
uncertainty_data = decompose_uncertainty(model, X_comp_t, None, n_samples=100, noise_std=noise_std)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: Total uncertainty
axes[0].plot(X_comp, np.sqrt(uncertainty_data['total']), 'purple', linewidth=2)
axes[0].fill_between(X_comp.ravel(), 0, np.sqrt(uncertainty_data['total']), alpha=0.3, color='purple')
axes[0].scatter(X_train, np.zeros_like(X_train), c='red', s=50, zorder=10, label='Training data')
axes[0].set_xlabel('x', fontsize=12)
axes[0].set_ylabel('Total Uncertainty (σ)', fontsize=12)
axes[0].set_title('Total Predictive Uncertainty', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].legend()

# Plot 2: Aleatoric vs Epistemic
axes[1].plot(X_comp, np.sqrt(uncertainty_data['aleatoric']), 'orange', 
             linewidth=2, label='Aleatoric (data noise)')
axes[1].plot(X_comp, np.sqrt(uncertainty_data['epistemic']), 'blue', 
             linewidth=2, label='Epistemic (model)')
axes[1].fill_between(X_comp.ravel(), 0, np.sqrt(uncertainty_data['aleatoric']), 
                     alpha=0.2, color='orange')
axes[1].fill_between(X_comp.ravel(), 0, np.sqrt(uncertainty_data['epistemic']), 
                     alpha=0.2, color='blue')
axes[1].set_xlabel('x', fontsize=12)
axes[1].set_ylabel('Uncertainty (σ)', fontsize=12)
axes[1].set_title('Uncertainty Decomposition', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

# Plot 3: Stacked uncertainty
axes[2].fill_between(X_comp.ravel(), 0, np.sqrt(uncertainty_data['aleatoric']), 
                     alpha=0.5, color='orange', label='Aleatoric')
axes[2].fill_between(X_comp.ravel(), 
                     np.sqrt(uncertainty_data['aleatoric']), 
                     np.sqrt(uncertainty_data['aleatoric']) + np.sqrt(uncertainty_data['epistemic']), 
                     alpha=0.5, color='blue', label='Epistemic')
axes[2].scatter(X_train, np.zeros_like(X_train), c='red', s=50, zorder=10)
axes[2].set_xlabel('x', fontsize=12)
axes[2].set_ylabel('Cumulative Uncertainty (σ)', fontsize=12)
axes[2].set_title('Stacked Uncertainty Components', fontsize=13, fontweight='bold')
axes[2].legend(fontsize=11, loc='upper left')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('uncertainty_decomposition.png', dpi=150, bbox_inches='tight')
plt.show()

print("\n" + "="*70)
print("UNCERTAINTY ANALYSIS")
print("="*70)
print(f"Mean Aleatoric Uncertainty:  {np.sqrt(uncertainty_data['aleatoric']).mean():.4f}")
print(f"Mean Epistemic Uncertainty:  {np.sqrt(uncertainty_data['epistemic']).mean():.4f}")
print(f"Mean Total Uncertainty:      {np.sqrt(uncertainty_data['total']).mean():.4f}")
print("\nNote: Epistemic uncertainty is HIGH in extrapolation regions (|x| > 1.0)")
print("      Aleatoric uncertainty is CONSTANT (inherent data noise)")
print("="*70)

MC Dropout¶

MC (Monte Carlo) Dropout is a practical approximation to Bayesian inference: simply keep dropout active at test time and run multiple forward passes. Gal & Ghahramani (2016) showed that dropout at test time is mathematically equivalent to approximate variational inference in a deep Gaussian process. The mean of multiple stochastic forward passes estimates the predictive mean, and the variance estimates the predictive uncertainty. MC Dropout requires no architectural changes beyond standard dropout – making it the most accessible method for adding uncertainty estimation to any existing neural network. The trade-off is that the uncertainty estimates are less calibrated than full variational inference or ensemble methods.

class MCDropoutNN(nn.Module):
    """Network with MC Dropout for uncertainty."""
    
    def __init__(self, input_dim, hidden_dim, output_dim, dropout_p=0.3):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout_p)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

# Train MC Dropout model
mc_model = MCDropoutNN(1, 64, 1).to(device)
optimizer = torch.optim.Adam(mc_model.parameters(), lr=1e-2)

for epoch in range(1000):
    mc_model.train()
    y_pred = mc_model(X_train_t)
    loss = F.mse_loss(y_pred, y_train_t)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# MC predictions (keep dropout on)
mc_model.train()  # Keep dropout enabled
mc_predictions = []

with torch.no_grad():
    for _ in range(100):
        y_pred = mc_model(X_test_t)
        mc_predictions.append(y_pred.cpu().numpy())

mc_predictions = np.array(mc_predictions).squeeze()
mc_mean = mc_predictions.mean(axis=0)
mc_std = mc_predictions.std(axis=0)

# Plot
plt.figure(figsize=(12, 6))
plt.plot(X_test, f(X_test), 'k--', label='True', linewidth=2)
plt.plot(X_test, mc_mean, 'g-', label='MC Dropout mean', linewidth=2)
plt.fill_between(X_test.ravel(), mc_mean - 2*mc_std, mc_mean + 2*mc_std, 
                 alpha=0.3, color='g', label='±2σ')
plt.scatter(X_train, y_train, c='r', s=50, zorder=10, label='Data')
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('MC Dropout Uncertainty', fontsize=13)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Summary¶

Bayesian Neural Networks:¶

Key Ideas:

Probability distribution over weights
Predictive uncertainty via integration
Variational inference for tractability
ELBO = likelihood - KL divergence

Uncertainty Types:¶

Epistemic: Model uncertainty (reducible with data)
Aleatoric: Data noise (irreducible)

Methods:¶

Variational Inference:

Gaussian posterior q(w)
Reparameterization trick
KL to prior regularization

MC Dropout:

Dropout as Bayesian approximation
Enable dropout at test time
Multiple forward passes

Advantages:¶

Calibrated uncertainty
Active learning
Out-of-distribution detection
Safety-critical applications

Applications:¶

Medical diagnosis
Autonomous vehicles
Reinforcement learning (exploration)
Bayesian optimization

Variants:¶

Laplace approximation
Ensemble methods
Deep ensembles
SWAG (Stochastic Weight Averaging)

Challenges:¶

Computational cost
Hyperparameter tuning
Posterior approximation quality

Advanced Bayesian Neural Networks Theory¶

1. Introduction to Bayesian Deep Learning¶

1.1 Motivation: Uncertainty Quantification¶

Traditional neural networks provide point estimates θ̂ for parameters, leading to:

Overconfidence on out-of-distribution data
No uncertainty quantification (cannot distinguish “don’t know” from “sure but wrong”)
Poor calibration (predicted probabilities ≠ true probabilities)

Bayesian Neural Networks (BNNs) maintain distributions over weights p(θ|D), enabling:

Epistemic uncertainty (model uncertainty, reducible with more data)
Aleatoric uncertainty (data noise, irreducible)
Principled decision-making under uncertainty
Calibrated predictions with confidence intervals

1.2 Bayesian Framework¶

Prior: p(θ) - belief before seeing data
Likelihood: p(D|θ) = ∏ᵢ p(yᵢ|xᵢ, θ)
Posterior: p(θ|D) = p(D|θ)p(θ) / p(D) via Bayes’ theorem

Prediction for new input x*:

p(y*|x*, D) = ∫ p(y*|x*, θ) p(θ|D) dθ

Challenge: Posterior p(θ|D) is intractable for neural networks (millions of parameters)

2. Bayesian Inference Methods¶

2.1 Variational Inference (VI)¶

Idea: Approximate intractable posterior p(θ|D) with tractable q_φ(θ)

Objective: Minimize KL divergence

KL(q_φ(θ) || p(θ|D)) = ∫ q_φ(θ) log[q_φ(θ) / p(θ|D)] dθ

Equivalent to maximizing ELBO (Evidence Lower Bound):

ELBO(φ) = E_{q_φ(θ)}[log p(D|θ)] - KL(q_φ(θ) || p(θ))
         = Σᵢ E_{q_φ(θ)}[log p(yᵢ|xᵢ, θ)] - KL(q_φ(θ) || p(θ))

Bayes by Backprop (Blundell et al., 2015):

Parameterize q_φ(θ) = N(μ, σ²) (mean-field Gaussian)
Reparameterization trick: θ = μ + σ ⊙ ε, ε ~ N(0, I)
Gradient: ∇_φ ELBO = ∇_φ E_ε[log p(D|θ(ε)) - log q_φ(θ(ε)) + log p(θ(ε))]

Advantages: Scalable, differentiable, GPU-friendly
Disadvantages: Mean-field assumption (independence), local optima

2.2 Monte Carlo Dropout¶

Observation (Gal & Ghahramani, 2016): Dropout is approximate Bayesian inference!

Standard dropout:

y = W₂ · dropout(ReLU(W₁x))

Interpretation: Each dropout mask ~ sample from posterior q(θ)

Inference:

Train with dropout (rate p)
At test time, keep dropout active
Sample T predictions: ŷ₁, …, ŷ_T (different masks)
Mean: E[y*|x*] ≈ 1/T Σₜ ŷₜ
Variance: Var[y*|x*] ≈ 1/T Σₜ (ŷₜ - mean)²

Advantages: Zero training overhead, works with any architecture
Disadvantages: Limited expressiveness, heuristic connection to VI

2.3 Deep Ensembles¶

Idea: Train M independent models with different initializations

Prediction:

p(y*|x*, D) ≈ 1/M Σₘ p(y*|x*, θₘ)

Training:

Different random seeds
Different data subsets (bagging)
Adversarial training for diversity

Advantages: Simple, strong empirical performance, diverse hypotheses
Disadvantages: M× training cost, not truly Bayesian (no prior)

2.4 Markov Chain Monte Carlo (MCMC)¶

Goal: Sample θ ~ p(θ|D) exactly (asymptotically)

Stochastic Gradient Langevin Dynamics (SGLD):

θₜ₊₁ = θₜ + (η/2)∇log p(θₜ|D) + N(0, η)
     = θₜ + (η/2)[∇log p(D|θₜ) + ∇log p(θₜ)] + N(0, η)

Gradient ascent + Langevin noise
Converges to true posterior as η → 0

Hamiltonian Monte Carlo (HMC):

Introduce momentum variables
Leapfrog integrator for proposals
Higher acceptance rate than SGLD

Advantages: Asymptotically exact, no variational gap
Disadvantages: Slow convergence, high computational cost, tuning required

2.5 Laplace Approximation¶

Idea: Approximate posterior with Gaussian at MAP estimate

Procedure:

Find MAP: θ_MAP = argmax p(θ|D)
Compute Hessian: H = -∇²log p(θ|D)|_{θ_MAP}
Approximate: p(θ|D) ≈ N(θ_MAP, H⁻¹)

Challenges for NNs:

Hessian is huge (millions × millions)
Expensive to compute and invert

Modern approaches:

KFAC (Kronecker-Factored Approximate Curvature): Block-diagonal Hessian
Diagonal Laplace: Only diagonal of H
Last-layer Laplace: Only linearize last layer (cheap!)

Advantages: Post-hoc (apply to pretrained models), principled
Disadvantages: Gaussian assumption, expensive for full network

3. Structured Variational Inference¶

3.1 Mean-Field vs. Structured Approximations¶

Mean-field: q(θ) = ∏ᵢ q(θᵢ) (fully factorized)

Simple, scalable
Too restrictive: Ignores correlations

Matrix Variate Gaussian (Louizos & Welling, 2016):

q(W) = MN(M, U, V)  (W is matrix of weights)

Captures row/column correlations
Kronecker structure for efficiency

Normalizing Flows (Rezende & Mohamed, 2015):

θ = f_K(...f₂(f₁(ε))...)  where ε ~ p₀(ε)
Change of variables: q(θ) = p₀(ε) |det J_f⁻¹|⁻¹

Expressive posteriors via invertible transformations
Planar flows, RealNVP, MAF for BNNs

3.2 Weight Uncertainty in Neural Networks¶

Hierarchical priors for automatic relevance determination:

p(W) = ∏ᵢⱼ N(Wᵢⱼ|0, αᵢⱼ⁻¹)
p(α) = ∏ᵢⱼ Gamma(αᵢⱼ|a, b)

Sparse variational dropout (Molchanov et al., 2017):

Learn dropout rate per weight
Prune weights with high dropout (αᵢⱼ → ∞)
Achieves sparsity + uncertainty

4. Scalable BNN Training¶

4.1 Minibatch Scaling for ELBO¶

Full ELBO:

L(φ) = Σᵢ₌₁ᴺ E_{q_φ}[log p(yᵢ|xᵢ, θ)] - KL(q_φ || p)

Minibatch estimate:

L̃(φ) = (N/B) Σᵢ∈batch E_{q_φ}[log p(yᵢ|xᵢ, θ)] - KL(q_φ || p)

KL term constant (computed once per batch)
Likelihood term scaled by N/B

4.2 Local Reparameterization Trick¶

Standard reparameterization: Sample full θ per minibatch

High variance gradients

Local reparameterization (Kingma et al., 2015):

Sample activations instead of weights

For linear layer: z = Wx where W ~ q(W)

E[z] = E[W]x = μx
Var[z] = x^T diag(σ²) x
z ~ N(μx, Var[z])

Lower variance, faster

4.3 Natural Gradient VI¶

Natural gradient: Adjust for parameter space curvature

φₜ₊₁ = φₜ + η F⁻¹ ∇_φ ELBO

where F is Fisher information matrix

Practical: Use Adam (approximates natural gradient)

5. Uncertainty Decomposition¶

5.1 Epistemic vs. Aleatoric Uncertainty¶

Epistemic (model uncertainty):

Reducible with more data
Captured by posterior variance p(θ|D)
Example: Insufficient data in region

Aleatoric (data noise):

Irreducible (inherent randomness)
Captured by likelihood variance p(y|x, θ)
Example: Sensor noise, label ambiguity

5.2 Heteroscedastic Aleatoric Uncertainty¶

Homoscedastic: σ² is constant
Heteroscedastic: σ²(x) varies with input

Model:

NN: x → (μ(x), σ²(x))
p(y|x, θ) = N(y|μ(x), σ²(x))

Loss (negative log-likelihood):

L = (1/2σ²(x))(y - μ(x))² + (1/2)log σ²(x)

First term: Precision-weighted MSE
Second term: Regularization (prevents σ → ∞)

Combined epistemic + aleatoric:

Var[y*] = E_θ[Var[y|x, θ]] + Var_θ[E[y|x, θ]]
        = E[σ²(x)] + Var[μ(x)]
        = aleatoric + epistemic

6. Priors for Neural Networks¶

6.1 Weight Priors¶

Gaussian prior: p(θ) = N(0, σ²_p I)

Corresponds to L2 regularization (MAP = ridge)
Induces smoothness

Laplace prior: p(θ) = Laplace(0, b)

Corresponds to L1 regularization (MAP = lasso)
Induces sparsity

Horseshoe prior:

p(wⱼ) = N(0, λⱼ²τ²)
p(λⱼ) = C⁺(0, 1)  (half-Cauchy)

Sparse but keeps important weights large
Better than Laplace for high-dimensional data

6.2 Functional Priors (Neural Network Gaussian Processes)¶

Observation: Infinite-width NN → Gaussian Process (Neal, 1996)

For single hidden layer with width H → ∞:

f(x) = (b/√H) Σₕ vₕ φ(wₕᵀx)

where wₕ ~ N(0, I), vₕ ~ N(0, 1)

Limit: f(x) ~ GP(0, K(x, x’))
Kernel (NNGP):

K(x, x') = E_w[φ(wᵀx) φ(wᵀx')]

For ReLU: Kernel has closed form (Cho & Saul, 2009)

Deep GPs: Multi-layer limit (Lee et al., 2018)

Provides prior over functions
Useful for architecture search, initialization

7. BNN Applications¶

7.1 Active Learning¶

Goal: Select most informative points to label

BALD (Bayesian Active Learning by Disagreement):

I(y; θ|x, D) = H[y|x, D] - E_θ[H[y|x, θ]]
             = H[E_θ[p(y|x, θ)]] - E_θ[H[p(y|x, θ)]]

High when models disagree (epistemic uncertainty)
Query points with maximum mutual information

7.2 Continual Learning¶

Catastrophic forgetting: New tasks overwrite old knowledge

Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017):

L_new = L_task(θ) + (λ/2) Σᵢ Fᵢ(θᵢ - θᵢ*)²

Fᵢ: Fisher information (importance of weight i for old task)
Prevents large changes to important weights

Variational Continual Learning (Nguyen et al., 2018):

Posterior of task t becomes prior for task t+1
Maintains memory of all tasks

7.3 Out-of-Distribution Detection¶

Predictive entropy:

H[y|x, D] = -Σ_c p(y=c|x, D) log p(y=c|x, D)

High entropy → uncertain → likely OOD

Predictive variance: Var[y|x, D]

High variance → epistemic uncertainty → OOD

Threshold: Reject if H or Var > threshold

7.4 Safety-Critical Applications¶

Medical diagnosis: Uncertainty for “refer to specialist”
Autonomous driving: Detect novel scenarios
Reinforcement learning: Risk-aware exploration

8. Computational Complexity¶

8.1 Training Cost¶

Method	Forward Pass	Backward Pass	Memory
Standard NN	O(W)	O(W)	O(W)
Bayes by Backprop	O(SW)	O(SW)	O(2W)
MC Dropout	O(W)	O(W)	O(W)
Deep Ensembles (M)	O(MW)	O(MW)	O(MW)
SGLD	O(W)	O(W)	O(TW)

W: Number of weights
S: Samples per batch (typically 1-3)
M: Ensemble size (typically 5-10)
T: MCMC samples

8.2 Inference Cost¶

Single prediction:

Standard: 1 forward pass
BNN: T forward passes (sample θ₁, …, θ_T)

Typical T: 10-100 for uncertainty, 1000+ for calibration

9. Evaluation Metrics¶

9.1 Calibration¶

Perfect calibration: Predicted probability = actual frequency

Expected Calibration Error (ECE):

ECE = Σₘ (|Bₘ|/N) |acc(Bₘ) - conf(Bₘ)|

Partition predictions into M bins by confidence
Compare accuracy vs. confidence per bin

Reliability diagram: Plot accuracy vs. confidence

Ideal: Diagonal line

9.2 Negative Log-Likelihood (NLL)¶

NLL = -(1/N) Σᵢ log p(yᵢ|xᵢ, D)

Measures quality of predicted distribution
Lower is better

9.3 Brier Score¶

BS = (1/N) Σᵢ Σ_c (p(y=c|xᵢ) - 𝟙[yᵢ=c])²

Squared error of predicted probabilities
Lower is better

10. Recent Advances (2017-2024)¶

10.1 Function-Space Inference¶

Neural Tangent Kernel (NTK) (Jacot et al., 2018):

Infinite-width limit at initialization
Kernel remains constant during training
Exact GP inference in function space

Limitations: Requires infinite width, doesn’t capture feature learning

10.2 Stochastic Weight Averaging Gaussian (SWAG)¶

Idea (Maddox et al., 2019): Approximate posterior from SGD trajectory

Run SGD for T iterations
Collect θ₁, …, θ_T in later epochs
Fit Gaussian: μ = mean(θ), Σ = cov(θ)

Advantages: Post-hoc, uses existing training, cheap

10.3 Neural Processes¶

Meta-learning BNNs:

Learn prior p(θ) from related tasks
Fast adaptation to new tasks
Combines benefits of GPs and NNs

10.4 Predictive Uncertainty with Deep Kernel Learning¶

Combine NN feature extractor with GP:

k(x, x') = k_GP(φ_NN(x), φ_NN(x'))

NN learns features
GP provides calibrated uncertainty

10.5 Generalized Variational Inference¶

Rényi divergence instead of KL:

D_α(q||p) = (1/(α-1)) log ∫ q^α p^{1-α}

α = 1: Recovers KL
α > 1: More robust, mode-seeking
α < 1: Mass-covering

11. Practical Guidelines¶

11.1 Method Selection¶

Use MC Dropout if:

Have pretrained model
Need quick uncertainty estimate
Limited compute budget

Use Bayes by Backprop if:

Training from scratch
Want principled Bayesian inference
Have moderate data

Use Deep Ensembles if:

Need best empirical performance
Can afford M× training
Want diversity

Use Last-Layer Laplace if:

Have pretrained model
Want post-hoc uncertainty
Need calibrated predictions

11.2 Hyperparameter Tuning¶

Prior variance σ²_p: Controls regularization

Too small: Underfitting
Too large: Overfitting
Tune via validation NLL

Posterior learning rate: Typically lower than standard

Adam with lr = 1e-4 to 1e-3

Number of samples T: Trade-off accuracy vs. speed

Training: 1-3 samples
Evaluation: 10-100 samples

12. Limitations and Open Problems¶

12.1 Current Challenges¶

Computational cost: T× slower inference
Scalability: Difficult for huge models (GPT-3)
Posterior collapse: VI can underestimate uncertainty
Hyperprior selection: Sensitive to prior choice
Calibration: Not guaranteed even for BNNs

12.2 Open Research Questions¶

Scalable exact inference: MCMC for billions of parameters
Better posteriors: Beyond mean-field, tractable structured VI
Functional priors: Specify p(f) instead of p(θ)
Multi-modal posteriors: Capture symmetries, local optima
Uncertainty in generative models: BNNs for GANs, diffusion

14. Software Libraries¶

14.1 Python Libraries¶

Pyro: Probabilistic programming (PyTorch backend)
TensorFlow Probability: Bayesian layers, VI, MCMC
Blitz: Bayes by Backprop in PyTorch
Laplace: Last-layer Laplace approximation
GPyTorch: Scalable GPs for DNN features

14.2 Example: Bayes by Backprop Layer¶

class BayesianLinear(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.mu_w = nn.Parameter(torch.randn(out_features, in_features))
        self.rho_w = nn.Parameter(torch.randn(out_features, in_features))
        self.mu_b = nn.Parameter(torch.randn(out_features))
        self.rho_b = nn.Parameter(torch.randn(out_features))
    
    def forward(self, x):
        sigma_w = torch.log1p(torch.exp(self.rho_w))
        sigma_b = torch.log1p(torch.exp(self.rho_b))
        
        w = self.mu_w + sigma_w * torch.randn_like(sigma_w)
        b = self.mu_b + sigma_b * torch.randn_like(sigma_b)
        
        return F.linear(x, w, b)

15. Benchmarks and Results¶

15.1 Classification Tasks¶

CIFAR-10 (Accuracy / NLL / ECE):

Standard ResNet: 95.5% / 0.18 / 0.05
MC Dropout (p=0.1): 95.2% / 0.16 / 0.04
Deep Ensemble (M=5): 96.1% / 0.14 / 0.02
SWAG: 95.8% / 0.15 / 0.03

ImageNet:

Standard: 76.1% / 1.02
Ensemble: 77.3% / 0.91
Temp scaling: 76.1% / 0.89

15.2 Regression Tasks¶

UCI datasets (Avg. RMSE / Avg. NLL):

Standard MLP: 0.52 / 1.21
MC Dropout: 0.49 / 0.98
Variational: 0.47 / 0.91
Ensemble: 0.46 / 0.88

15.3 Active Learning¶

MNIST (Accuracy with 100 labels):

Random: 85%
Entropy: 88%
BALD (BNN): 92%

16. Key Takeaways¶

BNNs provide uncertainty: Epistemic + aleatoric via posterior
Trade-offs: Computational cost vs. uncertainty quality
Practical methods: MC Dropout, ensembles, last-layer Laplace
Calibration is crucial: Measure with ECE, NLL, reliability plots
Applications: Active learning, OOD detection, safety-critical systems
Open problems: Scalability, multi-modal posteriors, functional priors

When to use BNNs:

Safety-critical applications (medical, autonomous)
Small data regimes (active learning)
Need confidence intervals
Out-of-distribution detection

When NOT to use:

Computational budget limited
Only accuracy matters (not uncertainty)
Data is abundant and clean

17. References¶

Foundational:

Neal (1996): Bayesian Learning for Neural Networks
MacKay (1992): Practical Bayesian Framework for Backprop