1. The Sampling ProblemΒΆ

Why Sample?ΒΆ

Goal: Compute expectations under complex distribution \(p(x)\):

\[\mathbb{E}_{x \sim p}[f(x)] = \int f(x) p(x) dx\]

Problem:

  • High-dimensional integral (can’t compute analytically)

  • \(p(x)\) is complex (e.g., posterior in Bayesian inference)

  • May only know \(p(x)\) up to normalizing constant: \(p(x) = \frac{\tilde{p}(x)}{Z}\)

Monte Carlo SolutionΒΆ

If we can sample \(x_1, ..., x_N \sim p(x)\), then:

\[\mathbb{E}_{x \sim p}[f(x)] \approx \frac{1}{N} \sum_{i=1}^N f(x_i)\]

Law of Large Numbers: Converges as \(N \to \infty\)

But: How to sample from \(p(x)\) when it’s complex?

Answer: Markov Chain Monte Carlo!

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 4)
np.random.seed(42)

2. Markov Chain BasicsΒΆ

Markov ChainΒΆ

Definition: Sequence \(X_0, X_1, X_2, ...\) where:

\[P(X_{t+1} | X_t, X_{t-1}, ..., X_0) = P(X_{t+1} | X_t)\]

Transition kernel: \(T(x' | x) = P(X_{t+1} = x' | X_t = x)\)

Stationary DistributionΒΆ

Definition: \(\pi(x)\) is stationary if:

\[\pi(x') = \int T(x' | x) \pi(x) dx\]

Key Property: If chain has stationary distribution and is:

  1. Irreducible: Can reach any state from any state

  2. Aperiodic: Not stuck in cycles

  3. Positive recurrent: Returns to states in finite time

Then: \(\lim_{t \to \infty} P(X_t = x) = \pi(x)\) regardless of \(X_0\)!

Detailed BalanceΒΆ

Sufficient condition for \(\pi\) to be stationary:

\[\pi(x) T(x' | x) = \pi(x') T(x | x')\]

This is called detailed balance or reversibility.

3. Metropolis-Hastings AlgorithmΒΆ

The AlgorithmΒΆ

Goal: Sample from \(p(x)\) (target distribution)

Given:

  • Target \(p(x)\) (known up to constant: \(p(x) = \tilde{p}(x) / Z\))

  • Proposal \(q(x' | x)\) (e.g., Gaussian centered at \(x\))

Algorithm:

  1. Initialize \(x_0\)

  2. For \(t = 0, 1, 2, ...\):

    • Sample \(x' \sim q(x' | x_t)\) (proposal)

    • Compute acceptance probability: $\(\alpha = \min\left(1, \frac{p(x') q(x_t | x')}{p(x_t) q(x' | x_t)}\right) = \min\left(1, \frac{\tilde{p}(x')}{\tilde{p}(x_t)} \cdot \frac{q(x_t | x')}{q(x' | x_t)}\right)\)$

    • With probability \(\alpha\): \(x_{t+1} = x'\) (accept)

    • Otherwise: \(x_{t+1} = x_t\) (reject)

Special case: If \(q\) is symmetric (\(q(x'|x) = q(x|x')\)): $\(\alpha = \min\left(1, \frac{p(x')}{p(x_t)}\right)\)$

This is the Metropolis algorithm.

Why It WorksΒΆ

Theorem: The Markov chain satisfies detailed balance with respect to \(p(x)\):

\[p(x) T(x' | x) = p(x') T(x | x')\]

where \(T(x' | x) = q(x' | x) \alpha(x, x')\) is the transition kernel.

Proof: Exercise! (Hint: Consider two cases: \(p(x')q(x|x') > p(x)q(x'|x)\) and vice versa)

# Implement Metropolis-Hastings for 2D distribution

def target_distribution(x):
    """Target: Mixture of 2 Gaussians"""
    mu1, mu2 = np.array([0, 0]), np.array([3, 3])
    cov1 = np.array([[1, 0.5], [0.5, 1]])
    cov2 = np.array([[1, -0.5], [-0.5, 1]])
    
    p1 = stats.multivariate_normal.pdf(x, mu1, cov1)
    p2 = stats.multivariate_normal.pdf(x, mu2, cov2)
    
    return 0.6 * p1 + 0.4 * p2

def metropolis_hastings(target, n_samples=10000, proposal_std=0.5, x0=None):
    """Metropolis-Hastings with Gaussian proposal"""
    
    # Initialize
    if x0 is None:
        x = np.random.randn(2)
    else:
        x = x0.copy()
    
    samples = np.zeros((n_samples, 2))
    accepted = 0
    
    for i in range(n_samples):
        # Propose
        x_proposal = x + np.random.randn(2) * proposal_std
        
        # Compute acceptance ratio
        # For symmetric proposal, just ratio of target densities
        alpha = min(1, target(x_proposal) / target(x))
        
        # Accept/reject
        if np.random.rand() < alpha:
            x = x_proposal
            accepted += 1
        
        samples[i] = x
    
    acceptance_rate = accepted / n_samples
    return samples, acceptance_rate

# Run Metropolis-Hastings
print("Running Metropolis-Hastings...")
samples, acc_rate = metropolis_hastings(target_distribution, n_samples=10000, proposal_std=1.0)
print(f"Acceptance rate: {acc_rate:.2%}")

# Visualize
fig = plt.figure(figsize=(16, 5))

# True distribution
ax1 = fig.add_subplot(131)
x1 = np.linspace(-3, 6, 100)
x2 = np.linspace(-3, 6, 100)
X1, X2 = np.meshgrid(x1, x2)
Z = np.zeros_like(X1)
for i in range(X1.shape[0]):
    for j in range(X1.shape[1]):
        Z[i, j] = target_distribution(np.array([X1[i, j], X2[i, j]]))

ax1.contourf(X1, X2, Z, levels=20, cmap='viridis', alpha=0.8)
ax1.set_xlabel('$x_1$')
ax1.set_ylabel('$x_2$')
ax1.set_title('True Distribution (Target)', fontsize=14, fontweight='bold')
ax1.axis('equal')

# MCMC samples
ax2 = fig.add_subplot(132)
ax2.scatter(samples[:, 0], samples[:, 1], alpha=0.3, s=1, c='blue')
ax2.set_xlabel('$x_1$')
ax2.set_ylabel('$x_2$')
ax2.set_title(f'MCMC Samples (N={len(samples)})', fontsize=14, fontweight='bold')
ax2.axis('equal')

# Trace plot (first 500 samples)
ax3 = fig.add_subplot(133)
ax3.plot(samples[:500, 0], alpha=0.7, label='$x_1$')
ax3.plot(samples[:500, 1], alpha=0.7, label='$x_2$')
ax3.set_xlabel('Iteration')
ax3.set_ylabel('Value')
ax3.set_title('Trace Plot (First 500 iterations)', fontsize=14, fontweight='bold')
ax3.legend()
ax3.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nβœ… Metropolis-Hastings successfully sampled from target distribution!")

4. Gibbs SamplingΒΆ

The AlgorithmΒΆ

Special case of Metropolis-Hastings for multivariate distributions

Setup: Want to sample from \(p(x_1, ..., x_d)\)

Requirement: Can sample from conditionals \(p(x_i | x_{-i})\)

Algorithm:

  1. Initialize \((x_1^{(0)}, ..., x_d^{(0)})\)

  2. For \(t = 0, 1, 2, ...\):

    • Sample \(x_1^{(t+1)} \sim p(x_1 | x_2^{(t)}, ..., x_d^{(t)})\)

    • Sample \(x_2^{(t+1)} \sim p(x_2 | x_1^{(t+1)}, x_3^{(t)}, ..., x_d^{(t)})\)

    • …

    • Sample \(x_d^{(t+1)} \sim p(x_d | x_1^{(t+1)}, ..., x_{d-1}^{(t+1)})\)

Key Property: Acceptance probability is always 1! (No rejections)

Example: Bivariate GaussianΒΆ

For \(p(x_1, x_2)\) jointly Gaussian:

  • \(p(x_1 | x_2)\) is Gaussian

  • \(p(x_2 | x_1)\) is Gaussian

Both easy to sample from!

# Gibbs sampling for bivariate Gaussian

def gibbs_sampling_gaussian(n_samples=10000, rho=0.8):
    """Gibbs sampling for N([0,0], [[1, rho], [rho, 1]])"""
    
    samples = np.zeros((n_samples, 2))
    x = np.array([0.0, 0.0])  # Initialize
    
    for i in range(n_samples):
        # Sample x1 | x2
        # Conditional: x1 | x2 ~ N(rho * x2, 1 - rho^2)
        x[0] = np.random.randn() * np.sqrt(1 - rho**2) + rho * x[1]
        
        # Sample x2 | x1
        # Conditional: x2 | x1 ~ N(rho * x1, 1 - rho^2)
        x[1] = np.random.randn() * np.sqrt(1 - rho**2) + rho * x[0]
        
        samples[i] = x.copy()
    
    return samples

# Run Gibbs sampling
rho = 0.8
samples_gibbs = gibbs_sampling_gaussian(n_samples=5000, rho=rho)

# True samples (for comparison)
mean = [0, 0]
cov = [[1, rho], [rho, 1]]
samples_true = np.random.multivariate_normal(mean, cov, 5000)

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# True distribution
axes[0].scatter(samples_true[:, 0], samples_true[:, 1], alpha=0.3, s=10)
axes[0].set_xlabel('$x_1$')
axes[0].set_ylabel('$x_2$')
axes[0].set_title('True Distribution', fontsize=14, fontweight='bold')
axes[0].axis('equal')
axes[0].grid(True, alpha=0.3)

# Gibbs samples
axes[1].scatter(samples_gibbs[:, 0], samples_gibbs[:, 1], alpha=0.3, s=10, color='orange')
axes[1].set_xlabel('$x_1$')
axes[1].set_ylabel('$x_2$')
axes[1].set_title(f'Gibbs Sampling (ρ={rho})', fontsize=14, fontweight='bold')
axes[1].axis('equal')
axes[1].grid(True, alpha=0.3)

# Trace plot showing zigzag pattern
axes[2].plot(samples_gibbs[:100, 0], samples_gibbs[:100, 1], 'o-', alpha=0.6, markersize=4)
axes[2].set_xlabel('$x_1$')
axes[2].set_ylabel('$x_2$')
axes[2].set_title('Gibbs Sampling Path (100 iterations)', fontsize=14, fontweight='bold')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("βœ… Gibbs sampling matches true distribution!")
print(f"True correlation: {rho:.3f}")
print(f"Gibbs correlation: {np.corrcoef(samples_gibbs.T)[0,1]:.3f}")

5. Convergence and DiagnosticsΒΆ

Burn-in PeriodΒΆ

Problem: Early samples depend on initialization

Solution: Discard first \(B\) samples (burn-in)

Mixing and AutocorrelationΒΆ

Good mixing: Chain explores space quickly

Poor mixing: Chain gets stuck, high autocorrelation

Autocorrelation: $\(\text{ACF}(k) = \frac{\text{Cov}(X_t, X_{t+k})}{\text{Var}(X_t)}\)$

Effective sample size: $\(N_{\text{eff}} = \frac{N}{1 + 2\sum_{k=1}^\infty \text{ACF}(k)}\)$

DiagnosticsΒΆ

  1. Trace plots: Visual inspection of mixing

  2. Autocorrelation plots: Check for independence

  3. Gelman-Rubin statistic: Compare multiple chains

  4. Effective sample size: Adjust for autocorrelation

# Compute autocorrelation

def autocorrelation(x, max_lag=50):
    """Compute autocorrelation function"""
    x = x - x.mean()
    c0 = np.dot(x, x) / len(x)
    
    acf = np.zeros(max_lag)
    for k in range(max_lag):
        c_k = np.dot(x[:-k-1], x[k+1:]) / len(x) if k > 0 else c0
        acf[k] = c_k / c0
    
    return acf

# Compare autocorrelation for different proposal stds
stds = [0.1, 1.0, 3.0]
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, std in enumerate(stds):
    samples, acc_rate = metropolis_hastings(target_distribution, n_samples=5000, proposal_std=std)
    acf = autocorrelation(samples[:, 0], max_lag=100)
    
    axes[idx].plot(acf, linewidth=2)
    axes[idx].axhline(y=0, color='r', linestyle='--', alpha=0.5)
    axes[idx].set_xlabel('Lag')
    axes[idx].set_ylabel('Autocorrelation')
    axes[idx].set_title(f'Proposal std={std:.1f}\nAccept rate={acc_rate:.2%}', 
                       fontsize=12, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("πŸ“Š Autocorrelation analysis:")
print("  - Too small proposal β†’ slow mixing, high autocorrelation")
print("  - Too large proposal β†’ low acceptance, slow mixing")
print("  - Optimal: 20-50% acceptance rate")

6. SummaryΒΆ

MCMC FrameworkΒΆ

βœ… Goal: Sample from complex distributions βœ… Method: Construct Markov chain with stationary distribution = target βœ… Convergence: Eventually samples from target (after burn-in)

Key AlgorithmsΒΆ

  1. Metropolis-Hastings

    • General purpose

    • Requires proposal distribution

    • May have rejections

  2. Gibbs Sampling

    • Special case of MH

    • Sample from conditionals

    • No rejections, but can be slow

  3. Hamiltonian Monte Carlo (not covered)

    • Uses gradient information

    • Better mixing in high dimensions

    • Used in Stan, PyMC3

Practical ConsiderationsΒΆ

⚠️ Burn-in: Discard early samples ⚠️ Thinning: Keep every k-th sample to reduce autocorrelation ⚠️ Diagnostics: Check convergence! ⚠️ Multiple chains: Start from different initializations

ApplicationsΒΆ

  • Bayesian inference: Sample from posterior

  • Statistical physics: Partition functions

  • Machine learning: Latent variable models

  • Optimization: Simulated annealing

Next Notebook: 11_variational_inference.ipynb