1. The Sampling ProblemΒΆ
Why Sample?ΒΆ
Goal: Compute expectations under complex distribution \(p(x)\):
Problem:
High-dimensional integral (canβt compute analytically)
\(p(x)\) is complex (e.g., posterior in Bayesian inference)
May only know \(p(x)\) up to normalizing constant: \(p(x) = \frac{\tilde{p}(x)}{Z}\)
Monte Carlo SolutionΒΆ
If we can sample \(x_1, ..., x_N \sim p(x)\), then:
Law of Large Numbers: Converges as \(N \to \infty\)
But: How to sample from \(p(x)\) when itβs complex?
Answer: Markov Chain Monte Carlo!
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 4)
np.random.seed(42)
2. Markov Chain BasicsΒΆ
Markov ChainΒΆ
Definition: Sequence \(X_0, X_1, X_2, ...\) where:
Transition kernel: \(T(x' | x) = P(X_{t+1} = x' | X_t = x)\)
Stationary DistributionΒΆ
Definition: \(\pi(x)\) is stationary if:
Key Property: If chain has stationary distribution and is:
Irreducible: Can reach any state from any state
Aperiodic: Not stuck in cycles
Positive recurrent: Returns to states in finite time
Then: \(\lim_{t \to \infty} P(X_t = x) = \pi(x)\) regardless of \(X_0\)!
Detailed BalanceΒΆ
Sufficient condition for \(\pi\) to be stationary:
This is called detailed balance or reversibility.
3. Metropolis-Hastings AlgorithmΒΆ
The AlgorithmΒΆ
Goal: Sample from \(p(x)\) (target distribution)
Given:
Target \(p(x)\) (known up to constant: \(p(x) = \tilde{p}(x) / Z\))
Proposal \(q(x' | x)\) (e.g., Gaussian centered at \(x\))
Algorithm:
Initialize \(x_0\)
For \(t = 0, 1, 2, ...\):
Sample \(x' \sim q(x' | x_t)\) (proposal)
Compute acceptance probability: $\(\alpha = \min\left(1, \frac{p(x') q(x_t | x')}{p(x_t) q(x' | x_t)}\right) = \min\left(1, \frac{\tilde{p}(x')}{\tilde{p}(x_t)} \cdot \frac{q(x_t | x')}{q(x' | x_t)}\right)\)$
With probability \(\alpha\): \(x_{t+1} = x'\) (accept)
Otherwise: \(x_{t+1} = x_t\) (reject)
Special case: If \(q\) is symmetric (\(q(x'|x) = q(x|x')\)): $\(\alpha = \min\left(1, \frac{p(x')}{p(x_t)}\right)\)$
This is the Metropolis algorithm.
Why It WorksΒΆ
Theorem: The Markov chain satisfies detailed balance with respect to \(p(x)\):
where \(T(x' | x) = q(x' | x) \alpha(x, x')\) is the transition kernel.
Proof: Exercise! (Hint: Consider two cases: \(p(x')q(x|x') > p(x)q(x'|x)\) and vice versa)
# Implement Metropolis-Hastings for 2D distribution
def target_distribution(x):
"""Target: Mixture of 2 Gaussians"""
mu1, mu2 = np.array([0, 0]), np.array([3, 3])
cov1 = np.array([[1, 0.5], [0.5, 1]])
cov2 = np.array([[1, -0.5], [-0.5, 1]])
p1 = stats.multivariate_normal.pdf(x, mu1, cov1)
p2 = stats.multivariate_normal.pdf(x, mu2, cov2)
return 0.6 * p1 + 0.4 * p2
def metropolis_hastings(target, n_samples=10000, proposal_std=0.5, x0=None):
"""Metropolis-Hastings with Gaussian proposal"""
# Initialize
if x0 is None:
x = np.random.randn(2)
else:
x = x0.copy()
samples = np.zeros((n_samples, 2))
accepted = 0
for i in range(n_samples):
# Propose
x_proposal = x + np.random.randn(2) * proposal_std
# Compute acceptance ratio
# For symmetric proposal, just ratio of target densities
alpha = min(1, target(x_proposal) / target(x))
# Accept/reject
if np.random.rand() < alpha:
x = x_proposal
accepted += 1
samples[i] = x
acceptance_rate = accepted / n_samples
return samples, acceptance_rate
# Run Metropolis-Hastings
print("Running Metropolis-Hastings...")
samples, acc_rate = metropolis_hastings(target_distribution, n_samples=10000, proposal_std=1.0)
print(f"Acceptance rate: {acc_rate:.2%}")
# Visualize
fig = plt.figure(figsize=(16, 5))
# True distribution
ax1 = fig.add_subplot(131)
x1 = np.linspace(-3, 6, 100)
x2 = np.linspace(-3, 6, 100)
X1, X2 = np.meshgrid(x1, x2)
Z = np.zeros_like(X1)
for i in range(X1.shape[0]):
for j in range(X1.shape[1]):
Z[i, j] = target_distribution(np.array([X1[i, j], X2[i, j]]))
ax1.contourf(X1, X2, Z, levels=20, cmap='viridis', alpha=0.8)
ax1.set_xlabel('$x_1$')
ax1.set_ylabel('$x_2$')
ax1.set_title('True Distribution (Target)', fontsize=14, fontweight='bold')
ax1.axis('equal')
# MCMC samples
ax2 = fig.add_subplot(132)
ax2.scatter(samples[:, 0], samples[:, 1], alpha=0.3, s=1, c='blue')
ax2.set_xlabel('$x_1$')
ax2.set_ylabel('$x_2$')
ax2.set_title(f'MCMC Samples (N={len(samples)})', fontsize=14, fontweight='bold')
ax2.axis('equal')
# Trace plot (first 500 samples)
ax3 = fig.add_subplot(133)
ax3.plot(samples[:500, 0], alpha=0.7, label='$x_1$')
ax3.plot(samples[:500, 1], alpha=0.7, label='$x_2$')
ax3.set_xlabel('Iteration')
ax3.set_ylabel('Value')
ax3.set_title('Trace Plot (First 500 iterations)', fontsize=14, fontweight='bold')
ax3.legend()
ax3.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\nβ
Metropolis-Hastings successfully sampled from target distribution!")
4. Gibbs SamplingΒΆ
The AlgorithmΒΆ
Special case of Metropolis-Hastings for multivariate distributions
Setup: Want to sample from \(p(x_1, ..., x_d)\)
Requirement: Can sample from conditionals \(p(x_i | x_{-i})\)
Algorithm:
Initialize \((x_1^{(0)}, ..., x_d^{(0)})\)
For \(t = 0, 1, 2, ...\):
Sample \(x_1^{(t+1)} \sim p(x_1 | x_2^{(t)}, ..., x_d^{(t)})\)
Sample \(x_2^{(t+1)} \sim p(x_2 | x_1^{(t+1)}, x_3^{(t)}, ..., x_d^{(t)})\)
β¦
Sample \(x_d^{(t+1)} \sim p(x_d | x_1^{(t+1)}, ..., x_{d-1}^{(t+1)})\)
Key Property: Acceptance probability is always 1! (No rejections)
Example: Bivariate GaussianΒΆ
For \(p(x_1, x_2)\) jointly Gaussian:
\(p(x_1 | x_2)\) is Gaussian
\(p(x_2 | x_1)\) is Gaussian
Both easy to sample from!
# Gibbs sampling for bivariate Gaussian
def gibbs_sampling_gaussian(n_samples=10000, rho=0.8):
"""Gibbs sampling for N([0,0], [[1, rho], [rho, 1]])"""
samples = np.zeros((n_samples, 2))
x = np.array([0.0, 0.0]) # Initialize
for i in range(n_samples):
# Sample x1 | x2
# Conditional: x1 | x2 ~ N(rho * x2, 1 - rho^2)
x[0] = np.random.randn() * np.sqrt(1 - rho**2) + rho * x[1]
# Sample x2 | x1
# Conditional: x2 | x1 ~ N(rho * x1, 1 - rho^2)
x[1] = np.random.randn() * np.sqrt(1 - rho**2) + rho * x[0]
samples[i] = x.copy()
return samples
# Run Gibbs sampling
rho = 0.8
samples_gibbs = gibbs_sampling_gaussian(n_samples=5000, rho=rho)
# True samples (for comparison)
mean = [0, 0]
cov = [[1, rho], [rho, 1]]
samples_true = np.random.multivariate_normal(mean, cov, 5000)
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# True distribution
axes[0].scatter(samples_true[:, 0], samples_true[:, 1], alpha=0.3, s=10)
axes[0].set_xlabel('$x_1$')
axes[0].set_ylabel('$x_2$')
axes[0].set_title('True Distribution', fontsize=14, fontweight='bold')
axes[0].axis('equal')
axes[0].grid(True, alpha=0.3)
# Gibbs samples
axes[1].scatter(samples_gibbs[:, 0], samples_gibbs[:, 1], alpha=0.3, s=10, color='orange')
axes[1].set_xlabel('$x_1$')
axes[1].set_ylabel('$x_2$')
axes[1].set_title(f'Gibbs Sampling (Ο={rho})', fontsize=14, fontweight='bold')
axes[1].axis('equal')
axes[1].grid(True, alpha=0.3)
# Trace plot showing zigzag pattern
axes[2].plot(samples_gibbs[:100, 0], samples_gibbs[:100, 1], 'o-', alpha=0.6, markersize=4)
axes[2].set_xlabel('$x_1$')
axes[2].set_ylabel('$x_2$')
axes[2].set_title('Gibbs Sampling Path (100 iterations)', fontsize=14, fontweight='bold')
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("β
Gibbs sampling matches true distribution!")
print(f"True correlation: {rho:.3f}")
print(f"Gibbs correlation: {np.corrcoef(samples_gibbs.T)[0,1]:.3f}")
5. Convergence and DiagnosticsΒΆ
Burn-in PeriodΒΆ
Problem: Early samples depend on initialization
Solution: Discard first \(B\) samples (burn-in)
Mixing and AutocorrelationΒΆ
Good mixing: Chain explores space quickly
Poor mixing: Chain gets stuck, high autocorrelation
Autocorrelation: $\(\text{ACF}(k) = \frac{\text{Cov}(X_t, X_{t+k})}{\text{Var}(X_t)}\)$
Effective sample size: $\(N_{\text{eff}} = \frac{N}{1 + 2\sum_{k=1}^\infty \text{ACF}(k)}\)$
DiagnosticsΒΆ
Trace plots: Visual inspection of mixing
Autocorrelation plots: Check for independence
Gelman-Rubin statistic: Compare multiple chains
Effective sample size: Adjust for autocorrelation
# Compute autocorrelation
def autocorrelation(x, max_lag=50):
"""Compute autocorrelation function"""
x = x - x.mean()
c0 = np.dot(x, x) / len(x)
acf = np.zeros(max_lag)
for k in range(max_lag):
c_k = np.dot(x[:-k-1], x[k+1:]) / len(x) if k > 0 else c0
acf[k] = c_k / c0
return acf
# Compare autocorrelation for different proposal stds
stds = [0.1, 1.0, 3.0]
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for idx, std in enumerate(stds):
samples, acc_rate = metropolis_hastings(target_distribution, n_samples=5000, proposal_std=std)
acf = autocorrelation(samples[:, 0], max_lag=100)
axes[idx].plot(acf, linewidth=2)
axes[idx].axhline(y=0, color='r', linestyle='--', alpha=0.5)
axes[idx].set_xlabel('Lag')
axes[idx].set_ylabel('Autocorrelation')
axes[idx].set_title(f'Proposal std={std:.1f}\nAccept rate={acc_rate:.2%}',
fontsize=12, fontweight='bold')
axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("π Autocorrelation analysis:")
print(" - Too small proposal β slow mixing, high autocorrelation")
print(" - Too large proposal β low acceptance, slow mixing")
print(" - Optimal: 20-50% acceptance rate")
6. SummaryΒΆ
MCMC FrameworkΒΆ
β Goal: Sample from complex distributions β Method: Construct Markov chain with stationary distribution = target β Convergence: Eventually samples from target (after burn-in)
Key AlgorithmsΒΆ
Metropolis-Hastings
General purpose
Requires proposal distribution
May have rejections
Gibbs Sampling
Special case of MH
Sample from conditionals
No rejections, but can be slow
Hamiltonian Monte Carlo (not covered)
Uses gradient information
Better mixing in high dimensions
Used in Stan, PyMC3
Practical ConsiderationsΒΆ
β οΈ Burn-in: Discard early samples β οΈ Thinning: Keep every k-th sample to reduce autocorrelation β οΈ Diagnostics: Check convergence! β οΈ Multiple chains: Start from different initializations
ApplicationsΒΆ
Bayesian inference: Sample from posterior
Statistical physics: Partition functions
Machine learning: Latent variable models
Optimization: Simulated annealing
Next Notebook: 11_variational_inference.ipynb