Run this notebook: Open in Colab Open in Kaggle

import torch
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

1. Gaussian Process Theory¶

Definition:¶

\[f \sim \mathcal{GP}(m, k)\]

where $m(x)$ is mean, $k(x,x')$ is kernel.

Posterior:¶

\[f_* | X, y, X_* \sim \mathcal{N}(\mu_*, \Sigma_*)\]

\[\mu_* = K_*^T (K + \sigma_n^2 I)^{-1} y\]

\[\Sigma_* = K_{**} - K_*^T (K + \sigma_n^2 I)^{-1} K_*\]

📚 Reference Materials:

gp_nn.pdf - Gp Nn
bayesian.pdf - Bayesian

1.5. Gaussian Process Theory: Deep Mathematical Foundations¶

What is a Gaussian Process?¶

Definition: A GP is a collection of random variables, any finite subset of which has a joint Gaussian distribution.

\[f \sim \mathcal{GP}(m(x), k(x, x'))\]

where:

Mean function: $m(x) = \mathbb{E}[f(x)]$
Covariance/Kernel function: $k(x, x') = \mathbb{E}[(f(x) - m(x))(f(x') - m(x'))]$

Intuition: Instead of learning a single function, we learn a distribution over functions.

From Functions to Distributions¶

Traditional regression: $$y = f(x; \theta) + \epsilon$$ Find best parameters $\theta$.

GP regression: $$f(x) \sim \mathcal{GP}(0, k(x, x'))$$y = f(x) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma_n^2)$$

Every finite collection of function values is Gaussian:

\[\begin{split}\begin{bmatrix} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_n) \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} k(x_1,x_1) & k(x_1,x_2) & \cdots \\ k(x_2,x_1) & k(x_2,x_2) & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}\right)\end{split}\]

Posterior Derivation: Step-by-Step¶

Setup:

Training: $X = \{x_1, \ldots, x_n\}$, $\mathbf{y} = \{y_1, \ldots, y_n\}$
Test: $X_* = \{x_{*1}, \ldots, x_{*m}\}$
Want: $p(f_* | X, \mathbf{y}, X_*)$

Prior (before seeing data):

\[\begin{split}\begin{bmatrix} \mathbf{y} \\ \mathbf{f}_* \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} K + \sigma_n^2 I & K_* \\ K_*^T & K_{**} \end{bmatrix}\right)\end{split}\]

where:

$K = k(X, X)$ is $n \times n$ (training covariance)
$K_* = k(X, X_*)$ is $n \times m$ (cross-covariance)
$K_{**} = k(X_*, X_*)$ is $m \times m$ (test covariance)

Gaussian conditioning formula:

If $\begin{bmatrix} \mathbf{x} \\ \mathbf{y} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \mu_x \\ \mu_y \end{bmatrix}, \begin{bmatrix} \Sigma_{xx} & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_{yy} \end{bmatrix}\right)$

Then: $\mathbf{y} | \mathbf{x} \sim \mathcal{N}(\mu_y + \Sigma_{yx}\Sigma_{xx}^{-1}(\mathbf{x} - \mu_x), \Sigma_{yy} - \Sigma_{yx}\Sigma_{xx}^{-1}\Sigma_{xy})$

Applying to our case:

\[\boxed{\mathbf{f}_* | X, \mathbf{y}, X_* \sim \mathcal{N}(\mu_*, \Sigma_*)}\]

\[\boxed{\mu_* = K_*^T (K + \sigma_n^2 I)^{-1} \mathbf{y}}\]

\[\boxed{\Sigma_* = K_{**} - K_*^T (K + \sigma_n^2 I)^{-1} K_*}\]

Interpretation:

Mean $\mu_*$: Weighted combination of training labels
- Weights depend on similarity (kernel) between test and training points
Variance $\Sigma_*$: Uncertainty decreases near training data
- $K_{**}$: Prior uncertainty (far from data)
- Second term: Reduction due to observations

Kernel Functions: Mathematical Properties¶

Definition: A kernel $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is valid if its Gram matrix is positive semi-definite (PSD) for any finite set of points.

Mercer’s Theorem: A symmetric function $k(x, x')$ can be expressed as:

\[k(x, x') = \sum_{i=1}^\infty \lambda_i \phi_i(x) \phi_i(x')\]

where $\lambda_i > 0$ are eigenvalues and $\phi_i$ are eigenfunctions.

Popular Kernels:

Squared Exponential (RBF):

\[k_{\text{SE}}(x, x') = \sigma^2 \exp\left(-\frac{\|x - x'\|^2}{2\ell^2}\right)\]

Properties: Infinitely differentiable, stationary, isotropic
Spectral density: Gaussian
Use: Smooth functions

Matérn:

\[k_{\text{Mat}}(x, x') = \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}r}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}r}{\ell}\right)\]

where $r = \|x - x'\|$, $K_\nu$ is modified Bessel function.

Special cases:
- $\nu = 1/2$: Exponential kernel (not differentiable)
- $\nu = 3/2$: Once differentiable
- $\nu = 5/2$: Twice differentiable
- $\nu \to \infty$: RBF

Periodic:

\[k_{\text{Per}}(x, x') = \sigma^2 \exp\left(-\frac{2\sin^2(\pi|x-x'|/p)}{\ell^2}\right)\]

Period $p$, length scale $\ell$
For periodic data (seasons, daily patterns)

Rational Quadratic (mixture of RBF):

\[k_{\text{RQ}}(x, x') = \sigma^2 \left(1 + \frac{\|x-x'\|^2}{2\alpha\ell^2}\right)^{-\alpha}\]

Scale mixture of SE kernels with different length scales
$\alpha \to \infty$: Becomes SE

Kernel Algebra: Combine kernels to build complex ones!

Sum: $k_1 + k_2$ → Models different patterns
Product: $k_1 \cdot k_2$ → Conjunction of properties
Composition: $k_1(f(x), f(x'))$ → Feature transformation

Example combination: $$k(x, x') = k_{\text{trend}}(x, x') + k_{\text{periodic}}(x, x') + k_{\text{noise}}(x, x')$$

Computational Complexity Analysis¶

Training (computing $K^{-1}$):

Naive inversion: $O(n^3)$
Cholesky decomposition: $O(n^3)$ (numerically stable)

Prediction (for $m$ test points):

Mean: $O(nm + n^2)$
Variance: $O(m^2n)$ (full covariance) or $O(mn)$ (diagonal only)

Storage:

Kernel matrix: $O(n^2)$

Bottleneck: For $n > 10,000$, standard GP becomes impractical!

Numerical Stability¶

Problem: Direct matrix inversion is unstable.

Solution: Cholesky decomposition

\[K + \sigma_n^2 I = L L^T\]

where $L$ is lower triangular.

Efficient computation:

# Instead of: K_inv @ y
L = cholesky(K + σ²I)
α = solve(L.T, solve(L, y))  # α = K⁻¹y

# For variance:
v = solve(L, K_*)
var = K_** - v.T @ v

Time: Still $O(n^3)$ but more stable!

Jitter: Add small diagonal term for numerical stability $$K \leftarrow K + 10^{-6} I$$

Log Marginal Likelihood (Model Selection)¶

To optimize hyperparameters $\theta = \{\ell, \sigma, \sigma_n\}$:

\[\log p(\mathbf{y} | X, \theta) = -\frac{1}{2}\mathbf{y}^T K_y^{-1} \mathbf{y} - \frac{1}{2}\log|K_y| - \frac{n}{2}\log(2\pi)\]

where $K_y = K + \sigma_n^2 I$.

Three terms:

Data fit: $-\frac{1}{2}\mathbf{y}^T K_y^{-1} \mathbf{y}$ (how well model explains data)
Complexity penalty: $-\frac{1}{2}\log|K_y|$ (Occam’s razor)
Normalization: $-\frac{n}{2}\log(2\pi)$

Gradient (for optimization):

\[\frac{\partial}{\partial \theta} \log p(\mathbf{y}|X,\theta) = \frac{1}{2}\text{tr}\left((K_y^{-1}\mathbf{y}\mathbf{y}^T K_y^{-1} - K_y^{-1})\frac{\partial K_y}{\partial \theta}\right)\]

Can optimize with gradient descent!

def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0):
    """
    RBF/Squared Exponential kernel.
    k(x,x') = variance * exp(-||x-x'||^2 / (2*length_scale^2))
    """
    dists = cdist(X1, X2, 'sqeuclidean')
    K = variance * np.exp(-dists / (2 * length_scale**2))
    return K

def matern_kernel(X1, X2, length_scale=1.0, nu=1.5):
    """
    Matérn kernel (nu=1.5).
    """
    dists = cdist(X1, X2)
    dists = np.sqrt(3) * dists / length_scale
    K = (1 + dists) * np.exp(-dists)
    return K

print("Kernels defined")

GP Regression¶

Gaussian Process regression places a prior distribution directly over functions: $f \sim \mathcal{GP}(m(x), k(x, x'))$, where $m$ is the mean function and $k$ is the kernel (covariance function) that encodes assumptions about function smoothness and length scale. Given observed data, the posterior is also a Gaussian Process, yielding an exact predictive mean (the best estimate) and predictive variance (the uncertainty) at every test point. The closed-form posterior is computed via matrix operations involving the kernel matrix $K$, training targets $y$, and noise variance $\sigma_n^2$: $\mu_* = K_*^T (K + \sigma_n^2 I)^{-1} y$, $\Sigma_* = K_{**} - K_*^T (K + \sigma_n^2 I)^{-1} K_*$. This non-parametric approach provides well-calibrated uncertainty estimates out of the box.

class GaussianProcess:
    """GP regression with RBF kernel."""
    
    def __init__(self, length_scale=1.0, variance=1.0, noise=1e-2):
        self.length_scale = length_scale
        self.variance = variance
        self.noise = noise
        
        self.X_train = None
        self.y_train = None
        self.K_inv = None
    
    def fit(self, X, y):
        """Fit GP to training data."""
        self.X_train = X
        self.y_train = y
        
        # Compute kernel matrix
        K = rbf_kernel(X, X, self.length_scale, self.variance)
        K += self.noise * np.eye(len(X))
        
        # Invert (with Cholesky for stability)
        L = np.linalg.cholesky(K)
        self.K_inv = np.linalg.solve(L.T, np.linalg.solve(L, np.eye(len(X))))
    
    def predict(self, X_test, return_std=True):
        """Predict with uncertainty."""
        # K_*
        K_s = rbf_kernel(self.X_train, X_test, self.length_scale, self.variance)
        
        # Mean
        mu = K_s.T @ self.K_inv @ self.y_train
        
        if return_std:
            # K_**
            K_ss = rbf_kernel(X_test, X_test, self.length_scale, self.variance)
            
            # Variance
            var = np.diag(K_ss - K_s.T @ self.K_inv @ K_s)
            std = np.sqrt(np.maximum(var, 0))
            
            return mu, std
        
        return mu

print("GaussianProcess defined")

Generate Test Data¶

We create a synthetic regression dataset with a known ground-truth function (e.g., a sinusoidal with noise) and a set of sparse observation points. The gap between observations tests the GP’s interpolation capability, while the region beyond the observations tests extrapolation. GPs excel at interpolation (the uncertainty shrinks near observed points) but honestly report high uncertainty in extrapolation regions – a crucial property for safety-critical applications where knowing what you do not know is as important as making predictions.

# True function
def f(x):
    return np.sin(3*x) + 0.3*np.cos(9*x)

# Training data
np.random.seed(42)
X_train = np.random.uniform(-1, 1, 20).reshape(-1, 1)
y_train = f(X_train).ravel() + 0.1 * np.random.randn(20)

# Test data
X_test = np.linspace(-1.5, 1.5, 200).reshape(-1, 1)
y_test = f(X_test).ravel()

print(f"Train: {X_train.shape}, Test: {X_test.shape}")

Fit GP and Predict¶

Fitting a GP involves computing the kernel matrix between all training points, adding observation noise to the diagonal, and solving the resulting linear system. Prediction at new test points requires computing the cross-kernel between test and training points, then applying the posterior formulas. The computational cost scales as $O(N^3)$ for $N$ training points (due to matrix inversion), which limits standard GPs to a few thousand observations. For larger datasets, sparse approximations (inducing points) or scalable GP methods reduce this to $O(NM^2)$ where $M \ll N$ is the number of inducing points.

# Fit GP
gp = GaussianProcess(length_scale=0.3, variance=1.0, noise=0.01)
gp.fit(X_train, y_train)

# Predict
mu, std = gp.predict(X_test, return_std=True)

print(f"Predictions: mean {mu.shape}, std {std.shape}")

Visualize GP Regression¶

The GP regression plot shows the predictive mean as a line, the 95% confidence interval as a shaded region, and the training points as markers. Near observed data, the confidence interval narrows (the GP is certain); far from observations, it widens (the GP is uncertain). This uncertainty visualization is the defining advantage of GPs over standard neural networks, which typically produce point predictions without calibrated confidence. In applications like Bayesian optimization, active learning, and autonomous systems, these uncertainty estimates guide where to collect new data or when to defer to a human.

plt.figure(figsize=(12, 6))

# True function
plt.plot(X_test, y_test, 'k--', label='True function', linewidth=2)

# GP mean
plt.plot(X_test, mu, 'b-', label='GP mean', linewidth=2)

# Uncertainty (2 std)
plt.fill_between(X_test.ravel(), mu - 2*std, mu + 2*std, alpha=0.3, label='±2σ')

# Training data
plt.scatter(X_train, y_train, c='r', s=50, zorder=10, label='Training data')

plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Gaussian Process Regression', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Prior Samples¶

Sampling functions from the GP prior (before observing any data) illustrates what the kernel encodes about function properties. An RBF kernel produces smooth functions; a Matern-1/2 kernel produces rough, non-differentiable functions; a periodic kernel produces periodic functions. The kernel’s length scale controls how quickly the function varies: short length scales produce rapidly oscillating samples, while long length scales produce slowly varying ones. Visualizing prior samples is an important step in GP modeling because it verifies that the chosen kernel matches prior beliefs about the function being modeled.

# Sample from GP prior
X_prior = np.linspace(-2, 2, 100).reshape(-1, 1)
K_prior = rbf_kernel(X_prior, X_prior, length_scale=0.5, variance=1.0)
K_prior += 1e-6 * np.eye(len(X_prior))

# Draw samples
samples = np.random.multivariate_normal(np.zeros(len(X_prior)), K_prior, size=5)

# Plot
plt.figure(figsize=(12, 6))
for i in range(5):
    plt.plot(X_prior, samples[i], alpha=0.7, linewidth=2)

plt.xlabel('x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('GP Prior Samples', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Kernel Comparison¶

Comparing different kernels on the same regression task reveals how kernel choice affects both the fit quality and the uncertainty estimates. The RBF (squared exponential) kernel assumes infinite smoothness and works well for smooth functions. The Matern family provides a controllable smoothness parameter $\nu$: $\nu = 1/2$ gives the Ornstein-Uhlenbeck process (rough), $\nu = 3/2$ is once-differentiable, and $\nu \to \infty$ recovers the RBF. Kernel hyperparameters (length scale, variance) are optimized by maximizing the marginal likelihood $p(y | X, \theta)$, which automatically balances fit quality with model complexity – an elegant form of Bayesian model selection.

7.5. Sparse Gaussian Processes: Scaling to Large Data¶

The Scalability Problem¶

Standard GP: $O(n^3)$ time, $O(n^2)$ memory

Not feasible for $n > 10,000$!

Solution 1: Subset of Data (Subset of Regressors)¶

Use only $m \ll n$ “inducing points” $Z = \{z_1, \ldots, z_m\}$.

Approximation:

\[p(f_*|X, \mathbf{y}) \approx p(f_*|Z, \mathbf{u})\]

where $\mathbf{u} = f(Z)$ are function values at inducing points.

Complexity: $O(nm^2)$ instead of $O(n^3)$

Typically $m \approx 100-1000$.

Solution 2: Variational Sparse GP (SVGP)¶

Key idea: Variational inference with inducing points.

True posterior: $$p(\mathbf{u}|X, \mathbf{y}) = \frac{p(\mathbf{y}|X, \mathbf{u})p(\mathbf{u})}{p(\mathbf{y}|X)}$$

Approximate with: $$q(\mathbf{u}) = \mathcal{N}(\mathbf{m}, \mathbf{S})$$

ELBO (variational bound):

\[\mathcal{L} = \sum_{i=1}^n \mathbb{E}_{q(f_i)}[\log p(y_i|f_i)] - \text{KL}(q(\mathbf{u}) \| p(\mathbf{u}))\]

Optimize $\{Z, \mathbf{m}, \mathbf{S}\}$ to maximize $\mathcal{L}$.

Advantages:

$O(nm^2)$ complexity
Works with mini-batches!
Can handle millions of points

Stochastic optimization:

\[\mathcal{L}_{\text{batch}} = \frac{n}{|B|}\sum_{i \in B} \mathbb{E}_{q(f_i)}[\log p(y_i|f_i)] - \text{KL}(q(\mathbf{u}) \| p(\mathbf{u}))\]

Solution 3: Structured Kernels¶

For grid data (images, time series):

Use Kronecker structure or Toeplitz structure.

Example (2D grid):

\[K = K_x \otimes K_y\]

where $\otimes$ is Kronecker product.

Matrix-vector multiplication: $O(n)$ instead of $O(n^2)$ using FFT!

Applications:

Spatiotemporal data
Image processing
Efficient GP on grids

Comparison of Sparse Methods¶

Method	Complexity	Accuracy	Notes
Full GP	$O(n^3)$	Exact	$n < 10K$
Subset	$O(nm^2)$	Approximate	Simple, fast
FITC	$O(nm^2)$	Better	Fixed inducing points
SVGP	$O(nm^2)$	Best	Optimize inducing
Structured	$O(n \log n)$	Exact	Grid data only

When to use:

$n < 10K$: Full GP
$10K < n < 1M$: SVGP with $m \approx 500$
Grid data: Structured kernels
$n > 1M$: Deep GPs or Neural Processes

Inducing Point Selection¶

Strategies:

K-means clustering:
- Cluster data, use centroids as $Z$
- Simple, fast
- Doesn’t account for GP structure
Greedy selection:
- Iteratively add points that maximize marginal likelihood
- Better coverage
- $O(nm^3)$ cost
Variational optimization:
- Treat $Z$ as parameters, optimize with gradient descent
- Best performance
- Used in SVGP
Pseudo-inputs:
- Not constrained to training data
- Can be anywhere in input space
- Most flexible

Practical: Start with K-means, refine with gradients.

Code Example: SVGP¶import gpytorch

class SVGPModel(gpytorch.models.ApproximateGP):
    def __init__(self, inducing_points):
        variational_distribution = gpytorch.variational.CholeskyVariationalDistribution(
            inducing_points.size(0)
        )
        variational_strategy = gpytorch.variational.VariationalStrategy(
            self, inducing_points, variational_distribution
        )
        super().__init__(variational_strategy)
        
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel()
        )
    
    def forward(self, x):
        mean = self.mean_module(x)
        covar = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean, covar)

# Initialize with m=500 inducing points
inducing_points = X_train[::10]  # Subsample
model = SVGPModel(inducing_points)

# Train with stochastic optimization
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
elbo = gpytorch.mlls.VariationalELBO(likelihood, model, num_data=len(X_train))

for epoch in range(100):
    optimizer.zero_grad()
    output = model(X_batch)
    loss = -elbo(output, y_batch)
    loss.backward()
    optimizer.step()
# Different length scales
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

length_scales = [0.1, 0.3, 1.0]

for idx, ls in enumerate(length_scales):
    gp_test = GaussianProcess(length_scale=ls, variance=1.0, noise=0.01)
    gp_test.fit(X_train, y_train)
    mu_test, std_test = gp_test.predict(X_test)
    
    axes[idx].plot(X_test, y_test, 'k--', label='True', linewidth=2)
    axes[idx].plot(X_test, mu_test, 'b-', label='GP mean', linewidth=2)
    axes[idx].fill_between(X_test.ravel(), mu_test - 2*std_test, mu_test + 2*std_test, alpha=0.3)
    axes[idx].scatter(X_train, y_train, c='r', s=40, zorder=10)
    
    axes[idx].set_title(f'Length scale = {ls}', fontsize=12)
    axes[idx].set_xlabel('x', fontsize=11)
    axes[idx].set_ylabel('y', fontsize=11)
    axes[idx].legend(fontsize=10)
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Summary¶

Gaussian Processes:¶

Key Concepts:

Distribution over functions
Specified by mean and kernel
Exact Bayesian inference
Uncertainty quantification

Kernels:¶

RBF: Smooth, infinitely differentiable
Matérn: Control smoothness via ν
Periodic: For periodic data
Linear: For linear trends

Hyperparameters:¶

Length scale: Controls smoothness
Variance: Controls amplitude
Noise: Observation noise level

Advantages:¶

Principled uncertainty
Kernel flexibility
Small data regime
Interpretable

Challenges:¶

O(n³) complexity
Kernel choice matters
Hyperparameter tuning

Applications:¶

Bayesian optimization
Time series forecasting
Calibration
Active learning

Extensions:¶

Sparse GPs (inducing points)
Deep GPs (composition)
Variational GPs (scalability)
Neural network GPs

Advanced GP Topics and Extensions¶

1. Deep Gaussian Processes¶

Idea: Compose multiple GP layers

\[f_1(x) \sim \mathcal{GP}(0, k_1)\]

\[f_2(f_1) \sim \mathcal{GP}(0, k_2)\]

\[\vdots\]

\[y | f_L \sim \mathcal{N}(f_L(x), \sigma^2)\]

Advantages:

More expressive than single GP
Hierarchical representations
Better for complex patterns

Challenges:

No closed-form posterior
Need approximate inference (variational)
Harder to train

Applications:

Image modeling
Hierarchical time series
Transfer learning

2. Neural Network Gaussian Processes¶

Infinite-width neural networks = Gaussian processes!

Neal’s Theorem (1996): As width $\to \infty$, Bayesian NN converges to GP.

Kernel for single hidden layer:

\[k(x, x') = \mathbb{E}_{\theta}[\sigma(w^T x) \sigma(w^T x')]\]

where $w \sim \mathcal{N}(0, I)$.

For ReLU: $k(x, x') = \frac{\|x\|\|x'\|}{2\pi}(\sin\theta + (\pi - \theta)\cos\theta)$

where $\theta = \cos^{-1}\left(\frac{x^T x'}{\|x\|\|x'\|}\right)$

Neural Tangent Kernel (NTK):

Evolution of infinite-width NN during training = GP with NTK!

\[k_{\text{NTK}}(x, x') = \langle \nabla_\theta f(x; \theta_0), \nabla_\theta f(x'; \theta_0) \rangle\]

Practical: Can compute NN kernels, use with GP inference!

3. Multi-Output Gaussian Processes¶

Problem: Predict multiple correlated outputs.

Example: Robot arm - predict (x, y, z) positions simultaneously.

Solution: Multi-output GP with cross-output correlations.

Linear Model of Coregionalization (LMC):

\[f_i(x) = \sum_{q=1}^Q w_{iq} u_q(x)\]

where $u_q \sim \mathcal{GP}(0, k_q)$.

Covariance between outputs $i, j$ at locations $x, x'$:

\[\text{Cov}[f_i(x), f_j(x')] = \sum_{q=1}^Q w_{iq} w_{jq} k_q(x, x')\]

Computational cost: $O((Pn)^3)$ for $P$ outputs!

Sparse approximation: Use inducing points per output.

4. Gaussian Process Latent Variable Models (GPLVM)¶

Problem: High-dimensional data, unknown latent structure.

Generative model:

\[\mathbf{x}_i \sim \mathcal{N}(0, I) \quad \text{(latent, low-dim)}\]

\[\mathbf{y}_i | \mathbf{x}_i \sim \mathcal{GP}(0, k(\mathbf{x}, \mathbf{x}'))\]

Learn: Latent coordinates $X = \{\mathbf{x}_1, \ldots, \mathbf{x}_n\}$ and kernel hyperparameters.

Optimization:

\[\max_{X, \theta} \log p(\mathbf{Y} | X, \theta) = -\frac{D}{2}\log|K_X| - \frac{1}{2}\text{tr}(K_X^{-1}\mathbf{Y}\mathbf{Y}^T) + \text{const}\]

Applications:

Dimensionality reduction (like PCA but nonlinear)
Visualization
Motion capture analysis

Bayesian GPLVM: Treat $X$ as random, use variational inference.

5. Warped Gaussian Processes¶

Problem: GP assumes Gaussian likelihood. What if data is not Gaussian?

Solution: Apply monotonic transformation $g$:

\[y = g(f(x)) + \epsilon\]

where $f \sim \mathcal{GP}$.

Example transformations:

Log: For positive data
Logit: For bounded [0,1] data
Box-Cox: $g(y) = \frac{y^\lambda - 1}{\lambda}$

Learn: Jointly optimize $\lambda$ and GP hyperparameters.

6. Student-t Processes¶

Heavy-tailed alternative to GP:

\[f \sim \mathcal{TP}_\nu(m, k)\]

Student-t process with $\nu$ degrees of freedom.

Marginals:

\[f(x_1), \ldots, f(x_n) \sim \text{Student-t}_\nu(0, K)\]

Robust to outliers!

Relationship to GP:

$\nu = 1$: Cauchy process (very heavy tails)
$\nu \to \infty$: Gaussian process
Typical: $\nu = 3$ or $\nu = 5$

7. Heteroscedastic GPs¶

Standard GP: Constant noise $\sigma_n^2$.

Heteroscedastic: Noise varies with input!

\[y = f(x) + \epsilon(x), \quad \epsilon(x) \sim \mathcal{N}(0, \sigma^2(x))\]

Model noise as another GP:

\[\log \sigma^2(x) \sim \mathcal{GP}(m_\sigma, k_\sigma)\]

Joint inference: Optimize both $f$ and $\sigma^2$.

Applications:

Financial data (volatility)
Sensor noise varies by location
More accurate uncertainty

8. Bayesian Optimization with GPs¶

Problem: Optimize expensive black-box function $f(x)$.

Example: Tune hyperparameters (each evaluation = full training run).

Algorithm:

GP prior over $f$
Observe $f(x_1), \ldots, f(x_n)$
Update GP posterior
Acquisition function $a(x)$ suggests next point:
- UCB: $a(x) = \mu(x) + \beta \sigma(x)$ (explore-exploit)
- EI: Expected Improvement over current best
- PI: Probability of Improvement
Evaluate $x_{n+1} = \arg\max a(x)$
Repeat

Expected Improvement:

\[\text{EI}(x) = \mathbb{E}[\max(f(x) - f^+, 0)]\]

where $f^+ = \max_{i} f(x_i)$ is current best.

Closed form (if GP posterior is Gaussian):

\[\text{EI}(x) = (\mu(x) - f^+)\Phi(Z) + \sigma(x)\phi(Z)\]

where $Z = \frac{\mu(x) - f^+}{\sigma(x)}$, $\Phi$ is CDF, $\phi$ is PDF.

Why it works:

High $\mu(x)$: Likely to improve (exploitation)
High $\sigma(x)$: Uncertain, might be great (exploration)

Applications:

Hyperparameter tuning (Spearmint, GPyOpt)
Neural architecture search
Robotics (policy search)
Drug discovery

Comparison: GPs vs Neural Networks¶

Aspect	Gaussian Processes	Neural Networks
Uncertainty	✅ Principled, calibrated	⚠️ Requires special methods
Small data	✅ Excellent ($n < 1000$)	❌ Needs lots of data
Scalability	❌ $O(n^3)$, sparse for large $n$	✅ Scales well
Interpretability	✅ Kernel choice interpretable	❌ Black box
Flexibility	⚠️ Limited by kernel	✅ Universal approximator
Feature learning	❌ Manual kernel design	✅ Automatic
Optimization	✅ Convex (often)	❌ Non-convex

When to use GPs:

Small/medium datasets ($n < 10K$)
Need uncertainty quantification
Prior knowledge (kernel structure)
Safety-critical applications
Bayesian optimization

When to use NNs:

Large datasets ($n > 100K$)
Complex patterns (images, text)
Raw features (no engineering)
Don’t need calibrated uncertainty
Computational resources available

Hybrid: Combine both! Deep kernel learning, Neural Processes.

Advanced Gaussian Processes Theory¶

1. Introduction to Gaussian Processes¶

1.1 From Bayesian Linear Regression to GPs¶

Bayesian Linear Regression:

f(x) = wᵀφ(x),  w ~ N(0, Σ_p)

Prior over weights w
Induces prior over functions f

Gaussian Process: Generalization to infinite basis functions

f ~ GP(m(x), k(x, x'))

Mean function: m(x) = E[f(x)] (typically 0)
Covariance (kernel): k(x, x’) = Cov[f(x), f(x’)]

Key property: Any finite collection of function values is jointly Gaussian

[f(x₁), ..., f(xₙ)]ᵀ ~ N(m, K)
where Kᵢⱼ = k(xᵢ, xⱼ)

1.2 Why Gaussian Processes?¶

Advantages:

Uncertainty quantification: Predictive variance σ²(x*)
Non-parametric: Complexity grows with data (no fixed architecture)
Interpretable: Kernel encodes prior assumptions
Theoretical guarantees: Convergence, calibration
Sample efficiency: Strong performance with small data

Disadvantages:

Cubic complexity: O(N³) for N data points
Kernel choice: Requires domain knowledge
Scalability: Difficult for N > 10,000
Categorical inputs: Less natural than for continuous

2. Kernel Functions¶

2.1 Properties of Valid Kernels¶

Positive semi-definite: K ⪰ 0 for all x₁, …, xₙ

Mercer’s theorem: k(x, x’) corresponds to inner product in some Hilbert space

k(x, x') = ⟨φ(x), φ(x')⟩_H

Constructing new kernels:

Sum: k₁(x, x’) + k₂(x, x’)
Product: k₁(x, x’) · k₂(x, x’)
Scaling: c · k(x, x’)

2.2 Common Kernel Functions¶

Squared Exponential (RBF):

k_SE(x, x') = σ² exp(-||x - x'||²/(2ℓ²))

σ²: Signal variance (vertical scale)
ℓ: Length scale (horizontal scale)
Properties: Infinitely differentiable, smooth functions
Use: Default choice, general-purpose

Matérn family:

k_Matérn(r) = (2^(1-ν)/Γ(ν)) (√(2ν)r/ℓ)^ν K_ν(√(2ν)r/ℓ)

ν: Smoothness parameter
- ν = 1/2: Exponential kernel (continuous, not differentiable)
- ν = 3/2: Once differentiable
- ν = 5/2: Twice differentiable
- ν → ∞: RBF kernel
Use: When roughness is important (e.g., sensor data)

Periodic kernel:

k_per(x, x') = σ² exp(-2sin²(π|x-x'|/p)/ℓ²)

p: Period
Use: Seasonal data, oscillations

Linear kernel:

k_lin(x, x') = σ_b² + σ_v² (x - c)(x' - c)

Use: Linear trends, baseline comparison

Rational Quadratic:

k_RQ(x, x') = σ² (1 + ||x-x'||²/(2αℓ²))^(-α)

α: Scale mixture parameter
Interpretation: Mixture of RBF kernels with different length scales
Use: Multi-scale phenomena

2.3 Kernel Design Patterns¶

Additive kernels (decomposition):

k(x, x') = k₁(x₁, x'₁) + k₂(x₂, x'₂)

Model independent contributions from different dimensions

Multiplicative kernels (interactions):

k(x, x') = k₁(x, x') · k₂(x, x')

Capture interactions between features

Spectral Mixture Kernel (Wilson & Adams, 2013):

k_SM(τ) = Σᵢ wᵢ exp(-2π²τ²μᵢ²) cos(2πτνᵢ)

Learn mixture of spectral components
Highly expressive

3. GP Regression (Kriging)¶

3.1 Posterior Distribution¶

Setup:

Data: D = {(xᵢ, yᵢ)}ᵢ₌₁ᴺ
Model: y = f(x) + ε, ε ~ N(0, σ²_n)
Prior: f ~ GP(0, k)

Posterior predictive distribution at x*:

p(f*|x*, X, y) = N(μ*, σ²*)

μ* = k*ᵀ (K + σ²_n I)⁻¹ y
σ²* = k** - k*ᵀ (K + σ²_n I)⁻¹ k*

where:

K: N×N kernel matrix, Kᵢⱼ = k(xᵢ, xⱼ)
k*: N×1 vector, kᵢ = k(xᵢ, x)
k**: scalar, k** = k(x*, x*)

Interpretation:

μ*: Weighted sum of observations (similar inputs → high weight)
σ²*: Reduces near data, high in unexplored regions

3.2 Marginal Likelihood (Evidence)¶

Log marginal likelihood:

log p(y|X, θ) = -½ yᵀ(K + σ²_n I)⁻¹y - ½ log|K + σ²_n I| - (N/2)log(2π)
                 \_____data fit______/    \___complexity___/   \__const_/

Hyperparameter optimization: θ* = argmax log p(y|X, θ)

Typically: θ = {ℓ, σ², σ²_n} (length scale, signal variance, noise)
Gradient-based optimization (L-BFGS, Adam)

Gradient (for RBF kernel):

∂log p(y|X, θ)/∂θ = ½ tr[(ααᵀ - K⁻¹) ∂K/∂θ]
where α = K⁻¹y

3.3 Computational Complexity¶

Naive implementation:

Inversion: O(N³)
Storage: O(N²)

Bottleneck: Solving (K + σ²_n I)⁻¹y

Cholesky decomposition (standard approach):

Compute L: K = LLᵀ (O(N³))
Solve Lα = y via forward substitution (O(N²))
Solve Lᵀv = α via backward substitution (O(N²))

Total: O(N³) training, O(N²) prediction per test point

4. Scalable Gaussian Processes¶

4.1 Inducing Points (Sparse GPs)¶

Idea: Approximate GP with M << N inducing points Z

Variational Free Energy (Titsias, 2009):

F = -KL(q(f)||p(f|X, Z)) + E_q[log p(y|f)]

Predictive distribution:

q(f*|x*) = ∫ p(f*|u) q(u) du

where u = f(Z) are inducing function values

Complexity: O(NM²) training, O(M²) prediction

SVGP (Stochastic Variational GP):

Minibatch training
Scales to millions of data points
Complexity: O(BM²) per iteration (B = batch size)

4.2 Structured Kernel Interpolation (SKI)¶

Idea: Approximate kernel with grid interpolation

Steps:

Place M inducing points on regular grid
Interpolate: k(x, x’) ≈ wᵀK_uu w’
Use FFT for fast matrix-vector products

Complexity: O(N + M log M) with FFT

KISS-GP (Kernel Interpolation for Scalable Structured GPs):

Combines SKI with minibatching
Scales to millions in 1D-3D

4.3 Random Features (Fourier Features)¶

Bochner’s theorem: Stationary kernel ↔ Fourier transform of measure

For RBF kernel:

k(x, x') = E_ω[cos(ωᵀx) cos(ωᵀx') + sin(ωᵀx) sin(ωᵀx')]
         ≈ (1/M) Σₘ φ(x; ωₘ)ᵀ φ(x'; ωₘ)

Random Fourier Features (Rahimi & Recht, 2007):

φ(x) = √(2/M) [cos(ω₁ᵀx), ..., cos(ωₘᵀx), sin(ω₁ᵀx), ..., sin(ωₘᵀx)]ᵀ
ωₘ ~ p(ω) (spectral density)

Approximation:

k(x, x') ≈ φ(x)ᵀφ(x')
f(x) ≈ φ(x)ᵀw,  w ~ N(0, I)

Complexity: O(NM) training (linear regression), O(M) prediction

Fastfood: Structured random features using Hadamard transform (O(M log M))

4.4 Deep Kernel Learning¶

Combine NN feature extractor with GP:

f(x) = GP(φ_NN(x))
k(x, x') = k_GP(φ_NN(x), φ_NN(x'))

Training: Jointly optimize NN parameters and GP hyperparameters

Advantages:

NN learns features (representation learning)
GP provides uncertainty (calibrated predictions)

Implementation (Wilson et al., 2016):

Use spectral mixture kernel on learned features
Inducing points for scalability

5. GP Classification¶

5.1 Binary Classification¶

Model:

f ~ GP(0, k)
y ∈ {-1, +1}
p(y=+1|f) = σ(f)  (logistic or probit)

Challenge: Posterior p(f|X, y) is non-Gaussian (non-conjugate likelihood)

Approaches:

Laplace approximation: Gaussian at mode
Expectation Propagation (EP): Iterative refinement
Variational inference: Lower bound
MCMC: Sampling (expensive)

5.2 Laplace Approximation¶

Steps:

Find MAP: f̂ = argmax p(f|X, y)
Compute Hessian: H = -∇²log p(f|X, y)|_{f̂}
Approximate: p(f|X, y) ≈ N(f̂, H⁻¹)

Predictive distribution:

p(y*=+1|x*, X, y) ≈ ∫ σ(f*) N(f*|μ*, σ²*) df*

Requires 1D integral (Gaussian quadrature or probit approximation)

Limitation: Underestimates uncertainty

5.3 Expectation Propagation¶

Idea: Approximate each likelihood term with Gaussian

p(yᵢ|fᵢ) ≈ tᵢ(fᵢ) = sᵢ N(fᵢ|μ̃ᵢ, σ̃ᵢ²)

Algorithm:

Initialize tilted distributions
For each factor i:
- Remove tᵢ from posterior (cavity distribution)
- Compute new tᵢ via moment matching
- Update posterior
Repeat until convergence

Advantage: More accurate uncertainty than Laplace

Complexity: O(N³) (same as regression)

6. Multi-Output Gaussian Processes¶

6.1 Motivation¶

Goal: Model P correlated outputs jointly

Example: Multi-task learning, vector-valued functions

Naive approach: Independent GPs for each output

Ignores correlations

Multi-output GP: Joint distribution over all outputs

6.2 Linear Model of Coregionalization (LMC)¶

Idea: Each output is linear combination of Q latent GPs

fₚ(x) = Σ_q aₚq gq(x)
where gq ~ GP(0, kq)

Kernel:

k((x, p), (x', p')) = Σ_q aₚq aₚ'q kq(x, x')
                    = [AAᵀ]ₚₚ' k(x, x')

A: P×Q mixing matrix
Encodes output correlations

Complexity: O((NP)³) without approximations

6.3 Convolutional GPs¶

Idea: Outputs are convolutions of latent GPs

fₚ(x) = ∫ wₚ(x - z) g(z) dz

Advantage: Induces structured correlations

Applications: Spatial data, time series

7. Advanced Topics¶

7.1 Additive GPs¶

Decomposition:

f(x) = f₁(x₁) + f₂(x₂) + ... + f_D(x_D)

Kernel:

k(x, x') = Σ_d k_d(x_d, x'_d)

Advantages:

Interpretability (feature importance)
Scalability (lower effective dimension)
ANOVA decomposition

SALSA (Structured Additive Sparse GPs):

Sparse approximation for each term
Complexity: O(D · M²)

7.2 Warped GPs¶

Problem: GP assumes Gaussian noise, but data may be skewed

Solution: Learn monotonic warp function g

y = g(f(x)) + ε

Training: Maximize marginal likelihood over f and g jointly

Applications: Positive-only outputs, heavy-tailed noise

7.3 Heteroscedastic GPs¶

Problem: Noise variance σ²_n may vary with input

Model:

log σ²(x) ~ GP(m_σ, k_σ)
f(x) ~ GP(m, k)
y|f, σ² ~ N(f, σ²)

Inference: Variational or MCMC (non-conjugate)

7.4 Robust GPs¶

Student-t likelihood (heavy-tailed):

p(y|f) = t_ν(y|f, σ²)

ν: Degrees of freedom (lower → heavier tails)
Robust to outliers

Inference: Laplace or EP

7.5 Bayesian Optimization with GPs¶

Goal: Optimize expensive black-box function f(x)

Acquisition functions:

Upper Confidence Bound (UCB):
```
UCB(x) = μ(x) + β σ(x)
```
- β: Exploration-exploitation trade-off

Expected Improvement (EI):

EI(x) = E[max(f(x) - f⁺, 0)]
       = (μ - f⁺)Φ(Z) + σφ(Z)

f⁺: Current best
Z = (μ - f⁺)/σ

Probability of Improvement (PI):
```
PI(x) = Φ((μ - f⁺)/σ)
```

Algorithm:

Fit GP to observed data
Maximize acquisition function → x_next
Evaluate f(x_next), add to data
Repeat

Applications: Hyperparameter tuning, drug discovery, experimental design

8. Infinite-Width Neural Networks as GPs¶

8.1 Neal’s Theorem (1996)¶

Single hidden layer NN:

f(x) = (b/√H) Σₕ vₕ σ(wₕᵀx)
where wₕ ~ N(0, Σ_w), vₕ ~ N(0, 1)

Limit as H → ∞:

f(x) ~ GP(0, K_NNGP(x, x'))

Kernel (for ReLU):

K(x, x') = (||x|| ||x'||)/(2π) · [sin θ + (π - θ) cos θ]
where θ = arccos(xᵀx'/(||x|| ||x'||))

8.2 Neural Tangent Kernel (NTK)¶

Observation (Jacot et al., 2018): In infinite-width limit, NN training dynamics are linear!

NTK:

Θ(x, x') = E[⟨∇_θf(x; θ₀), ∇_θf(x'; θ₀)⟩]

Training dynamics (gradient descent):

df_t/dt = -η Θ(X, X)(f_t - y)

Closed-form solution (kernel ridge regression)

Prediction:

f_∞(x) = Θ(x, X)[Θ(X, X) + λI]⁻¹y

Implications:

Infinite-width NNs don’t learn features (kernel fixed)
Finite-width NNs differ: feature learning, non-linearity

8.3 Deep GPs vs. Deep NNs¶

Correspondence:

Depth L → ∞: Deep GP
Width H → ∞: Shallow GP (no feature learning)

Deep GP kernel (compositional):

k_L(x, x') = k(k_{L-1}(x, ·), k_{L-1}(x', ·))

9. Practical Considerations¶

9.1 Numerical Stability¶

Problem: K may be nearly singular (ill-conditioned)

Solutions:

Add jitter: K + εI (ε ≈ 1e-6)
Cholesky with pivoting: Permute rows for stability
Preconditioned conjugate gradients: Iterative solver

9.2 Kernel Hyperparameter Initialization¶

Length scale ℓ: Start with median pairwise distance

ℓ₀ = median({||xᵢ - xⱼ|| : i ≠ j})

Signal variance σ²: Match data variance

σ²₀ = Var(y)

Noise variance σ²_n: Small fraction of σ²

σ²_n,0 = 0.1 · Var(y)

9.3 Cross-Validation¶

Leave-one-out (LOO) via marginal likelihood:

-log p(yᵢ|X, y₋ᵢ) = ½ log(2πσ²ᵢ|ᵢ₋₁) + (yᵢ - μᵢ|ᵢ₋₁)²/(2σ²ᵢ|ᵢ₋₁)

Can compute efficiently without retraining:

μᵢ|ᵢ₋₁ = yᵢ - αᵢ/[K⁻¹]ᵢᵢ
σ²ᵢ|ᵢ₋₁ = 1/[K⁻¹]ᵢᵢ

10. Computational Complexity Summary¶

Method	Training	Prediction	Memory	Approximation
Exact GP	O(N³)	O(N²)	O(N²)	None
Sparse GP (SVGP)	O(NM²)	O(M²)	O(NM)	Variational
SKI/KISS-GP	O(N+M log M)	O(M log M)	O(M)	Interpolation
Random Features	O(NM)	O(M)	O(NM)	Monte Carlo
Laplace (classify)	O(N³)	O(N²)	O(N²)	Mode approx
Deep Kernel	O(NM²)	O(M²)	O(NM)	+ NN features

N: Data points
M: Inducing points / features (M << N)

Rule of thumb:

N < 5,000: Exact GP
5,000 < N < 100,000: Sparse GP
N > 100,000: Random features or deep kernel

11. Software Libraries¶

11.1 Python Libraries¶

GPy: General-purpose GP library
GPyTorch: Scalable GPs in PyTorch (GPU support)
scikit-learn: Basic GP regression/classification
GPflow: TensorFlow-based, variational inference
PyMC3: Bayesian inference with GP priors
BoTorch: Bayesian optimization

11.2 R Libraries¶

kernlab: Kernel methods
mlegp: Maximum likelihood GP
tgp: Treed Gaussian processes

12. Benchmarks and Results¶

12.1 Regression Tasks¶

UCI datasets (avg. RMSE):

Linear regression: 0.62
Random forest: 0.54
GP (RBF): 0.48
Deep GP: 0.45

Sample efficiency (RMSE with 100 samples):

Neural network: 0.75
GP: 0.52

12.2 Bayesian Optimization¶

Hyperparameter tuning (CV score after T evaluations):

Random search (T=50): 0.82
Grid search (T=50): 0.85
GP-EI (T=50): 0.91

Speed: GP-EI finds optimum 3× faster than random

13. Connections to Other Methods¶

13.1 GP vs. Kernel Ridge Regression¶

Kernel Ridge Regression (KRR):

f(x) = Σᵢ αᵢ k(x, xᵢ)
α = (K + λI)⁻¹y

Relationship: KRR = GP posterior mean (maximum a posteriori)

GP additionally provides uncertainty σ²(x)

13.2 GP vs. Splines¶

Splines: Piecewise polynomials with continuity constraints

Connection: Certain kernels (e.g., Brownian motion) → natural cubic splines

GP advantage: Generalizes to higher dimensions, arbitrary kernels

13.3 GP vs. RKHS Methods¶

Reproducing Kernel Hilbert Space (RKHS):

Function space with inner product ⟨f, g⟩_k
GP is probabilistic view of RKHS

Representer theorem: Optimal f lies in span of kernel evaluations

Justifies finite-dimensional approximations

14. Recent Advances (2018-2024)¶

14.1 Scalable Exact GPs¶

LOVE (Lanczos Variance Estimates):

Use Lanczos algorithm for log-determinant
O(N log N) with structured kernels

MVM (Matrix-Vector Multiplication):

Exploit Toeplitz/Kronecker structure
FFT-based O(N log N)

14.2 Neural Process Family¶

Neural Processes (Garnelo et al., 2018):

Meta-learning for GP-like uncertainty
Amortized inference (encoder-decoder)

Variants:

Conditional NP: Deterministic path
Attentive NP: Attention mechanism
Transformer NP: Full Transformer architecture

14.3 Infinite-Width Limits Beyond MLPs¶

Convolutional NNGP (Garriga-Alonso et al., 2019):

Kernel for infinite-width CNNs
Spatial structure preserved

Transformer NNGP (Hron et al., 2020):

Kernel for infinite-width Transformers
Self-attention in function space

14.4 Functional Priors¶

GP-LVM (GP Latent Variable Model):

Learn low-dimensional latent space
Non-linear dimensionality reduction

Deep GPs (Damianou & Lawrence, 2013):

Composition of GPs: f_L(f_{L-1}(…f_1(x)))
Hierarchical feature learning

15. Limitations and Open Problems¶

15.1 Current Challenges¶

Scalability: Still O(N³) for general kernels
High dimensions: Curse of dimensionality (requires many samples)
Discrete/categorical inputs: Less principled than continuous
Model selection: Kernel choice is critical but difficult
Non-stationarity: Real data often has varying local properties

15.2 Open Research Questions¶

Adaptive kernels: Learn kernel structure from data
Scalable classification: GP classification still expensive
Deep GPs: How many layers? How to train?
Uncertainty decomposition: Epistemic vs. aleatoric for GPs
Meta-learning: Transfer kernel knowledge across tasks

16. Key Takeaways¶

GPs are distributions over functions: Fully Bayesian, non-parametric
Kernel encodes assumptions: Smoothness, periodicity, additivity
Uncertainty for free: Predictive variance σ²(x*) naturally quantifies epistemic uncertainty
Scalability via approximations: Inducing points, random features, structure
Strong with small data: Sample-efficient, ideal for Bayesian optimization
Connection to NNs: Infinite-width limit → NNGP, NTK

When to use GPs:

Small data (N < 10,000)
Need uncertainty quantification
Interpretability important (kernel structure)
Bayesian optimization, active learning

When NOT to use:

Large data (N > 100,000) without approximations
High-dimensional inputs (D > 50)
Need feature learning (use deep kernel or NN)

17. References¶

Foundational:

Rasmussen & Williams (2006): Gaussian Processes for Machine Learning (book)
Neal (1996): Bayesian Learning for Neural Networks

Scalable GPs:

Titsias (2009): Variational Learning of Inducing Variables (SVGP)
Hensman et al. (2013): Scalable Variational GP Regression
Wilson et al. (2015): Kernel Interpolation for Scalable GPs (KISS-GP)
Rahimi & Recht (2007): Random Features for Large-Scale Kernel Machines

Deep Kernels:

Wilson et al. (2016): Deep Kernel Learning
Wilson & Izmailov (2020): Bayesian Deep Learning (survey)

Infinite-Width NNs:

Jacot et al. (2018): Neural Tangent Kernel
Lee et al. (2018): Deep Neural Networks as GPs
Garriga-Alonso et al. (2019): Deep Convolutional Networks as Shallow GPs

Applications:

Snoek et al. (2012): Practical Bayesian Optimization (hyperparameter tuning)
Frazier (2018): Bayesian Optimization Tutorial

18. Practical Workflow¶

Step 1: Exploratory Analysis¶

Plot data, check for trends, periodicity
Compute pairwise distances (length scale estimate)

Step 2: Kernel Selection¶

Start with RBF (default)
Add periodic if seasonality detected
Use Matérn for rough data
Combine kernels (additive/multiplicative)

Step 3: Hyperparameter Optimization¶

Initialize: ℓ = median distance, σ² = Var(y)
Maximize log marginal likelihood
Use gradient-based optimizer (L-BFGS)

Step 4: Model Validation¶

Cross-validation (LOO or k-fold)
Check calibration (predictive intervals)
Residual analysis

Step 5: Prediction¶

Compute posterior mean and variance
Plot with confidence bands
Identify regions of high uncertainty (active learning)

"""
Complete Gaussian Process Implementations
==========================================
Includes: GP regression with multiple kernels, scalable approximations (sparse GP, random features),
GP classification (Laplace approximation), Bayesian optimization, deep kernel learning.
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize
from scipy.stats import norm
from scipy.linalg import cholesky, solve_triangular

# ============================================================================
# 1. Kernel Functions
# ============================================================================

class Kernel:
    """Base class for kernels."""
    def __call__(self, X1, X2=None):
        """
        Compute kernel matrix.
        
        Args:
            X1: [N1, D]
            X2: [N2, D] or None (use X1)
        Returns:
            K: [N1, N2] kernel matrix
        """
        raise NotImplementedError


class RBFKernel(Kernel):
    """
    Radial Basis Function (Squared Exponential) kernel.
    
    k(x, x') = σ² exp(-||x - x'||²/(2ℓ²))
    
    Args:
        length_scale: Length scale ℓ (controls smoothness)
        signal_variance: Signal variance σ² (vertical scale)
    """
    def __init__(self, length_scale=1.0, signal_variance=1.0):
        self.length_scale = length_scale
        self.signal_variance = signal_variance
    
    def __call__(self, X1, X2=None):
        if X2 is None:
            X2 = X1
        
        # Compute squared distances
        # ||x - x'||² = ||x||² + ||x'||² - 2x·x'
        X1_norm = np.sum(X1**2, axis=1, keepdims=True)
        X2_norm = np.sum(X2**2, axis=1, keepdims=True)
        distances_sq = X1_norm + X2_norm.T - 2 * X1 @ X2.T
        
        # RBF kernel
        K = self.signal_variance * np.exp(-distances_sq / (2 * self.length_scale**2))
        return K
    
    def gradient(self, X):
        """Gradient w.r.t. hyperparameters for marginal likelihood optimization."""
        K = self(X)
        
        # ∂K/∂ℓ
        X_norm = np.sum(X**2, axis=1, keepdims=True)
        distances_sq = X_norm + X_norm.T - 2 * X @ X.T
        dK_dl = K * distances_sq / self.length_scale**3
        
        # ∂K/∂σ²
        dK_dsigma2 = K / self.signal_variance
        
        return {'length_scale': dK_dl, 'signal_variance': dK_dsigma2}


class MaternKernel(Kernel):
    """
    Matérn kernel (ν = 5/2).
    
    k(r) = σ² (1 + √5 r/ℓ + 5r²/(3ℓ²)) exp(-√5 r/ℓ)
    
    Twice differentiable (smoother than ν=3/2, rougher than RBF).
    """
    def __init__(self, length_scale=1.0, signal_variance=1.0, nu=2.5):
        self.length_scale = length_scale
        self.signal_variance = signal_variance
        self.nu = nu
    
    def __call__(self, X1, X2=None):
        if X2 is None:
            X2 = X1
        
        # Euclidean distances
        X1_norm = np.sum(X1**2, axis=1, keepdims=True)
        X2_norm = np.sum(X2**2, axis=1, keepdims=True)
        distances = np.sqrt(np.maximum(
            X1_norm + X2_norm.T - 2 * X1 @ X2.T, 0))
        
        # Matérn 5/2
        sqrt5_r = np.sqrt(5) * distances / self.length_scale
        K = self.signal_variance * (1 + sqrt5_r + sqrt5_r**2 / 3) * np.exp(-sqrt5_r)
        
        return K


class PeriodicKernel(Kernel):
    """
    Periodic kernel.
    
    k(x, x') = σ² exp(-2 sin²(π|x-x'|/p) / ℓ²)
    
    Args:
        period: Period p
        length_scale: Length scale ℓ
        signal_variance: Signal variance σ²
    """
    def __init__(self, period=1.0, length_scale=1.0, signal_variance=1.0):
        self.period = period
        self.length_scale = length_scale
        self.signal_variance = signal_variance
    
    def __call__(self, X1, X2=None):
        if X2 is None:
            X2 = X1
        
        # Distances
        diffs = X1[:, np.newaxis, :] - X2[np.newaxis, :, :]
        distances = np.sqrt(np.sum(diffs**2, axis=2))
        
        # Periodic kernel
        sin_term = np.sin(np.pi * distances / self.period)
        K = self.signal_variance * np.exp(-2 * sin_term**2 / self.length_scale**2)
        
        return K


class LinearKernel(Kernel):
    """
    Linear kernel.
    
    k(x, x') = σ_b² + σ_v² (x - c)(x' - c)
    """
    def __init__(self, variance=1.0, bias=0.0, center=0.0):
        self.variance = variance
        self.bias = bias
        self.center = center
    
    def __call__(self, X1, X2=None):
        if X2 is None:
            X2 = X1
        
        X1_centered = X1 - self.center
        X2_centered = X2 - self.center
        
        K = self.bias + self.variance * (X1_centered @ X2_centered.T)
        return K


class AdditiveKernel(Kernel):
    """Sum of kernels: k(x, x') = k₁(x, x') + k₂(x, x')"""
    def __init__(self, kernel1, kernel2):
        self.kernel1 = kernel1
        self.kernel2 = kernel2
    
    def __call__(self, X1, X2=None):
        return self.kernel1(X1, X2) + self.kernel2(X1, X2)


class ProductKernel(Kernel):
    """Product of kernels: k(x, x') = k₁(x, x') · k₂(x, x')"""
    def __init__(self, kernel1, kernel2):
        self.kernel1 = kernel1
        self.kernel2 = kernel2
    
    def __call__(self, X1, X2=None):
        return self.kernel1(X1, X2) * self.kernel2(X1, X2)


# ============================================================================
# 2. Gaussian Process Regression
# ============================================================================

class GaussianProcessRegressor:
    """
    Gaussian Process Regression.
    
    Args:
        kernel: Kernel function
        noise_variance: Observation noise σ²_n
        optimize_hyperparams: Whether to optimize kernel hyperparameters
    """
    def __init__(self, kernel, noise_variance=1e-2, optimize_hyperparams=True):
        self.kernel = kernel
        self.noise_variance = noise_variance
        self.optimize_hyperparams = optimize_hyperparams
        
        self.X_train = None
        self.y_train = None
        self.K_inv = None
        self.alpha = None
    
    def fit(self, X, y):
        """
        Fit GP to training data.
        
        Args:
            X: Training inputs [N, D]
            y: Training targets [N, 1]
        """
        self.X_train = X
        self.y_train = y
        
        # Optimize hyperparameters
        if self.optimize_hyperparams and hasattr(self.kernel, 'gradient'):
            self._optimize_hyperparameters()
        
        # Compute kernel matrix
        K = self.kernel(X) + self.noise_variance * np.eye(len(X))
        
        # Cholesky decomposition for numerical stability
        L = cholesky(K, lower=True)
        
        # Solve for α = K⁻¹ y
        self.alpha = solve_triangular(L.T, solve_triangular(L, y, lower=True))
        
        # Store inverse (for predictive variance)
        self.K_inv = solve_triangular(L.T, solve_triangular(L, np.eye(len(X)), lower=True))
    
    def predict(self, X_test, return_std=True):
        """
        Predict at test points.
        
        Args:
            X_test: Test inputs [N*, D]
            return_std: Whether to return predictive std
        
        Returns:
            mean: Predictive mean [N*]
            std: Predictive std [N*] (if return_std=True)
        """
        # Kernel between test and train
        K_star = self.kernel(X_test, self.X_train)
        
        # Predictive mean: μ* = k*ᵀ K⁻¹ y
        mean = K_star @ self.alpha
        
        if not return_std:
            return mean.ravel()
        
        # Predictive variance: σ²* = k** - k*ᵀ K⁻¹ k*
        K_star_star = self.kernel(X_test)
        variance = np.diag(K_star_star) - np.sum(K_star @ self.K_inv * K_star, axis=1)
        std = np.sqrt(np.maximum(variance, 0))  # Clip negatives (numerical errors)
        
        return mean.ravel(), std
    
    def log_marginal_likelihood(self):
        """
        Compute log marginal likelihood.
        
        log p(y|X) = -½ yᵀK⁻¹y - ½ log|K| - (N/2)log(2π)
        """
        N = len(self.y_train)
        K = self.kernel(self.X_train) + self.noise_variance * np.eye(N)
        
        # Cholesky decomposition
        L = cholesky(K, lower=True)
        
        # Data fit: yᵀK⁻¹y
        alpha = solve_triangular(L.T, solve_triangular(L, self.y_train, lower=True))
        data_fit = -0.5 * self.y_train.T @ alpha
        
        # Complexity penalty: log|K|
        complexity = -np.sum(np.log(np.diag(L)))
        
        # Constant
        constant = -0.5 * N * np.log(2 * np.pi)
        
        return (data_fit + complexity + constant).item()
    
    def _optimize_hyperparameters(self):
        """Optimize kernel hyperparameters via gradient descent."""
        
        def objective(params):
            # Update kernel parameters
            self.kernel.length_scale = params[0]
            self.kernel.signal_variance = params[1]
            self.noise_variance = params[2]
            
            # Negative log marginal likelihood
            return -self.log_marginal_likelihood()
        
        # Initial parameters
        x0 = np.array([self.kernel.length_scale, 
                      self.kernel.signal_variance,
                      self.noise_variance])
        
        # Optimize
        result = minimize(objective, x0, method='L-BFGS-B', 
                         bounds=[(1e-3, 10), (1e-3, 10), (1e-5, 1)])
        
        # Update parameters
        self.kernel.length_scale = result.x[0]
        self.kernel.signal_variance = result.x[1]
        self.noise_variance = result.x[2]


# ============================================================================
# 3. Sparse GP (Inducing Points)
# ============================================================================

class SparseGP:
    """
    Sparse Gaussian Process using inducing points.
    
    Complexity: O(NM²) training, O(M²) prediction
    
    Args:
        kernel: Kernel function
        num_inducing: Number of inducing points M
        noise_variance: Observation noise
    """
    def __init__(self, kernel, num_inducing=100, noise_variance=1e-2):
        self.kernel = kernel
        self.num_inducing = num_inducing
        self.noise_variance = noise_variance
        
        self.Z = None  # Inducing points
        self.Qaa = None  # Covariance at inducing points
    
    def fit(self, X, y):
        """Fit sparse GP."""
        N = len(X)
        
        # Select inducing points (k-means or random subset)
        indices = np.random.choice(N, self.num_inducing, replace=False)
        self.Z = X[indices]
        
        # Kernel matrices
        K_mm = self.kernel(self.Z) + 1e-6 * np.eye(self.num_inducing)  # Jitter
        K_nm = self.kernel(X, self.Z)
        
        # Compute Qaa = K_mm + K_mn (K + σ²I)⁻¹ K_nm
        # Via Woodbury matrix identity
        L_mm = cholesky(K_mm, lower=True)
        A = solve_triangular(L_mm, K_nm.T, lower=True) / np.sqrt(self.noise_variance)
        
        # Posterior over inducing points
        B = np.eye(self.num_inducing) + A @ A.T
        L_B = cholesky(B, lower=True)
        
        # Store for prediction
        self.Qaa = K_mm @ solve_triangular(
            L_B.T, solve_triangular(L_B, K_mm, lower=True))
        
        # Mean at inducing points
        c = solve_triangular(L_B, A @ y / self.noise_variance, lower=True)
        self.m_u = K_mm @ solve_triangular(L_B.T, c)
    
    def predict(self, X_test, return_std=True):
        """Predict using inducing points."""
        # Kernel between test and inducing
        K_star_m = self.kernel(X_test, self.Z)
        
        # Mean
        K_mm = self.kernel(self.Z)
        K_mm_inv = np.linalg.inv(K_mm + 1e-6 * np.eye(self.num_inducing))
        mean = K_star_m @ K_mm_inv @ self.m_u
        
        if not return_std:
            return mean.ravel()
        
        # Variance
        K_star_star = self.kernel(X_test)
        variance = np.diag(K_star_star) - np.sum(
            K_star_m @ K_mm_inv * K_star_m, axis=1)
        std = np.sqrt(np.maximum(variance, 0))
        
        return mean.ravel(), std


# ============================================================================
# 4. Random Fourier Features
# ============================================================================

class RandomFourierFeatures:
    """
    Approximate kernel with random Fourier features.
    
    k(x, x') ≈ φ(x)ᵀ φ(x')
    
    Args:
        num_features: Number of random features M
        kernel: Kernel to approximate (for sampling frequencies)
        length_scale: Length scale (for RBF kernel)
    """
    def __init__(self, num_features=100, length_scale=1.0):
        self.num_features = num_features
        self.length_scale = length_scale
        self.omega = None
        self.weights = None
    
    def fit(self, X, y):
        """
        Fit via linear regression in feature space.
        
        φ(x) = √(2/M) [cos(ω₁ᵀx), ..., cos(ωₘᵀx), sin(ω₁ᵀx), ..., sin(ωₘᵀx)]
        """
        D = X.shape[1]
        
        # Sample random frequencies from spectral density
        # For RBF: p(ω) = N(0, 1/ℓ²)
        self.omega = np.random.randn(self.num_features, D) / self.length_scale
        
        # Compute features
        Phi = self._compute_features(X)
        
        # Ridge regression: w = (ΦᵀΦ + λI)⁻¹ Φᵀy
        lambda_reg = 1e-4
        self.weights = np.linalg.solve(
            Phi.T @ Phi + lambda_reg * np.eye(Phi.shape[1]),
            Phi.T @ y
        )
    
    def _compute_features(self, X):
        """Compute random Fourier features."""
        proj = X @ self.omega.T  # [N, M]
        
        # Concatenate cos and sin
        features = np.concatenate([
            np.cos(proj), np.sin(proj)
        ], axis=1) * np.sqrt(2.0 / self.num_features)
        
        return features
    
    def predict(self, X_test):
        """Predict (no uncertainty, this is a point estimate)."""
        Phi_test = self._compute_features(X_test)
        return Phi_test @ self.weights


# ============================================================================
# 5. GP Classification (Laplace Approximation)
# ============================================================================

class GPClassifier:
    """
    Gaussian Process Classification using Laplace approximation.
    
    Binary classification with logistic likelihood.
    """
    def __init__(self, kernel, max_iter=20):
        self.kernel = kernel
        self.max_iter = max_iter
        
        self.X_train = None
        self.y_train = None
        self.f_map = None  # MAP estimate of latent function
    
    def fit(self, X, y):
        """
        Fit GP classifier.
        
        Args:
            X: Training inputs [N, D]
            y: Training labels [N] (0 or 1)
        """
        self.X_train = X
        self.y_train = y
        
        # Convert labels to {-1, +1}
        y_signed = 2 * y - 1
        
        # Compute kernel matrix
        K = self.kernel(X)
        
        # Find MAP via Newton's method
        f = np.zeros(len(X))  # Initialize at 0
        
        for iteration in range(self.max_iter):
            # Logistic likelihood
            pi = self._sigmoid(f)
            
            # Gradient: ∇log p(y|f) + ∇log p(f)
            # = (y - π) - K⁻¹f
            W = pi * (1 - pi)  # Diagonal of Hessian (logistic)
            grad_log_lik = y_signed * (1 - pi) - np.linalg.solve(K, f)
            
            # Hessian: -W - K⁻¹
            W_sqrt = np.sqrt(W)
            B = np.eye(len(X)) + W_sqrt[:, None] * K * W_sqrt[None, :]
            L = cholesky(B, lower=True)
            
            # Newton step
            b = W * f + y_signed * (1 - pi)
            a = b - W_sqrt * solve_triangular(
                L.T, solve_triangular(L, W_sqrt * (K @ b), lower=True))
            f_new = K @ a
            
            # Check convergence
            if np.linalg.norm(f_new - f) < 1e-4:
                break
            
            f = f_new
        
        self.f_map = f
    
    def predict_proba(self, X_test):
        """
        Predict probabilities.
        
        Returns:
            probabilities: [N*, 2] (probability of class 0 and 1)
        """
        # Kernel between test and train
        K_star = self.kernel(X_test, self.X_train)
        
        # Mean of latent function
        f_star_mean = K_star @ np.linalg.solve(
            self.kernel(self.X_train), self.f_map)
        
        # Laplace approximation: ignore variance for simplicity
        # (Full version requires 1D integral)
        probs = self._sigmoid(f_star_mean)
        
        return np.column_stack([1 - probs, probs])
    
    def predict(self, X_test):
        """Predict class labels."""
        probs = self.predict_proba(X_test)
        return (probs[:, 1] > 0.5).astype(int)
    
    @staticmethod
    def _sigmoid(x):
        """Numerically stable sigmoid."""
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))


# ============================================================================
# 6. Bayesian Optimization
# ============================================================================

class BayesianOptimization:
    """
    Bayesian Optimization using GP and acquisition functions.
    
    Args:
        bounds: List of (min, max) for each dimension
        kernel: GP kernel
        acquisition: 'ei', 'ucb', or 'pi'
    """
    def __init__(self, bounds, kernel=None, acquisition='ei'):
        self.bounds = np.array(bounds)
        self.dim = len(bounds)
        self.acquisition = acquisition
        
        if kernel is None:
            kernel = RBFKernel(length_scale=1.0, signal_variance=1.0)
        
        self.gp = GaussianProcessRegressor(kernel, noise_variance=1e-4)
        
        self.X_observed = []
        self.y_observed = []
        self.best_y = -np.inf
    
    def suggest(self):
        """
        Suggest next point to evaluate.
        
        Returns:
            x_next: Next point to query [D]
        """
        if len(self.X_observed) == 0:
            # Random initialization
            return np.random.uniform(self.bounds[:, 0], self.bounds[:, 1])
        
        # Fit GP to observed data
        X = np.array(self.X_observed)
        y = np.array(self.y_observed).reshape(-1, 1)
        self.gp.fit(X, y)
        
        # Optimize acquisition function
        def objective(x):
            return -self._acquisition_function(x.reshape(1, -1))
        
        # Multi-start optimization
        best_acq = -np.inf
        best_x = None
        
        for _ in range(10):
            x0 = np.random.uniform(self.bounds[:, 0], self.bounds[:, 1])
            result = minimize(objective, x0, method='L-BFGS-B', bounds=self.bounds)
            
            if -result.fun > best_acq:
                best_acq = -result.fun
                best_x = result.x
        
        return best_x
    
    def observe(self, x, y):
        """
        Observe new data point.
        
        Args:
            x: Input [D]
            y: Output (scalar)
        """
        self.X_observed.append(x)
        self.y_observed.append(y)
        
        if y > self.best_y:
            self.best_y = y
    
    def _acquisition_function(self, x):
        """
        Compute acquisition function value.
        
        Args:
            x: Input [1, D]
        Returns:
            Acquisition value (scalar)
        """
        mean, std = self.gp.predict(x, return_std=True)
        
        if self.acquisition == 'ucb':
            # Upper Confidence Bound
            beta = 2.0
            return mean + beta * std
        
        elif self.acquisition == 'ei':
            # Expected Improvement
            if std == 0:
                return 0
            
            z = (mean - self.best_y) / std
            ei = (mean - self.best_y) * norm.cdf(z) + std * norm.pdf(z)
            return ei
        
        elif self.acquisition == 'pi':
            # Probability of Improvement
            if std == 0:
                return 0
            
            z = (mean - self.best_y) / std
            return norm.cdf(z)
        
        else:
            raise ValueError(f"Unknown acquisition: {self.acquisition}")


# ============================================================================
# 7. Demonstrations
# ============================================================================

def demo_gp_regression():
    """Demonstrate GP regression with multiple kernels."""
    print("="*70)
    print("GP Regression Demo")
    print("="*70)
    
    # Generate synthetic data
    np.random.seed(42)
    X_train = np.random.uniform(-5, 5, (20, 1))
    y_train = np.sin(X_train) + 0.1 * np.random.randn(20, 1)
    
    X_test = np.linspace(-6, 6, 100).reshape(-1, 1)
    
    # Fit GP with different kernels
    kernels = {
        'RBF': RBFKernel(length_scale=1.0, signal_variance=1.0),
        'Matérn': MaternKernel(length_scale=1.0, signal_variance=1.0),
        'Periodic': PeriodicKernel(period=2*np.pi, length_scale=1.0),
    }
    
    for name, kernel in kernels.items():
        gp = GaussianProcessRegressor(kernel, noise_variance=0.01, 
                                      optimize_hyperparams=True)
        gp.fit(X_train, y_train)
        
        mean, std = gp.predict(X_test, return_std=True)
        
        print(f"\n{name} Kernel:")
        print(f"  Optimized length scale: {kernel.length_scale:.4f}")
        print(f"  Optimized signal variance: {kernel.signal_variance:.4f}")
        print(f"  Log marginal likelihood: {gp.log_marginal_likelihood():.4f}")
        print(f"  Mean prediction range: [{mean.min():.4f}, {mean.max():.4f}]")
        print(f"  Avg uncertainty: {std.mean():.4f}")
    
    print()


def demo_sparse_gp():
    """Demonstrate sparse GP with inducing points."""
    print("="*70)
    print("Sparse GP Demo")
    print("="*70)
    
    # Generate larger dataset
    np.random.seed(42)
    N = 500
    X_train = np.random.uniform(-10, 10, (N, 1))
    y_train = np.sin(X_train) + 0.1 * np.random.randn(N, 1)
    
    X_test = np.linspace(-12, 12, 200).reshape(-1, 1)
    
    # Compare exact vs sparse
    kernel = RBFKernel(length_scale=1.0, signal_variance=1.0)
    
    # Exact GP (slow for N=500)
    print("Exact GP...")
    gp_exact = GaussianProcessRegressor(kernel, noise_variance=0.01, 
                                        optimize_hyperparams=False)
    gp_exact.fit(X_train, y_train)
    mean_exact, std_exact = gp_exact.predict(X_test)
    
    # Sparse GP
    print("Sparse GP with M=50 inducing points...")
    gp_sparse = SparseGP(kernel, num_inducing=50, noise_variance=0.01)
    gp_sparse.fit(X_train, y_train)
    mean_sparse, std_sparse = gp_sparse.predict(X_test)
    
    # Compare
    print(f"\nPrediction difference (RMSE): {np.sqrt(np.mean((mean_exact - mean_sparse)**2)):.6f}")
    print(f"Exact GP complexity: O({N}³) = O({N**3:,})")
    print(f"Sparse GP complexity: O({N}×{50}²) = O({N * 50**2:,})")
    print(f"Speedup: ~{(N**3) / (N * 50**2):.1f}×")
    print()


def demo_gp_classification():
    """Demonstrate GP classification."""
    print("="*70)
    print("GP Classification Demo")
    print("="*70)
    
    # Generate synthetic binary classification data
    np.random.seed(42)
    X_train = np.random.randn(100, 2)
    y_train = (X_train[:, 0]**2 + X_train[:, 1]**2 > 1).astype(int)
    
    # Fit GP classifier
    kernel = RBFKernel(length_scale=1.0, signal_variance=1.0)
    gp_clf = GPClassifier(kernel)
    gp_clf.fit(X_train, y_train)
    
    # Test
    X_test = np.random.randn(20, 2)
    y_pred = gp_clf.predict(X_test)
    probs = gp_clf.predict_proba(X_test)
    
    print("Sample predictions:")
    for i in range(5):
        print(f"  Test {i+1}: P(y=0)={probs[i, 0]:.4f}, P(y=1)={probs[i, 1]:.4f}, "
              f"Predicted class={y_pred[i]}")
    
    print(f"\nTraining accuracy: {(gp_clf.predict(X_train) == y_train).mean():.4f}")
    print()


def demo_bayesian_optimization():
    """Demonstrate Bayesian optimization."""
    print("="*70)
    print("Bayesian Optimization Demo")
    print("="*70)
    
    # Define toy objective (unknown to optimizer)
    def objective(x):
        return -(x - 2)**2 * np.sin(5 * x)
    
    # Run Bayesian optimization
    bo = BayesianOptimization(bounds=[(-5, 5)], acquisition='ei')
    
    print("Optimizing objective with Expected Improvement...")
    for iteration in range(15):
        x_next = bo.suggest()
        y_next = objective(x_next)[0]
        bo.observe(x_next, y_next)
        
        if (iteration + 1) % 5 == 0:
            print(f"  Iteration {iteration+1}: Best y = {bo.best_y:.4f} at x = {bo.X_observed[np.argmax(bo.y_observed)]}")
    
    # True optimum
    x_opt_true = 2.0  # Approximate
    y_opt_true = objective(np.array([x_opt_true]))[0]
    
    print(f"\nTrue optimum: y ≈ {y_opt_true:.4f}")
    print(f"BO found: y = {bo.best_y:.4f} (difference: {abs(bo.best_y - y_opt_true):.4f})")
    print()


def print_complexity_comparison():
    """Print complexity comparison."""
    print("="*70)
    print("GP Method Complexity Comparison")
    print("="*70)
    print()
    
    comparison = """
┌─────────────────────┬─────────────┬─────────────┬────────────┬──────────────┐
│ Method              │ Training    │ Prediction  │ Memory     │ Best For     │
├─────────────────────┼─────────────┼─────────────┼────────────┼──────────────┤
│ Exact GP            │ O(N³)       │ O(N²)       │ O(N²)      │ N < 5,000    │
├─────────────────────┼─────────────┼─────────────┼────────────┼──────────────┤
│ Sparse GP (SVGP)    │ O(NM²)      │ O(M²)       │ O(NM)      │ 5K < N < 100K│
│                     │             │             │            │ (M ~ 100-500)│
├─────────────────────┼─────────────┼─────────────┼────────────┼──────────────┤
│ Random Features     │ O(NM)       │ O(M)        │ O(NM)      │ N > 100K     │
│ (RFF)               │ (ridge reg) │             │            │ (M ~ 1000)   │
├─────────────────────┼─────────────┼─────────────┼────────────┼──────────────┤
│ SKI/KISS-GP         │ O(N+M log M)│ O(M log M)  │ O(M)       │ 1D-3D, huge  │
│                     │             │             │            │ structured   │
└─────────────────────┴─────────────┴─────────────┴────────────┴──────────────┘

**Scalability example (N = 10,000 data points):**

- Exact GP: ~10³ × 10³ × 10³ = 10⁹ operations (impractical)
- Sparse GP (M=200): ~10⁴ × 200² = 4×10⁸ operations (feasible)
- Random Features (M=500): ~10⁴ × 500 = 5×10⁶ operations (fast)

**When to use each method:**

1. **Exact GP**: Small data (N < 5,000), need exact inference
2. **Sparse GP**: Medium data (5K-100K), need uncertainty
3. **Random Features**: Large data (>100K), speed critical
4. **Deep Kernel**: Complex patterns, have GPU, willing to train

**Kernel selection guide:**

- **RBF**: Default choice, smooth functions
- **Matérn (ν=3/2, 5/2)**: Rough/noisy data
- **Periodic**: Seasonal patterns
- **Linear**: Linear trends (baseline)
- **Sum/Product**: Combine multiple patterns
"""
    
    print(comparison)
    print()


# ============================================================================
# Run Demonstrations
# ============================================================================

if __name__ == "__main__":
    np.random.seed(42)
    
    demo_gp_regression()
    demo_sparse_gp()
    demo_gp_classification()
    demo_bayesian_optimization()
    print_complexity_comparison()
    
    print("="*70)
    print("Gaussian Process Implementations Complete")
    print("="*70)
    print()
    print("Summary:")
    print("  • GP Regression: Exact inference with multiple kernels")
    print("  • Sparse GP: Inducing points for O(NM²) complexity")
    print("  • Random Features: O(NM) approximation via Fourier features")
    print("  • GP Classification: Laplace approximation for non-Gaussian likelihood")
    print("  • Bayesian Optimization: EI/UCB acquisition for black-box optimization")
    print()
    print("Key insight: GPs provide uncertainty quantification naturally")
    print("Trade-off: Computational cost vs. sample efficiency")
    print("Applications: Small data, Bayesian optimization, calibrated uncertainty")
    print()

Method	Complexity	Accuracy	Notes
Full GP	\(O(n^3)\)	Exact	\(n < 10K\)
Subset	\(O(nm^2)\)	Approximate	Simple, fast
FITC	\(O(nm^2)\)	Better	Fixed inducing points
SVGP	\(O(nm^2)\)	Best	Optimize inducing
Structured	\(O(n \log n)\)	Exact	Grid data only