import torch
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

1. Gaussian Process TheoryΒΆ

Definition:ΒΆ

\[f \sim \mathcal{GP}(m, k)\]

where \(m(x)\) is mean, \(k(x,x')\) is kernel.

Posterior:ΒΆ

\[f_* | X, y, X_* \sim \mathcal{N}(\mu_*, \Sigma_*)\]
\[\mu_* = K_*^T (K + \sigma_n^2 I)^{-1} y\]
\[\Sigma_* = K_{**} - K_*^T (K + \sigma_n^2 I)^{-1} K_*\]

πŸ“š Reference Materials:

1.5. Gaussian Process Theory: Deep Mathematical FoundationsΒΆ

What is a Gaussian Process?ΒΆ

Definition: A GP is a collection of random variables, any finite subset of which has a joint Gaussian distribution.

\[f \sim \mathcal{GP}(m(x), k(x, x'))\]

where:

  • Mean function: \(m(x) = \mathbb{E}[f(x)]\)

  • Covariance/Kernel function: \(k(x, x') = \mathbb{E}[(f(x) - m(x))(f(x') - m(x'))]\)

Intuition: Instead of learning a single function, we learn a distribution over functions.

From Functions to DistributionsΒΆ

Traditional regression: $\(y = f(x; \theta) + \epsilon\)\( Find best parameters \)\theta$.

GP regression: $\(f(x) \sim \mathcal{GP}(0, k(x, x'))\)\( \)\(y = f(x) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma_n^2)\)$

Every finite collection of function values is Gaussian:

\[\begin{split}\begin{bmatrix} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_n) \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} k(x_1,x_1) & k(x_1,x_2) & \cdots \\ k(x_2,x_1) & k(x_2,x_2) & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}\right)\end{split}\]

Posterior Derivation: Step-by-StepΒΆ

Setup:

  • Training: \(X = \{x_1, \ldots, x_n\}\), \(\mathbf{y} = \{y_1, \ldots, y_n\}\)

  • Test: \(X_* = \{x_{*1}, \ldots, x_{*m}\}\)

  • Want: \(p(f_* | X, \mathbf{y}, X_*)\)

Prior (before seeing data):

\[\begin{split}\begin{bmatrix} \mathbf{y} \\ \mathbf{f}_* \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} K + \sigma_n^2 I & K_* \\ K_*^T & K_{**} \end{bmatrix}\right)\end{split}\]

where:

  • \(K = k(X, X)\) is \(n \times n\) (training covariance)

  • \(K_* = k(X, X_*)\) is \(n \times m\) (cross-covariance)

  • \(K_{**} = k(X_*, X_*)\) is \(m \times m\) (test covariance)

Gaussian conditioning formula:

If \(\begin{bmatrix} \mathbf{x} \\ \mathbf{y} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \mu_x \\ \mu_y \end{bmatrix}, \begin{bmatrix} \Sigma_{xx} & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_{yy} \end{bmatrix}\right)\)

Then: \(\mathbf{y} | \mathbf{x} \sim \mathcal{N}(\mu_y + \Sigma_{yx}\Sigma_{xx}^{-1}(\mathbf{x} - \mu_x), \Sigma_{yy} - \Sigma_{yx}\Sigma_{xx}^{-1}\Sigma_{xy})\)

Applying to our case:

\[\boxed{\mathbf{f}_* | X, \mathbf{y}, X_* \sim \mathcal{N}(\mu_*, \Sigma_*)}\]
\[\boxed{\mu_* = K_*^T (K + \sigma_n^2 I)^{-1} \mathbf{y}}\]
\[\boxed{\Sigma_* = K_{**} - K_*^T (K + \sigma_n^2 I)^{-1} K_*}\]

Interpretation:

  • Mean \(\mu_*\): Weighted combination of training labels

    • Weights depend on similarity (kernel) between test and training points

  • Variance \(\Sigma_*\): Uncertainty decreases near training data

    • \(K_{**}\): Prior uncertainty (far from data)

    • Second term: Reduction due to observations

Kernel Functions: Mathematical PropertiesΒΆ

Definition: A kernel \(k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) is valid if its Gram matrix is positive semi-definite (PSD) for any finite set of points.

Mercer’s Theorem: A symmetric function \(k(x, x')\) can be expressed as:

\[k(x, x') = \sum_{i=1}^\infty \lambda_i \phi_i(x) \phi_i(x')\]

where \(\lambda_i > 0\) are eigenvalues and \(\phi_i\) are eigenfunctions.

Popular Kernels:

  1. Squared Exponential (RBF):

\[k_{\text{SE}}(x, x') = \sigma^2 \exp\left(-\frac{\|x - x'\|^2}{2\ell^2}\right)\]
  • Properties: Infinitely differentiable, stationary, isotropic

  • Spectral density: Gaussian

  • Use: Smooth functions

  1. MatΓ©rn:

\[k_{\text{Mat}}(x, x') = \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}r}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}r}{\ell}\right)\]

where \(r = \|x - x'\|\), \(K_\nu\) is modified Bessel function.

  • Special cases:

    • \(\nu = 1/2\): Exponential kernel (not differentiable)

    • \(\nu = 3/2\): Once differentiable

    • \(\nu = 5/2\): Twice differentiable

    • \(\nu \to \infty\): RBF

  1. Periodic:

\[k_{\text{Per}}(x, x') = \sigma^2 \exp\left(-\frac{2\sin^2(\pi|x-x'|/p)}{\ell^2}\right)\]
  • Period \(p\), length scale \(\ell\)

  • For periodic data (seasons, daily patterns)

  1. Rational Quadratic (mixture of RBF):

\[k_{\text{RQ}}(x, x') = \sigma^2 \left(1 + \frac{\|x-x'\|^2}{2\alpha\ell^2}\right)^{-\alpha}\]
  • Scale mixture of SE kernels with different length scales

  • \(\alpha \to \infty\): Becomes SE

Kernel Algebra: Combine kernels to build complex ones!

  • Sum: \(k_1 + k_2\) β†’ Models different patterns

  • Product: \(k_1 \cdot k_2\) β†’ Conjunction of properties

  • Composition: \(k_1(f(x), f(x'))\) β†’ Feature transformation

Example combination: $\(k(x, x') = k_{\text{trend}}(x, x') + k_{\text{periodic}}(x, x') + k_{\text{noise}}(x, x')\)$

Computational Complexity AnalysisΒΆ

Training (computing \(K^{-1}\)):

  • Naive inversion: \(O(n^3)\)

  • Cholesky decomposition: \(O(n^3)\) (numerically stable)

Prediction (for \(m\) test points):

  • Mean: \(O(nm + n^2)\)

  • Variance: \(O(m^2n)\) (full covariance) or \(O(mn)\) (diagonal only)

Storage:

  • Kernel matrix: \(O(n^2)\)

Bottleneck: For \(n > 10,000\), standard GP becomes impractical!

Numerical StabilityΒΆ

Problem: Direct matrix inversion is unstable.

Solution: Cholesky decomposition

\[K + \sigma_n^2 I = L L^T\]

where \(L\) is lower triangular.

Efficient computation:

# Instead of: K_inv @ y
L = cholesky(K + σ²I)
α = solve(L.T, solve(L, y))  # α = K⁻¹y

# For variance:
v = solve(L, K_*)
var = K_** - v.T @ v

Time: Still \(O(n^3)\) but more stable!

Jitter: Add small diagonal term for numerical stability $\(K \leftarrow K + 10^{-6} I\)$

Log Marginal Likelihood (Model Selection)ΒΆ

To optimize hyperparameters \(\theta = \{\ell, \sigma, \sigma_n\}\):

\[\log p(\mathbf{y} | X, \theta) = -\frac{1}{2}\mathbf{y}^T K_y^{-1} \mathbf{y} - \frac{1}{2}\log|K_y| - \frac{n}{2}\log(2\pi)\]

where \(K_y = K + \sigma_n^2 I\).

Three terms:

  1. Data fit: \(-\frac{1}{2}\mathbf{y}^T K_y^{-1} \mathbf{y}\) (how well model explains data)

  2. Complexity penalty: \(-\frac{1}{2}\log|K_y|\) (Occam’s razor)

  3. Normalization: \(-\frac{n}{2}\log(2\pi)\)

Gradient (for optimization):

\[\frac{\partial}{\partial \theta} \log p(\mathbf{y}|X,\theta) = \frac{1}{2}\text{tr}\left((K_y^{-1}\mathbf{y}\mathbf{y}^T K_y^{-1} - K_y^{-1})\frac{\partial K_y}{\partial \theta}\right)\]

Can optimize with gradient descent!

def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0):
    """
    RBF/Squared Exponential kernel.
    k(x,x') = variance * exp(-||x-x'||^2 / (2*length_scale^2))
    """
    dists = cdist(X1, X2, 'sqeuclidean')
    K = variance * np.exp(-dists / (2 * length_scale**2))
    return K

def matern_kernel(X1, X2, length_scale=1.0, nu=1.5):
    """
    MatΓ©rn kernel (nu=1.5).
    """
    dists = cdist(X1, X2)
    dists = np.sqrt(3) * dists / length_scale
    K = (1 + dists) * np.exp(-dists)
    return K

print("Kernels defined")

GP RegressionΒΆ

Gaussian Process regression places a prior distribution directly over functions: \(f \sim \mathcal{GP}(m(x), k(x, x'))\), where \(m\) is the mean function and \(k\) is the kernel (covariance function) that encodes assumptions about function smoothness and length scale. Given observed data, the posterior is also a Gaussian Process, yielding an exact predictive mean (the best estimate) and predictive variance (the uncertainty) at every test point. The closed-form posterior is computed via matrix operations involving the kernel matrix \(K\), training targets \(y\), and noise variance \(\sigma_n^2\): \(\mu_* = K_*^T (K + \sigma_n^2 I)^{-1} y\), \(\Sigma_* = K_{**} - K_*^T (K + \sigma_n^2 I)^{-1} K_*\). This non-parametric approach provides well-calibrated uncertainty estimates out of the box.

class GaussianProcess:
    """GP regression with RBF kernel."""
    
    def __init__(self, length_scale=1.0, variance=1.0, noise=1e-2):
        self.length_scale = length_scale
        self.variance = variance
        self.noise = noise
        
        self.X_train = None
        self.y_train = None
        self.K_inv = None
    
    def fit(self, X, y):
        """Fit GP to training data."""
        self.X_train = X
        self.y_train = y
        
        # Compute kernel matrix
        K = rbf_kernel(X, X, self.length_scale, self.variance)
        K += self.noise * np.eye(len(X))
        
        # Invert (with Cholesky for stability)
        L = np.linalg.cholesky(K)
        self.K_inv = np.linalg.solve(L.T, np.linalg.solve(L, np.eye(len(X))))
    
    def predict(self, X_test, return_std=True):
        """Predict with uncertainty."""
        # K_*
        K_s = rbf_kernel(self.X_train, X_test, self.length_scale, self.variance)
        
        # Mean
        mu = K_s.T @ self.K_inv @ self.y_train
        
        if return_std:
            # K_**
            K_ss = rbf_kernel(X_test, X_test, self.length_scale, self.variance)
            
            # Variance
            var = np.diag(K_ss - K_s.T @ self.K_inv @ K_s)
            std = np.sqrt(np.maximum(var, 0))
            
            return mu, std
        
        return mu

print("GaussianProcess defined")

Generate Test DataΒΆ

We create a synthetic regression dataset with a known ground-truth function (e.g., a sinusoidal with noise) and a set of sparse observation points. The gap between observations tests the GP’s interpolation capability, while the region beyond the observations tests extrapolation. GPs excel at interpolation (the uncertainty shrinks near observed points) but honestly report high uncertainty in extrapolation regions – a crucial property for safety-critical applications where knowing what you do not know is as important as making predictions.

# True function
def f(x):
    return np.sin(3*x) + 0.3*np.cos(9*x)

# Training data
np.random.seed(42)
X_train = np.random.uniform(-1, 1, 20).reshape(-1, 1)
y_train = f(X_train).ravel() + 0.1 * np.random.randn(20)

# Test data
X_test = np.linspace(-1.5, 1.5, 200).reshape(-1, 1)
y_test = f(X_test).ravel()

print(f"Train: {X_train.shape}, Test: {X_test.shape}")

Fit GP and PredictΒΆ

Fitting a GP involves computing the kernel matrix between all training points, adding observation noise to the diagonal, and solving the resulting linear system. Prediction at new test points requires computing the cross-kernel between test and training points, then applying the posterior formulas. The computational cost scales as \(O(N^3)\) for \(N\) training points (due to matrix inversion), which limits standard GPs to a few thousand observations. For larger datasets, sparse approximations (inducing points) or scalable GP methods reduce this to \(O(NM^2)\) where \(M \ll N\) is the number of inducing points.

# Fit GP
gp = GaussianProcess(length_scale=0.3, variance=1.0, noise=0.01)
gp.fit(X_train, y_train)

# Predict
mu, std = gp.predict(X_test, return_std=True)

print(f"Predictions: mean {mu.shape}, std {std.shape}")

Visualize GP RegressionΒΆ

The GP regression plot shows the predictive mean as a line, the 95% confidence interval as a shaded region, and the training points as markers. Near observed data, the confidence interval narrows (the GP is certain); far from observations, it widens (the GP is uncertain). This uncertainty visualization is the defining advantage of GPs over standard neural networks, which typically produce point predictions without calibrated confidence. In applications like Bayesian optimization, active learning, and autonomous systems, these uncertainty estimates guide where to collect new data or when to defer to a human.

plt.figure(figsize=(12, 6))

# True function
plt.plot(X_test, y_test, 'k--', label='True function', linewidth=2)

# GP mean
plt.plot(X_test, mu, 'b-', label='GP mean', linewidth=2)

# Uncertainty (2 std)
plt.fill_between(X_test.ravel(), mu - 2*std, mu + 2*std, alpha=0.3, label='Β±2Οƒ')

# Training data
plt.scatter(X_train, y_train, c='r', s=50, zorder=10, label='Training data')

plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Gaussian Process Regression', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Prior SamplesΒΆ

Sampling functions from the GP prior (before observing any data) illustrates what the kernel encodes about function properties. An RBF kernel produces smooth functions; a Matern-1/2 kernel produces rough, non-differentiable functions; a periodic kernel produces periodic functions. The kernel’s length scale controls how quickly the function varies: short length scales produce rapidly oscillating samples, while long length scales produce slowly varying ones. Visualizing prior samples is an important step in GP modeling because it verifies that the chosen kernel matches prior beliefs about the function being modeled.

# Sample from GP prior
X_prior = np.linspace(-2, 2, 100).reshape(-1, 1)
K_prior = rbf_kernel(X_prior, X_prior, length_scale=0.5, variance=1.0)
K_prior += 1e-6 * np.eye(len(X_prior))

# Draw samples
samples = np.random.multivariate_normal(np.zeros(len(X_prior)), K_prior, size=5)

# Plot
plt.figure(figsize=(12, 6))
for i in range(5):
    plt.plot(X_prior, samples[i], alpha=0.7, linewidth=2)

plt.xlabel('x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('GP Prior Samples', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Kernel ComparisonΒΆ

Comparing different kernels on the same regression task reveals how kernel choice affects both the fit quality and the uncertainty estimates. The RBF (squared exponential) kernel assumes infinite smoothness and works well for smooth functions. The Matern family provides a controllable smoothness parameter \(\nu\): \(\nu = 1/2\) gives the Ornstein-Uhlenbeck process (rough), \(\nu = 3/2\) is once-differentiable, and \(\nu \to \infty\) recovers the RBF. Kernel hyperparameters (length scale, variance) are optimized by maximizing the marginal likelihood \(p(y | X, \theta)\), which automatically balances fit quality with model complexity – an elegant form of Bayesian model selection.

7.5. Sparse Gaussian Processes: Scaling to Large DataΒΆ

The Scalability ProblemΒΆ

Standard GP: \(O(n^3)\) time, \(O(n^2)\) memory

Not feasible for \(n > 10,000\)!

Solution 1: Subset of Data (Subset of Regressors)ΒΆ

Use only \(m \ll n\) β€œinducing points” \(Z = \{z_1, \ldots, z_m\}\).

Approximation:

\[p(f_*|X, \mathbf{y}) \approx p(f_*|Z, \mathbf{u})\]

where \(\mathbf{u} = f(Z)\) are function values at inducing points.

Complexity: \(O(nm^2)\) instead of \(O(n^3)\)

Typically \(m \approx 100-1000\).

Solution 2: Variational Sparse GP (SVGP)ΒΆ

Key idea: Variational inference with inducing points.

True posterior: $\(p(\mathbf{u}|X, \mathbf{y}) = \frac{p(\mathbf{y}|X, \mathbf{u})p(\mathbf{u})}{p(\mathbf{y}|X)}\)$

Approximate with: $\(q(\mathbf{u}) = \mathcal{N}(\mathbf{m}, \mathbf{S})\)$

ELBO (variational bound):

\[\mathcal{L} = \sum_{i=1}^n \mathbb{E}_{q(f_i)}[\log p(y_i|f_i)] - \text{KL}(q(\mathbf{u}) \| p(\mathbf{u}))\]

Optimize \(\{Z, \mathbf{m}, \mathbf{S}\}\) to maximize \(\mathcal{L}\).

Advantages:

  • \(O(nm^2)\) complexity

  • Works with mini-batches!

  • Can handle millions of points

Stochastic optimization:

\[\mathcal{L}_{\text{batch}} = \frac{n}{|B|}\sum_{i \in B} \mathbb{E}_{q(f_i)}[\log p(y_i|f_i)] - \text{KL}(q(\mathbf{u}) \| p(\mathbf{u}))\]

Solution 3: Structured KernelsΒΆ

For grid data (images, time series):

Use Kronecker structure or Toeplitz structure.

Example (2D grid):

\[K = K_x \otimes K_y\]

where \(\otimes\) is Kronecker product.

Matrix-vector multiplication: \(O(n)\) instead of \(O(n^2)\) using FFT!

Applications:

  • Spatiotemporal data

  • Image processing

  • Efficient GP on grids

Comparison of Sparse MethodsΒΆ

Method

Complexity

Accuracy

Notes

Full GP

\(O(n^3)\)

Exact

\(n < 10K\)

Subset

\(O(nm^2)\)

Approximate

Simple, fast

FITC

\(O(nm^2)\)

Better

Fixed inducing points

SVGP

\(O(nm^2)\)

Best

Optimize inducing

Structured

\(O(n \log n)\)

Exact

Grid data only

When to use:

  • \(n < 10K\): Full GP

  • \(10K < n < 1M\): SVGP with \(m \approx 500\)

  • Grid data: Structured kernels

  • \(n > 1M\): Deep GPs or Neural Processes

Inducing Point SelectionΒΆ

Strategies:

  1. K-means clustering:

    • Cluster data, use centroids as \(Z\)

    • Simple, fast

    • Doesn’t account for GP structure

  2. Greedy selection:

    • Iteratively add points that maximize marginal likelihood

    • Better coverage

    • \(O(nm^3)\) cost

  3. Variational optimization:

    • Treat \(Z\) as parameters, optimize with gradient descent

    • Best performance

    • Used in SVGP

  4. Pseudo-inputs:

    • Not constrained to training data

    • Can be anywhere in input space

    • Most flexible

Practical: Start with K-means, refine with gradients.

Code Example: SVGPΒΆ

import gpytorch

class SVGPModel(gpytorch.models.ApproximateGP):
    def __init__(self, inducing_points):
        variational_distribution = gpytorch.variational.CholeskyVariationalDistribution(
            inducing_points.size(0)
        )
        variational_strategy = gpytorch.variational.VariationalStrategy(
            self, inducing_points, variational_distribution
        )
        super().__init__(variational_strategy)
        
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel()
        )
    
    def forward(self, x):
        mean = self.mean_module(x)
        covar = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean, covar)

# Initialize with m=500 inducing points
inducing_points = X_train[::10]  # Subsample
model = SVGPModel(inducing_points)

# Train with stochastic optimization
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
elbo = gpytorch.mlls.VariationalELBO(likelihood, model, num_data=len(X_train))

for epoch in range(100):
    optimizer.zero_grad()
    output = model(X_batch)
    loss = -elbo(output, y_batch)
    loss.backward()
    optimizer.step()
# Different length scales
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

length_scales = [0.1, 0.3, 1.0]

for idx, ls in enumerate(length_scales):
    gp_test = GaussianProcess(length_scale=ls, variance=1.0, noise=0.01)
    gp_test.fit(X_train, y_train)
    mu_test, std_test = gp_test.predict(X_test)
    
    axes[idx].plot(X_test, y_test, 'k--', label='True', linewidth=2)
    axes[idx].plot(X_test, mu_test, 'b-', label='GP mean', linewidth=2)
    axes[idx].fill_between(X_test.ravel(), mu_test - 2*std_test, mu_test + 2*std_test, alpha=0.3)
    axes[idx].scatter(X_train, y_train, c='r', s=40, zorder=10)
    
    axes[idx].set_title(f'Length scale = {ls}', fontsize=12)
    axes[idx].set_xlabel('x', fontsize=11)
    axes[idx].set_ylabel('y', fontsize=11)
    axes[idx].legend(fontsize=10)
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

SummaryΒΆ

Gaussian Processes:ΒΆ

Key Concepts:

  1. Distribution over functions

  2. Specified by mean and kernel

  3. Exact Bayesian inference

  4. Uncertainty quantification

Kernels:ΒΆ

  • RBF: Smooth, infinitely differentiable

  • MatΓ©rn: Control smoothness via Ξ½

  • Periodic: For periodic data

  • Linear: For linear trends

Hyperparameters:ΒΆ

  • Length scale: Controls smoothness

  • Variance: Controls amplitude

  • Noise: Observation noise level

Advantages:ΒΆ

  • Principled uncertainty

  • Kernel flexibility

  • Small data regime

  • Interpretable

Challenges:ΒΆ

  • O(nΒ³) complexity

  • Kernel choice matters

  • Hyperparameter tuning

Applications:ΒΆ

  • Bayesian optimization

  • Time series forecasting

  • Calibration

  • Active learning

Extensions:ΒΆ

  • Sparse GPs (inducing points)

  • Deep GPs (composition)

  • Variational GPs (scalability)

  • Neural network GPs

Advanced GP Topics and ExtensionsΒΆ

1. Deep Gaussian ProcessesΒΆ

Idea: Compose multiple GP layers

\[f_1(x) \sim \mathcal{GP}(0, k_1)\]
\[f_2(f_1) \sim \mathcal{GP}(0, k_2)\]
\[\vdots\]
\[y | f_L \sim \mathcal{N}(f_L(x), \sigma^2)\]

Advantages:

  • More expressive than single GP

  • Hierarchical representations

  • Better for complex patterns

Challenges:

  • No closed-form posterior

  • Need approximate inference (variational)

  • Harder to train

Applications:

  • Image modeling

  • Hierarchical time series

  • Transfer learning

2. Neural Network Gaussian ProcessesΒΆ

Infinite-width neural networks = Gaussian processes!

Neal’s Theorem (1996): As width \(\to \infty\), Bayesian NN converges to GP.

Kernel for single hidden layer:

\[k(x, x') = \mathbb{E}_{\theta}[\sigma(w^T x) \sigma(w^T x')]\]

where \(w \sim \mathcal{N}(0, I)\).

For ReLU: \(k(x, x') = \frac{\|x\|\|x'\|}{2\pi}(\sin\theta + (\pi - \theta)\cos\theta)\)

where \(\theta = \cos^{-1}\left(\frac{x^T x'}{\|x\|\|x'\|}\right)\)

Neural Tangent Kernel (NTK):

Evolution of infinite-width NN during training = GP with NTK!

\[k_{\text{NTK}}(x, x') = \langle \nabla_\theta f(x; \theta_0), \nabla_\theta f(x'; \theta_0) \rangle\]

Practical: Can compute NN kernels, use with GP inference!

3. Multi-Output Gaussian ProcessesΒΆ

Problem: Predict multiple correlated outputs.

Example: Robot arm - predict (x, y, z) positions simultaneously.

Solution: Multi-output GP with cross-output correlations.

Linear Model of Coregionalization (LMC):

\[f_i(x) = \sum_{q=1}^Q w_{iq} u_q(x)\]

where \(u_q \sim \mathcal{GP}(0, k_q)\).

Covariance between outputs \(i, j\) at locations \(x, x'\):

\[\text{Cov}[f_i(x), f_j(x')] = \sum_{q=1}^Q w_{iq} w_{jq} k_q(x, x')\]

Computational cost: \(O((Pn)^3)\) for \(P\) outputs!

Sparse approximation: Use inducing points per output.

4. Gaussian Process Latent Variable Models (GPLVM)ΒΆ

Problem: High-dimensional data, unknown latent structure.

Generative model:

\[\mathbf{x}_i \sim \mathcal{N}(0, I) \quad \text{(latent, low-dim)}\]
\[\mathbf{y}_i | \mathbf{x}_i \sim \mathcal{GP}(0, k(\mathbf{x}, \mathbf{x}'))\]

Learn: Latent coordinates \(X = \{\mathbf{x}_1, \ldots, \mathbf{x}_n\}\) and kernel hyperparameters.

Optimization:

\[\max_{X, \theta} \log p(\mathbf{Y} | X, \theta) = -\frac{D}{2}\log|K_X| - \frac{1}{2}\text{tr}(K_X^{-1}\mathbf{Y}\mathbf{Y}^T) + \text{const}\]

Applications:

  • Dimensionality reduction (like PCA but nonlinear)

  • Visualization

  • Motion capture analysis

Bayesian GPLVM: Treat \(X\) as random, use variational inference.

5. Warped Gaussian ProcessesΒΆ

Problem: GP assumes Gaussian likelihood. What if data is not Gaussian?

Solution: Apply monotonic transformation \(g\):

\[y = g(f(x)) + \epsilon\]

where \(f \sim \mathcal{GP}\).

Example transformations:

  • Log: For positive data

  • Logit: For bounded [0,1] data

  • Box-Cox: \(g(y) = \frac{y^\lambda - 1}{\lambda}\)

Learn: Jointly optimize \(\lambda\) and GP hyperparameters.

6. Student-t ProcessesΒΆ

Heavy-tailed alternative to GP:

\[f \sim \mathcal{TP}_\nu(m, k)\]

Student-t process with \(\nu\) degrees of freedom.

Marginals:

\[f(x_1), \ldots, f(x_n) \sim \text{Student-t}_\nu(0, K)\]

Robust to outliers!

Relationship to GP:

  • \(\nu = 1\): Cauchy process (very heavy tails)

  • \(\nu \to \infty\): Gaussian process

  • Typical: \(\nu = 3\) or \(\nu = 5\)

7. Heteroscedastic GPsΒΆ

Standard GP: Constant noise \(\sigma_n^2\).

Heteroscedastic: Noise varies with input!

\[y = f(x) + \epsilon(x), \quad \epsilon(x) \sim \mathcal{N}(0, \sigma^2(x))\]

Model noise as another GP:

\[\log \sigma^2(x) \sim \mathcal{GP}(m_\sigma, k_\sigma)\]

Joint inference: Optimize both \(f\) and \(\sigma^2\).

Applications:

  • Financial data (volatility)

  • Sensor noise varies by location

  • More accurate uncertainty

8. Bayesian Optimization with GPsΒΆ

Problem: Optimize expensive black-box function \(f(x)\).

Example: Tune hyperparameters (each evaluation = full training run).

Algorithm:

  1. GP prior over \(f\)

  2. Observe \(f(x_1), \ldots, f(x_n)\)

  3. Update GP posterior

  4. Acquisition function \(a(x)\) suggests next point:

    • UCB: \(a(x) = \mu(x) + \beta \sigma(x)\) (explore-exploit)

    • EI: Expected Improvement over current best

    • PI: Probability of Improvement

  5. Evaluate \(x_{n+1} = \arg\max a(x)\)

  6. Repeat

Expected Improvement:

\[\text{EI}(x) = \mathbb{E}[\max(f(x) - f^+, 0)]\]

where \(f^+ = \max_{i} f(x_i)\) is current best.

Closed form (if GP posterior is Gaussian):

\[\text{EI}(x) = (\mu(x) - f^+)\Phi(Z) + \sigma(x)\phi(Z)\]

where \(Z = \frac{\mu(x) - f^+}{\sigma(x)}\), \(\Phi\) is CDF, \(\phi\) is PDF.

Why it works:

  • High \(\mu(x)\): Likely to improve (exploitation)

  • High \(\sigma(x)\): Uncertain, might be great (exploration)

Applications:

  • Hyperparameter tuning (Spearmint, GPyOpt)

  • Neural architecture search

  • Robotics (policy search)

  • Drug discovery

Comparison: GPs vs Neural NetworksΒΆ

Aspect

Gaussian Processes

Neural Networks

Uncertainty

βœ… Principled, calibrated

⚠️ Requires special methods

Small data

βœ… Excellent (\(n < 1000\))

❌ Needs lots of data

Scalability

❌ \(O(n^3)\), sparse for large \(n\)

βœ… Scales well

Interpretability

βœ… Kernel choice interpretable

❌ Black box

Flexibility

⚠️ Limited by kernel

βœ… Universal approximator

Feature learning

❌ Manual kernel design

βœ… Automatic

Optimization

βœ… Convex (often)

❌ Non-convex

When to use GPs:

  • Small/medium datasets (\(n < 10K\))

  • Need uncertainty quantification

  • Prior knowledge (kernel structure)

  • Safety-critical applications

  • Bayesian optimization

When to use NNs:

  • Large datasets (\(n > 100K\))

  • Complex patterns (images, text)

  • Raw features (no engineering)

  • Don’t need calibrated uncertainty

  • Computational resources available

Hybrid: Combine both! Deep kernel learning, Neural Processes.

Advanced Gaussian Processes TheoryΒΆ

1. Introduction to Gaussian ProcessesΒΆ

1.1 From Bayesian Linear Regression to GPsΒΆ

Bayesian Linear Regression:

f(x) = wα΅€Ο†(x),  w ~ N(0, Ξ£_p)
  • Prior over weights w

  • Induces prior over functions f

Gaussian Process: Generalization to infinite basis functions

f ~ GP(m(x), k(x, x'))
  • Mean function: m(x) = E[f(x)] (typically 0)

  • Covariance (kernel): k(x, x’) = Cov[f(x), f(x’)]

Key property: Any finite collection of function values is jointly Gaussian

[f(x₁), ..., f(xβ‚™)]α΅€ ~ N(m, K)
where Kα΅’β±Ό = k(xα΅’, xβ±Ό)

1.2 Why Gaussian Processes?ΒΆ

Advantages:

  1. Uncertainty quantification: Predictive variance σ²(x*)

  2. Non-parametric: Complexity grows with data (no fixed architecture)

  3. Interpretable: Kernel encodes prior assumptions

  4. Theoretical guarantees: Convergence, calibration

  5. Sample efficiency: Strong performance with small data

Disadvantages:

  1. Cubic complexity: O(NΒ³) for N data points

  2. Kernel choice: Requires domain knowledge

  3. Scalability: Difficult for N > 10,000

  4. Categorical inputs: Less natural than for continuous

2. Kernel FunctionsΒΆ

2.1 Properties of Valid KernelsΒΆ

Positive semi-definite: K βͺ° 0 for all x₁, …, xβ‚™

Mercer’s theorem: k(x, x’) corresponds to inner product in some Hilbert space

k(x, x') = βŸ¨Ο†(x), Ο†(x')⟩_H

Constructing new kernels:

  • Sum: k₁(x, x’) + kβ‚‚(x, x’)

  • Product: k₁(x, x’) Β· kβ‚‚(x, x’)

  • Scaling: c Β· k(x, x’)

2.2 Common Kernel FunctionsΒΆ

Squared Exponential (RBF):

k_SE(x, x') = σ² exp(-||x - x'||Β²/(2β„“Β²))
  • σ²: Signal variance (vertical scale)

  • β„“: Length scale (horizontal scale)

  • Properties: Infinitely differentiable, smooth functions

  • Use: Default choice, general-purpose

MatΓ©rn family:

k_MatΓ©rn(r) = (2^(1-Ξ½)/Ξ“(Ξ½)) (√(2Ξ½)r/β„“)^Ξ½ K_Ξ½(√(2Ξ½)r/β„“)
  • Ξ½: Smoothness parameter

    • Ξ½ = 1/2: Exponential kernel (continuous, not differentiable)

    • Ξ½ = 3/2: Once differentiable

    • Ξ½ = 5/2: Twice differentiable

    • Ξ½ β†’ ∞: RBF kernel

  • Use: When roughness is important (e.g., sensor data)

Periodic kernel:

k_per(x, x') = σ² exp(-2sinΒ²(Ο€|x-x'|/p)/β„“Β²)
  • p: Period

  • Use: Seasonal data, oscillations

Linear kernel:

k_lin(x, x') = Οƒ_bΒ² + Οƒ_vΒ² (x - c)(x' - c)
  • Use: Linear trends, baseline comparison

Rational Quadratic:

k_RQ(x, x') = σ² (1 + ||x-x'||Β²/(2Ξ±β„“Β²))^(-Ξ±)
  • Ξ±: Scale mixture parameter

  • Interpretation: Mixture of RBF kernels with different length scales

  • Use: Multi-scale phenomena

2.3 Kernel Design PatternsΒΆ

Additive kernels (decomposition):

k(x, x') = k₁(x₁, x'₁) + kβ‚‚(xβ‚‚, x'β‚‚)
  • Model independent contributions from different dimensions

Multiplicative kernels (interactions):

k(x, x') = k₁(x, x') Β· kβ‚‚(x, x')
  • Capture interactions between features

Spectral Mixture Kernel (Wilson & Adams, 2013):

k_SM(Ο„) = Ξ£α΅’ wα΅’ exp(-2π²τ²μᡒ²) cos(2πτνᡒ)
  • Learn mixture of spectral components

  • Highly expressive

3. GP Regression (Kriging)ΒΆ

3.1 Posterior DistributionΒΆ

Setup:

  • Data: D = {(xα΅’, yα΅’)}α΅’β‚Œβ‚α΄Ί

  • Model: y = f(x) + Ξ΅, Ξ΅ ~ N(0, σ²_n)

  • Prior: f ~ GP(0, k)

Posterior predictive distribution at x*:

p(f*|x*, X, y) = N(ΞΌ*, σ²*)

ΞΌ* = k*α΅€ (K + σ²_n I)⁻¹ y
σ²* = k** - k*α΅€ (K + σ²_n I)⁻¹ k*

where:

  • K: NΓ—N kernel matrix, Kα΅’β±Ό = k(xα΅’, xβ±Ό)

  • k*: NΓ—1 vector, kα΅’ = k(xα΅’, x)

  • k**: scalar, k** = k(x*, x*)

Interpretation:

  • ΞΌ*: Weighted sum of observations (similar inputs β†’ high weight)

  • σ²*: Reduces near data, high in unexplored regions

3.2 Marginal Likelihood (Evidence)ΒΆ

Log marginal likelihood:

log p(y|X, ΞΈ) = -Β½ yα΅€(K + σ²_n I)⁻¹y - Β½ log|K + σ²_n I| - (N/2)log(2Ο€)
                 \_____data fit______/    \___complexity___/   \__const_/

Hyperparameter optimization: ΞΈ* = argmax log p(y|X, ΞΈ)

  • Typically: ΞΈ = {β„“, σ², σ²_n} (length scale, signal variance, noise)

  • Gradient-based optimization (L-BFGS, Adam)

Gradient (for RBF kernel):

βˆ‚log p(y|X, ΞΈ)/βˆ‚ΞΈ = Β½ tr[(Ξ±Ξ±α΅€ - K⁻¹) βˆ‚K/βˆ‚ΞΈ]
where α = K⁻¹y

3.3 Computational ComplexityΒΆ

Naive implementation:

  • Inversion: O(NΒ³)

  • Storage: O(NΒ²)

Bottleneck: Solving (K + σ²_n I)⁻¹y

Cholesky decomposition (standard approach):

  1. Compute L: K = LLα΅€ (O(NΒ³))

  2. Solve LΞ± = y via forward substitution (O(NΒ²))

  3. Solve Lα΅€v = Ξ± via backward substitution (O(NΒ²))

Total: O(NΒ³) training, O(NΒ²) prediction per test point

4. Scalable Gaussian ProcessesΒΆ

4.1 Inducing Points (Sparse GPs)ΒΆ

Idea: Approximate GP with M << N inducing points Z

Variational Free Energy (Titsias, 2009):

F = -KL(q(f)||p(f|X, Z)) + E_q[log p(y|f)]

Predictive distribution:

q(f*|x*) = ∫ p(f*|u) q(u) du

where u = f(Z) are inducing function values

Complexity: O(NMΒ²) training, O(MΒ²) prediction

SVGP (Stochastic Variational GP):

  • Minibatch training

  • Scales to millions of data points

  • Complexity: O(BMΒ²) per iteration (B = batch size)

4.2 Structured Kernel Interpolation (SKI)ΒΆ

Idea: Approximate kernel with grid interpolation

Steps:

  1. Place M inducing points on regular grid

  2. Interpolate: k(x, x’) β‰ˆ wα΅€K_uu w’

  3. Use FFT for fast matrix-vector products

Complexity: O(N + M log M) with FFT

KISS-GP (Kernel Interpolation for Scalable Structured GPs):

  • Combines SKI with minibatching

  • Scales to millions in 1D-3D

4.3 Random Features (Fourier Features)ΒΆ

Bochner’s theorem: Stationary kernel ↔ Fourier transform of measure

For RBF kernel:

k(x, x') = E_Ο‰[cos(Ο‰α΅€x) cos(Ο‰α΅€x') + sin(Ο‰α΅€x) sin(Ο‰α΅€x')]
         β‰ˆ (1/M) Ξ£β‚˜ Ο†(x; Ο‰β‚˜)α΅€ Ο†(x'; Ο‰β‚˜)

Random Fourier Features (Rahimi & Recht, 2007):

Ο†(x) = √(2/M) [cos(ω₁ᡀx), ..., cos(Ο‰β‚˜α΅€x), sin(ω₁ᡀx), ..., sin(Ο‰β‚˜α΅€x)]α΅€
Ο‰β‚˜ ~ p(Ο‰) (spectral density)

Approximation:

k(x, x') β‰ˆ Ο†(x)α΅€Ο†(x')
f(x) β‰ˆ Ο†(x)α΅€w,  w ~ N(0, I)

Complexity: O(NM) training (linear regression), O(M) prediction

Fastfood: Structured random features using Hadamard transform (O(M log M))

4.4 Deep Kernel LearningΒΆ

Combine NN feature extractor with GP:

f(x) = GP(Ο†_NN(x))
k(x, x') = k_GP(Ο†_NN(x), Ο†_NN(x'))

Training: Jointly optimize NN parameters and GP hyperparameters

Advantages:

  • NN learns features (representation learning)

  • GP provides uncertainty (calibrated predictions)

Implementation (Wilson et al., 2016):

  • Use spectral mixture kernel on learned features

  • Inducing points for scalability

5. GP ClassificationΒΆ

5.1 Binary ClassificationΒΆ

Model:

f ~ GP(0, k)
y ∈ {-1, +1}
p(y=+1|f) = Οƒ(f)  (logistic or probit)

Challenge: Posterior p(f|X, y) is non-Gaussian (non-conjugate likelihood)

Approaches:

  1. Laplace approximation: Gaussian at mode

  2. Expectation Propagation (EP): Iterative refinement

  3. Variational inference: Lower bound

  4. MCMC: Sampling (expensive)

5.2 Laplace ApproximationΒΆ

Steps:

  1. Find MAP: fΜ‚ = argmax p(f|X, y)

  2. Compute Hessian: H = -βˆ‡Β²log p(f|X, y)|_{fΜ‚}

  3. Approximate: p(f|X, y) β‰ˆ N(fΜ‚, H⁻¹)

Predictive distribution:

p(y*=+1|x*, X, y) β‰ˆ ∫ Οƒ(f*) N(f*|ΞΌ*, σ²*) df*
  • Requires 1D integral (Gaussian quadrature or probit approximation)

Limitation: Underestimates uncertainty

5.3 Expectation PropagationΒΆ

Idea: Approximate each likelihood term with Gaussian

p(yα΅’|fα΅’) β‰ˆ tα΅’(fα΅’) = sα΅’ N(fα΅’|ΞΌΜƒα΅’, ΟƒΜƒα΅’Β²)

Algorithm:

  1. Initialize tilted distributions

  2. For each factor i:

    • Remove tα΅’ from posterior (cavity distribution)

    • Compute new tα΅’ via moment matching

    • Update posterior

  3. Repeat until convergence

Advantage: More accurate uncertainty than Laplace

Complexity: O(NΒ³) (same as regression)

6. Multi-Output Gaussian ProcessesΒΆ

6.1 MotivationΒΆ

Goal: Model P correlated outputs jointly

  • Example: Multi-task learning, vector-valued functions

Naive approach: Independent GPs for each output

  • Ignores correlations

Multi-output GP: Joint distribution over all outputs

6.2 Linear Model of Coregionalization (LMC)ΒΆ

Idea: Each output is linear combination of Q latent GPs

fβ‚š(x) = Ξ£_q aβ‚šq gq(x)
where gq ~ GP(0, kq)

Kernel:

k((x, p), (x', p')) = Ξ£_q aβ‚šq aβ‚š'q kq(x, x')
                    = [AAα΅€]β‚šβ‚š' k(x, x')
  • A: PΓ—Q mixing matrix

  • Encodes output correlations

Complexity: O((NP)Β³) without approximations

6.3 Convolutional GPsΒΆ

Idea: Outputs are convolutions of latent GPs

fβ‚š(x) = ∫ wβ‚š(x - z) g(z) dz

Advantage: Induces structured correlations

Applications: Spatial data, time series

7. Advanced TopicsΒΆ

7.1 Additive GPsΒΆ

Decomposition:

f(x) = f₁(x₁) + fβ‚‚(xβ‚‚) + ... + f_D(x_D)

Kernel:

k(x, x') = Ξ£_d k_d(x_d, x'_d)

Advantages:

  • Interpretability (feature importance)

  • Scalability (lower effective dimension)

  • ANOVA decomposition

SALSA (Structured Additive Sparse GPs):

  • Sparse approximation for each term

  • Complexity: O(D Β· MΒ²)

7.2 Warped GPsΒΆ

Problem: GP assumes Gaussian noise, but data may be skewed

Solution: Learn monotonic warp function g

y = g(f(x)) + Ξ΅

Training: Maximize marginal likelihood over f and g jointly

Applications: Positive-only outputs, heavy-tailed noise

7.3 Heteroscedastic GPsΒΆ

Problem: Noise variance σ²_n may vary with input

Model:

log σ²(x) ~ GP(m_Οƒ, k_Οƒ)
f(x) ~ GP(m, k)
y|f, σ² ~ N(f, σ²)

Inference: Variational or MCMC (non-conjugate)

7.4 Robust GPsΒΆ

Student-t likelihood (heavy-tailed):

p(y|f) = t_Ξ½(y|f, σ²)
  • Ξ½: Degrees of freedom (lower β†’ heavier tails)

  • Robust to outliers

Inference: Laplace or EP

7.5 Bayesian Optimization with GPsΒΆ

Goal: Optimize expensive black-box function f(x)

Acquisition functions:

  1. Upper Confidence Bound (UCB):

    UCB(x) = ΞΌ(x) + Ξ² Οƒ(x)
    
    • Ξ²: Exploration-exploitation trade-off

  2. Expected Improvement (EI):

    EI(x) = E[max(f(x) - f⁺, 0)]
           = (ΞΌ - f⁺)Ξ¦(Z) + σφ(Z)
    
    • f⁺: Current best

    • Z = (ΞΌ - f⁺)/Οƒ

  3. Probability of Improvement (PI):

    PI(x) = Ξ¦((ΞΌ - f⁺)/Οƒ)
    

Algorithm:

  1. Fit GP to observed data

  2. Maximize acquisition function β†’ x_next

  3. Evaluate f(x_next), add to data

  4. Repeat

Applications: Hyperparameter tuning, drug discovery, experimental design

8. Infinite-Width Neural Networks as GPsΒΆ

8.1 Neal’s Theorem (1996)ΒΆ

Single hidden layer NN:

f(x) = (b/√H) Ξ£β‚• vβ‚• Οƒ(wβ‚•α΅€x)
where wβ‚• ~ N(0, Ξ£_w), vβ‚• ~ N(0, 1)

Limit as H β†’ ∞:

f(x) ~ GP(0, K_NNGP(x, x'))

Kernel (for ReLU):

K(x, x') = (||x|| ||x'||)/(2Ο€) Β· [sin ΞΈ + (Ο€ - ΞΈ) cos ΞΈ]
where ΞΈ = arccos(xα΅€x'/(||x|| ||x'||))

8.2 Neural Tangent Kernel (NTK)ΒΆ

Observation (Jacot et al., 2018): In infinite-width limit, NN training dynamics are linear!

NTK:

Θ(x, x') = E[βŸ¨βˆ‡_ΞΈf(x; ΞΈβ‚€), βˆ‡_ΞΈf(x'; ΞΈβ‚€)⟩]

Training dynamics (gradient descent):

df_t/dt = -η Θ(X, X)(f_t - y)
  • Closed-form solution (kernel ridge regression)

Prediction:

f_∞(x) = Θ(x, X)[Θ(X, X) + λI]⁻¹y

Implications:

  • Infinite-width NNs don’t learn features (kernel fixed)

  • Finite-width NNs differ: feature learning, non-linearity

8.3 Deep GPs vs. Deep NNsΒΆ

Correspondence:

  • Depth L β†’ ∞: Deep GP

  • Width H β†’ ∞: Shallow GP (no feature learning)

Deep GP kernel (compositional):

k_L(x, x') = k(k_{L-1}(x, Β·), k_{L-1}(x', Β·))

9. Practical ConsiderationsΒΆ

9.1 Numerical StabilityΒΆ

Problem: K may be nearly singular (ill-conditioned)

Solutions:

  1. Add jitter: K + Ξ΅I (Ξ΅ β‰ˆ 1e-6)

  2. Cholesky with pivoting: Permute rows for stability

  3. Preconditioned conjugate gradients: Iterative solver

9.2 Kernel Hyperparameter InitializationΒΆ

Length scale β„“: Start with median pairwise distance

β„“β‚€ = median({||xα΅’ - xβ±Ό|| : i β‰  j})

Signal variance σ²: Match data variance

σ²₀ = Var(y)

Noise variance σ²_n: Small fraction of σ²

σ²_n,0 = 0.1 Β· Var(y)

9.3 Cross-ValidationΒΆ

Leave-one-out (LOO) via marginal likelihood:

-log p(yα΅’|X, yβ‚‹α΅’) = Β½ log(2πσ²ᡒ|ᡒ₋₁) + (yα΅’ - ΞΌα΅’|ᡒ₋₁)Β²/(2σ²ᡒ|ᡒ₋₁)

Can compute efficiently without retraining:

ΞΌα΅’|ᡒ₋₁ = yα΅’ - Ξ±α΅’/[K⁻¹]α΅’α΅’
σ²ᡒ|ᡒ₋₁ = 1/[K⁻¹]α΅’α΅’

10. Computational Complexity SummaryΒΆ

Method

Training

Prediction

Memory

Approximation

Exact GP

O(NΒ³)

O(NΒ²)

O(NΒ²)

None

Sparse GP (SVGP)

O(NMΒ²)

O(MΒ²)

O(NM)

Variational

SKI/KISS-GP

O(N+M log M)

O(M log M)

O(M)

Interpolation

Random Features

O(NM)

O(M)

O(NM)

Monte Carlo

Laplace (classify)

O(NΒ³)

O(NΒ²)

O(NΒ²)

Mode approx

Deep Kernel

O(NMΒ²)

O(MΒ²)

O(NM)

+ NN features

  • N: Data points

  • M: Inducing points / features (M << N)

Rule of thumb:

  • N < 5,000: Exact GP

  • 5,000 < N < 100,000: Sparse GP

  • N > 100,000: Random features or deep kernel

11. Software LibrariesΒΆ

11.1 Python LibrariesΒΆ

  • GPy: General-purpose GP library

  • GPyTorch: Scalable GPs in PyTorch (GPU support)

  • scikit-learn: Basic GP regression/classification

  • GPflow: TensorFlow-based, variational inference

  • PyMC3: Bayesian inference with GP priors

  • BoTorch: Bayesian optimization

11.2 R LibrariesΒΆ

  • kernlab: Kernel methods

  • mlegp: Maximum likelihood GP

  • tgp: Treed Gaussian processes

12. Benchmarks and ResultsΒΆ

12.1 Regression TasksΒΆ

UCI datasets (avg. RMSE):

  • Linear regression: 0.62

  • Random forest: 0.54

  • GP (RBF): 0.48

  • Deep GP: 0.45

Sample efficiency (RMSE with 100 samples):

  • Neural network: 0.75

  • GP: 0.52

12.2 Bayesian OptimizationΒΆ

Hyperparameter tuning (CV score after T evaluations):

  • Random search (T=50): 0.82

  • Grid search (T=50): 0.85

  • GP-EI (T=50): 0.91

Speed: GP-EI finds optimum 3Γ— faster than random

13. Connections to Other MethodsΒΆ

13.1 GP vs. Kernel Ridge RegressionΒΆ

Kernel Ridge Regression (KRR):

f(x) = Ξ£α΅’ Ξ±α΅’ k(x, xα΅’)
α = (K + λI)⁻¹y

Relationship: KRR = GP posterior mean (maximum a posteriori)

  • GP additionally provides uncertainty σ²(x)

13.2 GP vs. SplinesΒΆ

Splines: Piecewise polynomials with continuity constraints

Connection: Certain kernels (e.g., Brownian motion) β†’ natural cubic splines

GP advantage: Generalizes to higher dimensions, arbitrary kernels

13.3 GP vs. RKHS MethodsΒΆ

Reproducing Kernel Hilbert Space (RKHS):

  • Function space with inner product ⟨f, g⟩_k

  • GP is probabilistic view of RKHS

Representer theorem: Optimal f lies in span of kernel evaluations

  • Justifies finite-dimensional approximations

14. Recent Advances (2018-2024)ΒΆ

14.1 Scalable Exact GPsΒΆ

LOVE (Lanczos Variance Estimates):

  • Use Lanczos algorithm for log-determinant

  • O(N log N) with structured kernels

MVM (Matrix-Vector Multiplication):

  • Exploit Toeplitz/Kronecker structure

  • FFT-based O(N log N)

14.2 Neural Process FamilyΒΆ

Neural Processes (Garnelo et al., 2018):

  • Meta-learning for GP-like uncertainty

  • Amortized inference (encoder-decoder)

Variants:

  • Conditional NP: Deterministic path

  • Attentive NP: Attention mechanism

  • Transformer NP: Full Transformer architecture

14.3 Infinite-Width Limits Beyond MLPsΒΆ

Convolutional NNGP (Garriga-Alonso et al., 2019):

  • Kernel for infinite-width CNNs

  • Spatial structure preserved

Transformer NNGP (Hron et al., 2020):

  • Kernel for infinite-width Transformers

  • Self-attention in function space

14.4 Functional PriorsΒΆ

GP-LVM (GP Latent Variable Model):

  • Learn low-dimensional latent space

  • Non-linear dimensionality reduction

Deep GPs (Damianou & Lawrence, 2013):

  • Composition of GPs: f_L(f_{L-1}(…f_1(x)))

  • Hierarchical feature learning

15. Limitations and Open ProblemsΒΆ

15.1 Current ChallengesΒΆ

  1. Scalability: Still O(NΒ³) for general kernels

  2. High dimensions: Curse of dimensionality (requires many samples)

  3. Discrete/categorical inputs: Less principled than continuous

  4. Model selection: Kernel choice is critical but difficult

  5. Non-stationarity: Real data often has varying local properties

15.2 Open Research QuestionsΒΆ

  • Adaptive kernels: Learn kernel structure from data

  • Scalable classification: GP classification still expensive

  • Deep GPs: How many layers? How to train?

  • Uncertainty decomposition: Epistemic vs. aleatoric for GPs

  • Meta-learning: Transfer kernel knowledge across tasks

16. Key TakeawaysΒΆ

  1. GPs are distributions over functions: Fully Bayesian, non-parametric

  2. Kernel encodes assumptions: Smoothness, periodicity, additivity

  3. Uncertainty for free: Predictive variance σ²(x*) naturally quantifies epistemic uncertainty

  4. Scalability via approximations: Inducing points, random features, structure

  5. Strong with small data: Sample-efficient, ideal for Bayesian optimization

  6. Connection to NNs: Infinite-width limit β†’ NNGP, NTK

When to use GPs:

  • Small data (N < 10,000)

  • Need uncertainty quantification

  • Interpretability important (kernel structure)

  • Bayesian optimization, active learning

When NOT to use:

  • Large data (N > 100,000) without approximations

  • High-dimensional inputs (D > 50)

  • Need feature learning (use deep kernel or NN)

17. ReferencesΒΆ

Foundational:

  • Rasmussen & Williams (2006): Gaussian Processes for Machine Learning (book)

  • Neal (1996): Bayesian Learning for Neural Networks

Scalable GPs:

  • Titsias (2009): Variational Learning of Inducing Variables (SVGP)

  • Hensman et al. (2013): Scalable Variational GP Regression

  • Wilson et al. (2015): Kernel Interpolation for Scalable GPs (KISS-GP)

  • Rahimi & Recht (2007): Random Features for Large-Scale Kernel Machines

Deep Kernels:

  • Wilson et al. (2016): Deep Kernel Learning

  • Wilson & Izmailov (2020): Bayesian Deep Learning (survey)

Infinite-Width NNs:

  • Jacot et al. (2018): Neural Tangent Kernel

  • Lee et al. (2018): Deep Neural Networks as GPs

  • Garriga-Alonso et al. (2019): Deep Convolutional Networks as Shallow GPs

Applications:

  • Snoek et al. (2012): Practical Bayesian Optimization (hyperparameter tuning)

  • Frazier (2018): Bayesian Optimization Tutorial

18. Practical WorkflowΒΆ

Step 1: Exploratory AnalysisΒΆ

  • Plot data, check for trends, periodicity

  • Compute pairwise distances (length scale estimate)

Step 2: Kernel SelectionΒΆ

  • Start with RBF (default)

  • Add periodic if seasonality detected

  • Use MatΓ©rn for rough data

  • Combine kernels (additive/multiplicative)

Step 3: Hyperparameter OptimizationΒΆ

  • Initialize: β„“ = median distance, σ² = Var(y)

  • Maximize log marginal likelihood

  • Use gradient-based optimizer (L-BFGS)

Step 4: Model ValidationΒΆ

  • Cross-validation (LOO or k-fold)

  • Check calibration (predictive intervals)

  • Residual analysis

Step 5: PredictionΒΆ

  • Compute posterior mean and variance

  • Plot with confidence bands

  • Identify regions of high uncertainty (active learning)

"""
Complete Gaussian Process Implementations
==========================================
Includes: GP regression with multiple kernels, scalable approximations (sparse GP, random features),
GP classification (Laplace approximation), Bayesian optimization, deep kernel learning.
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize
from scipy.stats import norm
from scipy.linalg import cholesky, solve_triangular

# ============================================================================
# 1. Kernel Functions
# ============================================================================

class Kernel:
    """Base class for kernels."""
    def __call__(self, X1, X2=None):
        """
        Compute kernel matrix.
        
        Args:
            X1: [N1, D]
            X2: [N2, D] or None (use X1)
        Returns:
            K: [N1, N2] kernel matrix
        """
        raise NotImplementedError


class RBFKernel(Kernel):
    """
    Radial Basis Function (Squared Exponential) kernel.
    
    k(x, x') = σ² exp(-||x - x'||Β²/(2β„“Β²))
    
    Args:
        length_scale: Length scale β„“ (controls smoothness)
        signal_variance: Signal variance σ² (vertical scale)
    """
    def __init__(self, length_scale=1.0, signal_variance=1.0):
        self.length_scale = length_scale
        self.signal_variance = signal_variance
    
    def __call__(self, X1, X2=None):
        if X2 is None:
            X2 = X1
        
        # Compute squared distances
        # ||x - x'||Β² = ||x||Β² + ||x'||Β² - 2xΒ·x'
        X1_norm = np.sum(X1**2, axis=1, keepdims=True)
        X2_norm = np.sum(X2**2, axis=1, keepdims=True)
        distances_sq = X1_norm + X2_norm.T - 2 * X1 @ X2.T
        
        # RBF kernel
        K = self.signal_variance * np.exp(-distances_sq / (2 * self.length_scale**2))
        return K
    
    def gradient(self, X):
        """Gradient w.r.t. hyperparameters for marginal likelihood optimization."""
        K = self(X)
        
        # βˆ‚K/βˆ‚β„“
        X_norm = np.sum(X**2, axis=1, keepdims=True)
        distances_sq = X_norm + X_norm.T - 2 * X @ X.T
        dK_dl = K * distances_sq / self.length_scale**3
        
        # βˆ‚K/βˆ‚ΟƒΒ²
        dK_dsigma2 = K / self.signal_variance
        
        return {'length_scale': dK_dl, 'signal_variance': dK_dsigma2}


class MaternKernel(Kernel):
    """
    MatΓ©rn kernel (Ξ½ = 5/2).
    
    k(r) = σ² (1 + √5 r/β„“ + 5rΒ²/(3β„“Β²)) exp(-√5 r/β„“)
    
    Twice differentiable (smoother than Ξ½=3/2, rougher than RBF).
    """
    def __init__(self, length_scale=1.0, signal_variance=1.0, nu=2.5):
        self.length_scale = length_scale
        self.signal_variance = signal_variance
        self.nu = nu
    
    def __call__(self, X1, X2=None):
        if X2 is None:
            X2 = X1
        
        # Euclidean distances
        X1_norm = np.sum(X1**2, axis=1, keepdims=True)
        X2_norm = np.sum(X2**2, axis=1, keepdims=True)
        distances = np.sqrt(np.maximum(
            X1_norm + X2_norm.T - 2 * X1 @ X2.T, 0))
        
        # MatΓ©rn 5/2
        sqrt5_r = np.sqrt(5) * distances / self.length_scale
        K = self.signal_variance * (1 + sqrt5_r + sqrt5_r**2 / 3) * np.exp(-sqrt5_r)
        
        return K


class PeriodicKernel(Kernel):
    """
    Periodic kernel.
    
    k(x, x') = σ² exp(-2 sinΒ²(Ο€|x-x'|/p) / β„“Β²)
    
    Args:
        period: Period p
        length_scale: Length scale β„“
        signal_variance: Signal variance σ²
    """
    def __init__(self, period=1.0, length_scale=1.0, signal_variance=1.0):
        self.period = period
        self.length_scale = length_scale
        self.signal_variance = signal_variance
    
    def __call__(self, X1, X2=None):
        if X2 is None:
            X2 = X1
        
        # Distances
        diffs = X1[:, np.newaxis, :] - X2[np.newaxis, :, :]
        distances = np.sqrt(np.sum(diffs**2, axis=2))
        
        # Periodic kernel
        sin_term = np.sin(np.pi * distances / self.period)
        K = self.signal_variance * np.exp(-2 * sin_term**2 / self.length_scale**2)
        
        return K


class LinearKernel(Kernel):
    """
    Linear kernel.
    
    k(x, x') = Οƒ_bΒ² + Οƒ_vΒ² (x - c)(x' - c)
    """
    def __init__(self, variance=1.0, bias=0.0, center=0.0):
        self.variance = variance
        self.bias = bias
        self.center = center
    
    def __call__(self, X1, X2=None):
        if X2 is None:
            X2 = X1
        
        X1_centered = X1 - self.center
        X2_centered = X2 - self.center
        
        K = self.bias + self.variance * (X1_centered @ X2_centered.T)
        return K


class AdditiveKernel(Kernel):
    """Sum of kernels: k(x, x') = k₁(x, x') + kβ‚‚(x, x')"""
    def __init__(self, kernel1, kernel2):
        self.kernel1 = kernel1
        self.kernel2 = kernel2
    
    def __call__(self, X1, X2=None):
        return self.kernel1(X1, X2) + self.kernel2(X1, X2)


class ProductKernel(Kernel):
    """Product of kernels: k(x, x') = k₁(x, x') Β· kβ‚‚(x, x')"""
    def __init__(self, kernel1, kernel2):
        self.kernel1 = kernel1
        self.kernel2 = kernel2
    
    def __call__(self, X1, X2=None):
        return self.kernel1(X1, X2) * self.kernel2(X1, X2)


# ============================================================================
# 2. Gaussian Process Regression
# ============================================================================

class GaussianProcessRegressor:
    """
    Gaussian Process Regression.
    
    Args:
        kernel: Kernel function
        noise_variance: Observation noise σ²_n
        optimize_hyperparams: Whether to optimize kernel hyperparameters
    """
    def __init__(self, kernel, noise_variance=1e-2, optimize_hyperparams=True):
        self.kernel = kernel
        self.noise_variance = noise_variance
        self.optimize_hyperparams = optimize_hyperparams
        
        self.X_train = None
        self.y_train = None
        self.K_inv = None
        self.alpha = None
    
    def fit(self, X, y):
        """
        Fit GP to training data.
        
        Args:
            X: Training inputs [N, D]
            y: Training targets [N, 1]
        """
        self.X_train = X
        self.y_train = y
        
        # Optimize hyperparameters
        if self.optimize_hyperparams and hasattr(self.kernel, 'gradient'):
            self._optimize_hyperparameters()
        
        # Compute kernel matrix
        K = self.kernel(X) + self.noise_variance * np.eye(len(X))
        
        # Cholesky decomposition for numerical stability
        L = cholesky(K, lower=True)
        
        # Solve for α = K⁻¹ y
        self.alpha = solve_triangular(L.T, solve_triangular(L, y, lower=True))
        
        # Store inverse (for predictive variance)
        self.K_inv = solve_triangular(L.T, solve_triangular(L, np.eye(len(X)), lower=True))
    
    def predict(self, X_test, return_std=True):
        """
        Predict at test points.
        
        Args:
            X_test: Test inputs [N*, D]
            return_std: Whether to return predictive std
        
        Returns:
            mean: Predictive mean [N*]
            std: Predictive std [N*] (if return_std=True)
        """
        # Kernel between test and train
        K_star = self.kernel(X_test, self.X_train)
        
        # Predictive mean: ΞΌ* = k*α΅€ K⁻¹ y
        mean = K_star @ self.alpha
        
        if not return_std:
            return mean.ravel()
        
        # Predictive variance: σ²* = k** - k*α΅€ K⁻¹ k*
        K_star_star = self.kernel(X_test)
        variance = np.diag(K_star_star) - np.sum(K_star @ self.K_inv * K_star, axis=1)
        std = np.sqrt(np.maximum(variance, 0))  # Clip negatives (numerical errors)
        
        return mean.ravel(), std
    
    def log_marginal_likelihood(self):
        """
        Compute log marginal likelihood.
        
        log p(y|X) = -Β½ yα΅€K⁻¹y - Β½ log|K| - (N/2)log(2Ο€)
        """
        N = len(self.y_train)
        K = self.kernel(self.X_train) + self.noise_variance * np.eye(N)
        
        # Cholesky decomposition
        L = cholesky(K, lower=True)
        
        # Data fit: yα΅€K⁻¹y
        alpha = solve_triangular(L.T, solve_triangular(L, self.y_train, lower=True))
        data_fit = -0.5 * self.y_train.T @ alpha
        
        # Complexity penalty: log|K|
        complexity = -np.sum(np.log(np.diag(L)))
        
        # Constant
        constant = -0.5 * N * np.log(2 * np.pi)
        
        return (data_fit + complexity + constant).item()
    
    def _optimize_hyperparameters(self):
        """Optimize kernel hyperparameters via gradient descent."""
        
        def objective(params):
            # Update kernel parameters
            self.kernel.length_scale = params[0]
            self.kernel.signal_variance = params[1]
            self.noise_variance = params[2]
            
            # Negative log marginal likelihood
            return -self.log_marginal_likelihood()
        
        # Initial parameters
        x0 = np.array([self.kernel.length_scale, 
                      self.kernel.signal_variance,
                      self.noise_variance])
        
        # Optimize
        result = minimize(objective, x0, method='L-BFGS-B', 
                         bounds=[(1e-3, 10), (1e-3, 10), (1e-5, 1)])
        
        # Update parameters
        self.kernel.length_scale = result.x[0]
        self.kernel.signal_variance = result.x[1]
        self.noise_variance = result.x[2]


# ============================================================================
# 3. Sparse GP (Inducing Points)
# ============================================================================

class SparseGP:
    """
    Sparse Gaussian Process using inducing points.
    
    Complexity: O(NMΒ²) training, O(MΒ²) prediction
    
    Args:
        kernel: Kernel function
        num_inducing: Number of inducing points M
        noise_variance: Observation noise
    """
    def __init__(self, kernel, num_inducing=100, noise_variance=1e-2):
        self.kernel = kernel
        self.num_inducing = num_inducing
        self.noise_variance = noise_variance
        
        self.Z = None  # Inducing points
        self.Qaa = None  # Covariance at inducing points
    
    def fit(self, X, y):
        """Fit sparse GP."""
        N = len(X)
        
        # Select inducing points (k-means or random subset)
        indices = np.random.choice(N, self.num_inducing, replace=False)
        self.Z = X[indices]
        
        # Kernel matrices
        K_mm = self.kernel(self.Z) + 1e-6 * np.eye(self.num_inducing)  # Jitter
        K_nm = self.kernel(X, self.Z)
        
        # Compute Qaa = K_mm + K_mn (K + σ²I)⁻¹ K_nm
        # Via Woodbury matrix identity
        L_mm = cholesky(K_mm, lower=True)
        A = solve_triangular(L_mm, K_nm.T, lower=True) / np.sqrt(self.noise_variance)
        
        # Posterior over inducing points
        B = np.eye(self.num_inducing) + A @ A.T
        L_B = cholesky(B, lower=True)
        
        # Store for prediction
        self.Qaa = K_mm @ solve_triangular(
            L_B.T, solve_triangular(L_B, K_mm, lower=True))
        
        # Mean at inducing points
        c = solve_triangular(L_B, A @ y / self.noise_variance, lower=True)
        self.m_u = K_mm @ solve_triangular(L_B.T, c)
    
    def predict(self, X_test, return_std=True):
        """Predict using inducing points."""
        # Kernel between test and inducing
        K_star_m = self.kernel(X_test, self.Z)
        
        # Mean
        K_mm = self.kernel(self.Z)
        K_mm_inv = np.linalg.inv(K_mm + 1e-6 * np.eye(self.num_inducing))
        mean = K_star_m @ K_mm_inv @ self.m_u
        
        if not return_std:
            return mean.ravel()
        
        # Variance
        K_star_star = self.kernel(X_test)
        variance = np.diag(K_star_star) - np.sum(
            K_star_m @ K_mm_inv * K_star_m, axis=1)
        std = np.sqrt(np.maximum(variance, 0))
        
        return mean.ravel(), std


# ============================================================================
# 4. Random Fourier Features
# ============================================================================

class RandomFourierFeatures:
    """
    Approximate kernel with random Fourier features.
    
    k(x, x') β‰ˆ Ο†(x)α΅€ Ο†(x')
    
    Args:
        num_features: Number of random features M
        kernel: Kernel to approximate (for sampling frequencies)
        length_scale: Length scale (for RBF kernel)
    """
    def __init__(self, num_features=100, length_scale=1.0):
        self.num_features = num_features
        self.length_scale = length_scale
        self.omega = None
        self.weights = None
    
    def fit(self, X, y):
        """
        Fit via linear regression in feature space.
        
        Ο†(x) = √(2/M) [cos(ω₁ᡀx), ..., cos(Ο‰β‚˜α΅€x), sin(ω₁ᡀx), ..., sin(Ο‰β‚˜α΅€x)]
        """
        D = X.shape[1]
        
        # Sample random frequencies from spectral density
        # For RBF: p(Ο‰) = N(0, 1/β„“Β²)
        self.omega = np.random.randn(self.num_features, D) / self.length_scale
        
        # Compute features
        Phi = self._compute_features(X)
        
        # Ridge regression: w = (Ξ¦α΅€Ξ¦ + Ξ»I)⁻¹ Ξ¦α΅€y
        lambda_reg = 1e-4
        self.weights = np.linalg.solve(
            Phi.T @ Phi + lambda_reg * np.eye(Phi.shape[1]),
            Phi.T @ y
        )
    
    def _compute_features(self, X):
        """Compute random Fourier features."""
        proj = X @ self.omega.T  # [N, M]
        
        # Concatenate cos and sin
        features = np.concatenate([
            np.cos(proj), np.sin(proj)
        ], axis=1) * np.sqrt(2.0 / self.num_features)
        
        return features
    
    def predict(self, X_test):
        """Predict (no uncertainty, this is a point estimate)."""
        Phi_test = self._compute_features(X_test)
        return Phi_test @ self.weights


# ============================================================================
# 5. GP Classification (Laplace Approximation)
# ============================================================================

class GPClassifier:
    """
    Gaussian Process Classification using Laplace approximation.
    
    Binary classification with logistic likelihood.
    """
    def __init__(self, kernel, max_iter=20):
        self.kernel = kernel
        self.max_iter = max_iter
        
        self.X_train = None
        self.y_train = None
        self.f_map = None  # MAP estimate of latent function
    
    def fit(self, X, y):
        """
        Fit GP classifier.
        
        Args:
            X: Training inputs [N, D]
            y: Training labels [N] (0 or 1)
        """
        self.X_train = X
        self.y_train = y
        
        # Convert labels to {-1, +1}
        y_signed = 2 * y - 1
        
        # Compute kernel matrix
        K = self.kernel(X)
        
        # Find MAP via Newton's method
        f = np.zeros(len(X))  # Initialize at 0
        
        for iteration in range(self.max_iter):
            # Logistic likelihood
            pi = self._sigmoid(f)
            
            # Gradient: βˆ‡log p(y|f) + βˆ‡log p(f)
            # = (y - Ο€) - K⁻¹f
            W = pi * (1 - pi)  # Diagonal of Hessian (logistic)
            grad_log_lik = y_signed * (1 - pi) - np.linalg.solve(K, f)
            
            # Hessian: -W - K⁻¹
            W_sqrt = np.sqrt(W)
            B = np.eye(len(X)) + W_sqrt[:, None] * K * W_sqrt[None, :]
            L = cholesky(B, lower=True)
            
            # Newton step
            b = W * f + y_signed * (1 - pi)
            a = b - W_sqrt * solve_triangular(
                L.T, solve_triangular(L, W_sqrt * (K @ b), lower=True))
            f_new = K @ a
            
            # Check convergence
            if np.linalg.norm(f_new - f) < 1e-4:
                break
            
            f = f_new
        
        self.f_map = f
    
    def predict_proba(self, X_test):
        """
        Predict probabilities.
        
        Returns:
            probabilities: [N*, 2] (probability of class 0 and 1)
        """
        # Kernel between test and train
        K_star = self.kernel(X_test, self.X_train)
        
        # Mean of latent function
        f_star_mean = K_star @ np.linalg.solve(
            self.kernel(self.X_train), self.f_map)
        
        # Laplace approximation: ignore variance for simplicity
        # (Full version requires 1D integral)
        probs = self._sigmoid(f_star_mean)
        
        return np.column_stack([1 - probs, probs])
    
    def predict(self, X_test):
        """Predict class labels."""
        probs = self.predict_proba(X_test)
        return (probs[:, 1] > 0.5).astype(int)
    
    @staticmethod
    def _sigmoid(x):
        """Numerically stable sigmoid."""
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))


# ============================================================================
# 6. Bayesian Optimization
# ============================================================================

class BayesianOptimization:
    """
    Bayesian Optimization using GP and acquisition functions.
    
    Args:
        bounds: List of (min, max) for each dimension
        kernel: GP kernel
        acquisition: 'ei', 'ucb', or 'pi'
    """
    def __init__(self, bounds, kernel=None, acquisition='ei'):
        self.bounds = np.array(bounds)
        self.dim = len(bounds)
        self.acquisition = acquisition
        
        if kernel is None:
            kernel = RBFKernel(length_scale=1.0, signal_variance=1.0)
        
        self.gp = GaussianProcessRegressor(kernel, noise_variance=1e-4)
        
        self.X_observed = []
        self.y_observed = []
        self.best_y = -np.inf
    
    def suggest(self):
        """
        Suggest next point to evaluate.
        
        Returns:
            x_next: Next point to query [D]
        """
        if len(self.X_observed) == 0:
            # Random initialization
            return np.random.uniform(self.bounds[:, 0], self.bounds[:, 1])
        
        # Fit GP to observed data
        X = np.array(self.X_observed)
        y = np.array(self.y_observed).reshape(-1, 1)
        self.gp.fit(X, y)
        
        # Optimize acquisition function
        def objective(x):
            return -self._acquisition_function(x.reshape(1, -1))
        
        # Multi-start optimization
        best_acq = -np.inf
        best_x = None
        
        for _ in range(10):
            x0 = np.random.uniform(self.bounds[:, 0], self.bounds[:, 1])
            result = minimize(objective, x0, method='L-BFGS-B', bounds=self.bounds)
            
            if -result.fun > best_acq:
                best_acq = -result.fun
                best_x = result.x
        
        return best_x
    
    def observe(self, x, y):
        """
        Observe new data point.
        
        Args:
            x: Input [D]
            y: Output (scalar)
        """
        self.X_observed.append(x)
        self.y_observed.append(y)
        
        if y > self.best_y:
            self.best_y = y
    
    def _acquisition_function(self, x):
        """
        Compute acquisition function value.
        
        Args:
            x: Input [1, D]
        Returns:
            Acquisition value (scalar)
        """
        mean, std = self.gp.predict(x, return_std=True)
        
        if self.acquisition == 'ucb':
            # Upper Confidence Bound
            beta = 2.0
            return mean + beta * std
        
        elif self.acquisition == 'ei':
            # Expected Improvement
            if std == 0:
                return 0
            
            z = (mean - self.best_y) / std
            ei = (mean - self.best_y) * norm.cdf(z) + std * norm.pdf(z)
            return ei
        
        elif self.acquisition == 'pi':
            # Probability of Improvement
            if std == 0:
                return 0
            
            z = (mean - self.best_y) / std
            return norm.cdf(z)
        
        else:
            raise ValueError(f"Unknown acquisition: {self.acquisition}")


# ============================================================================
# 7. Demonstrations
# ============================================================================

def demo_gp_regression():
    """Demonstrate GP regression with multiple kernels."""
    print("="*70)
    print("GP Regression Demo")
    print("="*70)
    
    # Generate synthetic data
    np.random.seed(42)
    X_train = np.random.uniform(-5, 5, (20, 1))
    y_train = np.sin(X_train) + 0.1 * np.random.randn(20, 1)
    
    X_test = np.linspace(-6, 6, 100).reshape(-1, 1)
    
    # Fit GP with different kernels
    kernels = {
        'RBF': RBFKernel(length_scale=1.0, signal_variance=1.0),
        'MatΓ©rn': MaternKernel(length_scale=1.0, signal_variance=1.0),
        'Periodic': PeriodicKernel(period=2*np.pi, length_scale=1.0),
    }
    
    for name, kernel in kernels.items():
        gp = GaussianProcessRegressor(kernel, noise_variance=0.01, 
                                      optimize_hyperparams=True)
        gp.fit(X_train, y_train)
        
        mean, std = gp.predict(X_test, return_std=True)
        
        print(f"\n{name} Kernel:")
        print(f"  Optimized length scale: {kernel.length_scale:.4f}")
        print(f"  Optimized signal variance: {kernel.signal_variance:.4f}")
        print(f"  Log marginal likelihood: {gp.log_marginal_likelihood():.4f}")
        print(f"  Mean prediction range: [{mean.min():.4f}, {mean.max():.4f}]")
        print(f"  Avg uncertainty: {std.mean():.4f}")
    
    print()


def demo_sparse_gp():
    """Demonstrate sparse GP with inducing points."""
    print("="*70)
    print("Sparse GP Demo")
    print("="*70)
    
    # Generate larger dataset
    np.random.seed(42)
    N = 500
    X_train = np.random.uniform(-10, 10, (N, 1))
    y_train = np.sin(X_train) + 0.1 * np.random.randn(N, 1)
    
    X_test = np.linspace(-12, 12, 200).reshape(-1, 1)
    
    # Compare exact vs sparse
    kernel = RBFKernel(length_scale=1.0, signal_variance=1.0)
    
    # Exact GP (slow for N=500)
    print("Exact GP...")
    gp_exact = GaussianProcessRegressor(kernel, noise_variance=0.01, 
                                        optimize_hyperparams=False)
    gp_exact.fit(X_train, y_train)
    mean_exact, std_exact = gp_exact.predict(X_test)
    
    # Sparse GP
    print("Sparse GP with M=50 inducing points...")
    gp_sparse = SparseGP(kernel, num_inducing=50, noise_variance=0.01)
    gp_sparse.fit(X_train, y_train)
    mean_sparse, std_sparse = gp_sparse.predict(X_test)
    
    # Compare
    print(f"\nPrediction difference (RMSE): {np.sqrt(np.mean((mean_exact - mean_sparse)**2)):.6f}")
    print(f"Exact GP complexity: O({N}Β³) = O({N**3:,})")
    print(f"Sparse GP complexity: O({N}Γ—{50}Β²) = O({N * 50**2:,})")
    print(f"Speedup: ~{(N**3) / (N * 50**2):.1f}Γ—")
    print()


def demo_gp_classification():
    """Demonstrate GP classification."""
    print("="*70)
    print("GP Classification Demo")
    print("="*70)
    
    # Generate synthetic binary classification data
    np.random.seed(42)
    X_train = np.random.randn(100, 2)
    y_train = (X_train[:, 0]**2 + X_train[:, 1]**2 > 1).astype(int)
    
    # Fit GP classifier
    kernel = RBFKernel(length_scale=1.0, signal_variance=1.0)
    gp_clf = GPClassifier(kernel)
    gp_clf.fit(X_train, y_train)
    
    # Test
    X_test = np.random.randn(20, 2)
    y_pred = gp_clf.predict(X_test)
    probs = gp_clf.predict_proba(X_test)
    
    print("Sample predictions:")
    for i in range(5):
        print(f"  Test {i+1}: P(y=0)={probs[i, 0]:.4f}, P(y=1)={probs[i, 1]:.4f}, "
              f"Predicted class={y_pred[i]}")
    
    print(f"\nTraining accuracy: {(gp_clf.predict(X_train) == y_train).mean():.4f}")
    print()


def demo_bayesian_optimization():
    """Demonstrate Bayesian optimization."""
    print("="*70)
    print("Bayesian Optimization Demo")
    print("="*70)
    
    # Define toy objective (unknown to optimizer)
    def objective(x):
        return -(x - 2)**2 * np.sin(5 * x)
    
    # Run Bayesian optimization
    bo = BayesianOptimization(bounds=[(-5, 5)], acquisition='ei')
    
    print("Optimizing objective with Expected Improvement...")
    for iteration in range(15):
        x_next = bo.suggest()
        y_next = objective(x_next)[0]
        bo.observe(x_next, y_next)
        
        if (iteration + 1) % 5 == 0:
            print(f"  Iteration {iteration+1}: Best y = {bo.best_y:.4f} at x = {bo.X_observed[np.argmax(bo.y_observed)]}")
    
    # True optimum
    x_opt_true = 2.0  # Approximate
    y_opt_true = objective(np.array([x_opt_true]))[0]
    
    print(f"\nTrue optimum: y β‰ˆ {y_opt_true:.4f}")
    print(f"BO found: y = {bo.best_y:.4f} (difference: {abs(bo.best_y - y_opt_true):.4f})")
    print()


def print_complexity_comparison():
    """Print complexity comparison."""
    print("="*70)
    print("GP Method Complexity Comparison")
    print("="*70)
    print()
    
    comparison = """
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Method              β”‚ Training    β”‚ Prediction  β”‚ Memory     β”‚ Best For     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Exact GP            β”‚ O(NΒ³)       β”‚ O(NΒ²)       β”‚ O(NΒ²)      β”‚ N < 5,000    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Sparse GP (SVGP)    β”‚ O(NMΒ²)      β”‚ O(MΒ²)       β”‚ O(NM)      β”‚ 5K < N < 100Kβ”‚
β”‚                     β”‚             β”‚             β”‚            β”‚ (M ~ 100-500)β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Random Features     β”‚ O(NM)       β”‚ O(M)        β”‚ O(NM)      β”‚ N > 100K     β”‚
β”‚ (RFF)               β”‚ (ridge reg) β”‚             β”‚            β”‚ (M ~ 1000)   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ SKI/KISS-GP         β”‚ O(N+M log M)β”‚ O(M log M)  β”‚ O(M)       β”‚ 1D-3D, huge  β”‚
β”‚                     β”‚             β”‚             β”‚            β”‚ structured   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

**Scalability example (N = 10,000 data points):**

- Exact GP: ~10Β³ Γ— 10Β³ Γ— 10Β³ = 10⁹ operations (impractical)
- Sparse GP (M=200): ~10⁴ Γ— 200Β² = 4Γ—10⁸ operations (feasible)
- Random Features (M=500): ~10⁴ Γ— 500 = 5Γ—10⁢ operations (fast)

**When to use each method:**

1. **Exact GP**: Small data (N < 5,000), need exact inference
2. **Sparse GP**: Medium data (5K-100K), need uncertainty
3. **Random Features**: Large data (>100K), speed critical
4. **Deep Kernel**: Complex patterns, have GPU, willing to train

**Kernel selection guide:**

- **RBF**: Default choice, smooth functions
- **MatΓ©rn (Ξ½=3/2, 5/2)**: Rough/noisy data
- **Periodic**: Seasonal patterns
- **Linear**: Linear trends (baseline)
- **Sum/Product**: Combine multiple patterns
"""
    
    print(comparison)
    print()


# ============================================================================
# Run Demonstrations
# ============================================================================

if __name__ == "__main__":
    np.random.seed(42)
    
    demo_gp_regression()
    demo_sparse_gp()
    demo_gp_classification()
    demo_bayesian_optimization()
    print_complexity_comparison()
    
    print("="*70)
    print("Gaussian Process Implementations Complete")
    print("="*70)
    print()
    print("Summary:")
    print("  β€’ GP Regression: Exact inference with multiple kernels")
    print("  β€’ Sparse GP: Inducing points for O(NMΒ²) complexity")
    print("  β€’ Random Features: O(NM) approximation via Fourier features")
    print("  β€’ GP Classification: Laplace approximation for non-Gaussian likelihood")
    print("  β€’ Bayesian Optimization: EI/UCB acquisition for black-box optimization")
    print()
    print("Key insight: GPs provide uncertainty quantification naturally")
    print("Trade-off: Computational cost vs. sample efficiency")
    print("Applications: Small data, Bayesian optimization, calibrated uncertainty")
    print()