import torch
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")
1. Gaussian Process TheoryΒΆ
Definition:ΒΆ
where \(m(x)\) is mean, \(k(x,x')\) is kernel.
Posterior:ΒΆ
π Reference Materials:
gp_nn.pdf - Gp Nn
bayesian.pdf - Bayesian
1.5. Gaussian Process Theory: Deep Mathematical FoundationsΒΆ
What is a Gaussian Process?ΒΆ
Definition: A GP is a collection of random variables, any finite subset of which has a joint Gaussian distribution.
where:
Mean function: \(m(x) = \mathbb{E}[f(x)]\)
Covariance/Kernel function: \(k(x, x') = \mathbb{E}[(f(x) - m(x))(f(x') - m(x'))]\)
Intuition: Instead of learning a single function, we learn a distribution over functions.
From Functions to DistributionsΒΆ
Traditional regression: $\(y = f(x; \theta) + \epsilon\)\( Find best parameters \)\theta$.
GP regression: $\(f(x) \sim \mathcal{GP}(0, k(x, x'))\)\( \)\(y = f(x) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma_n^2)\)$
Every finite collection of function values is Gaussian:
Posterior Derivation: Step-by-StepΒΆ
Setup:
Training: \(X = \{x_1, \ldots, x_n\}\), \(\mathbf{y} = \{y_1, \ldots, y_n\}\)
Test: \(X_* = \{x_{*1}, \ldots, x_{*m}\}\)
Want: \(p(f_* | X, \mathbf{y}, X_*)\)
Prior (before seeing data):
where:
\(K = k(X, X)\) is \(n \times n\) (training covariance)
\(K_* = k(X, X_*)\) is \(n \times m\) (cross-covariance)
\(K_{**} = k(X_*, X_*)\) is \(m \times m\) (test covariance)
Gaussian conditioning formula:
If \(\begin{bmatrix} \mathbf{x} \\ \mathbf{y} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \mu_x \\ \mu_y \end{bmatrix}, \begin{bmatrix} \Sigma_{xx} & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_{yy} \end{bmatrix}\right)\)
Then: \(\mathbf{y} | \mathbf{x} \sim \mathcal{N}(\mu_y + \Sigma_{yx}\Sigma_{xx}^{-1}(\mathbf{x} - \mu_x), \Sigma_{yy} - \Sigma_{yx}\Sigma_{xx}^{-1}\Sigma_{xy})\)
Applying to our case:
Interpretation:
Mean \(\mu_*\): Weighted combination of training labels
Weights depend on similarity (kernel) between test and training points
Variance \(\Sigma_*\): Uncertainty decreases near training data
\(K_{**}\): Prior uncertainty (far from data)
Second term: Reduction due to observations
Kernel Functions: Mathematical PropertiesΒΆ
Definition: A kernel \(k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) is valid if its Gram matrix is positive semi-definite (PSD) for any finite set of points.
Mercerβs Theorem: A symmetric function \(k(x, x')\) can be expressed as:
where \(\lambda_i > 0\) are eigenvalues and \(\phi_i\) are eigenfunctions.
Popular Kernels:
Squared Exponential (RBF):
Properties: Infinitely differentiable, stationary, isotropic
Spectral density: Gaussian
Use: Smooth functions
MatΓ©rn:
where \(r = \|x - x'\|\), \(K_\nu\) is modified Bessel function.
Special cases:
\(\nu = 1/2\): Exponential kernel (not differentiable)
\(\nu = 3/2\): Once differentiable
\(\nu = 5/2\): Twice differentiable
\(\nu \to \infty\): RBF
Periodic:
Period \(p\), length scale \(\ell\)
For periodic data (seasons, daily patterns)
Rational Quadratic (mixture of RBF):
Scale mixture of SE kernels with different length scales
\(\alpha \to \infty\): Becomes SE
Kernel Algebra: Combine kernels to build complex ones!
Sum: \(k_1 + k_2\) β Models different patterns
Product: \(k_1 \cdot k_2\) β Conjunction of properties
Composition: \(k_1(f(x), f(x'))\) β Feature transformation
Example combination: $\(k(x, x') = k_{\text{trend}}(x, x') + k_{\text{periodic}}(x, x') + k_{\text{noise}}(x, x')\)$
Computational Complexity AnalysisΒΆ
Training (computing \(K^{-1}\)):
Naive inversion: \(O(n^3)\)
Cholesky decomposition: \(O(n^3)\) (numerically stable)
Prediction (for \(m\) test points):
Mean: \(O(nm + n^2)\)
Variance: \(O(m^2n)\) (full covariance) or \(O(mn)\) (diagonal only)
Storage:
Kernel matrix: \(O(n^2)\)
Bottleneck: For \(n > 10,000\), standard GP becomes impractical!
Numerical StabilityΒΆ
Problem: Direct matrix inversion is unstable.
Solution: Cholesky decomposition
where \(L\) is lower triangular.
Efficient computation:
# Instead of: K_inv @ y
L = cholesky(K + ΟΒ²I)
Ξ± = solve(L.T, solve(L, y)) # Ξ± = Kβ»ΒΉy
# For variance:
v = solve(L, K_*)
var = K_** - v.T @ v
Time: Still \(O(n^3)\) but more stable!
Jitter: Add small diagonal term for numerical stability $\(K \leftarrow K + 10^{-6} I\)$
Log Marginal Likelihood (Model Selection)ΒΆ
To optimize hyperparameters \(\theta = \{\ell, \sigma, \sigma_n\}\):
where \(K_y = K + \sigma_n^2 I\).
Three terms:
Data fit: \(-\frac{1}{2}\mathbf{y}^T K_y^{-1} \mathbf{y}\) (how well model explains data)
Complexity penalty: \(-\frac{1}{2}\log|K_y|\) (Occamβs razor)
Normalization: \(-\frac{n}{2}\log(2\pi)\)
Gradient (for optimization):
Can optimize with gradient descent!
def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0):
"""
RBF/Squared Exponential kernel.
k(x,x') = variance * exp(-||x-x'||^2 / (2*length_scale^2))
"""
dists = cdist(X1, X2, 'sqeuclidean')
K = variance * np.exp(-dists / (2 * length_scale**2))
return K
def matern_kernel(X1, X2, length_scale=1.0, nu=1.5):
"""
MatΓ©rn kernel (nu=1.5).
"""
dists = cdist(X1, X2)
dists = np.sqrt(3) * dists / length_scale
K = (1 + dists) * np.exp(-dists)
return K
print("Kernels defined")
GP RegressionΒΆ
Gaussian Process regression places a prior distribution directly over functions: \(f \sim \mathcal{GP}(m(x), k(x, x'))\), where \(m\) is the mean function and \(k\) is the kernel (covariance function) that encodes assumptions about function smoothness and length scale. Given observed data, the posterior is also a Gaussian Process, yielding an exact predictive mean (the best estimate) and predictive variance (the uncertainty) at every test point. The closed-form posterior is computed via matrix operations involving the kernel matrix \(K\), training targets \(y\), and noise variance \(\sigma_n^2\): \(\mu_* = K_*^T (K + \sigma_n^2 I)^{-1} y\), \(\Sigma_* = K_{**} - K_*^T (K + \sigma_n^2 I)^{-1} K_*\). This non-parametric approach provides well-calibrated uncertainty estimates out of the box.
class GaussianProcess:
"""GP regression with RBF kernel."""
def __init__(self, length_scale=1.0, variance=1.0, noise=1e-2):
self.length_scale = length_scale
self.variance = variance
self.noise = noise
self.X_train = None
self.y_train = None
self.K_inv = None
def fit(self, X, y):
"""Fit GP to training data."""
self.X_train = X
self.y_train = y
# Compute kernel matrix
K = rbf_kernel(X, X, self.length_scale, self.variance)
K += self.noise * np.eye(len(X))
# Invert (with Cholesky for stability)
L = np.linalg.cholesky(K)
self.K_inv = np.linalg.solve(L.T, np.linalg.solve(L, np.eye(len(X))))
def predict(self, X_test, return_std=True):
"""Predict with uncertainty."""
# K_*
K_s = rbf_kernel(self.X_train, X_test, self.length_scale, self.variance)
# Mean
mu = K_s.T @ self.K_inv @ self.y_train
if return_std:
# K_**
K_ss = rbf_kernel(X_test, X_test, self.length_scale, self.variance)
# Variance
var = np.diag(K_ss - K_s.T @ self.K_inv @ K_s)
std = np.sqrt(np.maximum(var, 0))
return mu, std
return mu
print("GaussianProcess defined")
Generate Test DataΒΆ
We create a synthetic regression dataset with a known ground-truth function (e.g., a sinusoidal with noise) and a set of sparse observation points. The gap between observations tests the GPβs interpolation capability, while the region beyond the observations tests extrapolation. GPs excel at interpolation (the uncertainty shrinks near observed points) but honestly report high uncertainty in extrapolation regions β a crucial property for safety-critical applications where knowing what you do not know is as important as making predictions.
# True function
def f(x):
return np.sin(3*x) + 0.3*np.cos(9*x)
# Training data
np.random.seed(42)
X_train = np.random.uniform(-1, 1, 20).reshape(-1, 1)
y_train = f(X_train).ravel() + 0.1 * np.random.randn(20)
# Test data
X_test = np.linspace(-1.5, 1.5, 200).reshape(-1, 1)
y_test = f(X_test).ravel()
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
Fit GP and PredictΒΆ
Fitting a GP involves computing the kernel matrix between all training points, adding observation noise to the diagonal, and solving the resulting linear system. Prediction at new test points requires computing the cross-kernel between test and training points, then applying the posterior formulas. The computational cost scales as \(O(N^3)\) for \(N\) training points (due to matrix inversion), which limits standard GPs to a few thousand observations. For larger datasets, sparse approximations (inducing points) or scalable GP methods reduce this to \(O(NM^2)\) where \(M \ll N\) is the number of inducing points.
# Fit GP
gp = GaussianProcess(length_scale=0.3, variance=1.0, noise=0.01)
gp.fit(X_train, y_train)
# Predict
mu, std = gp.predict(X_test, return_std=True)
print(f"Predictions: mean {mu.shape}, std {std.shape}")
Visualize GP RegressionΒΆ
The GP regression plot shows the predictive mean as a line, the 95% confidence interval as a shaded region, and the training points as markers. Near observed data, the confidence interval narrows (the GP is certain); far from observations, it widens (the GP is uncertain). This uncertainty visualization is the defining advantage of GPs over standard neural networks, which typically produce point predictions without calibrated confidence. In applications like Bayesian optimization, active learning, and autonomous systems, these uncertainty estimates guide where to collect new data or when to defer to a human.
plt.figure(figsize=(12, 6))
# True function
plt.plot(X_test, y_test, 'k--', label='True function', linewidth=2)
# GP mean
plt.plot(X_test, mu, 'b-', label='GP mean', linewidth=2)
# Uncertainty (2 std)
plt.fill_between(X_test.ravel(), mu - 2*std, mu + 2*std, alpha=0.3, label='Β±2Ο')
# Training data
plt.scatter(X_train, y_train, c='r', s=50, zorder=10, label='Training data')
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Gaussian Process Regression', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Prior SamplesΒΆ
Sampling functions from the GP prior (before observing any data) illustrates what the kernel encodes about function properties. An RBF kernel produces smooth functions; a Matern-1/2 kernel produces rough, non-differentiable functions; a periodic kernel produces periodic functions. The kernelβs length scale controls how quickly the function varies: short length scales produce rapidly oscillating samples, while long length scales produce slowly varying ones. Visualizing prior samples is an important step in GP modeling because it verifies that the chosen kernel matches prior beliefs about the function being modeled.
# Sample from GP prior
X_prior = np.linspace(-2, 2, 100).reshape(-1, 1)
K_prior = rbf_kernel(X_prior, X_prior, length_scale=0.5, variance=1.0)
K_prior += 1e-6 * np.eye(len(X_prior))
# Draw samples
samples = np.random.multivariate_normal(np.zeros(len(X_prior)), K_prior, size=5)
# Plot
plt.figure(figsize=(12, 6))
for i in range(5):
plt.plot(X_prior, samples[i], alpha=0.7, linewidth=2)
plt.xlabel('x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('GP Prior Samples', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Kernel ComparisonΒΆ
Comparing different kernels on the same regression task reveals how kernel choice affects both the fit quality and the uncertainty estimates. The RBF (squared exponential) kernel assumes infinite smoothness and works well for smooth functions. The Matern family provides a controllable smoothness parameter \(\nu\): \(\nu = 1/2\) gives the Ornstein-Uhlenbeck process (rough), \(\nu = 3/2\) is once-differentiable, and \(\nu \to \infty\) recovers the RBF. Kernel hyperparameters (length scale, variance) are optimized by maximizing the marginal likelihood \(p(y | X, \theta)\), which automatically balances fit quality with model complexity β an elegant form of Bayesian model selection.
7.5. Sparse Gaussian Processes: Scaling to Large DataΒΆ
The Scalability ProblemΒΆ
Standard GP: \(O(n^3)\) time, \(O(n^2)\) memory
Not feasible for \(n > 10,000\)!
Solution 1: Subset of Data (Subset of Regressors)ΒΆ
Use only \(m \ll n\) βinducing pointsβ \(Z = \{z_1, \ldots, z_m\}\).
Approximation:
where \(\mathbf{u} = f(Z)\) are function values at inducing points.
Complexity: \(O(nm^2)\) instead of \(O(n^3)\)
Typically \(m \approx 100-1000\).
Solution 2: Variational Sparse GP (SVGP)ΒΆ
Key idea: Variational inference with inducing points.
True posterior: $\(p(\mathbf{u}|X, \mathbf{y}) = \frac{p(\mathbf{y}|X, \mathbf{u})p(\mathbf{u})}{p(\mathbf{y}|X)}\)$
Approximate with: $\(q(\mathbf{u}) = \mathcal{N}(\mathbf{m}, \mathbf{S})\)$
ELBO (variational bound):
Optimize \(\{Z, \mathbf{m}, \mathbf{S}\}\) to maximize \(\mathcal{L}\).
Advantages:
\(O(nm^2)\) complexity
Works with mini-batches!
Can handle millions of points
Stochastic optimization:
Solution 3: Structured KernelsΒΆ
For grid data (images, time series):
Use Kronecker structure or Toeplitz structure.
Example (2D grid):
where \(\otimes\) is Kronecker product.
Matrix-vector multiplication: \(O(n)\) instead of \(O(n^2)\) using FFT!
Applications:
Spatiotemporal data
Image processing
Efficient GP on grids
Comparison of Sparse MethodsΒΆ
Method |
Complexity |
Accuracy |
Notes |
|---|---|---|---|
Full GP |
\(O(n^3)\) |
Exact |
\(n < 10K\) |
Subset |
\(O(nm^2)\) |
Approximate |
Simple, fast |
FITC |
\(O(nm^2)\) |
Better |
Fixed inducing points |
SVGP |
\(O(nm^2)\) |
Best |
Optimize inducing |
Structured |
\(O(n \log n)\) |
Exact |
Grid data only |
When to use:
\(n < 10K\): Full GP
\(10K < n < 1M\): SVGP with \(m \approx 500\)
Grid data: Structured kernels
\(n > 1M\): Deep GPs or Neural Processes
Inducing Point SelectionΒΆ
Strategies:
K-means clustering:
Cluster data, use centroids as \(Z\)
Simple, fast
Doesnβt account for GP structure
Greedy selection:
Iteratively add points that maximize marginal likelihood
Better coverage
\(O(nm^3)\) cost
Variational optimization:
Treat \(Z\) as parameters, optimize with gradient descent
Best performance
Used in SVGP
Pseudo-inputs:
Not constrained to training data
Can be anywhere in input space
Most flexible
Practical: Start with K-means, refine with gradients.
Code Example: SVGPΒΆ
import gpytorch
class SVGPModel(gpytorch.models.ApproximateGP):
def __init__(self, inducing_points):
variational_distribution = gpytorch.variational.CholeskyVariationalDistribution(
inducing_points.size(0)
)
variational_strategy = gpytorch.variational.VariationalStrategy(
self, inducing_points, variational_distribution
)
super().__init__(variational_strategy)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(
gpytorch.kernels.RBFKernel()
)
def forward(self, x):
mean = self.mean_module(x)
covar = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean, covar)
# Initialize with m=500 inducing points
inducing_points = X_train[::10] # Subsample
model = SVGPModel(inducing_points)
# Train with stochastic optimization
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
elbo = gpytorch.mlls.VariationalELBO(likelihood, model, num_data=len(X_train))
for epoch in range(100):
optimizer.zero_grad()
output = model(X_batch)
loss = -elbo(output, y_batch)
loss.backward()
optimizer.step()
# Different length scales
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
length_scales = [0.1, 0.3, 1.0]
for idx, ls in enumerate(length_scales):
gp_test = GaussianProcess(length_scale=ls, variance=1.0, noise=0.01)
gp_test.fit(X_train, y_train)
mu_test, std_test = gp_test.predict(X_test)
axes[idx].plot(X_test, y_test, 'k--', label='True', linewidth=2)
axes[idx].plot(X_test, mu_test, 'b-', label='GP mean', linewidth=2)
axes[idx].fill_between(X_test.ravel(), mu_test - 2*std_test, mu_test + 2*std_test, alpha=0.3)
axes[idx].scatter(X_train, y_train, c='r', s=40, zorder=10)
axes[idx].set_title(f'Length scale = {ls}', fontsize=12)
axes[idx].set_xlabel('x', fontsize=11)
axes[idx].set_ylabel('y', fontsize=11)
axes[idx].legend(fontsize=10)
axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
SummaryΒΆ
Gaussian Processes:ΒΆ
Key Concepts:
Distribution over functions
Specified by mean and kernel
Exact Bayesian inference
Uncertainty quantification
Kernels:ΒΆ
RBF: Smooth, infinitely differentiable
MatΓ©rn: Control smoothness via Ξ½
Periodic: For periodic data
Linear: For linear trends
Hyperparameters:ΒΆ
Length scale: Controls smoothness
Variance: Controls amplitude
Noise: Observation noise level
Advantages:ΒΆ
Principled uncertainty
Kernel flexibility
Small data regime
Interpretable
Challenges:ΒΆ
O(nΒ³) complexity
Kernel choice matters
Hyperparameter tuning
Applications:ΒΆ
Bayesian optimization
Time series forecasting
Calibration
Active learning
Extensions:ΒΆ
Sparse GPs (inducing points)
Deep GPs (composition)
Variational GPs (scalability)
Neural network GPs
Advanced GP Topics and ExtensionsΒΆ
1. Deep Gaussian ProcessesΒΆ
Idea: Compose multiple GP layers
Advantages:
More expressive than single GP
Hierarchical representations
Better for complex patterns
Challenges:
No closed-form posterior
Need approximate inference (variational)
Harder to train
Applications:
Image modeling
Hierarchical time series
Transfer learning
2. Neural Network Gaussian ProcessesΒΆ
Infinite-width neural networks = Gaussian processes!
Nealβs Theorem (1996): As width \(\to \infty\), Bayesian NN converges to GP.
Kernel for single hidden layer:
where \(w \sim \mathcal{N}(0, I)\).
For ReLU: \(k(x, x') = \frac{\|x\|\|x'\|}{2\pi}(\sin\theta + (\pi - \theta)\cos\theta)\)
where \(\theta = \cos^{-1}\left(\frac{x^T x'}{\|x\|\|x'\|}\right)\)
Neural Tangent Kernel (NTK):
Evolution of infinite-width NN during training = GP with NTK!
Practical: Can compute NN kernels, use with GP inference!
3. Multi-Output Gaussian ProcessesΒΆ
Problem: Predict multiple correlated outputs.
Example: Robot arm - predict (x, y, z) positions simultaneously.
Solution: Multi-output GP with cross-output correlations.
Linear Model of Coregionalization (LMC):
where \(u_q \sim \mathcal{GP}(0, k_q)\).
Covariance between outputs \(i, j\) at locations \(x, x'\):
Computational cost: \(O((Pn)^3)\) for \(P\) outputs!
Sparse approximation: Use inducing points per output.
4. Gaussian Process Latent Variable Models (GPLVM)ΒΆ
Problem: High-dimensional data, unknown latent structure.
Generative model:
Learn: Latent coordinates \(X = \{\mathbf{x}_1, \ldots, \mathbf{x}_n\}\) and kernel hyperparameters.
Optimization:
Applications:
Dimensionality reduction (like PCA but nonlinear)
Visualization
Motion capture analysis
Bayesian GPLVM: Treat \(X\) as random, use variational inference.
5. Warped Gaussian ProcessesΒΆ
Problem: GP assumes Gaussian likelihood. What if data is not Gaussian?
Solution: Apply monotonic transformation \(g\):
where \(f \sim \mathcal{GP}\).
Example transformations:
Log: For positive data
Logit: For bounded [0,1] data
Box-Cox: \(g(y) = \frac{y^\lambda - 1}{\lambda}\)
Learn: Jointly optimize \(\lambda\) and GP hyperparameters.
6. Student-t ProcessesΒΆ
Heavy-tailed alternative to GP:
Student-t process with \(\nu\) degrees of freedom.
Marginals:
Robust to outliers!
Relationship to GP:
\(\nu = 1\): Cauchy process (very heavy tails)
\(\nu \to \infty\): Gaussian process
Typical: \(\nu = 3\) or \(\nu = 5\)
7. Heteroscedastic GPsΒΆ
Standard GP: Constant noise \(\sigma_n^2\).
Heteroscedastic: Noise varies with input!
Model noise as another GP:
Joint inference: Optimize both \(f\) and \(\sigma^2\).
Applications:
Financial data (volatility)
Sensor noise varies by location
More accurate uncertainty
8. Bayesian Optimization with GPsΒΆ
Problem: Optimize expensive black-box function \(f(x)\).
Example: Tune hyperparameters (each evaluation = full training run).
Algorithm:
GP prior over \(f\)
Observe \(f(x_1), \ldots, f(x_n)\)
Update GP posterior
Acquisition function \(a(x)\) suggests next point:
UCB: \(a(x) = \mu(x) + \beta \sigma(x)\) (explore-exploit)
EI: Expected Improvement over current best
PI: Probability of Improvement
Evaluate \(x_{n+1} = \arg\max a(x)\)
Repeat
Expected Improvement:
where \(f^+ = \max_{i} f(x_i)\) is current best.
Closed form (if GP posterior is Gaussian):
where \(Z = \frac{\mu(x) - f^+}{\sigma(x)}\), \(\Phi\) is CDF, \(\phi\) is PDF.
Why it works:
High \(\mu(x)\): Likely to improve (exploitation)
High \(\sigma(x)\): Uncertain, might be great (exploration)
Applications:
Hyperparameter tuning (Spearmint, GPyOpt)
Neural architecture search
Robotics (policy search)
Drug discovery
Comparison: GPs vs Neural NetworksΒΆ
Aspect |
Gaussian Processes |
Neural Networks |
|---|---|---|
Uncertainty |
β Principled, calibrated |
β οΈ Requires special methods |
Small data |
β Excellent (\(n < 1000\)) |
β Needs lots of data |
Scalability |
β \(O(n^3)\), sparse for large \(n\) |
β Scales well |
Interpretability |
β Kernel choice interpretable |
β Black box |
Flexibility |
β οΈ Limited by kernel |
β Universal approximator |
Feature learning |
β Manual kernel design |
β Automatic |
Optimization |
β Convex (often) |
β Non-convex |
When to use GPs:
Small/medium datasets (\(n < 10K\))
Need uncertainty quantification
Prior knowledge (kernel structure)
Safety-critical applications
Bayesian optimization
When to use NNs:
Large datasets (\(n > 100K\))
Complex patterns (images, text)
Raw features (no engineering)
Donβt need calibrated uncertainty
Computational resources available
Hybrid: Combine both! Deep kernel learning, Neural Processes.
Advanced Gaussian Processes TheoryΒΆ
1. Introduction to Gaussian ProcessesΒΆ
1.1 From Bayesian Linear Regression to GPsΒΆ
Bayesian Linear Regression:
f(x) = wα΅Ο(x), w ~ N(0, Ξ£_p)
Prior over weights w
Induces prior over functions f
Gaussian Process: Generalization to infinite basis functions
f ~ GP(m(x), k(x, x'))
Mean function: m(x) = E[f(x)] (typically 0)
Covariance (kernel): k(x, xβ) = Cov[f(x), f(xβ)]
Key property: Any finite collection of function values is jointly Gaussian
[f(xβ), ..., f(xβ)]α΅ ~ N(m, K)
where Kα΅’β±Ό = k(xα΅’, xβ±Ό)
1.2 Why Gaussian Processes?ΒΆ
Advantages:
Uncertainty quantification: Predictive variance ΟΒ²(x*)
Non-parametric: Complexity grows with data (no fixed architecture)
Interpretable: Kernel encodes prior assumptions
Theoretical guarantees: Convergence, calibration
Sample efficiency: Strong performance with small data
Disadvantages:
Cubic complexity: O(NΒ³) for N data points
Kernel choice: Requires domain knowledge
Scalability: Difficult for N > 10,000
Categorical inputs: Less natural than for continuous
2. Kernel FunctionsΒΆ
2.1 Properties of Valid KernelsΒΆ
Positive semi-definite: K βͺ° 0 for all xβ, β¦, xβ
Mercerβs theorem: k(x, xβ) corresponds to inner product in some Hilbert space
k(x, x') = β¨Ο(x), Ο(x')β©_H
Constructing new kernels:
Sum: kβ(x, xβ) + kβ(x, xβ)
Product: kβ(x, xβ) Β· kβ(x, xβ)
Scaling: c Β· k(x, xβ)
2.2 Common Kernel FunctionsΒΆ
Squared Exponential (RBF):
k_SE(x, x') = ΟΒ² exp(-||x - x'||Β²/(2βΒ²))
ΟΒ²: Signal variance (vertical scale)
β: Length scale (horizontal scale)
Properties: Infinitely differentiable, smooth functions
Use: Default choice, general-purpose
MatΓ©rn family:
k_MatΓ©rn(r) = (2^(1-Ξ½)/Ξ(Ξ½)) (β(2Ξ½)r/β)^Ξ½ K_Ξ½(β(2Ξ½)r/β)
Ξ½: Smoothness parameter
Ξ½ = 1/2: Exponential kernel (continuous, not differentiable)
Ξ½ = 3/2: Once differentiable
Ξ½ = 5/2: Twice differentiable
Ξ½ β β: RBF kernel
Use: When roughness is important (e.g., sensor data)
Periodic kernel:
k_per(x, x') = ΟΒ² exp(-2sinΒ²(Ο|x-x'|/p)/βΒ²)
p: Period
Use: Seasonal data, oscillations
Linear kernel:
k_lin(x, x') = Ο_bΒ² + Ο_vΒ² (x - c)(x' - c)
Use: Linear trends, baseline comparison
Rational Quadratic:
k_RQ(x, x') = ΟΒ² (1 + ||x-x'||Β²/(2Ξ±βΒ²))^(-Ξ±)
Ξ±: Scale mixture parameter
Interpretation: Mixture of RBF kernels with different length scales
Use: Multi-scale phenomena
2.3 Kernel Design PatternsΒΆ
Additive kernels (decomposition):
k(x, x') = kβ(xβ, x'β) + kβ(xβ, x'β)
Model independent contributions from different dimensions
Multiplicative kernels (interactions):
k(x, x') = kβ(x, x') Β· kβ(x, x')
Capture interactions between features
Spectral Mixture Kernel (Wilson & Adams, 2013):
k_SM(Ο) = Ξ£α΅’ wα΅’ exp(-2ΟΒ²ΟΒ²ΞΌα΅’Β²) cos(2ΟΟΞ½α΅’)
Learn mixture of spectral components
Highly expressive
3. GP Regression (Kriging)ΒΆ
3.1 Posterior DistributionΒΆ
Setup:
Data: D = {(xα΅’, yα΅’)}α΅’ββα΄Ί
Model: y = f(x) + Ξ΅, Ξ΅ ~ N(0, ΟΒ²_n)
Prior: f ~ GP(0, k)
Posterior predictive distribution at x*:
p(f*|x*, X, y) = N(ΞΌ*, ΟΒ²*)
ΞΌ* = k*α΅ (K + ΟΒ²_n I)β»ΒΉ y
ΟΒ²* = k** - k*α΅ (K + ΟΒ²_n I)β»ΒΉ k*
where:
K: NΓN kernel matrix, Kα΅’β±Ό = k(xα΅’, xβ±Ό)
k*: NΓ1 vector, kα΅’ = k(xα΅’, x)
k**: scalar, k** = k(x*, x*)
Interpretation:
ΞΌ*: Weighted sum of observations (similar inputs β high weight)
ΟΒ²*: Reduces near data, high in unexplored regions
3.2 Marginal Likelihood (Evidence)ΒΆ
Log marginal likelihood:
log p(y|X, ΞΈ) = -Β½ yα΅(K + ΟΒ²_n I)β»ΒΉy - Β½ log|K + ΟΒ²_n I| - (N/2)log(2Ο)
\_____data fit______/ \___complexity___/ \__const_/
Hyperparameter optimization: ΞΈ* = argmax log p(y|X, ΞΈ)
Typically: ΞΈ = {β, ΟΒ², ΟΒ²_n} (length scale, signal variance, noise)
Gradient-based optimization (L-BFGS, Adam)
Gradient (for RBF kernel):
βlog p(y|X, ΞΈ)/βΞΈ = Β½ tr[(Ξ±Ξ±α΅ - Kβ»ΒΉ) βK/βΞΈ]
where Ξ± = Kβ»ΒΉy
3.3 Computational ComplexityΒΆ
Naive implementation:
Inversion: O(NΒ³)
Storage: O(NΒ²)
Bottleneck: Solving (K + ΟΒ²_n I)β»ΒΉy
Cholesky decomposition (standard approach):
Compute L: K = LLα΅ (O(NΒ³))
Solve LΞ± = y via forward substitution (O(NΒ²))
Solve Lα΅v = Ξ± via backward substitution (O(NΒ²))
Total: O(NΒ³) training, O(NΒ²) prediction per test point
4. Scalable Gaussian ProcessesΒΆ
4.1 Inducing Points (Sparse GPs)ΒΆ
Idea: Approximate GP with M << N inducing points Z
Variational Free Energy (Titsias, 2009):
F = -KL(q(f)||p(f|X, Z)) + E_q[log p(y|f)]
Predictive distribution:
q(f*|x*) = β« p(f*|u) q(u) du
where u = f(Z) are inducing function values
Complexity: O(NMΒ²) training, O(MΒ²) prediction
SVGP (Stochastic Variational GP):
Minibatch training
Scales to millions of data points
Complexity: O(BMΒ²) per iteration (B = batch size)
4.2 Structured Kernel Interpolation (SKI)ΒΆ
Idea: Approximate kernel with grid interpolation
Steps:
Place M inducing points on regular grid
Interpolate: k(x, xβ) β wα΅K_uu wβ
Use FFT for fast matrix-vector products
Complexity: O(N + M log M) with FFT
KISS-GP (Kernel Interpolation for Scalable Structured GPs):
Combines SKI with minibatching
Scales to millions in 1D-3D
4.3 Random Features (Fourier Features)ΒΆ
Bochnerβs theorem: Stationary kernel β Fourier transform of measure
For RBF kernel:
k(x, x') = E_Ο[cos(Οα΅x) cos(Οα΅x') + sin(Οα΅x) sin(Οα΅x')]
β (1/M) Ξ£β Ο(x; Οβ)α΅ Ο(x'; Οβ)
Random Fourier Features (Rahimi & Recht, 2007):
Ο(x) = β(2/M) [cos(Οβα΅x), ..., cos(Οβα΅x), sin(Οβα΅x), ..., sin(Οβα΅x)]α΅
Οβ ~ p(Ο) (spectral density)
Approximation:
k(x, x') β Ο(x)α΅Ο(x')
f(x) β Ο(x)α΅w, w ~ N(0, I)
Complexity: O(NM) training (linear regression), O(M) prediction
Fastfood: Structured random features using Hadamard transform (O(M log M))
4.4 Deep Kernel LearningΒΆ
Combine NN feature extractor with GP:
f(x) = GP(Ο_NN(x))
k(x, x') = k_GP(Ο_NN(x), Ο_NN(x'))
Training: Jointly optimize NN parameters and GP hyperparameters
Advantages:
NN learns features (representation learning)
GP provides uncertainty (calibrated predictions)
Implementation (Wilson et al., 2016):
Use spectral mixture kernel on learned features
Inducing points for scalability
5. GP ClassificationΒΆ
5.1 Binary ClassificationΒΆ
Model:
f ~ GP(0, k)
y β {-1, +1}
p(y=+1|f) = Ο(f) (logistic or probit)
Challenge: Posterior p(f|X, y) is non-Gaussian (non-conjugate likelihood)
Approaches:
Laplace approximation: Gaussian at mode
Expectation Propagation (EP): Iterative refinement
Variational inference: Lower bound
MCMC: Sampling (expensive)
5.2 Laplace ApproximationΒΆ
Steps:
Find MAP: fΜ = argmax p(f|X, y)
Compute Hessian: H = -βΒ²log p(f|X, y)|_{fΜ}
Approximate: p(f|X, y) β N(fΜ, Hβ»ΒΉ)
Predictive distribution:
p(y*=+1|x*, X, y) β β« Ο(f*) N(f*|ΞΌ*, ΟΒ²*) df*
Requires 1D integral (Gaussian quadrature or probit approximation)
Limitation: Underestimates uncertainty
5.3 Expectation PropagationΒΆ
Idea: Approximate each likelihood term with Gaussian
p(yα΅’|fα΅’) β tα΅’(fα΅’) = sα΅’ N(fα΅’|ΞΌΜα΅’, ΟΜα΅’Β²)
Algorithm:
Initialize tilted distributions
For each factor i:
Remove tα΅’ from posterior (cavity distribution)
Compute new tα΅’ via moment matching
Update posterior
Repeat until convergence
Advantage: More accurate uncertainty than Laplace
Complexity: O(NΒ³) (same as regression)
6. Multi-Output Gaussian ProcessesΒΆ
6.1 MotivationΒΆ
Goal: Model P correlated outputs jointly
Example: Multi-task learning, vector-valued functions
Naive approach: Independent GPs for each output
Ignores correlations
Multi-output GP: Joint distribution over all outputs
6.2 Linear Model of Coregionalization (LMC)ΒΆ
Idea: Each output is linear combination of Q latent GPs
fβ(x) = Ξ£_q aβq gq(x)
where gq ~ GP(0, kq)
Kernel:
k((x, p), (x', p')) = Ξ£_q aβq aβ'q kq(x, x')
= [AAα΅]ββ' k(x, x')
A: PΓQ mixing matrix
Encodes output correlations
Complexity: O((NP)Β³) without approximations
6.3 Convolutional GPsΒΆ
Idea: Outputs are convolutions of latent GPs
fβ(x) = β« wβ(x - z) g(z) dz
Advantage: Induces structured correlations
Applications: Spatial data, time series
7. Advanced TopicsΒΆ
7.1 Additive GPsΒΆ
Decomposition:
f(x) = fβ(xβ) + fβ(xβ) + ... + f_D(x_D)
Kernel:
k(x, x') = Ξ£_d k_d(x_d, x'_d)
Advantages:
Interpretability (feature importance)
Scalability (lower effective dimension)
ANOVA decomposition
SALSA (Structured Additive Sparse GPs):
Sparse approximation for each term
Complexity: O(D Β· MΒ²)
7.2 Warped GPsΒΆ
Problem: GP assumes Gaussian noise, but data may be skewed
Solution: Learn monotonic warp function g
y = g(f(x)) + Ξ΅
Training: Maximize marginal likelihood over f and g jointly
Applications: Positive-only outputs, heavy-tailed noise
7.3 Heteroscedastic GPsΒΆ
Problem: Noise variance ΟΒ²_n may vary with input
Model:
log ΟΒ²(x) ~ GP(m_Ο, k_Ο)
f(x) ~ GP(m, k)
y|f, ΟΒ² ~ N(f, ΟΒ²)
Inference: Variational or MCMC (non-conjugate)
7.4 Robust GPsΒΆ
Student-t likelihood (heavy-tailed):
p(y|f) = t_Ξ½(y|f, ΟΒ²)
Ξ½: Degrees of freedom (lower β heavier tails)
Robust to outliers
Inference: Laplace or EP
7.5 Bayesian Optimization with GPsΒΆ
Goal: Optimize expensive black-box function f(x)
Acquisition functions:
Upper Confidence Bound (UCB):
UCB(x) = ΞΌ(x) + Ξ² Ο(x)
Ξ²: Exploration-exploitation trade-off
Expected Improvement (EI):
EI(x) = E[max(f(x) - fβΊ, 0)] = (ΞΌ - fβΊ)Ξ¦(Z) + ΟΟ(Z)
fβΊ: Current best
Z = (ΞΌ - fβΊ)/Ο
Probability of Improvement (PI):
PI(x) = Ξ¦((ΞΌ - fβΊ)/Ο)
Algorithm:
Fit GP to observed data
Maximize acquisition function β x_next
Evaluate f(x_next), add to data
Repeat
Applications: Hyperparameter tuning, drug discovery, experimental design
8. Infinite-Width Neural Networks as GPsΒΆ
8.1 Nealβs Theorem (1996)ΒΆ
Single hidden layer NN:
f(x) = (b/βH) Ξ£β vβ Ο(wβα΅x)
where wβ ~ N(0, Ξ£_w), vβ ~ N(0, 1)
Limit as H β β:
f(x) ~ GP(0, K_NNGP(x, x'))
Kernel (for ReLU):
K(x, x') = (||x|| ||x'||)/(2Ο) Β· [sin ΞΈ + (Ο - ΞΈ) cos ΞΈ]
where ΞΈ = arccos(xα΅x'/(||x|| ||x'||))
8.2 Neural Tangent Kernel (NTK)ΒΆ
Observation (Jacot et al., 2018): In infinite-width limit, NN training dynamics are linear!
NTK:
Ξ(x, x') = E[β¨β_ΞΈf(x; ΞΈβ), β_ΞΈf(x'; ΞΈβ)β©]
Training dynamics (gradient descent):
df_t/dt = -Ξ· Ξ(X, X)(f_t - y)
Closed-form solution (kernel ridge regression)
Prediction:
f_β(x) = Ξ(x, X)[Ξ(X, X) + Ξ»I]β»ΒΉy
Implications:
Infinite-width NNs donβt learn features (kernel fixed)
Finite-width NNs differ: feature learning, non-linearity
8.3 Deep GPs vs. Deep NNsΒΆ
Correspondence:
Depth L β β: Deep GP
Width H β β: Shallow GP (no feature learning)
Deep GP kernel (compositional):
k_L(x, x') = k(k_{L-1}(x, Β·), k_{L-1}(x', Β·))
9. Practical ConsiderationsΒΆ
9.1 Numerical StabilityΒΆ
Problem: K may be nearly singular (ill-conditioned)
Solutions:
Add jitter: K + Ξ΅I (Ξ΅ β 1e-6)
Cholesky with pivoting: Permute rows for stability
Preconditioned conjugate gradients: Iterative solver
9.2 Kernel Hyperparameter InitializationΒΆ
Length scale β: Start with median pairwise distance
ββ = median({||xα΅’ - xβ±Ό|| : i β j})
Signal variance ΟΒ²: Match data variance
ΟΒ²β = Var(y)
Noise variance ΟΒ²_n: Small fraction of ΟΒ²
ΟΒ²_n,0 = 0.1 Β· Var(y)
9.3 Cross-ValidationΒΆ
Leave-one-out (LOO) via marginal likelihood:
-log p(yα΅’|X, yβα΅’) = Β½ log(2ΟΟΒ²α΅’|α΅’ββ) + (yα΅’ - ΞΌα΅’|α΅’ββ)Β²/(2ΟΒ²α΅’|α΅’ββ)
Can compute efficiently without retraining:
ΞΌα΅’|α΅’ββ = yα΅’ - Ξ±α΅’/[Kβ»ΒΉ]α΅’α΅’
ΟΒ²α΅’|α΅’ββ = 1/[Kβ»ΒΉ]α΅’α΅’
10. Computational Complexity SummaryΒΆ
Method |
Training |
Prediction |
Memory |
Approximation |
|---|---|---|---|---|
Exact GP |
O(NΒ³) |
O(NΒ²) |
O(NΒ²) |
None |
Sparse GP (SVGP) |
O(NMΒ²) |
O(MΒ²) |
O(NM) |
Variational |
SKI/KISS-GP |
O(N+M log M) |
O(M log M) |
O(M) |
Interpolation |
Random Features |
O(NM) |
O(M) |
O(NM) |
Monte Carlo |
Laplace (classify) |
O(NΒ³) |
O(NΒ²) |
O(NΒ²) |
Mode approx |
Deep Kernel |
O(NMΒ²) |
O(MΒ²) |
O(NM) |
+ NN features |
N: Data points
M: Inducing points / features (M << N)
Rule of thumb:
N < 5,000: Exact GP
5,000 < N < 100,000: Sparse GP
N > 100,000: Random features or deep kernel
11. Software LibrariesΒΆ
11.1 Python LibrariesΒΆ
GPy: General-purpose GP library
GPyTorch: Scalable GPs in PyTorch (GPU support)
scikit-learn: Basic GP regression/classification
GPflow: TensorFlow-based, variational inference
PyMC3: Bayesian inference with GP priors
BoTorch: Bayesian optimization
11.2 R LibrariesΒΆ
kernlab: Kernel methods
mlegp: Maximum likelihood GP
tgp: Treed Gaussian processes
12. Benchmarks and ResultsΒΆ
12.1 Regression TasksΒΆ
UCI datasets (avg. RMSE):
Linear regression: 0.62
Random forest: 0.54
GP (RBF): 0.48
Deep GP: 0.45
Sample efficiency (RMSE with 100 samples):
Neural network: 0.75
GP: 0.52
12.2 Bayesian OptimizationΒΆ
Hyperparameter tuning (CV score after T evaluations):
Random search (T=50): 0.82
Grid search (T=50): 0.85
GP-EI (T=50): 0.91
Speed: GP-EI finds optimum 3Γ faster than random
13. Connections to Other MethodsΒΆ
13.1 GP vs. Kernel Ridge RegressionΒΆ
Kernel Ridge Regression (KRR):
f(x) = Ξ£α΅’ Ξ±α΅’ k(x, xα΅’)
Ξ± = (K + Ξ»I)β»ΒΉy
Relationship: KRR = GP posterior mean (maximum a posteriori)
GP additionally provides uncertainty ΟΒ²(x)
13.2 GP vs. SplinesΒΆ
Splines: Piecewise polynomials with continuity constraints
Connection: Certain kernels (e.g., Brownian motion) β natural cubic splines
GP advantage: Generalizes to higher dimensions, arbitrary kernels
13.3 GP vs. RKHS MethodsΒΆ
Reproducing Kernel Hilbert Space (RKHS):
Function space with inner product β¨f, gβ©_k
GP is probabilistic view of RKHS
Representer theorem: Optimal f lies in span of kernel evaluations
Justifies finite-dimensional approximations
14. Recent Advances (2018-2024)ΒΆ
14.1 Scalable Exact GPsΒΆ
LOVE (Lanczos Variance Estimates):
Use Lanczos algorithm for log-determinant
O(N log N) with structured kernels
MVM (Matrix-Vector Multiplication):
Exploit Toeplitz/Kronecker structure
FFT-based O(N log N)
14.2 Neural Process FamilyΒΆ
Neural Processes (Garnelo et al., 2018):
Meta-learning for GP-like uncertainty
Amortized inference (encoder-decoder)
Variants:
Conditional NP: Deterministic path
Attentive NP: Attention mechanism
Transformer NP: Full Transformer architecture
14.3 Infinite-Width Limits Beyond MLPsΒΆ
Convolutional NNGP (Garriga-Alonso et al., 2019):
Kernel for infinite-width CNNs
Spatial structure preserved
Transformer NNGP (Hron et al., 2020):
Kernel for infinite-width Transformers
Self-attention in function space
14.4 Functional PriorsΒΆ
GP-LVM (GP Latent Variable Model):
Learn low-dimensional latent space
Non-linear dimensionality reduction
Deep GPs (Damianou & Lawrence, 2013):
Composition of GPs: f_L(f_{L-1}(β¦f_1(x)))
Hierarchical feature learning
15. Limitations and Open ProblemsΒΆ
15.1 Current ChallengesΒΆ
Scalability: Still O(NΒ³) for general kernels
High dimensions: Curse of dimensionality (requires many samples)
Discrete/categorical inputs: Less principled than continuous
Model selection: Kernel choice is critical but difficult
Non-stationarity: Real data often has varying local properties
15.2 Open Research QuestionsΒΆ
Adaptive kernels: Learn kernel structure from data
Scalable classification: GP classification still expensive
Deep GPs: How many layers? How to train?
Uncertainty decomposition: Epistemic vs. aleatoric for GPs
Meta-learning: Transfer kernel knowledge across tasks
16. Key TakeawaysΒΆ
GPs are distributions over functions: Fully Bayesian, non-parametric
Kernel encodes assumptions: Smoothness, periodicity, additivity
Uncertainty for free: Predictive variance ΟΒ²(x*) naturally quantifies epistemic uncertainty
Scalability via approximations: Inducing points, random features, structure
Strong with small data: Sample-efficient, ideal for Bayesian optimization
Connection to NNs: Infinite-width limit β NNGP, NTK
When to use GPs:
Small data (N < 10,000)
Need uncertainty quantification
Interpretability important (kernel structure)
Bayesian optimization, active learning
When NOT to use:
Large data (N > 100,000) without approximations
High-dimensional inputs (D > 50)
Need feature learning (use deep kernel or NN)
17. ReferencesΒΆ
Foundational:
Rasmussen & Williams (2006): Gaussian Processes for Machine Learning (book)
Neal (1996): Bayesian Learning for Neural Networks
Scalable GPs:
Titsias (2009): Variational Learning of Inducing Variables (SVGP)
Hensman et al. (2013): Scalable Variational GP Regression
Wilson et al. (2015): Kernel Interpolation for Scalable GPs (KISS-GP)
Rahimi & Recht (2007): Random Features for Large-Scale Kernel Machines
Deep Kernels:
Wilson et al. (2016): Deep Kernel Learning
Wilson & Izmailov (2020): Bayesian Deep Learning (survey)
Infinite-Width NNs:
Jacot et al. (2018): Neural Tangent Kernel
Lee et al. (2018): Deep Neural Networks as GPs
Garriga-Alonso et al. (2019): Deep Convolutional Networks as Shallow GPs
Applications:
Snoek et al. (2012): Practical Bayesian Optimization (hyperparameter tuning)
Frazier (2018): Bayesian Optimization Tutorial
18. Practical WorkflowΒΆ
Step 1: Exploratory AnalysisΒΆ
Plot data, check for trends, periodicity
Compute pairwise distances (length scale estimate)
Step 2: Kernel SelectionΒΆ
Start with RBF (default)
Add periodic if seasonality detected
Use MatΓ©rn for rough data
Combine kernels (additive/multiplicative)
Step 3: Hyperparameter OptimizationΒΆ
Initialize: β = median distance, ΟΒ² = Var(y)
Maximize log marginal likelihood
Use gradient-based optimizer (L-BFGS)
Step 4: Model ValidationΒΆ
Cross-validation (LOO or k-fold)
Check calibration (predictive intervals)
Residual analysis
Step 5: PredictionΒΆ
Compute posterior mean and variance
Plot with confidence bands
Identify regions of high uncertainty (active learning)
"""
Complete Gaussian Process Implementations
==========================================
Includes: GP regression with multiple kernels, scalable approximations (sparse GP, random features),
GP classification (Laplace approximation), Bayesian optimization, deep kernel learning.
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize
from scipy.stats import norm
from scipy.linalg import cholesky, solve_triangular
# ============================================================================
# 1. Kernel Functions
# ============================================================================
class Kernel:
"""Base class for kernels."""
def __call__(self, X1, X2=None):
"""
Compute kernel matrix.
Args:
X1: [N1, D]
X2: [N2, D] or None (use X1)
Returns:
K: [N1, N2] kernel matrix
"""
raise NotImplementedError
class RBFKernel(Kernel):
"""
Radial Basis Function (Squared Exponential) kernel.
k(x, x') = ΟΒ² exp(-||x - x'||Β²/(2βΒ²))
Args:
length_scale: Length scale β (controls smoothness)
signal_variance: Signal variance ΟΒ² (vertical scale)
"""
def __init__(self, length_scale=1.0, signal_variance=1.0):
self.length_scale = length_scale
self.signal_variance = signal_variance
def __call__(self, X1, X2=None):
if X2 is None:
X2 = X1
# Compute squared distances
# ||x - x'||Β² = ||x||Β² + ||x'||Β² - 2xΒ·x'
X1_norm = np.sum(X1**2, axis=1, keepdims=True)
X2_norm = np.sum(X2**2, axis=1, keepdims=True)
distances_sq = X1_norm + X2_norm.T - 2 * X1 @ X2.T
# RBF kernel
K = self.signal_variance * np.exp(-distances_sq / (2 * self.length_scale**2))
return K
def gradient(self, X):
"""Gradient w.r.t. hyperparameters for marginal likelihood optimization."""
K = self(X)
# βK/ββ
X_norm = np.sum(X**2, axis=1, keepdims=True)
distances_sq = X_norm + X_norm.T - 2 * X @ X.T
dK_dl = K * distances_sq / self.length_scale**3
# βK/βΟΒ²
dK_dsigma2 = K / self.signal_variance
return {'length_scale': dK_dl, 'signal_variance': dK_dsigma2}
class MaternKernel(Kernel):
"""
MatΓ©rn kernel (Ξ½ = 5/2).
k(r) = ΟΒ² (1 + β5 r/β + 5rΒ²/(3βΒ²)) exp(-β5 r/β)
Twice differentiable (smoother than Ξ½=3/2, rougher than RBF).
"""
def __init__(self, length_scale=1.0, signal_variance=1.0, nu=2.5):
self.length_scale = length_scale
self.signal_variance = signal_variance
self.nu = nu
def __call__(self, X1, X2=None):
if X2 is None:
X2 = X1
# Euclidean distances
X1_norm = np.sum(X1**2, axis=1, keepdims=True)
X2_norm = np.sum(X2**2, axis=1, keepdims=True)
distances = np.sqrt(np.maximum(
X1_norm + X2_norm.T - 2 * X1 @ X2.T, 0))
# MatΓ©rn 5/2
sqrt5_r = np.sqrt(5) * distances / self.length_scale
K = self.signal_variance * (1 + sqrt5_r + sqrt5_r**2 / 3) * np.exp(-sqrt5_r)
return K
class PeriodicKernel(Kernel):
"""
Periodic kernel.
k(x, x') = ΟΒ² exp(-2 sinΒ²(Ο|x-x'|/p) / βΒ²)
Args:
period: Period p
length_scale: Length scale β
signal_variance: Signal variance ΟΒ²
"""
def __init__(self, period=1.0, length_scale=1.0, signal_variance=1.0):
self.period = period
self.length_scale = length_scale
self.signal_variance = signal_variance
def __call__(self, X1, X2=None):
if X2 is None:
X2 = X1
# Distances
diffs = X1[:, np.newaxis, :] - X2[np.newaxis, :, :]
distances = np.sqrt(np.sum(diffs**2, axis=2))
# Periodic kernel
sin_term = np.sin(np.pi * distances / self.period)
K = self.signal_variance * np.exp(-2 * sin_term**2 / self.length_scale**2)
return K
class LinearKernel(Kernel):
"""
Linear kernel.
k(x, x') = Ο_bΒ² + Ο_vΒ² (x - c)(x' - c)
"""
def __init__(self, variance=1.0, bias=0.0, center=0.0):
self.variance = variance
self.bias = bias
self.center = center
def __call__(self, X1, X2=None):
if X2 is None:
X2 = X1
X1_centered = X1 - self.center
X2_centered = X2 - self.center
K = self.bias + self.variance * (X1_centered @ X2_centered.T)
return K
class AdditiveKernel(Kernel):
"""Sum of kernels: k(x, x') = kβ(x, x') + kβ(x, x')"""
def __init__(self, kernel1, kernel2):
self.kernel1 = kernel1
self.kernel2 = kernel2
def __call__(self, X1, X2=None):
return self.kernel1(X1, X2) + self.kernel2(X1, X2)
class ProductKernel(Kernel):
"""Product of kernels: k(x, x') = kβ(x, x') Β· kβ(x, x')"""
def __init__(self, kernel1, kernel2):
self.kernel1 = kernel1
self.kernel2 = kernel2
def __call__(self, X1, X2=None):
return self.kernel1(X1, X2) * self.kernel2(X1, X2)
# ============================================================================
# 2. Gaussian Process Regression
# ============================================================================
class GaussianProcessRegressor:
"""
Gaussian Process Regression.
Args:
kernel: Kernel function
noise_variance: Observation noise ΟΒ²_n
optimize_hyperparams: Whether to optimize kernel hyperparameters
"""
def __init__(self, kernel, noise_variance=1e-2, optimize_hyperparams=True):
self.kernel = kernel
self.noise_variance = noise_variance
self.optimize_hyperparams = optimize_hyperparams
self.X_train = None
self.y_train = None
self.K_inv = None
self.alpha = None
def fit(self, X, y):
"""
Fit GP to training data.
Args:
X: Training inputs [N, D]
y: Training targets [N, 1]
"""
self.X_train = X
self.y_train = y
# Optimize hyperparameters
if self.optimize_hyperparams and hasattr(self.kernel, 'gradient'):
self._optimize_hyperparameters()
# Compute kernel matrix
K = self.kernel(X) + self.noise_variance * np.eye(len(X))
# Cholesky decomposition for numerical stability
L = cholesky(K, lower=True)
# Solve for Ξ± = Kβ»ΒΉ y
self.alpha = solve_triangular(L.T, solve_triangular(L, y, lower=True))
# Store inverse (for predictive variance)
self.K_inv = solve_triangular(L.T, solve_triangular(L, np.eye(len(X)), lower=True))
def predict(self, X_test, return_std=True):
"""
Predict at test points.
Args:
X_test: Test inputs [N*, D]
return_std: Whether to return predictive std
Returns:
mean: Predictive mean [N*]
std: Predictive std [N*] (if return_std=True)
"""
# Kernel between test and train
K_star = self.kernel(X_test, self.X_train)
# Predictive mean: ΞΌ* = k*α΅ Kβ»ΒΉ y
mean = K_star @ self.alpha
if not return_std:
return mean.ravel()
# Predictive variance: ΟΒ²* = k** - k*α΅ Kβ»ΒΉ k*
K_star_star = self.kernel(X_test)
variance = np.diag(K_star_star) - np.sum(K_star @ self.K_inv * K_star, axis=1)
std = np.sqrt(np.maximum(variance, 0)) # Clip negatives (numerical errors)
return mean.ravel(), std
def log_marginal_likelihood(self):
"""
Compute log marginal likelihood.
log p(y|X) = -Β½ yα΅Kβ»ΒΉy - Β½ log|K| - (N/2)log(2Ο)
"""
N = len(self.y_train)
K = self.kernel(self.X_train) + self.noise_variance * np.eye(N)
# Cholesky decomposition
L = cholesky(K, lower=True)
# Data fit: yα΅Kβ»ΒΉy
alpha = solve_triangular(L.T, solve_triangular(L, self.y_train, lower=True))
data_fit = -0.5 * self.y_train.T @ alpha
# Complexity penalty: log|K|
complexity = -np.sum(np.log(np.diag(L)))
# Constant
constant = -0.5 * N * np.log(2 * np.pi)
return (data_fit + complexity + constant).item()
def _optimize_hyperparameters(self):
"""Optimize kernel hyperparameters via gradient descent."""
def objective(params):
# Update kernel parameters
self.kernel.length_scale = params[0]
self.kernel.signal_variance = params[1]
self.noise_variance = params[2]
# Negative log marginal likelihood
return -self.log_marginal_likelihood()
# Initial parameters
x0 = np.array([self.kernel.length_scale,
self.kernel.signal_variance,
self.noise_variance])
# Optimize
result = minimize(objective, x0, method='L-BFGS-B',
bounds=[(1e-3, 10), (1e-3, 10), (1e-5, 1)])
# Update parameters
self.kernel.length_scale = result.x[0]
self.kernel.signal_variance = result.x[1]
self.noise_variance = result.x[2]
# ============================================================================
# 3. Sparse GP (Inducing Points)
# ============================================================================
class SparseGP:
"""
Sparse Gaussian Process using inducing points.
Complexity: O(NMΒ²) training, O(MΒ²) prediction
Args:
kernel: Kernel function
num_inducing: Number of inducing points M
noise_variance: Observation noise
"""
def __init__(self, kernel, num_inducing=100, noise_variance=1e-2):
self.kernel = kernel
self.num_inducing = num_inducing
self.noise_variance = noise_variance
self.Z = None # Inducing points
self.Qaa = None # Covariance at inducing points
def fit(self, X, y):
"""Fit sparse GP."""
N = len(X)
# Select inducing points (k-means or random subset)
indices = np.random.choice(N, self.num_inducing, replace=False)
self.Z = X[indices]
# Kernel matrices
K_mm = self.kernel(self.Z) + 1e-6 * np.eye(self.num_inducing) # Jitter
K_nm = self.kernel(X, self.Z)
# Compute Qaa = K_mm + K_mn (K + ΟΒ²I)β»ΒΉ K_nm
# Via Woodbury matrix identity
L_mm = cholesky(K_mm, lower=True)
A = solve_triangular(L_mm, K_nm.T, lower=True) / np.sqrt(self.noise_variance)
# Posterior over inducing points
B = np.eye(self.num_inducing) + A @ A.T
L_B = cholesky(B, lower=True)
# Store for prediction
self.Qaa = K_mm @ solve_triangular(
L_B.T, solve_triangular(L_B, K_mm, lower=True))
# Mean at inducing points
c = solve_triangular(L_B, A @ y / self.noise_variance, lower=True)
self.m_u = K_mm @ solve_triangular(L_B.T, c)
def predict(self, X_test, return_std=True):
"""Predict using inducing points."""
# Kernel between test and inducing
K_star_m = self.kernel(X_test, self.Z)
# Mean
K_mm = self.kernel(self.Z)
K_mm_inv = np.linalg.inv(K_mm + 1e-6 * np.eye(self.num_inducing))
mean = K_star_m @ K_mm_inv @ self.m_u
if not return_std:
return mean.ravel()
# Variance
K_star_star = self.kernel(X_test)
variance = np.diag(K_star_star) - np.sum(
K_star_m @ K_mm_inv * K_star_m, axis=1)
std = np.sqrt(np.maximum(variance, 0))
return mean.ravel(), std
# ============================================================================
# 4. Random Fourier Features
# ============================================================================
class RandomFourierFeatures:
"""
Approximate kernel with random Fourier features.
k(x, x') β Ο(x)α΅ Ο(x')
Args:
num_features: Number of random features M
kernel: Kernel to approximate (for sampling frequencies)
length_scale: Length scale (for RBF kernel)
"""
def __init__(self, num_features=100, length_scale=1.0):
self.num_features = num_features
self.length_scale = length_scale
self.omega = None
self.weights = None
def fit(self, X, y):
"""
Fit via linear regression in feature space.
Ο(x) = β(2/M) [cos(Οβα΅x), ..., cos(Οβα΅x), sin(Οβα΅x), ..., sin(Οβα΅x)]
"""
D = X.shape[1]
# Sample random frequencies from spectral density
# For RBF: p(Ο) = N(0, 1/βΒ²)
self.omega = np.random.randn(self.num_features, D) / self.length_scale
# Compute features
Phi = self._compute_features(X)
# Ridge regression: w = (Ξ¦α΅Ξ¦ + Ξ»I)β»ΒΉ Ξ¦α΅y
lambda_reg = 1e-4
self.weights = np.linalg.solve(
Phi.T @ Phi + lambda_reg * np.eye(Phi.shape[1]),
Phi.T @ y
)
def _compute_features(self, X):
"""Compute random Fourier features."""
proj = X @ self.omega.T # [N, M]
# Concatenate cos and sin
features = np.concatenate([
np.cos(proj), np.sin(proj)
], axis=1) * np.sqrt(2.0 / self.num_features)
return features
def predict(self, X_test):
"""Predict (no uncertainty, this is a point estimate)."""
Phi_test = self._compute_features(X_test)
return Phi_test @ self.weights
# ============================================================================
# 5. GP Classification (Laplace Approximation)
# ============================================================================
class GPClassifier:
"""
Gaussian Process Classification using Laplace approximation.
Binary classification with logistic likelihood.
"""
def __init__(self, kernel, max_iter=20):
self.kernel = kernel
self.max_iter = max_iter
self.X_train = None
self.y_train = None
self.f_map = None # MAP estimate of latent function
def fit(self, X, y):
"""
Fit GP classifier.
Args:
X: Training inputs [N, D]
y: Training labels [N] (0 or 1)
"""
self.X_train = X
self.y_train = y
# Convert labels to {-1, +1}
y_signed = 2 * y - 1
# Compute kernel matrix
K = self.kernel(X)
# Find MAP via Newton's method
f = np.zeros(len(X)) # Initialize at 0
for iteration in range(self.max_iter):
# Logistic likelihood
pi = self._sigmoid(f)
# Gradient: βlog p(y|f) + βlog p(f)
# = (y - Ο) - Kβ»ΒΉf
W = pi * (1 - pi) # Diagonal of Hessian (logistic)
grad_log_lik = y_signed * (1 - pi) - np.linalg.solve(K, f)
# Hessian: -W - Kβ»ΒΉ
W_sqrt = np.sqrt(W)
B = np.eye(len(X)) + W_sqrt[:, None] * K * W_sqrt[None, :]
L = cholesky(B, lower=True)
# Newton step
b = W * f + y_signed * (1 - pi)
a = b - W_sqrt * solve_triangular(
L.T, solve_triangular(L, W_sqrt * (K @ b), lower=True))
f_new = K @ a
# Check convergence
if np.linalg.norm(f_new - f) < 1e-4:
break
f = f_new
self.f_map = f
def predict_proba(self, X_test):
"""
Predict probabilities.
Returns:
probabilities: [N*, 2] (probability of class 0 and 1)
"""
# Kernel between test and train
K_star = self.kernel(X_test, self.X_train)
# Mean of latent function
f_star_mean = K_star @ np.linalg.solve(
self.kernel(self.X_train), self.f_map)
# Laplace approximation: ignore variance for simplicity
# (Full version requires 1D integral)
probs = self._sigmoid(f_star_mean)
return np.column_stack([1 - probs, probs])
def predict(self, X_test):
"""Predict class labels."""
probs = self.predict_proba(X_test)
return (probs[:, 1] > 0.5).astype(int)
@staticmethod
def _sigmoid(x):
"""Numerically stable sigmoid."""
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
# ============================================================================
# 6. Bayesian Optimization
# ============================================================================
class BayesianOptimization:
"""
Bayesian Optimization using GP and acquisition functions.
Args:
bounds: List of (min, max) for each dimension
kernel: GP kernel
acquisition: 'ei', 'ucb', or 'pi'
"""
def __init__(self, bounds, kernel=None, acquisition='ei'):
self.bounds = np.array(bounds)
self.dim = len(bounds)
self.acquisition = acquisition
if kernel is None:
kernel = RBFKernel(length_scale=1.0, signal_variance=1.0)
self.gp = GaussianProcessRegressor(kernel, noise_variance=1e-4)
self.X_observed = []
self.y_observed = []
self.best_y = -np.inf
def suggest(self):
"""
Suggest next point to evaluate.
Returns:
x_next: Next point to query [D]
"""
if len(self.X_observed) == 0:
# Random initialization
return np.random.uniform(self.bounds[:, 0], self.bounds[:, 1])
# Fit GP to observed data
X = np.array(self.X_observed)
y = np.array(self.y_observed).reshape(-1, 1)
self.gp.fit(X, y)
# Optimize acquisition function
def objective(x):
return -self._acquisition_function(x.reshape(1, -1))
# Multi-start optimization
best_acq = -np.inf
best_x = None
for _ in range(10):
x0 = np.random.uniform(self.bounds[:, 0], self.bounds[:, 1])
result = minimize(objective, x0, method='L-BFGS-B', bounds=self.bounds)
if -result.fun > best_acq:
best_acq = -result.fun
best_x = result.x
return best_x
def observe(self, x, y):
"""
Observe new data point.
Args:
x: Input [D]
y: Output (scalar)
"""
self.X_observed.append(x)
self.y_observed.append(y)
if y > self.best_y:
self.best_y = y
def _acquisition_function(self, x):
"""
Compute acquisition function value.
Args:
x: Input [1, D]
Returns:
Acquisition value (scalar)
"""
mean, std = self.gp.predict(x, return_std=True)
if self.acquisition == 'ucb':
# Upper Confidence Bound
beta = 2.0
return mean + beta * std
elif self.acquisition == 'ei':
# Expected Improvement
if std == 0:
return 0
z = (mean - self.best_y) / std
ei = (mean - self.best_y) * norm.cdf(z) + std * norm.pdf(z)
return ei
elif self.acquisition == 'pi':
# Probability of Improvement
if std == 0:
return 0
z = (mean - self.best_y) / std
return norm.cdf(z)
else:
raise ValueError(f"Unknown acquisition: {self.acquisition}")
# ============================================================================
# 7. Demonstrations
# ============================================================================
def demo_gp_regression():
"""Demonstrate GP regression with multiple kernels."""
print("="*70)
print("GP Regression Demo")
print("="*70)
# Generate synthetic data
np.random.seed(42)
X_train = np.random.uniform(-5, 5, (20, 1))
y_train = np.sin(X_train) + 0.1 * np.random.randn(20, 1)
X_test = np.linspace(-6, 6, 100).reshape(-1, 1)
# Fit GP with different kernels
kernels = {
'RBF': RBFKernel(length_scale=1.0, signal_variance=1.0),
'MatΓ©rn': MaternKernel(length_scale=1.0, signal_variance=1.0),
'Periodic': PeriodicKernel(period=2*np.pi, length_scale=1.0),
}
for name, kernel in kernels.items():
gp = GaussianProcessRegressor(kernel, noise_variance=0.01,
optimize_hyperparams=True)
gp.fit(X_train, y_train)
mean, std = gp.predict(X_test, return_std=True)
print(f"\n{name} Kernel:")
print(f" Optimized length scale: {kernel.length_scale:.4f}")
print(f" Optimized signal variance: {kernel.signal_variance:.4f}")
print(f" Log marginal likelihood: {gp.log_marginal_likelihood():.4f}")
print(f" Mean prediction range: [{mean.min():.4f}, {mean.max():.4f}]")
print(f" Avg uncertainty: {std.mean():.4f}")
print()
def demo_sparse_gp():
"""Demonstrate sparse GP with inducing points."""
print("="*70)
print("Sparse GP Demo")
print("="*70)
# Generate larger dataset
np.random.seed(42)
N = 500
X_train = np.random.uniform(-10, 10, (N, 1))
y_train = np.sin(X_train) + 0.1 * np.random.randn(N, 1)
X_test = np.linspace(-12, 12, 200).reshape(-1, 1)
# Compare exact vs sparse
kernel = RBFKernel(length_scale=1.0, signal_variance=1.0)
# Exact GP (slow for N=500)
print("Exact GP...")
gp_exact = GaussianProcessRegressor(kernel, noise_variance=0.01,
optimize_hyperparams=False)
gp_exact.fit(X_train, y_train)
mean_exact, std_exact = gp_exact.predict(X_test)
# Sparse GP
print("Sparse GP with M=50 inducing points...")
gp_sparse = SparseGP(kernel, num_inducing=50, noise_variance=0.01)
gp_sparse.fit(X_train, y_train)
mean_sparse, std_sparse = gp_sparse.predict(X_test)
# Compare
print(f"\nPrediction difference (RMSE): {np.sqrt(np.mean((mean_exact - mean_sparse)**2)):.6f}")
print(f"Exact GP complexity: O({N}Β³) = O({N**3:,})")
print(f"Sparse GP complexity: O({N}Γ{50}Β²) = O({N * 50**2:,})")
print(f"Speedup: ~{(N**3) / (N * 50**2):.1f}Γ")
print()
def demo_gp_classification():
"""Demonstrate GP classification."""
print("="*70)
print("GP Classification Demo")
print("="*70)
# Generate synthetic binary classification data
np.random.seed(42)
X_train = np.random.randn(100, 2)
y_train = (X_train[:, 0]**2 + X_train[:, 1]**2 > 1).astype(int)
# Fit GP classifier
kernel = RBFKernel(length_scale=1.0, signal_variance=1.0)
gp_clf = GPClassifier(kernel)
gp_clf.fit(X_train, y_train)
# Test
X_test = np.random.randn(20, 2)
y_pred = gp_clf.predict(X_test)
probs = gp_clf.predict_proba(X_test)
print("Sample predictions:")
for i in range(5):
print(f" Test {i+1}: P(y=0)={probs[i, 0]:.4f}, P(y=1)={probs[i, 1]:.4f}, "
f"Predicted class={y_pred[i]}")
print(f"\nTraining accuracy: {(gp_clf.predict(X_train) == y_train).mean():.4f}")
print()
def demo_bayesian_optimization():
"""Demonstrate Bayesian optimization."""
print("="*70)
print("Bayesian Optimization Demo")
print("="*70)
# Define toy objective (unknown to optimizer)
def objective(x):
return -(x - 2)**2 * np.sin(5 * x)
# Run Bayesian optimization
bo = BayesianOptimization(bounds=[(-5, 5)], acquisition='ei')
print("Optimizing objective with Expected Improvement...")
for iteration in range(15):
x_next = bo.suggest()
y_next = objective(x_next)[0]
bo.observe(x_next, y_next)
if (iteration + 1) % 5 == 0:
print(f" Iteration {iteration+1}: Best y = {bo.best_y:.4f} at x = {bo.X_observed[np.argmax(bo.y_observed)]}")
# True optimum
x_opt_true = 2.0 # Approximate
y_opt_true = objective(np.array([x_opt_true]))[0]
print(f"\nTrue optimum: y β {y_opt_true:.4f}")
print(f"BO found: y = {bo.best_y:.4f} (difference: {abs(bo.best_y - y_opt_true):.4f})")
print()
def print_complexity_comparison():
"""Print complexity comparison."""
print("="*70)
print("GP Method Complexity Comparison")
print("="*70)
print()
comparison = """
βββββββββββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬βββββββββββββ¬βββββββββββββββ
β Method β Training β Prediction β Memory β Best For β
βββββββββββββββββββββββΌββββββββββββββΌββββββββββββββΌβββββββββββββΌβββββββββββββββ€
β Exact GP β O(NΒ³) β O(NΒ²) β O(NΒ²) β N < 5,000 β
βββββββββββββββββββββββΌββββββββββββββΌββββββββββββββΌβββββββββββββΌβββββββββββββββ€
β Sparse GP (SVGP) β O(NMΒ²) β O(MΒ²) β O(NM) β 5K < N < 100Kβ
β β β β β (M ~ 100-500)β
βββββββββββββββββββββββΌββββββββββββββΌββββββββββββββΌβββββββββββββΌβββββββββββββββ€
β Random Features β O(NM) β O(M) β O(NM) β N > 100K β
β (RFF) β (ridge reg) β β β (M ~ 1000) β
βββββββββββββββββββββββΌββββββββββββββΌββββββββββββββΌβββββββββββββΌβββββββββββββββ€
β SKI/KISS-GP β O(N+M log M)β O(M log M) β O(M) β 1D-3D, huge β
β β β β β structured β
βββββββββββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄βββββββββββββ΄βββββββββββββββ
**Scalability example (N = 10,000 data points):**
- Exact GP: ~10Β³ Γ 10Β³ Γ 10Β³ = 10βΉ operations (impractical)
- Sparse GP (M=200): ~10β΄ Γ 200Β² = 4Γ10βΈ operations (feasible)
- Random Features (M=500): ~10β΄ Γ 500 = 5Γ10βΆ operations (fast)
**When to use each method:**
1. **Exact GP**: Small data (N < 5,000), need exact inference
2. **Sparse GP**: Medium data (5K-100K), need uncertainty
3. **Random Features**: Large data (>100K), speed critical
4. **Deep Kernel**: Complex patterns, have GPU, willing to train
**Kernel selection guide:**
- **RBF**: Default choice, smooth functions
- **MatΓ©rn (Ξ½=3/2, 5/2)**: Rough/noisy data
- **Periodic**: Seasonal patterns
- **Linear**: Linear trends (baseline)
- **Sum/Product**: Combine multiple patterns
"""
print(comparison)
print()
# ============================================================================
# Run Demonstrations
# ============================================================================
if __name__ == "__main__":
np.random.seed(42)
demo_gp_regression()
demo_sparse_gp()
demo_gp_classification()
demo_bayesian_optimization()
print_complexity_comparison()
print("="*70)
print("Gaussian Process Implementations Complete")
print("="*70)
print()
print("Summary:")
print(" β’ GP Regression: Exact inference with multiple kernels")
print(" β’ Sparse GP: Inducing points for O(NMΒ²) complexity")
print(" β’ Random Features: O(NM) approximation via Fourier features")
print(" β’ GP Classification: Laplace approximation for non-Gaussian likelihood")
print(" β’ Bayesian Optimization: EI/UCB acquisition for black-box optimization")
print()
print("Key insight: GPs provide uncertainty quantification naturally")
print("Trade-off: Computational cost vs. sample efficiency")
print("Applications: Small data, Bayesian optimization, calibrated uncertainty")
print()