Gaussian Processes & RKHS
1. The Infinite Dimensional Gaussian
The Object: A Gaussian Process (GP) is a collection of random variables indexed by a set , such that every finite linear combination is normally distributed. This sounds simple, but it requires the Kolmogorov Extension Theorem to be well-defined. We specify the process by:
- Mean Function: .
- Covariance Kernel: .
For consistency (to define a measure on ), the kernel must be Positive Semi-Definite (PSD). For any set of points , the Gram matrix must satisfy for all .
2. The Reproducing Kernel Hilbert Space (RKHS)
Associated with every PSD kernel is a unique Hilbert space of functions . This space is “small”—it contains the “skeleton” of the random process, but arguably not the sample paths themselves.
Construction:
- Define the pre-Hilbert space .
- Define the inner product on basis elements:
- Extend by linearity and complete the space to get .
The Reproducing Property: For any :
This implies that the evaluation functional is continuous bounded linear operator.
Mercer’s Theorem: Let be compact and be continuous. The integral operator :
is compact, positive, and self-adjoint. There exists an orthonormal basis of eigenfunctions and eigenvalues such that:
The RKHS norm can be characterized spectrally:
This explicitly penalizes high-frequency components (where ). The smaller , the “harder” it is for a function to have energy in that mode. Sample paths of the GP typically have infinite RKHS norm because diverges (where ). This is the Driscoll-Zero One Law: . The sample paths live in a slightly larger space (usually a fractional Sobolev space).
3. The Geometry of the Posterior
We observe noisy data . We wish to compute the posterior . Algebraically, this is just Gaussian conditioning. Geometrically, it is an Orthogonal Projection.
Consider the subspace spanned by the observations . Since the variables are jointly Gaussian, the conditional expectation is simply the orthogonal projection of the random variable onto the space spanned by the observations .
This coincides with the Representer Theorem. The mean function lies exactly in the subspace spanned by the kernel functions .
The posterior variance is the squared norm of the “residual” vector (the part of orthogonal to the observations).
This geometric view explains why the variance shrinks near data points: the angle between and the observed variables is small, so the projection captures most of the norm.
4. Spectral Kernels and Bochner’s Theorem
How do we design valid kernels? Theorem (Bochner): A continuous, shift-invariant function on is positive definite if and only if it is the Fourier transform of a finite non-negative measure .
Since is real, must be symmetric (). is called the Spectral Density.
Spectral Mixture Kernels (Wilson & Adams, 2013): Most standard kernels are merely low-pass filters.
- RBF: (Gaussian).
- Matern: (Student-t / Cauchy-like). These spectral densities are centered at 0 and decay monotonically. They cannot model negative covariances or periodicity.
Theorem: The set of Gaussian Mixtures is dense in the set of all distribution functions (in the weak topology).
Taking the inverse Fourier transform of this density yields the Spectral Mixture (SM) Kernel:
This kernel is interpretable:
- : Contribution of the -th mode.
- : Inverse lengthscale (frequency variance).
- : Frequency of the periodicity.
By learning the weights, means, and variances of the spectral density, a GP can effectively perform a Non-parametric Fourier Analysis while handling noise and gaps in the data robustly. This eliminates the “Kernel Engineering” step. The kernel is learned from data.
5. Scaling: Inducing Points (Sparse GPs)
To beat the bottleneck, we need to reduce the rank of the kernel matrix. Nystrom approximations are numerically unstable because they simply subsample the matrix. The modern approach, Variational Sparse GP (Titsias, 2009), treats the inducing points as distinct variational parameters.
The Augmented Model: Introduce inducing inputs and inducing variables . The joint prior is:
where and with .
Variational Inference: We approximate the true posterior with a variational distribution . Key assumption: . We preserve the exact conditional but approximate the inducing distribution . We minimize the KL divergence , which is equivalent to maximizing the Evidence Lower Bound (ELBO):
The difficult functional terms cancel out!
The Bound Terms:
- Expected Log Likelihood: Since is Gaussian (marginal of ), this is analytically tractable. It involves only the diagonal variances .
- KL Regularizer: This penalizes the inducing distribution from deviating from the prior.
Complexity: Computing the ELBO requires inverting () and multiplying by (). Total complexity: . This enables GPs to scale to millions of points (using mini-batching on the expected log-likelihood sum).
6. Deep Kernel Learning and Implementation
Usually GPs have fixed kernels. Deep Kernel Learning (DKL): . Where is a neural network. We optimize by maximizing the Marginal Likelihood.
Gradient computation requires backpropagating through the Cholesky decomposition. . Modern frameworks (JAX, GPyTorch) handle this automatically.
import jax
import jax.numpy as jnp
from jax import random, jit, grad, value_and_grad
import matplotlib.pyplot as plt
# -------------------------------------------------------------
# 1. Spectral Mixture Kernel Implementation
# -------------------------------------------------------------
def spectral_mixture_kernel(x1, x2, weights, means, variances):
"""
Computes SM Kernel: k(tau) = sum w_q * exp(-2pi^2 tau^T Sigma tau) * cos(2pi mu^T tau)
x1, x2: (N, D)
weights: (Q,)
means: (Q, D)
variances: (Q, D)
"""
# Tau: distance vectors
tau = x1[:, None, :] - x2[None, :, :] # (N, N, D)
tau_sq = tau ** 2
# We sum over Q components
# Vectorized computation for speed
def component_func(w, m, v):
# exp term: exp(-2 pi^2 * tau^2 * v)
exp_term = jnp.exp(-2.0 * jnp.pi**2 * jnp.sum(tau_sq * v, axis=-1))
# cos term: cos(2 pi * mu * tau)
cos_term = jnp.cos(2.0 * jnp.pi * jnp.sum(tau * m, axis=-1))
return w * exp_term * cos_term
# vmap over the Q mixture components
components = jax.vmap(component_func)(weights, means, variances)
return jnp.sum(components, axis=0)
# -------------------------------------------------------------
# 2. Exact GP Implementation
# -------------------------------------------------------------
class ExactGP:
def __init__(self, kernel_fn):
self.kernel_fn = kernel_fn
def loss_fn(self, params, X, y):
# Unpack params
noise = jnp.exp(params['log_noise'])
# Kernel
K = self.kernel_fn(X, X, params) + noise * jnp.eye(X.shape[0])
# Cholesky
L = jnp.linalg.cholesky(K)
# Solve
alpha = jax.scipy.linalg.cho_solve((L, True), y)
# MLL
term1 = -0.5 * jnp.dot(y.flatten(), alpha.flatten())
term2 = -jnp.sum(jnp.log(jnp.diag(L)))
term3 = -0.5 * len(y) * jnp.log(2 * jnp.pi)
return -(term1 + term2 + term3) # Return Negative Log Likelihood
def predict(self, params, X_train, y_train, X_test):
noise = jnp.exp(params['log_noise'])
K_xx = self.kernel_fn(X_train, X_train, params) + noise * jnp.eye(X_train.shape[0])
K_xs = self.kernel_fn(X_train, X_test, params)
K_ss = self.kernel_fn(X_test, X_test, params)
L = jnp.linalg.cholesky(K_xx)
alpha = jax.scipy.linalg.cho_solve((L, True), y_train)
mu = K_xs.T @ alpha
v = jax.scipy.linalg.solve_triangular(L, K_xs, lower=True)
var = K_ss - v.T @ v
return mu, jnp.diag(var)
# -------------------------------------------------------------
# 3. Training Loop
# -------------------------------------------------------------
def train_sm_gp(X, y, num_components=4, steps=1000):
key = random.PRNGKey(42)
D = X.shape[1]
# Initialize parameters randomly (crucial for SM kernel)
# Mixture weights, frequencies, lengthscales
params = {
'log_weights': jnp.log(jnp.ones(num_components) / num_components),
'log_means': random.normal(key, (num_components, D)),
'log_vars': random.normal(key, (num_components, D)) - 2.0, # Start with small bandwidth
'log_noise': jnp.log(0.1)
}
# Wrapper to unpack dict to args for kernel
def wrapped_kernel(x1, x2, p):
return spectral_mixture_kernel(
x1, x2,
jnp.exp(p['log_weights']),
jnp.exp(p['log_means']),
jnp.exp(p['log_vars'])
)
gp = ExactGP(wrapped_kernel)
optimizer = jax.experimental.optimizers.adam(1e-2)
@jit
def update(i, opt_state):
p = optimizer.params_fn(opt_state)
grads = grad(gp.loss_fn)(p, X, y)
return optimizer.update_fn(i, grads, opt_state)
opt_state = optimizer.init_fn(params)
for i in range(steps):
opt_state = update(i, opt_state)
final_params = optimizer.params_fn(opt_state)
return gp, final_params
# The SM Kernel allows the GP to extrapolate patterns (like cardiac rhythm or airline passengers)
# far beyond the training window, unlike the RBF kernel which reverts to the mean.7. Approximate Inference for Classification
So far we assumed (Gaussian noise). For Binary Classification, , the likelihood is non-Gaussian (Bernoulli/Logistic).
The posterior is no longer Gaussian. The integral is intractable. We need approximations.
7.1 The Laplace Approximation: We approximate the posterior with a Gaussian centered at the Maximum A Posteriori (MAP) mode.
- Find mode .
- Compute the Hessian of the negative log posterior at the mode: where is diagonal matrix of second derivatives of the likelihood (e.g. for logistic).
- Use for predictions. The variance accounts for the curvature .
7.2 Expectation Propagation (EP): Laplace is a local approximation. EP is global. We approximate each likelihood term with an unnormalized Gaussian site function .
We iteratively update the site parameters (cavity distributions) by moment matching the marginals. EP generally works better than Laplace for GP classification but is iteratively more expensive.
8. Conclusion: The Golden Standard of Uncertainty
Gaussian Processes represent the “Bayesian Ideal” for regression. They are mathematically elegant, handling uncertainty, smoothness, and finite-sample corrections exactly. While the cost was historically prohibitive, modern methods (Sparse Variational GPs, Spectral Kernels, Deep Kernel Learning) have made them scalable to millions of data points. Perhaps most importantly, they serve as the rigorous limit for Deep Neural Networks. When we train a massive Neural Network, we are implicitly performing Kernel Regression in a complex RKHS. Understanding GPs is therefore not just about “old school” statistics; it is about understanding the fundamental limits of Learning Theory itself.
Historical Timeline
| Year | Event | Significance |
|---|---|---|
| 1900 | Karl Pearson | Invents PCA (Linear Latent Variable Model). |
| 1940s | Kolmogorov/Wiener | Develop optimal linear filtering (Kriging). |
| 1994 | Radford Neal | Proves limit of infinite Neural Networks is a GP. |
| 2006 | Rasmussen & Williams | Publish “Gaussian Processes for Machine Learning”. |
| 2009 | Michalis Titsias | Sparse Variational GPs (Inducing Points). |
| 2013 | Wilson & Adams | Spectral Mixture Kernels. |
| 2016 | Wilson et al. | Deep Kernel Learning. |
| 2018 | Jacot et al. | Neural Tangent Kernel (NTK). |
Appendix A: Deriving the Matern Spectral Density
The Matern kernel arises not from arbitrary engineering, but from the Green’s function of a Stochastic Partial Differential Equation (SPDE). Consider the SPDE driven by white noise :
This is a fractional Helmholtz equation. Taking the Fourier Transform of both sides:
Since is white noise, its spectral density is constant . The power spectral density of the solution is the square of the transfer function modulus:
This matches the spectral density of the Matern class perfectly. To get the spatial kernel , we take the inverse Fourier transform.
Using Schwinger parametrizations and Bessel function identities, this integral yields:
Physical Interpretation:
- : The SPDE is first order (Ornstein-Uhlenbeck). Physics of Brownian motion with friction.
- : The operator converges to the Heat Operator limit, yielding the Diffusion (Heat) Kernel, which is the Gaussian RBF.
Appendix B: The Neural Tangent Kernel (NTK) Link
Why do GPs matter for Deep Learning? Neal’s Theorem (1994): A single-layer Bayesian Neural Network with infinite width converges to a Gaussian Process. Let . Central Limit Theorem implies that for fixed , is a sum of i.i.d. variables, hence Gaussian.
The Modern View (Jacot et al., 2018): Consider the training dynamics of an infinite width network under Gradient Descent. The network evolves as a linear model with respect to a fixed kernel , called the Neural Tangent Kernel (NTK).
In the infinite width limit, stays constant during training. The training dynamics are exactly given by the GP / Kernel Ridge Regression solution:
This proves that “Training a massive Neural Network” is functionally equivalent to “Performing Kernel Regression with a specific architecture-dependent kernel”. We can compute these kernels analytically for ReLUs, CNNs, etc. For a ReLU network, the “Arc-Cosine Kernel” emerges:
where is the angle between inputs. This gives us a closed-form way to understand what “inductive bias” a Neural Network architecture imposes: it specifies a specific RKHS norm.
Derivation of the Arc-Cosine Kernel: Consider a single neuron with . The kernel is . Let (ReLU). By rotational invariance, we can align with the first axis and in the plane spanned by the first two axes. Input vectors: , . Weights: . The activation becomes . We need to compute the integral:
The integration domain is the sector where both arguments are positive. Using polar coordinates for :
The radial integral . The angular integral yields the factor . This result (Cho & Saul, 2009) connects the geometric angle between data points directly to the correlation structure of the Deep Network. It implies that ReLU networks measuring similarity based on angle, not Euclidean distance (unlike RBF). This explains their superior performance on high-dimensional image data, where angular distance is more meaningful than radial distance due to concentration of measure.
Appendix C: Gaussian Process Latent Variable Models (GPLVM)
Standard PCA assumes data comes from a linear mapping of latent variables : . Dual Probabilistic PCA (Lawrence, 2004) marginalizes and optimizes . This yields a Gaussian Process mapping from latent space to data space: . The GPLVM optimizes the latent locations to maximize the likelihood of the data .
The Objective:
Since is inside the non-linear kernel , this is a non-convex optimization problem. GPLVM learns the manifold structure. If we use a linear kernel, we recover PCA. If we use an RBF kernel, we get a non-linear embedding that preserves local distance (similar to t-SNE, but probabilistic).
Bayesian GPLVM (Titsias & Lawrence, 2010): The standard GPLVM treats as point parameters (MAP). We can be fully Bayesian and place a prior and approximate the posterior . This requires a Variational Lower Bound slightly more complex than the Sparse GP one, because the kernel matrix itself is now a random variable (since it depends on stochastic ). Titsias derived a closed form for the -statistics:
For the RBF kernel, these expectations are analytically tractable (involving Gaussian convolution). This allows for “Automatic Relevance Determination” (ARD) on the latent dimensions, effectively learning the intrinsic dimensionality of the data automatically.
Appendix D: Glossary of Definitions
- Covariance Kernel: Encodes the similarity between points. Must be PSD.
- Inducing Points: Pseudopoints used to summarize the infinite process for scalability ().
- Marginal Likelihood: The probability of data given hyperparameters, integrating out the function. The objective for model selection.
- Posterior Contraction: How the uncertainty shrinks near observed data.
- RKHS: Hilbert space defined by the kernel. Contains the “smooth” functions.
- Spectral Density: Fourier transform of the kernel (Bochner’s Theorem).
References
1. Rasmussen, C. E., & Williams, C. K. I. (2006). “Gaussian Processes for Machine Learning”. The standard text. Covers regression, classification, covariance functions, and theoretical properties.
2. Wilson, A. G., & Adams, R. P. (2013). “Gaussian Process Kernels for Pattern Discovery and Extrapolation”. Introduced Spectral Mixture Kernels. Showed that standard kernels are just local averages; spectral kernels can model long-range structure and negative covariance.
3. Titsias, M. (2009). “Variational Learning of Inducing Variables in Sparse Gaussian Processes”. The breakthrough that made GPs scalable. Instead of Nystrom (which ignores data), Titsias treated inducing points as variational parameters to be optimized.
4. Jacot, A. et al. (2018). “Neural Tangent Kernel: Convergence and Generalization in Neural Networks”. The landmark paper connecting infinite width Neural Networks to Kernel methods, providing exact solvability for Deep Learning dynamics.
5. Lawrence, N. (2004). “Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data”. Introduced the GPLVM framework, flipping the GP regression problem to optimize inputs rather than weights.
6. MacKay, D. J. C. (1998). “Introduction to Gaussian Processes”. A classic tutorial by the late Sir David MacKay. Connects GPs to Neural Networks (early) and provides a Bayesian perspective on regularization.
7. Quinonero-Candela, J., & Rasmussen, C. E. (2005). “A Unifying View of Sparse Approximate Gaussian Process Regression”. The paper that organized the confusing zoo of sparse approximations (FITC, DTC, PITC) into a single effective theory.