Gaussian Processes & RKHS
1. The Infinite-Dimensional Gaussian
A Gaussian Process is a big pile of random variables indexed by a set and every finite linear combination of them drops out normal. The definition is short and clean but making it rigorous leans on the Kolmogorov Extension Theorem to stitch together a consistent measure on . The whole process sits on two pieces and that is all you need.
- Mean function: .
- Covariance kernel: .
For the finite-dimensional distributions to hang together the kernel has to be positive semi-definite and for any set of points the Gram matrix has to satisfy for every . And that is the whole constraint.
2. The Reproducing Kernel Hilbert Space (RKHS)
Every PSD kernel carries with it a unique Hilbert space of functions and this space is tiny. It holds the deterministic skeleton of the random process and wipes out the sample paths themselves with probability one.
You build it in a few steps and each one is short.
- Define the pre-Hilbert space .
- Define the inner product on basis elements:
- Extend by linearity and complete the space to get .
The Reproducing Property. For any you get
The evaluation functional is a continuous linear operator and this is what makes point evaluation behave.
Mercer’s Theorem. Take compact and continuous and look at the integral operator .
This operator is compact and positive and self-adjoint and it comes with an orthonormal basis of eigenfunctions and eigenvalues and these let you write the kernel as a sum.
The RKHS norm then drops into a clean spectral form.
The RKHS norm punishes high-frequency pieces where and the smaller the eigenvalue the more painful it is for a function to carry energy in that mode. Sample paths of the GP usually blow up the RKHS norm because diverges with and this is the Driscoll Zero-One Law which says . The sample paths actually sit in a slightly bigger space and that space is usually a fractional Sobolev space whose smoothness is set by how fast the spectrum of the kernel drops off.
3. The Geometry of the Posterior
You see noisy data and you want the posterior . On paper this boils down to Gaussian conditioning and geometrically it is an orthogonal projection sitting inside the space of random variables.
Look at the subspace spanned by the observations . The variables are jointly Gaussian and so the conditional expectation drops out as the orthogonal projection of onto the subspace spanned by .
This lines up with the Representer Theorem and the posterior mean sits inside the subspace spanned by the kernel sections and nowhere else.
The posterior variance is the squared norm of the residual and it picks out the piece of that sits orthogonal to the observed subspace.
The geometry here is easy to see. Variance shrinks near data points because the angle between and the observed variables gets small and the projection grabs almost all of the norm and leaves almost nothing behind.
4. Spectral Kernels and Bochner’s Theorem
Theorem (Bochner). A continuous and shift-invariant function on is positive definite exactly when it is the Fourier transform of a finite non-negative measure .
Since is real the density has to be symmetric and and this is what people call the spectral density.
Spectral Mixture Kernels (Wilson & Adams, 2013) come from a simple observation. Standard kernels match up with single-bump spectral densities centered at the origin and that is all they can do.
- RBF: (Gaussian).
- Matern: (Student-t / Cauchy-like). These densities fall off monotonically and they cannot hold negative covariances and they cannot hold periodic structure.
The set of Gaussian mixtures is dense in the space of distribution functions under weak convergence and so you can approximate any spectral density this way.
Taking the inverse Fourier transform hands you the Spectral Mixture kernel.
The parameters are easy to read off the expression.
- : Contribution of the -th mode.
- : Inverse lengthscale (frequency variance).
- : Frequency of the periodicity.
By learning the weights and means and variances of the spectral density a GP ends up doing non-parametric Fourier analysis and it handles noise and missing data at the same time. Learning the spectral density swaps out manual kernel selection for kernel learning straight from the data and this is a big deal.
5. Scaling: Inducing Points (Sparse GPs)
The cost of exact inference pushes you toward low-rank approximations of the kernel matrix. Nystrom approximations that just subsample the matrix are numerically flaky and the modern fix is the Variational Sparse GP (Titsias, 2009) which treats the inducing points as variational parameters and tunes them.
The augmented model starts by dropping in inducing inputs and inducing variables and the joint prior factors cleanly.
with and where .
For variational inference you approximate the true posterior with a variational distribution and the key move is to set . You hold the exact conditional fixed and only approximate the inducing distribution and then you minimize the KL divergence which is the same thing as maximizing the Evidence Lower Bound.
The terms cancel top and bottom and you are left with this.
The bound splits into two pieces and both are clean.
- Expected log likelihood. Since is Gaussian as the marginal of this one is analytically tractable and it only touches the diagonal variances and nothing else.
- KL regularizer. This pulls the inducing distribution back toward the prior whenever it drifts away.
Computing the ELBO forces you to invert which costs and multiply by which costs and the total comes out to . This stretches GPs out to millions of points once you mini-batch the expected log-likelihood sum.
6. Deep Kernel Learning and Implementation
Standard GPs run with fixed kernel families. Deep Kernel Learning glues a base kernel onto a learned feature map so where is a neural network and the parameters get tuned by pushing up the marginal likelihood.
Computing the gradient forces you to differentiate through the Cholesky decomposition and use and modern frameworks like JAX and GPyTorch handle this for you.
import jax
import jax.numpy as jnp
from jax import random, jit, grad, value_and_grad
import matplotlib.pyplot as plt
# -------------------------------------------------------------
# 1. Spectral Mixture Kernel Implementation
# -------------------------------------------------------------
def spectral_mixture_kernel(x1, x2, weights, means, variances):
"""
Computes SM Kernel: k(tau) = sum w_q * exp(-2pi^2 tau^T Sigma tau) * cos(2pi mu^T tau)
x1, x2: (N, D)
weights: (Q,)
means: (Q, D)
variances: (Q, D)
"""
# Tau: distance vectors
tau = x1[:, None, :] - x2[None, :, :] # (N, N, D)
tau_sq = tau ** 2
# We sum over Q components
# Vectorized computation for speed
def component_func(w, m, v):
# exp term: exp(-2 pi^2 * tau^2 * v)
exp_term = jnp.exp(-2.0 * jnp.pi**2 * jnp.sum(tau_sq * v, axis=-1))
# cos term: cos(2 pi * mu * tau)
cos_term = jnp.cos(2.0 * jnp.pi * jnp.sum(tau * m, axis=-1))
return w * exp_term * cos_term
# vmap over the Q mixture components
components = jax.vmap(component_func)(weights, means, variances)
return jnp.sum(components, axis=0)
# -------------------------------------------------------------
# 2. Exact GP Implementation
# -------------------------------------------------------------
class ExactGP:
def __init__(self, kernel_fn):
self.kernel_fn = kernel_fn
def loss_fn(self, params, X, y):
# Unpack params
noise = jnp.exp(params['log_noise'])
# Kernel
K = self.kernel_fn(X, X, params) + noise * jnp.eye(X.shape[0])
# Cholesky
L = jnp.linalg.cholesky(K)
# Solve
alpha = jax.scipy.linalg.cho_solve((L, True), y)
# MLL
term1 = -0.5 * jnp.dot(y.flatten(), alpha.flatten())
term2 = -jnp.sum(jnp.log(jnp.diag(L)))
term3 = -0.5 * len(y) * jnp.log(2 * jnp.pi)
return -(term1 + term2 + term3) # Return Negative Log Likelihood
def predict(self, params, X_train, y_train, X_test):
noise = jnp.exp(params['log_noise'])
K_xx = self.kernel_fn(X_train, X_train, params) + noise * jnp.eye(X_train.shape[0])
K_xs = self.kernel_fn(X_train, X_test, params)
K_ss = self.kernel_fn(X_test, X_test, params)
L = jnp.linalg.cholesky(K_xx)
alpha = jax.scipy.linalg.cho_solve((L, True), y_train)
mu = K_xs.T @ alpha
v = jax.scipy.linalg.solve_triangular(L, K_xs, lower=True)
var = K_ss - v.T @ v
return mu, jnp.diag(var)
# -------------------------------------------------------------
# 3. Training Loop
# -------------------------------------------------------------
def train_sm_gp(X, y, num_components=4, steps=1000):
key = random.PRNGKey(42)
D = X.shape[1]
# Initialize parameters randomly (crucial for SM kernel)
# Mixture weights, frequencies, lengthscales
params = {
'log_weights': jnp.log(jnp.ones(num_components) / num_components),
'log_means': random.normal(key, (num_components, D)),
'log_vars': random.normal(key, (num_components, D)) - 2.0, # Start with small bandwidth
'log_noise': jnp.log(0.1)
}
# Wrapper to unpack dict to args for kernel
def wrapped_kernel(x1, x2, p):
return spectral_mixture_kernel(
x1, x2,
jnp.exp(p['log_weights']),
jnp.exp(p['log_means']),
jnp.exp(p['log_vars'])
)
gp = ExactGP(wrapped_kernel)
optimizer = jax.experimental.optimizers.adam(1e-2)
@jit
def update(i, opt_state):
p = optimizer.params_fn(opt_state)
grads = grad(gp.loss_fn)(p, X, y)
return optimizer.update_fn(i, grads, opt_state)
opt_state = optimizer.init_fn(params)
for i in range(steps):
opt_state = update(i, opt_state)
final_params = optimizer.params_fn(opt_state)
return gp, final_params
# The SM Kernel allows the GP to extrapolate patterns (like cardiac rhythm or airline passengers)
# far beyond the training window, unlike the RBF kernel which reverts to the mean.7. Approximate Inference for Classification
Everything above assumed Gaussian observation noise . For binary classification with the likelihood stops being Gaussian.
The posterior stops being Gaussian and the marginal likelihood turns intractable and you have to fall back on approximate inference.
7.1 The Laplace Approximation. You approximate the posterior with a Gaussian centered at the MAP estimate and that is the whole idea.
- Find the mode .
- Compute the Hessian of the negative log posterior at the mode. where is the diagonal matrix of second derivatives of the log-likelihood and for the logistic likelihood you get and nothing messier than that.
- Use for predictions. The predictive variance soaks up the curvature through and that is where the uncertainty lives.
7.2 Expectation Propagation. The Laplace approximation is local and only grabs the curvature at the mode while EP gives you a global approximation by matching moments. Each likelihood factor gets replaced by an unnormalized Gaussian site function .
You iterate on the site parameters by matching the first and second moments of the cavity distributions. EP usually gives better calibrated predictive probabilities than the Laplace approximation for GP classification and you pay for it with extra iterations and more bookkeeping.
8. The Bayesian Ideal for Uncertainty
Gaussian Processes give you a principled Bayesian framework for regression and function estimation and they handle uncertainty and smoothness and finite-sample corrections in a mathematically exact way. The cost used to be a blocker and sparse variational GPs and spectral kernels and deep kernel learning have stretched them out to datasets with millions of observations.
The deeper point is the link to neural networks. Training a wide enough neural network in the infinite-width limit is the same thing as kernel regression in an architecture-dependent RKHS and so understanding GPs is a prerequisite for understanding the inductive biases and the generalization behavior of modern deep learning.
Timeline
| Year | Event | Significance |
|---|---|---|
| 1900 | Karl Pearson | Invents PCA (Linear Latent Variable Model). |
| 1940s | Kolmogorov/Wiener | Develop optimal linear filtering (Kriging). |
| 1994 | Radford Neal | Proves limit of infinite Neural Networks is a GP. |
| 2006 | Rasmussen & Williams | Publish “Gaussian Processes for Machine Learning”. |
| 2009 | Michalis Titsias | Sparse Variational GPs (Inducing Points). |
| 2013 | Wilson & Adams | Spectral Mixture Kernels. |
| 2016 | Wilson et al. | Deep Kernel Learning. |
| 2018 | Jacot et al. | Neural Tangent Kernel (NTK). |
Deriving the Matern Spectral Density
The Matern kernel pops out as the Green’s function of a stochastic partial differential equation. Take the SPDE driven by white noise and watch what happens.
This is a fractional Helmholtz equation and taking the Fourier transform of both sides untangles it right away.
Since is white noise its spectral density is constant and and the power spectral density of the solution drops out as the squared modulus of the transfer function.
This is exactly the spectral density of the Matern class and the spatial kernel comes back by running the inverse Fourier transform.
Using Schwinger parametrization and Bessel function identities this integral collapses down to a clean form.
The physics reads simply. At the SPDE is first-order and you get the Ornstein-Uhlenbeck process which is Brownian motion with friction. As the operator slides toward the heat operator and the kernel drifts toward the Gaussian RBF which is the diffusion kernel.
The Neural Tangent Kernel (NTK)
Neal’s Theorem (1994). A single-layer Bayesian neural network at infinite width turns into a Gaussian Process. Let and by the Central Limit Theorem at fixed the output is a sum of i.i.d. terms and drops to Gaussian in the limit.
The Modern Extension (Jacot et al., 2018). Look at the training dynamics of an infinite-width network under gradient descent. The network moves as a linear model against a fixed kernel which is the Neural Tangent Kernel.
In the infinite-width limit holds constant all the way through training and the training dynamics drop out exactly as the kernel regression solution.
So training an infinite-width network is really kernel regression with an architecture-dependent kernel and you can work these kernels out analytically for specific activations and architectures like ReLU networks and CNNs. For a ReLU network the arc-cosine kernel falls out.
with the angle between inputs. The kernel nails down the inductive bias of the architecture by picking out the RKHS norm that the network implicitly regularizes.
To derive the arc-cosine kernel look at a single neuron with . The kernel is and you take for ReLU. By rotational invariance you can spin onto the first coordinate axis and drop into the plane spanned by the first two axes so and . Writing the product of activations turns into a concrete expression. You then have to evaluate the integral that comes next.
The integration domain is the sector where both arguments come out positive and passing to polar coordinates splits it wide open.
The radial integral gives and the angular integral hands back the factor . This result (Cho & Saul, 2009) says ReLU networks measure similarity by the angle between data points and not by Euclidean distance the way the RBF kernel does. This angular dependence is part of why they do so well on high-dimensional data like images where Euclidean distances lose information because of concentration of measure.
Gaussian Process Latent Variable Models (GPLVM)
Standard PCA assumes data comes from a linear mapping of latent variables and . Dual Probabilistic PCA (Lawrence, 2004) integrates out and optimizes instead and this gives you a Gaussian Process mapping from latent space to data space and . The GPLVM tunes the latent locations to push up the likelihood of the data .
The objective looks like this and it is short.
Since enters through the non-linear kernel this optimization problem is not convex. The kernel you pick sets the class of manifold structures you can recover and a linear kernel recovers PCA exactly and an RBF kernel gives you a non-linear embedding that holds onto local distances and it behaves a lot like t-SNE but sitting inside a probabilistic framework.
Bayesian GPLVM (Titsias & Lawrence, 2010) goes further. The standard GPLVM treats as point estimates at the MAP and the Bayesian version drops a prior on it and approximates the posterior variationally. This forces you to build a variational lower bound that accounts for the kernel matrix being a random variable because it depends on the stochastic inputs . Titsias worked out closed-form expressions for the required -statistics.
For the RBF kernel these expectations are analytically tractable and they collapse to Gaussian convolutions. This framework opens the door to Automatic Relevance Determination on the latent dimensions and it picks out the intrinsic dimensionality of the data on its own.
Terminology
| Term | Meaning |
|---|---|
| Covariance kernel | A positive semi-definite function that specifies the second-order structure of the GP |
| Gram matrix | The matrix evaluated at training points |
| Kolmogorov extension | The theorem guaranteeing a consistent measure on from finite-dimensional marginals |
| Marginal likelihood | , the objective for hyperparameter optimization |
| Mercer expansion | The eigendecomposition |
| Posterior mean | , the optimal prediction under squared loss |
| Spectral density | The Fourier transform of a stationary kernel, which determines frequency content |
| Inducing points | A set of pseudo-inputs used to construct a low-rank approximation to the full GP |
| Driscoll zero-one law | GP sample paths have infinite RKHS norm with probability one |
| NTK | The Neural Tangent Kernel, which governs training dynamics of wide neural networks |
References
1. Rasmussen, C. E., & Williams, C. K. I. (2006). “Gaussian Processes for Machine Learning”. The standard text and it covers regression and classification and covariance functions and theoretical properties.
2. Wilson, A. G., & Adams, R. P. (2013). “Gaussian Process Kernels for Pattern Discovery and Extrapolation”. Introduced Spectral Mixture Kernels and showed that standard kernels are just local averages while spectral kernels can hold long-range structure and negative covariance.
3. Titsias, M. (2009). “Variational Learning of Inducing Variables in Sparse Gaussian Processes”. The breakthrough that made GPs scalable. Instead of Nystrom which ignores data Titsias treated inducing points as variational parameters to be tuned.
4. Jacot, A. et al. (2018). “Neural Tangent Kernel: Convergence and Generalization in Neural Networks”. The landmark paper linking infinite width Neural Networks to Kernel methods and giving exact solvability for Deep Learning dynamics.
5. Lawrence, N. (2004). “Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data”. Introduced the GPLVM framework and flipped the GP regression problem to tune inputs rather than weights.
6. MacKay, D. J. C. (1998). “Introduction to Gaussian Processes”. A classic tutorial by the late Sir David MacKay and it connects GPs to Neural Networks early on and gives a Bayesian perspective on regularization.
7. Quinonero-Candela, J., & Rasmussen, C. E. (2005). “A Unifying View of Sparse Approximate Gaussian Process Regression”. The paper that organized the confusing zoo of sparse approximations like FITC and DTC and PITC into a single effective theory.