Gaussian Processes & RKHS

1. The Infinite-Dimensional Gaussian

A Gaussian Process is a big pile of random variables indexed by a set XX and every finite linear combination of them drops out normal. The definition is short and clean but making it rigorous leans on the Kolmogorov Extension Theorem to stitch together a consistent measure on RX\mathbb{R}^X. The whole process sits on two pieces and that is all you need.

  1. Mean function: m(x)=E[f(x)]m(x) = \mathbb{E}[f(x)].
  2. Covariance kernel: k(x,x)=E[(f(x)m(x))(f(x)m(x))]k(x, x') = \mathbb{E}[(f(x)-m(x))(f(x')-m(x'))].

For the finite-dimensional distributions to hang together the kernel kk has to be positive semi-definite and for any set of points {xi}\{x_i\} the Gram matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j) has to satisfy vTKv0v^T K v \ge 0 for every vRnv \in \mathbb{R}^n. And that is the whole constraint.

2. The Reproducing Kernel Hilbert Space (RKHS)

Every PSD kernel kk carries with it a unique Hilbert space Hk\mathcal{H}_k of functions f:XRf: X \to \mathbb{R} and this space is tiny. It holds the deterministic skeleton of the random process and wipes out the sample paths themselves with probability one.

You build it in a few steps and each one is short.

  1. Define the pre-Hilbert space H0=span{k(,x)xX}\mathcal{H}_0 = \text{span} \{ k(\cdot, x) \mid x \in X \}.
  2. Define the inner product on basis elements: k(,x),k(,y)Hk=k(x,y)\langle k(\cdot, x), k(\cdot, y) \rangle_{\mathcal{H}_k} = k(x, y)
  3. Extend by linearity and complete the space to get Hk\mathcal{H}_k.

The Reproducing Property. For any fHkf \in \mathcal{H}_k you get

f,k(,x)Hk=f(x)\langle f, k(\cdot, x) \rangle_{\mathcal{H}_k} = f(x)

The evaluation functional δx:ff(x)\delta_x: f \mapsto f(x) is a continuous linear operator and this is what makes point evaluation behave.

f(x)=f,kxfHkkxHk=fHkk(x,x)|f(x)| = | \langle f, k_x \rangle | \le \|f\|_{\mathcal{H}_k} \|k_x\|_{\mathcal{H}_k} = \|f\|_{\mathcal{H}_k} \sqrt{k(x, x)}

Mercer’s Theorem. Take XX compact and kk continuous and look at the integral operator Tk:L2(X)L2(X)T_k: L^2(X) \to L^2(X).

(Tkf)(x)=Xk(x,y)f(y)dμ(y)(T_k f)(x) = \int_X k(x, y) f(y) d\mu(y)

This operator is compact and positive and self-adjoint and it comes with an orthonormal basis of eigenfunctions {ϕi}\{\phi_i\} and eigenvalues λi0\lambda_i \ge 0 and these let you write the kernel as a sum.

k(x,y)=i=1λiϕi(x)ϕi(y)k(x, y) = \sum_{i=1}^\infty \lambda_i \phi_i(x) \phi_i(y)

The RKHS norm then drops into a clean spectral form.

fHk2=i=1f,ϕiL22λi\|f\|_{\mathcal{H}_k}^2 = \sum_{i=1}^\infty \frac{\langle f, \phi_i \rangle_{L^2}^2}{\lambda_i}

The RKHS norm punishes high-frequency pieces where λi0\lambda_i \to 0 and the smaller the eigenvalue the more painful it is for a function to carry energy in that mode. Sample paths of the GP usually blow up the RKHS norm because ξi2λiλi=ξi2\sum \frac{|\xi_i|^2 \lambda_i}{\lambda_i} = \sum \xi_i^2 diverges with ξiN(0,1)\xi_i \sim \mathcal{N}(0, 1) and this is the Driscoll Zero-One Law which says P(fHk)=0P(f \in \mathcal{H}_k) = 0. The sample paths actually sit in a slightly bigger space and that space is usually a fractional Sobolev space whose smoothness is set by how fast the spectrum of the kernel drops off.


3. The Geometry of the Posterior

You see noisy data yi=f(xi)+ϵiy_i = f(x_i) + \epsilon_i and you want the posterior fyf_* | y. On paper this boils down to Gaussian conditioning and geometrically it is an orthogonal projection sitting inside the L2(Ω)L^2(\Omega) space of random variables.

Look at the subspace spanned by the observations Vn=span{f(x1),,f(xn)}V_n = \text{span}\{f(x_1), \dots, f(x_n)\}. The variables are jointly Gaussian and so the conditional expectation E[f(x)y]\mathbb{E}[f(x_*) | y] drops out as the orthogonal projection of f(x)f(x_*) onto the subspace spanned by yy.

f^(x)=Projyf(x)=Cov(f(x),y)Var(y)1y\hat{f}(x_*) = \text{Proj}_{y} f(x_*) = \text{Cov}(f(x_*), y) \text{Var}(y)^{-1} y =kn(Knn+σn2I)1y= k_{*n} (K_{nn} + \sigma_n^2 I)^{-1} y

This lines up with the Representer Theorem and the posterior mean sits inside the subspace spanned by the kernel sections {k(,xi)}\{k(\cdot, x_i)\} and nowhere else.

μpost(x)=i=1nαik(xi,x),α=(K+σ2I)1y\mu_{post}(x) = \sum_{i=1}^n \alpha_i k(x_i, x), \quad \alpha = (K + \sigma^2 I)^{-1} y

The posterior variance is the squared norm of the residual and it picks out the piece of f(x)f(x_*) that sits orthogonal to the observed subspace.

V[fy]=f(x)2Projyf(x)2\mathbb{V}[f_* | y] = \| f(x_*) \|^2 - \| \text{Proj}_{y} f(x_*) \|^2 =k(x,x)kn(K+σ2I)1kn= k(x_*, x_*) - k_{*n} (K + \sigma^2 I)^{-1} k_{n*}

The geometry here is easy to see. Variance shrinks near data points because the angle between f(x)f(x_*) and the observed variables f(xi)f(x_i) gets small and the projection grabs almost all of the norm and leaves almost nothing behind.

4. Spectral Kernels and Bochner’s Theorem

Theorem (Bochner). A continuous and shift-invariant function k(τ)k(\tau) on Rd\mathbb{R}^d is positive definite exactly when it is the Fourier transform of a finite non-negative measure S(ω)S(\omega).

k(τ)=Rde2πiωTτS(ω)dωk(\tau) = \int_{\mathbb{R}^d} e^{2\pi i \omega^T \tau} S(\omega) d\omega

Since kk is real the density S(ω)S(\omega) has to be symmetric and S(ω)=S(ω)S(\omega) = S(-\omega) and this S(ω)S(\omega) is what people call the spectral density.

Spectral Mixture Kernels (Wilson & Adams, 2013) come from a simple observation. Standard kernels match up with single-bump spectral densities centered at the origin and that is all they can do.

  • RBF: S(ω)eω2S(\omega) \propto e^{-\|\omega\|^2} (Gaussian).
  • Matern: S(ω)(1+ω2)νd/2S(\omega) \propto (1+\|\omega\|^2)^{-\nu-d/2} (Student-t / Cauchy-like). These densities fall off monotonically and they cannot hold negative covariances and they cannot hold periodic structure.

The set of Gaussian mixtures is dense in the space of distribution functions under weak convergence and so you can approximate any spectral density this way.

S(ω)q=1QwqN(ω;μq,Σq)S(\omega) \approx \sum_{q=1}^Q w_q \mathcal{N}(\omega; \mu_q, \Sigma_q)

Taking the inverse Fourier transform hands you the Spectral Mixture kernel.

kSM(τ)=q=1Qwqexp(2π2τTΣqτ)cos(2πμqTτ)k_{SM}(\tau) = \sum_{q=1}^Q w_q \exp(-2\pi^2 \tau^T \Sigma_q \tau) \cos(2\pi \mu_q^T \tau)

The parameters are easy to read off the expression.

  • wqw_q: Contribution of the qq-th mode.
  • Σq\Sigma_q: Inverse lengthscale (frequency variance).
  • μq\mu_q: Frequency of the periodicity.

By learning the weights and means and variances of the spectral density a GP ends up doing non-parametric Fourier analysis and it handles noise and missing data at the same time. Learning the spectral density swaps out manual kernel selection for kernel learning straight from the data and this is a big deal.

5. Scaling: Inducing Points (Sparse GPs)

The O(N3)O(N^3) cost of exact inference pushes you toward low-rank approximations of the kernel matrix. Nystrom approximations that just subsample the matrix are numerically flaky and the modern fix is the Variational Sparse GP (Titsias, 2009) which treats the inducing points as variational parameters and tunes them.

The augmented model starts by dropping in MNM \ll N inducing inputs Z={z1,,zM}Z = \{z_1, \dots, z_M\} and inducing variables u=f(Z)u = f(Z) and the joint prior factors cleanly.

p(f,u)=p(fu)p(u)p(f, u) = p(f|u) p(u)

with p(u)=N(0,Kuu)p(u) = \mathcal{N}(0, K_{uu}) and p(fu)=N(KfuKuu1u,KffQff)p(f|u) = \mathcal{N}(K_{fu} K_{uu}^{-1} u, K_{ff} - Q_{ff}) where Qff=KfuKuu1KufQ_{ff} = K_{fu} K_{uu}^{-1} K_{uf}.

For variational inference you approximate the true posterior p(f,uy)p(f, u | y) with a variational distribution q(f,u)q(f, u) and the key move is to set q(f,u)=p(fu)q(u)q(f, u) = p(f|u) q(u). You hold the exact conditional p(fu)p(f|u) fixed and only approximate the inducing distribution q(u)=N(m,S)q(u) = \mathcal{N}(m, S) and then you minimize the KL divergence KL(qp)\text{KL}(q \| p) which is the same thing as maximizing the Evidence Lower Bound.

L=Eq(f,u)[logp(yf)p(fu)p(u)p(fu)q(u)]\mathcal{L} = \mathbb{E}_{q(f, u)} \left[ \log \frac{p(y|f) p(f|u) p(u)}{p(f|u) q(u)} \right]

The p(fu)p(f|u) terms cancel top and bottom and you are left with this.

L=Eq(f)[logp(yf)]KL(q(u)p(u))\mathcal{L} = \mathbb{E}_{q(f)} [\log p(y|f)] - \text{KL}(q(u) \| p(u))

The bound splits into two pieces and both are clean.

  1. Expected log likelihood. i=1NEq(fi)[logN(yifi,σ2)]\sum_{i=1}^N \mathbb{E}_{q(f_i)} [\log \mathcal{N}(y_i | f_i, \sigma^2)] Since q(fi)q(f_i) is Gaussian as the marginal of q(f)q(f) this one is analytically tractable and it only touches the diagonal variances k(xi,xi)k(x_i, x_i) and nothing else.
  2. KL regularizer. KL(N(m,S)N(0,Kuu))\text{KL}(\mathcal{N}(m, S) \| \mathcal{N}(0, K_{uu})) This pulls the inducing distribution back toward the prior whenever it drifts away.

Computing the ELBO forces you to invert KuuK_{uu} which costs O(M3)O(M^3) and multiply by KufK_{uf} which costs O(NM2)O(NM^2) and the total comes out to O(NM2)O(NM^2). This stretches GPs out to millions of points once you mini-batch the expected log-likelihood sum.

6. Deep Kernel Learning and Implementation

Standard GPs run with fixed kernel families. Deep Kernel Learning glues a base kernel onto a learned feature map so k(x,y)=kbase(ϕθ(x),ϕθ(y))k(x, y) = k_{base}(\phi_\theta(x), \phi_\theta(y)) where ϕθ\phi_\theta is a neural network and the parameters θ\theta get tuned by pushing up the marginal likelihood.

logp(y)=12yTKθ1y12logKθn2log2π\log p(y) = -\frac{1}{2} y^T K_\theta^{-1} y - \frac{1}{2} \log |K_\theta| - \frac{n}{2} \log 2\pi

Computing the gradient forces you to differentiate through the Cholesky decomposition and use K1=K1(K)K1\partial K^{-1} = -K^{-1} (\partial K) K^{-1} and modern frameworks like JAX and GPyTorch handle this for you.

import jax import jax.numpy as jnp from jax import random, jit, grad, value_and_grad import matplotlib.pyplot as plt # ------------------------------------------------------------- # 1. Spectral Mixture Kernel Implementation # ------------------------------------------------------------- def spectral_mixture_kernel(x1, x2, weights, means, variances): """ Computes SM Kernel: k(tau) = sum w_q * exp(-2pi^2 tau^T Sigma tau) * cos(2pi mu^T tau) x1, x2: (N, D) weights: (Q,) means: (Q, D) variances: (Q, D) """ # Tau: distance vectors tau = x1[:, None, :] - x2[None, :, :] # (N, N, D) tau_sq = tau ** 2 # We sum over Q components # Vectorized computation for speed def component_func(w, m, v): # exp term: exp(-2 pi^2 * tau^2 * v) exp_term = jnp.exp(-2.0 * jnp.pi**2 * jnp.sum(tau_sq * v, axis=-1)) # cos term: cos(2 pi * mu * tau) cos_term = jnp.cos(2.0 * jnp.pi * jnp.sum(tau * m, axis=-1)) return w * exp_term * cos_term # vmap over the Q mixture components components = jax.vmap(component_func)(weights, means, variances) return jnp.sum(components, axis=0) # ------------------------------------------------------------- # 2. Exact GP Implementation # ------------------------------------------------------------- class ExactGP: def __init__(self, kernel_fn): self.kernel_fn = kernel_fn def loss_fn(self, params, X, y): # Unpack params noise = jnp.exp(params['log_noise']) # Kernel K = self.kernel_fn(X, X, params) + noise * jnp.eye(X.shape[0]) # Cholesky L = jnp.linalg.cholesky(K) # Solve alpha = jax.scipy.linalg.cho_solve((L, True), y) # MLL term1 = -0.5 * jnp.dot(y.flatten(), alpha.flatten()) term2 = -jnp.sum(jnp.log(jnp.diag(L))) term3 = -0.5 * len(y) * jnp.log(2 * jnp.pi) return -(term1 + term2 + term3) # Return Negative Log Likelihood def predict(self, params, X_train, y_train, X_test): noise = jnp.exp(params['log_noise']) K_xx = self.kernel_fn(X_train, X_train, params) + noise * jnp.eye(X_train.shape[0]) K_xs = self.kernel_fn(X_train, X_test, params) K_ss = self.kernel_fn(X_test, X_test, params) L = jnp.linalg.cholesky(K_xx) alpha = jax.scipy.linalg.cho_solve((L, True), y_train) mu = K_xs.T @ alpha v = jax.scipy.linalg.solve_triangular(L, K_xs, lower=True) var = K_ss - v.T @ v return mu, jnp.diag(var) # ------------------------------------------------------------- # 3. Training Loop # ------------------------------------------------------------- def train_sm_gp(X, y, num_components=4, steps=1000): key = random.PRNGKey(42) D = X.shape[1] # Initialize parameters randomly (crucial for SM kernel) # Mixture weights, frequencies, lengthscales params = { 'log_weights': jnp.log(jnp.ones(num_components) / num_components), 'log_means': random.normal(key, (num_components, D)), 'log_vars': random.normal(key, (num_components, D)) - 2.0, # Start with small bandwidth 'log_noise': jnp.log(0.1) } # Wrapper to unpack dict to args for kernel def wrapped_kernel(x1, x2, p): return spectral_mixture_kernel( x1, x2, jnp.exp(p['log_weights']), jnp.exp(p['log_means']), jnp.exp(p['log_vars']) ) gp = ExactGP(wrapped_kernel) optimizer = jax.experimental.optimizers.adam(1e-2) @jit def update(i, opt_state): p = optimizer.params_fn(opt_state) grads = grad(gp.loss_fn)(p, X, y) return optimizer.update_fn(i, grads, opt_state) opt_state = optimizer.init_fn(params) for i in range(steps): opt_state = update(i, opt_state) final_params = optimizer.params_fn(opt_state) return gp, final_params # The SM Kernel allows the GP to extrapolate patterns (like cardiac rhythm or airline passengers) # far beyond the training window, unlike the RBF kernel which reverts to the mean.

7. Approximate Inference for Classification

Everything above assumed Gaussian observation noise y=f(x)+ϵy = f(x) + \epsilon. For binary classification with y{1,1}y \in \{-1, 1\} the likelihood stops being Gaussian.

p(yifi)=σ(yifi)p(y_i | f_i) = \sigma(y_i f_i)

The posterior p(fy)p(yf)p(f)p(f|y) \propto p(y|f) p(f) stops being Gaussian and the marginal likelihood p(yf)p(f)df\int p(y|f) p(f) df turns intractable and you have to fall back on approximate inference.

7.1 The Laplace Approximation. You approximate the posterior with a Gaussian q(f)=N(f^,A1)q(f) = \mathcal{N}(\hat{f}, A^{-1}) centered at the MAP estimate and that is the whole idea.

  1. Find the mode f^=argmax(logp(yf)+logp(f))\hat{f} = \text{argmax} (\log p(y|f) + \log p(f)).
  2. Compute the Hessian of the negative log posterior at the mode. A=logp(fy)f^=K1+WA = - \nabla \nabla \log p(f|y) |_{\hat{f}} = K^{-1} + W where WW is the diagonal matrix of second derivatives of the log-likelihood and for the logistic likelihood you get Wii=πi(1πi)W_{ii} = \pi_i(1-\pi_i) and nothing messier than that.
  3. Use q(f)q(f) for predictions. E[f]=kTK1f^\mathbb{E}[f_*] = k_*^T K^{-1} \hat{f} The predictive variance soaks up the curvature through WW and that is where the uncertainty lives.

7.2 Expectation Propagation. The Laplace approximation is local and only grabs the curvature at the mode while EP gives you a global approximation by matching moments. Each likelihood factor p(yifi)p(y_i|f_i) gets replaced by an unnormalized Gaussian site function ti(fi)t_i(f_i).

p(yf)p(f)ti(fi)p(f)p(y|f) p(f) \approx \prod t_i(f_i) p(f)

You iterate on the site parameters by matching the first and second moments of the cavity distributions. EP usually gives better calibrated predictive probabilities than the Laplace approximation for GP classification and you pay for it with extra iterations and more bookkeeping.


8. The Bayesian Ideal for Uncertainty

Gaussian Processes give you a principled Bayesian framework for regression and function estimation and they handle uncertainty and smoothness and finite-sample corrections in a mathematically exact way. The O(N3)O(N^3) cost used to be a blocker and sparse variational GPs and spectral kernels and deep kernel learning have stretched them out to datasets with millions of observations.

The deeper point is the link to neural networks. Training a wide enough neural network in the infinite-width limit is the same thing as kernel regression in an architecture-dependent RKHS and so understanding GPs is a prerequisite for understanding the inductive biases and the generalization behavior of modern deep learning.


Timeline

YearEventSignificance
1900Karl PearsonInvents PCA (Linear Latent Variable Model).
1940sKolmogorov/WienerDevelop optimal linear filtering (Kriging).
1994Radford NealProves limit of infinite Neural Networks is a GP.
2006Rasmussen & WilliamsPublish “Gaussian Processes for Machine Learning”.
2009Michalis TitsiasSparse Variational GPs (Inducing Points).
2013Wilson & AdamsSpectral Mixture Kernels.
2016Wilson et al.Deep Kernel Learning.
2018Jacot et al.Neural Tangent Kernel (NTK).

Deriving the Matern Spectral Density

The Matern kernel pops out as the Green’s function of a stochastic partial differential equation. Take the SPDE driven by white noise W(x)W(x) and watch what happens.

(2ν2Δ)ν2+d4f(x)=W(x)\left( \frac{2\nu}{\ell^2} - \Delta \right)^{\frac{\nu}{2} + \frac{d}{4}} f(x) = W(x)

This is a fractional Helmholtz equation and taking the Fourier transform of both sides untangles it right away.

(2ν2+ω2)ν2+d4f^(ω)=W^(ω)\left( \frac{2\nu}{\ell^2} + \|\omega\|^2 \right)^{\frac{\nu}{2} + \frac{d}{4}} \hat{f}(\omega) = \hat{W}(\omega)

Since WW is white noise its spectral density is constant and SW(ω)=constS_W(\omega) = \text{const} and the power spectral density Sf(ω)S_f(\omega) of the solution drops out as the squared modulus of the transfer function.

Sf(ω)(2ν2+ω2)(ν2+d4)2S_f(\omega) \propto \left| \left( \frac{2\nu}{\ell^2} + \|\omega\|^2 \right)^{-(\frac{\nu}{2} + \frac{d}{4})} \right|^2 Sf(ω)(2ν2+ω2)νd2S_f(\omega) \propto \left( \frac{2\nu}{\ell^2} + \|\omega\|^2 \right)^{-\nu - \frac{d}{2}}

This is exactly the spectral density of the Matern class and the spatial kernel k(r)k(r) comes back by running the inverse Fourier transform.

k(r)=Rdeiωx1(λ2+ω2)αdωk(r) = \int_{\mathbb{R}^d} e^{i \omega \cdot x} \frac{1}{(\lambda^2 + \|\omega\|^2)^\alpha} d\omega

Using Schwinger parametrization and Bessel function identities this integral collapses down to a clean form.

kν(r)=21νΓ(ν)(2νr)νKν(2νr)k_\nu(r) = \frac{2^{1-\nu}}{\Gamma(\nu)} (\sqrt{2\nu} r)^\nu K_\nu(\sqrt{2\nu} r)

The physics reads simply. At ν=1/2\nu=1/2 the SPDE is first-order and you get the Ornstein-Uhlenbeck process which is Brownian motion with friction. As ν\nu \to \infty the operator slides toward the heat operator and the kernel drifts toward the Gaussian RBF which is the diffusion kernel.


The Neural Tangent Kernel (NTK)

Neal’s Theorem (1994). A single-layer Bayesian neural network at infinite width turns into a Gaussian Process. Let f(x)=1Hj=1Hvjσ(wjTx+bj)f(x) = \frac{1}{\sqrt{H}} \sum_{j=1}^H v_j \sigma(w_j^T x + b_j) and by the Central Limit Theorem at fixed xx the output f(x)f(x) is a sum of i.i.d. terms and drops to Gaussian in the limit.

The Modern Extension (Jacot et al., 2018). Look at the training dynamics of an infinite-width network under gradient descent. The network moves as a linear model against a fixed kernel Θ(x,x)\Theta(x, x') which is the Neural Tangent Kernel.

Θ(x,x)=θf(x),θf(x)\Theta(x, x') = \langle \nabla_\theta f(x), \nabla_\theta f(x') \rangle

In the infinite-width limit Θ\Theta holds constant all the way through training and the training dynamics drop out exactly as the kernel regression solution.

ft(x)=Θ(x,X)Θ(X,X)1(IeηΘ(X,X)t)yf_t(x) = \Theta(x, X) \Theta(X, X)^{-1} (I - e^{-\eta \Theta(X, X) t}) y

So training an infinite-width network is really kernel regression with an architecture-dependent kernel and you can work these kernels out analytically for specific activations and architectures like ReLU networks and CNNs. For a ReLU network the arc-cosine kernel falls out.

kReLU(x,x)xx(sinϕ+(πϕ)cosϕ)k_{ReLU}(x, x') \propto \|x\| \|x'\| \left( \sin \phi + (\pi - \phi) \cos \phi \right)

with ϕ\phi the angle between inputs. The kernel nails down the inductive bias of the architecture by picking out the RKHS norm that the network implicitly regularizes.

To derive the arc-cosine kernel look at a single neuron h(x)=σ(wTx)h(x) = \sigma(w^T x) with wN(0,I)w \sim \mathcal{N}(0, I). The kernel is k(x,y)=Ew[σ(wTx)σ(wTy)]k(x, y) = \mathbb{E}_w [\sigma(w^T x) \sigma(w^T y)] and you take σ(z)=max(0,z)\sigma(z) = \max(0, z) for ReLU. By rotational invariance you can spin xx onto the first coordinate axis and drop yy into the plane spanned by the first two axes so x=xe1x = \|x\| e_1 and y=y(cosϕe1+sinϕe2)y = \|y\| (\cos \phi \, e_1 + \sin \phi \, e_2). Writing w=(w1,w2,)w = (w_1, w_2, \dots) the product of activations turns into a concrete expression. max(0,xw1)max(0,y(w1cosϕ+w2sinϕ))\max(0, \|x\| w_1) \max(0, \|y\| (w_1 \cos \phi + w_2 \sin \phi)) You then have to evaluate the integral that comes next.

J(ϕ)=max(0,u)max(0,ucosϕ+vsinϕ)e(u2+v2)/2dudvJ(\phi) = \int \int \max(0, u) \max(0, u \cos \phi + v \sin \phi) \, e^{-(u^2+v^2)/2} \, du \, dv

The integration domain is the sector where both arguments come out positive and passing to polar coordinates (u,v)=r(cosθ,sinθ)(u, v) = r(\cos \theta, \sin \theta) splits it wide open.

J(ϕ)=0r3er2/2drπ/2+ϕπ/2cosθcos(θϕ)dθJ(\phi) = \int_0^\infty r^3 e^{-r^2/2} dr \int_{-\pi/2 + \phi}^{\pi/2} \cos \theta \cos(\theta - \phi) \, d\theta

The radial integral gives 0r3er2/2dr=2\int_0^\infty r^3 e^{-r^2/2} dr = 2 and the angular integral hands back the factor 12π(sinϕ+(πϕ)cosϕ)\frac{1}{2\pi} (\sin \phi + (\pi - \phi) \cos \phi). This result (Cho & Saul, 2009) says ReLU networks measure similarity by the angle between data points and not by Euclidean distance the way the RBF kernel does. This angular dependence is part of why they do so well on high-dimensional data like images where Euclidean distances lose information because of concentration of measure.


Gaussian Process Latent Variable Models (GPLVM)

Standard PCA assumes data YY comes from a linear mapping of latent variables XX and Y=WX+ϵY = WX + \epsilon. Dual Probabilistic PCA (Lawrence, 2004) integrates WW out and optimizes XX instead and this gives you a Gaussian Process mapping from latent space to data space and YN(0,K(X,X)+σ2I)Y \sim \mathcal{N}(0, K(X, X) + \sigma^2 I). The GPLVM tunes the latent locations XX to push up the likelihood of the data YY.

The objective looks like this and it is short.

logp(YX)=D2logK12Tr(K1YYT)const\log p(Y|X) = -\frac{D}{2} \log |K| - \frac{1}{2} \text{Tr}(K^{-1} Y Y^T) - \text{const}

Since XX enters through the non-linear kernel KK this optimization problem is not convex. The kernel you pick sets the class of manifold structures you can recover and a linear kernel recovers PCA exactly and an RBF kernel gives you a non-linear embedding that holds onto local distances and it behaves a lot like t-SNE but sitting inside a probabilistic framework.

Bayesian GPLVM (Titsias & Lawrence, 2010) goes further. The standard GPLVM treats XX as point estimates at the MAP and the Bayesian version drops a prior XN(0,I)X \sim \mathcal{N}(0, I) on it and approximates the posterior q(X)q(X) variationally. This forces you to build a variational lower bound that accounts for the kernel matrix KffK_{ff} being a random variable because it depends on the stochastic inputs XX. Titsias worked out closed-form expressions for the required Ψ\Psi-statistics.

Ψ0=Tr(Eq[Kff])\Psi_0 = \text{Tr}(\mathbb{E}_q[K_{ff}]) Ψ1=Eq[Kfu]\Psi_1 = \mathbb{E}_q[K_{fu}] Ψ2=Eq[KufKfu]\Psi_2 = \mathbb{E}_q[K_{uf} K_{fu}]

For the RBF kernel these expectations are analytically tractable and they collapse to Gaussian convolutions. This framework opens the door to Automatic Relevance Determination on the latent dimensions and it picks out the intrinsic dimensionality of the data on its own.


Terminology

TermMeaning
Covariance kernelA positive semi-definite function k(x,x)k(x,x') that specifies the second-order structure of the GP
Gram matrixThe n×nn \times n matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j) evaluated at training points
Kolmogorov extensionThe theorem guaranteeing a consistent measure on RX\mathbb{R}^X from finite-dimensional marginals
Marginal likelihoodp(yX)=p(yf)p(f)dfp(y \mid X) = \int p(y \mid f) p(f) df, the objective for hyperparameter optimization
Mercer expansionThe eigendecomposition k(x,x)=λiϕi(x)ϕi(x)k(x,x') = \sum \lambda_i \phi_i(x) \phi_i(x')
Posterior meanμ(x)=kT(K+σ2I)1y\mu_*(x) = k_*^T (K + \sigma^2 I)^{-1} y, the optimal prediction under squared loss
Spectral densityThe Fourier transform S(ω)S(\omega) of a stationary kernel, which determines frequency content
Inducing pointsA set of MNM \ll N pseudo-inputs used to construct a low-rank approximation to the full GP
Driscoll zero-one lawGP sample paths have infinite RKHS norm with probability one
NTKThe Neural Tangent Kernel, which governs training dynamics of wide neural networks

References

1. Rasmussen, C. E., & Williams, C. K. I. (2006). “Gaussian Processes for Machine Learning”. The standard text and it covers regression and classification and covariance functions and theoretical properties.

2. Wilson, A. G., & Adams, R. P. (2013). “Gaussian Process Kernels for Pattern Discovery and Extrapolation”. Introduced Spectral Mixture Kernels and showed that standard kernels are just local averages while spectral kernels can hold long-range structure and negative covariance.

3. Titsias, M. (2009). “Variational Learning of Inducing Variables in Sparse Gaussian Processes”. The breakthrough that made GPs scalable. Instead of Nystrom which ignores data Titsias treated inducing points as variational parameters to be tuned.

4. Jacot, A. et al. (2018). “Neural Tangent Kernel: Convergence and Generalization in Neural Networks”. The landmark paper linking infinite width Neural Networks to Kernel methods and giving exact solvability for Deep Learning dynamics.

5. Lawrence, N. (2004). “Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data”. Introduced the GPLVM framework and flipped the GP regression problem to tune inputs rather than weights.

6. MacKay, D. J. C. (1998). “Introduction to Gaussian Processes”. A classic tutorial by the late Sir David MacKay and it connects GPs to Neural Networks early on and gives a Bayesian perspective on regularization.

7. Quinonero-Candela, J., & Rasmussen, C. E. (2005). “A Unifying View of Sparse Approximate Gaussian Process Regression”. The paper that organized the confusing zoo of sparse approximations like FITC and DTC and PITC into a single effective theory.