Gaussian Processes & RKHS

1. The Infinite Dimensional Gaussian

The Object: A Gaussian Process (GP) is a collection of random variables indexed by a set XX, such that every finite linear combination is normally distributed. This sounds simple, but it requires the Kolmogorov Extension Theorem to be well-defined. We specify the process by:

  1. Mean Function: m(x)=E[f(x)]m(x) = \mathbb{E}[f(x)].
  2. Covariance Kernel: k(x,x)=E[(f(x)m(x))(f(x)m(x))]k(x, x') = \mathbb{E}[(f(x)-m(x))(f(x')-m(x'))].

For consistency (to define a measure on RX\mathbb{R}^X), the kernel kk must be Positive Semi-Definite (PSD). For any set of points {xi}\{x_i\}, the Gram matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j) must satisfy vTKv0v^T K v \ge 0 for all vRnv \in \mathbb{R}^n.

2. The Reproducing Kernel Hilbert Space (RKHS)

Associated with every PSD kernel kk is a unique Hilbert space Hk\mathcal{H}_k of functions f:XRf: X \to \mathbb{R}. This space is “small”—it contains the “skeleton” of the random process, but arguably not the sample paths themselves.

Construction:

  1. Define the pre-Hilbert space H0=span{k(,x)xX}\mathcal{H}_0 = \text{span} \{ k(\cdot, x) \mid x \in X \}.
  2. Define the inner product on basis elements: k(,x),k(,y)Hk=k(x,y)\langle k(\cdot, x), k(\cdot, y) \rangle_{\mathcal{H}_k} = k(x, y)
  3. Extend by linearity and complete the space to get Hk\mathcal{H}_k.

The Reproducing Property: For any fHkf \in \mathcal{H}_k:

f,k(,x)Hk=f(x)\langle f, k(\cdot, x) \rangle_{\mathcal{H}_k} = f(x)

This implies that the evaluation functional δx:ff(x)\delta_x: f \mapsto f(x) is continuous bounded linear operator.

f(x)=f,kxfHkkxHk=fHkk(x,x)|f(x)| = | \langle f, k_x \rangle | \le \|f\|_{\mathcal{H}_k} \|k_x\|_{\mathcal{H}_k} = \|f\|_{\mathcal{H}_k} \sqrt{k(x, x)}

Mercer’s Theorem: Let XX be compact and kk be continuous. The integral operator Tk:L2(X)L2(X)T_k: L^2(X) \to L^2(X):

(Tkf)(x)=Xk(x,y)f(y)dμ(y)(T_k f)(x) = \int_X k(x, y) f(y) d\mu(y)

is compact, positive, and self-adjoint. There exists an orthonormal basis of eigenfunctions {ϕi}\{\phi_i\} and eigenvalues λi0\lambda_i \ge 0 such that:

k(x,y)=i=1λiϕi(x)ϕi(y)k(x, y) = \sum_{i=1}^\infty \lambda_i \phi_i(x) \phi_i(y)

The RKHS norm can be characterized spectrally:

fHk2=i=1f,ϕiL22λi\|f\|_{\mathcal{H}_k}^2 = \sum_{i=1}^\infty \frac{\langle f, \phi_i \rangle_{L^2}^2}{\lambda_i}

This explicitly penalizes high-frequency components (where λi0\lambda_i \to 0). The smaller λi\lambda_i, the “harder” it is for a function to have energy in that mode. Sample paths of the GP typically have infinite RKHS norm because ξi2λiλi=ξi2\sum \frac{|\xi_i|^2 \lambda_i}{\lambda_i} = \sum \xi_i^2 diverges (where ξiN(0,1)\xi_i \sim \mathcal{N}(0, 1)). This is the Driscoll-Zero One Law: P(fHk)=0P(f \in \mathcal{H}_k) = 0. The sample paths live in a slightly larger space (usually a fractional Sobolev space).


3. The Geometry of the Posterior

We observe noisy data yi=f(xi)+ϵiy_i = f(x_i) + \epsilon_i. We wish to compute the posterior fyf_* | y. Algebraically, this is just Gaussian conditioning. Geometrically, it is an Orthogonal Projection.

Consider the subspace spanned by the observations Vn=span{f(x1),,f(xn)}V_n = \text{span}\{f(x_1), \dots, f(x_n)\}. Since the variables are jointly Gaussian, the conditional expectation E[f(x)y]\mathbb{E}[f(x_*) | y] is simply the orthogonal projection of the random variable f(x)f(x_*) onto the space spanned by the observations yy.

f^(x)=Projyf(x)=Cov(f(x),y)Var(y)1y\hat{f}(x_*) = \text{Proj}_{y} f(x_*) = \text{Cov}(f(x_*), y) \text{Var}(y)^{-1} y =kn(Knn+σn2I)1y= k_{*n} (K_{nn} + \sigma_n^2 I)^{-1} y

This coincides with the Representer Theorem. The mean function lies exactly in the subspace spanned by the kernel functions {k(,xi)}\{k(\cdot, x_i)\}.

μpost(x)=i=1nαik(xi,x),α=(K+σ2I)1y\mu_{post}(x) = \sum_{i=1}^n \alpha_i k(x_i, x), \quad \alpha = (K + \sigma^2 I)^{-1} y

The posterior variance is the squared norm of the “residual” vector (the part of f(x)f(x_*) orthogonal to the observations).

V[fy]=f(x)2Projyf(x)2\mathbb{V}[f_* | y] = \| f(x_*) \|^2 - \| \text{Proj}_{y} f(x_*) \|^2 =k(x,x)kn(K+σ2I)1kn= k(x_*, x_*) - k_{*n} (K + \sigma^2 I)^{-1} k_{n*}

This geometric view explains why the variance shrinks near data points: the angle between f(x)f(x_*) and the observed variables f(xi)f(x_i) is small, so the projection captures most of the norm.

4. Spectral Kernels and Bochner’s Theorem

How do we design valid kernels? Theorem (Bochner): A continuous, shift-invariant function k(τ)k(\tau) on Rd\mathbb{R}^d is positive definite if and only if it is the Fourier transform of a finite non-negative measure S(ω)S(\omega).

k(τ)=Rde2πiωTτS(ω)dωk(\tau) = \int_{\mathbb{R}^d} e^{2\pi i \omega^T \tau} S(\omega) d\omega

Since kk is real, S(ω)S(\omega) must be symmetric (S(ω)=S(ω)S(\omega) = S(-\omega)). S(ω)S(\omega) is called the Spectral Density.

Spectral Mixture Kernels (Wilson & Adams, 2013): Most standard kernels are merely low-pass filters.

  • RBF: S(ω)eω2S(\omega) \propto e^{-\|\omega\|^2} (Gaussian).
  • Matern: S(ω)(1+ω2)νd/2S(\omega) \propto (1+\|\omega\|^2)^{-\nu-d/2} (Student-t / Cauchy-like). These spectral densities are centered at 0 and decay monotonically. They cannot model negative covariances or periodicity.

Theorem: The set of Gaussian Mixtures is dense in the set of all distribution functions (in the weak topology).

S(ω)q=1QwqN(ω;μq,Σq)S(\omega) \approx \sum_{q=1}^Q w_q \mathcal{N}(\omega; \mu_q, \Sigma_q)

Taking the inverse Fourier transform of this density yields the Spectral Mixture (SM) Kernel:

kSM(τ)=q=1Qwqexp(2π2τTΣqτ)cos(2πμqTτ)k_{SM}(\tau) = \sum_{q=1}^Q w_q \exp(-2\pi^2 \tau^T \Sigma_q \tau) \cos(2\pi \mu_q^T \tau)

This kernel is interpretable:

  • wqw_q: Contribution of the qq-th mode.
  • Σq\Sigma_q: Inverse lengthscale (frequency variance).
  • μq\mu_q: Frequency of the periodicity.

By learning the weights, means, and variances of the spectral density, a GP can effectively perform a Non-parametric Fourier Analysis while handling noise and gaps in the data robustly. This eliminates the “Kernel Engineering” step. The kernel is learned from data.

5. Scaling: Inducing Points (Sparse GPs)

To beat the O(N3)O(N^3) bottleneck, we need to reduce the rank of the kernel matrix. Nystrom approximations are numerically unstable because they simply subsample the matrix. The modern approach, Variational Sparse GP (Titsias, 2009), treats the inducing points as distinct variational parameters.

The Augmented Model: Introduce MNM \ll N inducing inputs Z={z1,,zM}Z = \{z_1, \dots, z_M\} and inducing variables u=f(Z)u = f(Z). The joint prior is:

p(f,u)=p(fu)p(u)p(f, u) = p(f|u) p(u)

where p(u)=N(0,Kuu)p(u) = \mathcal{N}(0, K_{uu}) and p(fu)=N(KfuKuu1u,KffQff)p(f|u) = \mathcal{N}(K_{fu} K_{uu}^{-1} u, K_{ff} - Q_{ff}) with Qff=KfuKuu1KufQ_{ff} = K_{fu} K_{uu}^{-1} K_{uf}.

Variational Inference: We approximate the true posterior p(f,uy)p(f, u | y) with a variational distribution q(f,u)q(f, u). Key assumption: q(f,u)=p(fu)q(u)q(f, u) = p(f|u) q(u). We preserve the exact conditional p(fu)p(f|u) but approximate the inducing distribution q(u)=N(m,S)q(u) = \mathcal{N}(m, S). We minimize the KL divergence KL(qp)\text{KL}(q \| p), which is equivalent to maximizing the Evidence Lower Bound (ELBO):

L=Eq(f,u)[logp(yf)p(fu)p(u)p(fu)q(u)]\mathcal{L} = \mathbb{E}_{q(f, u)} \left[ \log \frac{p(y|f) p(f|u) p(u)}{p(f|u) q(u)} \right]

The difficult functional terms cancel out!

L=Eq(f)[logp(yf)]KL(q(u)p(u))\mathcal{L} = \mathbb{E}_{q(f)} [\log p(y|f)] - \text{KL}(q(u) \| p(u))

The Bound Terms:

  1. Expected Log Likelihood: i=1NEq(fi)[logN(yifi,σ2)]\sum_{i=1}^N \mathbb{E}_{q(f_i)} [\log \mathcal{N}(y_i | f_i, \sigma^2)] Since q(fi)q(f_i) is Gaussian (marginal of q(f)q(f)), this is analytically tractable. It involves only the diagonal variances k(xi,xi)k(x_i, x_i).
  2. KL Regularizer: KL(N(m,S)N(0,Kuu))\text{KL}(\mathcal{N}(m, S) \| \mathcal{N}(0, K_{uu})) This penalizes the inducing distribution from deviating from the prior.

Complexity: Computing the ELBO requires inverting KuuK_{uu} (O(M3)O(M^3)) and multiplying by KufK_{uf} (O(NM2)O(NM^2)). Total complexity: O(NM2)O(NM^2). This enables GPs to scale to millions of points (using mini-batching on the expected log-likelihood sum).

6. Deep Kernel Learning and Implementation

Usually GPs have fixed kernels. Deep Kernel Learning (DKL): k(x,y)=kbase(ϕθ(x),ϕθ(y))k(x, y) = k_{base}(\phi_\theta(x), \phi_\theta(y)). Where ϕθ\phi_\theta is a neural network. We optimize θ\theta by maximizing the Marginal Likelihood.

logp(y)=12yTKθ1y12logKθn2log2π\log p(y) = -\frac{1}{2} y^T K_\theta^{-1} y - \frac{1}{2} \log |K_\theta| - \frac{n}{2} \log 2\pi

Gradient computation requires backpropagating through the Cholesky decomposition. K1=K1(K)K1\partial K^{-1} = -K^{-1} (\partial K) K^{-1}. Modern frameworks (JAX, GPyTorch) handle this automatically.

import jax import jax.numpy as jnp from jax import random, jit, grad, value_and_grad import matplotlib.pyplot as plt # ------------------------------------------------------------- # 1. Spectral Mixture Kernel Implementation # ------------------------------------------------------------- def spectral_mixture_kernel(x1, x2, weights, means, variances): """ Computes SM Kernel: k(tau) = sum w_q * exp(-2pi^2 tau^T Sigma tau) * cos(2pi mu^T tau) x1, x2: (N, D) weights: (Q,) means: (Q, D) variances: (Q, D) """ # Tau: distance vectors tau = x1[:, None, :] - x2[None, :, :] # (N, N, D) tau_sq = tau ** 2 # We sum over Q components # Vectorized computation for speed def component_func(w, m, v): # exp term: exp(-2 pi^2 * tau^2 * v) exp_term = jnp.exp(-2.0 * jnp.pi**2 * jnp.sum(tau_sq * v, axis=-1)) # cos term: cos(2 pi * mu * tau) cos_term = jnp.cos(2.0 * jnp.pi * jnp.sum(tau * m, axis=-1)) return w * exp_term * cos_term # vmap over the Q mixture components components = jax.vmap(component_func)(weights, means, variances) return jnp.sum(components, axis=0) # ------------------------------------------------------------- # 2. Exact GP Implementation # ------------------------------------------------------------- class ExactGP: def __init__(self, kernel_fn): self.kernel_fn = kernel_fn def loss_fn(self, params, X, y): # Unpack params noise = jnp.exp(params['log_noise']) # Kernel K = self.kernel_fn(X, X, params) + noise * jnp.eye(X.shape[0]) # Cholesky L = jnp.linalg.cholesky(K) # Solve alpha = jax.scipy.linalg.cho_solve((L, True), y) # MLL term1 = -0.5 * jnp.dot(y.flatten(), alpha.flatten()) term2 = -jnp.sum(jnp.log(jnp.diag(L))) term3 = -0.5 * len(y) * jnp.log(2 * jnp.pi) return -(term1 + term2 + term3) # Return Negative Log Likelihood def predict(self, params, X_train, y_train, X_test): noise = jnp.exp(params['log_noise']) K_xx = self.kernel_fn(X_train, X_train, params) + noise * jnp.eye(X_train.shape[0]) K_xs = self.kernel_fn(X_train, X_test, params) K_ss = self.kernel_fn(X_test, X_test, params) L = jnp.linalg.cholesky(K_xx) alpha = jax.scipy.linalg.cho_solve((L, True), y_train) mu = K_xs.T @ alpha v = jax.scipy.linalg.solve_triangular(L, K_xs, lower=True) var = K_ss - v.T @ v return mu, jnp.diag(var) # ------------------------------------------------------------- # 3. Training Loop # ------------------------------------------------------------- def train_sm_gp(X, y, num_components=4, steps=1000): key = random.PRNGKey(42) D = X.shape[1] # Initialize parameters randomly (crucial for SM kernel) # Mixture weights, frequencies, lengthscales params = { 'log_weights': jnp.log(jnp.ones(num_components) / num_components), 'log_means': random.normal(key, (num_components, D)), 'log_vars': random.normal(key, (num_components, D)) - 2.0, # Start with small bandwidth 'log_noise': jnp.log(0.1) } # Wrapper to unpack dict to args for kernel def wrapped_kernel(x1, x2, p): return spectral_mixture_kernel( x1, x2, jnp.exp(p['log_weights']), jnp.exp(p['log_means']), jnp.exp(p['log_vars']) ) gp = ExactGP(wrapped_kernel) optimizer = jax.experimental.optimizers.adam(1e-2) @jit def update(i, opt_state): p = optimizer.params_fn(opt_state) grads = grad(gp.loss_fn)(p, X, y) return optimizer.update_fn(i, grads, opt_state) opt_state = optimizer.init_fn(params) for i in range(steps): opt_state = update(i, opt_state) final_params = optimizer.params_fn(opt_state) return gp, final_params # The SM Kernel allows the GP to extrapolate patterns (like cardiac rhythm or airline passengers) # far beyond the training window, unlike the RBF kernel which reverts to the mean.

7. Approximate Inference for Classification

So far we assumed y=f(x)+ϵy = f(x) + \epsilon (Gaussian noise). For Binary Classification, y{1,1}y \in \{-1, 1\}, the likelihood is non-Gaussian (Bernoulli/Logistic).

p(yifi)=σ(yifi)p(y_i | f_i) = \sigma(y_i f_i)

The posterior p(fy)p(yf)p(f)p(f|y) \propto p(y|f) p(f) is no longer Gaussian. The integral p(yf)p(f)df\int p(y|f) p(f) df is intractable. We need approximations.

7.1 The Laplace Approximation: We approximate the posterior with a Gaussian q(f)=N(f^,A1)q(f) = \mathcal{N}(\hat{f}, A^{-1}) centered at the Maximum A Posteriori (MAP) mode.

  1. Find mode f^=argmax(logp(yf)+logp(f))\hat{f} = \text{argmax} (\log p(y|f) + \log p(f)).
  2. Compute the Hessian of the negative log posterior at the mode: A=logp(fy)f^=K1+WA = - \nabla \nabla \log p(f|y) |_{\hat{f}} = K^{-1} + W where WW is diagonal matrix of second derivatives of the likelihood (e.g. πi(1πi)\pi_i(1-\pi_i) for logistic).
  3. Use q(f)q(f) for predictions. E[f]=kTK1f^\mathbb{E}[f_*] = k_*^T K^{-1} \hat{f} The variance accounts for the curvature WW.

7.2 Expectation Propagation (EP): Laplace is a local approximation. EP is global. We approximate each likelihood term p(yifi)p(y_i|f_i) with an unnormalized Gaussian site function ti(fi)t_i(f_i).

p(yf)p(f)ti(fi)p(f)p(y|f) p(f) \approx \prod t_i(f_i) p(f)

We iteratively update the site parameters (cavity distributions) by moment matching the marginals. EP generally works better than Laplace for GP classification but is iteratively more expensive.


8. Conclusion: The Golden Standard of Uncertainty

Gaussian Processes represent the “Bayesian Ideal” for regression. They are mathematically elegant, handling uncertainty, smoothness, and finite-sample corrections exactly. While the O(N3)O(N^3) cost was historically prohibitive, modern methods (Sparse Variational GPs, Spectral Kernels, Deep Kernel Learning) have made them scalable to millions of data points. Perhaps most importantly, they serve as the rigorous limit for Deep Neural Networks. When we train a massive Neural Network, we are implicitly performing Kernel Regression in a complex RKHS. Understanding GPs is therefore not just about “old school” statistics; it is about understanding the fundamental limits of Learning Theory itself.


Historical Timeline

YearEventSignificance
1900Karl PearsonInvents PCA (Linear Latent Variable Model).
1940sKolmogorov/WienerDevelop optimal linear filtering (Kriging).
1994Radford NealProves limit of infinite Neural Networks is a GP.
2006Rasmussen & WilliamsPublish “Gaussian Processes for Machine Learning”.
2009Michalis TitsiasSparse Variational GPs (Inducing Points).
2013Wilson & AdamsSpectral Mixture Kernels.
2016Wilson et al.Deep Kernel Learning.
2018Jacot et al.Neural Tangent Kernel (NTK).

Appendix A: Deriving the Matern Spectral Density

The Matern kernel arises not from arbitrary engineering, but from the Green’s function of a Stochastic Partial Differential Equation (SPDE). Consider the SPDE driven by white noise W(x)W(x):

(2ν2Δ)ν2+d4f(x)=W(x)\left( \frac{2\nu}{\ell^2} - \Delta \right)^{\frac{\nu}{2} + \frac{d}{4}} f(x) = W(x)

This is a fractional Helmholtz equation. Taking the Fourier Transform of both sides:

(2ν2+ω2)ν2+d4f^(ω)=W^(ω)\left( \frac{2\nu}{\ell^2} + \|\omega\|^2 \right)^{\frac{\nu}{2} + \frac{d}{4}} \hat{f}(\omega) = \hat{W}(\omega)

Since WW is white noise, its spectral density is constant SW(ω)=constS_W(\omega) = \text{const}. The power spectral density Sf(ω)S_f(\omega) of the solution is the square of the transfer function modulus:

Sf(ω)(2ν2+ω2)(ν2+d4)2S_f(\omega) \propto \left| \left( \frac{2\nu}{\ell^2} + \|\omega\|^2 \right)^{-(\frac{\nu}{2} + \frac{d}{4})} \right|^2 Sf(ω)(2ν2+ω2)νd2S_f(\omega) \propto \left( \frac{2\nu}{\ell^2} + \|\omega\|^2 \right)^{-\nu - \frac{d}{2}}

This matches the spectral density of the Matern class perfectly. To get the spatial kernel k(r)k(r), we take the inverse Fourier transform.

k(r)=Rdeiωx1(λ2+ω2)αdωk(r) = \int_{\mathbb{R}^d} e^{i \omega \cdot x} \frac{1}{(\lambda^2 + \|\omega\|^2)^\alpha} d\omega

Using Schwinger parametrizations and Bessel function identities, this integral yields:

kν(r)=21νΓ(ν)(2νr)νKν(2νr)k_\nu(r) = \frac{2^{1-\nu}}{\Gamma(\nu)} (\sqrt{2\nu} r)^\nu K_\nu(\sqrt{2\nu} r)

Physical Interpretation:

  • ν=1/2\nu=1/2: The SPDE is first order (Ornstein-Uhlenbeck). Physics of Brownian motion with friction.
  • ν\nu \to \infty: The operator converges to the Heat Operator limit, yielding the Diffusion (Heat) Kernel, which is the Gaussian RBF.

Why do GPs matter for Deep Learning? Neal’s Theorem (1994): A single-layer Bayesian Neural Network with infinite width converges to a Gaussian Process. Let f(x)=1Hj=1Hvjσ(wjTx+bj)f(x) = \frac{1}{\sqrt{H}} \sum_{j=1}^H v_j \sigma(w_j^T x + b_j). Central Limit Theorem implies that for fixed xx, f(x)f(x) is a sum of i.i.d. variables, hence Gaussian.

The Modern View (Jacot et al., 2018): Consider the training dynamics of an infinite width network under Gradient Descent. The network evolves as a linear model with respect to a fixed kernel Θ(x,x)\Theta(x, x'), called the Neural Tangent Kernel (NTK).

Θ(x,x)=θf(x),θf(x)\Theta(x, x') = \langle \nabla_\theta f(x), \nabla_\theta f(x') \rangle

In the infinite width limit, Θ\Theta stays constant during training. The training dynamics are exactly given by the GP / Kernel Ridge Regression solution:

ft(x)=Θ(x,X)Θ(X,X)1(IeηΘ(X,X)t)yf_t(x) = \Theta(x, X) \Theta(X, X)^{-1} (I - e^{-\eta \Theta(X, X) t}) y

This proves that “Training a massive Neural Network” is functionally equivalent to “Performing Kernel Regression with a specific architecture-dependent kernel”. We can compute these kernels analytically for ReLUs, CNNs, etc. For a ReLU network, the “Arc-Cosine Kernel” emerges:

kReLU(x,x)θθ(sinϕ+(πϕ)cosϕ)k_{ReLU}(x, x') \propto \|\theta\| \|\theta'\| \left( \sin \phi + (\pi - \phi) \cos \phi \right)

where ϕ\phi is the angle between inputs. This gives us a closed-form way to understand what “inductive bias” a Neural Network architecture imposes: it specifies a specific RKHS norm.

Derivation of the Arc-Cosine Kernel: Consider a single neuron h(x)=σ(wTx)h(x) = \sigma(w^T x) with wN(0,I)w \sim \mathcal{N}(0, I). The kernel is k(x,y)=Ew[σ(wTx)σ(wTy)]k(x, y) = \mathbb{E}_w [\sigma(w^T x) \sigma(w^T y)]. Let σ(z)=max(0,z)\sigma(z) = \max(0, z) (ReLU). By rotational invariance, we can align xx with the first axis and yy in the plane spanned by the first two axes. Input vectors: x=xe1x = \|x\| e_1, y=y(cosϕe1+sinϕe2)y = \|y\| (\cos \phi e_1 + \sin \phi e_2). Weights: w=(w1,w2,)w = (w_1, w_2, \dots). The activation becomes max(0,xw1)max(0,y(w1cosϕ+w2sinϕ))\max(0, \|x\| w_1) \max(0, \|y\| (w_1 \cos \phi + w_2 \sin \phi)). We need to compute the integral:

J(ϕ)=max(0,u)max(0,ucosϕ+vsinϕ)e(u2+v2)/2dudvJ(\phi) = \int \int \max(0, u) \max(0, u \cos \phi + v \sin \phi) e^{-(u^2+v^2)/2} du dv

The integration domain is the sector where both arguments are positive. Using polar coordinates for (u,v)=r(cosθ,sinθ)(u, v) = r(\cos \theta, \sin \theta):

J(ϕ)=0r3er2/2drπ/2+ϕπ/2cosθcos(θϕ)dθJ(\phi) = \int_0^\infty r^3 e^{-r^2/2} dr \int_{-\pi/2 + \phi}^{\pi/2} \cos \theta \cos(\theta - \phi) d\theta

The radial integral r3er2/2dr=2\int r^3 e^{-r^2/2} dr = 2. The angular integral yields the factor 12π(sinϕ+(πϕ)cosϕ)\frac{1}{2\pi} (\sin \phi + (\pi - \phi) \cos \phi). This result (Cho & Saul, 2009) connects the geometric angle between data points directly to the correlation structure of the Deep Network. It implies that ReLU networks measuring similarity based on angle, not Euclidean distance (unlike RBF). This explains their superior performance on high-dimensional image data, where angular distance is more meaningful than radial distance due to concentration of measure.


Appendix C: Gaussian Process Latent Variable Models (GPLVM)

Standard PCA assumes data YY comes from a linear mapping of latent variables XX: Y=WX+ϵY = WX + \epsilon. Dual Probabilistic PCA (Lawrence, 2004) marginalizes WW and optimizes XX. This yields a Gaussian Process mapping from latent space to data space: YN(0,K(X,X)+σ2I)Y \sim \mathcal{N}(0, K(X, X) + \sigma^2 I). The GPLVM optimizes the latent locations XX to maximize the likelihood of the data YY.

The Objective:

logp(YX)=D2logK12Tr(K1YYT)const\log p(Y|X) = -\frac{D}{2} \log |K| - \frac{1}{2} \text{Tr}(K^{-1} Y Y^T) - \text{const}

Since XX is inside the non-linear kernel KK, this is a non-convex optimization problem. GPLVM learns the manifold structure. If we use a linear kernel, we recover PCA. If we use an RBF kernel, we get a non-linear embedding that preserves local distance (similar to t-SNE, but probabilistic).

Bayesian GPLVM (Titsias & Lawrence, 2010): The standard GPLVM treats XX as point parameters (MAP). We can be fully Bayesian and place a prior XN(0,I)X \sim \mathcal{N}(0, I) and approximate the posterior q(X)q(X). This requires a Variational Lower Bound slightly more complex than the Sparse GP one, because the kernel matrix KffK_{ff} itself is now a random variable (since it depends on stochastic XX). Titsias derived a closed form for the Ψ\Psi-statistics:

Ψ0=Tr(Eq[Kff])\Psi_0 = \text{Tr}(\mathbb{E}_q[K_{ff}]) Ψ1=Eq[Kfu]\Psi_1 = \mathbb{E}_q[K_{fu}] Ψ2=Eq[KufKfu]\Psi_2 = \mathbb{E}_q[K_{uf} K_{fu}]

For the RBF kernel, these expectations are analytically tractable (involving Gaussian convolution). This allows for “Automatic Relevance Determination” (ARD) on the latent dimensions, effectively learning the intrinsic dimensionality of the data automatically.


Appendix D: Glossary of Definitions

  • Covariance Kernel: Encodes the similarity between points. Must be PSD.
  • Inducing Points: Pseudopoints used to summarize the infinite process for scalability (O(NM2)O(NM^2)).
  • Marginal Likelihood: The probability of data given hyperparameters, integrating out the function. The objective for model selection.
  • Posterior Contraction: How the uncertainty shrinks near observed data.
  • RKHS: Hilbert space defined by the kernel. Contains the “smooth” functions.
  • Spectral Density: Fourier transform of the kernel (Bochner’s Theorem).

References

1. Rasmussen, C. E., & Williams, C. K. I. (2006). “Gaussian Processes for Machine Learning”. The standard text. Covers regression, classification, covariance functions, and theoretical properties.

2. Wilson, A. G., & Adams, R. P. (2013). “Gaussian Process Kernels for Pattern Discovery and Extrapolation”. Introduced Spectral Mixture Kernels. Showed that standard kernels are just local averages; spectral kernels can model long-range structure and negative covariance.

3. Titsias, M. (2009). “Variational Learning of Inducing Variables in Sparse Gaussian Processes”. The breakthrough that made GPs scalable. Instead of Nystrom (which ignores data), Titsias treated inducing points as variational parameters to be optimized.

4. Jacot, A. et al. (2018). “Neural Tangent Kernel: Convergence and Generalization in Neural Networks”. The landmark paper connecting infinite width Neural Networks to Kernel methods, providing exact solvability for Deep Learning dynamics.

5. Lawrence, N. (2004). “Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data”. Introduced the GPLVM framework, flipping the GP regression problem to optimize inputs rather than weights.

6. MacKay, D. J. C. (1998). “Introduction to Gaussian Processes”. A classic tutorial by the late Sir David MacKay. Connects GPs to Neural Networks (early) and provides a Bayesian perspective on regularization.

7. Quinonero-Candela, J., & Rasmussen, C. E. (2005). “A Unifying View of Sparse Approximate Gaussian Process Regression”. The paper that organized the confusing zoo of sparse approximations (FITC, DTC, PITC) into a single effective theory.