Gaussian Processes & RKHS

8/15/2025

1. The Infinite Dimensional Gaussian

The Object: A Gaussian Process (GP) is a collection of random variables indexed by a set $X$ , such that every finite linear combination is normally distributed. This sounds simple, but it requires the Kolmogorov Extension Theorem to be well-defined. We specify the process by:

Mean Function: $m(x) = \mathbb{E}[f(x)]$ .
Covariance Kernel: $k(x, x') = \mathbb{E}[(f(x)-m(x))(f(x')-m(x'))]$ .

For consistency (to define a measure on $\mathbb{R}^X$ ), the kernel $k$ must be Positive Semi-Definite (PSD). For any set of points $\{x_i\}$ , the Gram matrix $K_{ij} = k(x_i, x_j)$ must satisfy $v^T K v \ge 0$ for all $v \in \mathbb{R}^n$ .

2. The Reproducing Kernel Hilbert Space (RKHS)

Associated with every PSD kernel $k$ is a unique Hilbert space $\mathcal{H}_k$ of functions $f: X \to \mathbb{R}$ . This space is “small”—it contains the “skeleton” of the random process, but arguably not the sample paths themselves.

Construction:

Define the pre-Hilbert space $\mathcal{H}_0 = \text{span} \{ k(\cdot, x) \mid x \in X \}$ .
Define the inner product on basis elements: $\langle k(\cdot, x), k(\cdot, y) \rangle_{\mathcal{H}_k} = k(x, y)$
Extend by linearity and complete the space to get $\mathcal{H}_k$ .

The Reproducing Property: For any $f \in \mathcal{H}_k$ :

\langle f, k(\cdot, x) \rangle_{\mathcal{H}_k} = f(x)

This implies that the evaluation functional $\delta_x: f \mapsto f(x)$ is continuous bounded linear operator.

|f(x)| = | \langle f, k_x \rangle | \le \|f\|_{\mathcal{H}_k} \|k_x\|_{\mathcal{H}_k} = \|f\|_{\mathcal{H}_k} \sqrt{k(x, x)}

Mercer’s Theorem: Let $X$ be compact and $k$ be continuous. The integral operator $T_k: L^2(X) \to L^2(X)$ :

(T_k f)(x) = \int_X k(x, y) f(y) d\mu(y)

is compact, positive, and self-adjoint. There exists an orthonormal basis of eigenfunctions $\{\phi_i\}$ and eigenvalues $\lambda_i \ge 0$ such that:

k(x, y) = \sum_{i=1}^\infty \lambda_i \phi_i(x) \phi_i(y)

The RKHS norm can be characterized spectrally:

\|f\|_{\mathcal{H}_k}^2 = \sum_{i=1}^\infty \frac{\langle f, \phi_i \rangle_{L^2}^2}{\lambda_i}

This explicitly penalizes high-frequency components (where $\lambda_i \to 0$ ). The smaller $\lambda_i$ , the “harder” it is for a function to have energy in that mode. Sample paths of the GP typically have infinite RKHS norm because $\sum \frac{|\xi_i|^2 \lambda_i}{\lambda_i} = \sum \xi_i^2$ diverges (where $\xi_i \sim \mathcal{N}(0, 1)$ ). This is the Driscoll-Zero One Law: $P(f \in \mathcal{H}_k) = 0$ . The sample paths live in a slightly larger space (usually a fractional Sobolev space).

3. The Geometry of the Posterior

We observe noisy data $y_i = f(x_i) + \epsilon_i$ . We wish to compute the posterior $f_* | y$ . Algebraically, this is just Gaussian conditioning. Geometrically, it is an Orthogonal Projection.

Consider the subspace spanned by the observations $V_n = \text{span}\{f(x_1), \dots, f(x_n)\}$ . Since the variables are jointly Gaussian, the conditional expectation $\mathbb{E}[f(x_*) | y]$ is simply the orthogonal projection of the random variable $f(x_*)$ onto the space spanned by the observations $y$ .

\hat{f}(x_*) = \text{Proj}_{y} f(x_*) = \text{Cov}(f(x_*), y) \text{Var}(y)^{-1} y

= k_{*n} (K_{nn} + \sigma_n^2 I)^{-1} y

This coincides with the Representer Theorem. The mean function lies exactly in the subspace spanned by the kernel functions $\{k(\cdot, x_i)\}$ .

\mu_{post}(x) = \sum_{i=1}^n \alpha_i k(x_i, x), \quad \alpha = (K + \sigma^2 I)^{-1} y

The posterior variance is the squared norm of the “residual” vector (the part of $f(x_*)$ orthogonal to the observations).

\mathbb{V}[f_* | y] = \| f(x_*) \|^2 - \| \text{Proj}_{y} f(x_*) \|^2

= k(x_*, x_*) - k_{*n} (K + \sigma^2 I)^{-1} k_{n*}

This geometric view explains why the variance shrinks near data points: the angle between $f(x_*)$ and the observed variables $f(x_i)$ is small, so the projection captures most of the norm.

4. Spectral Kernels and Bochner’s Theorem

How do we design valid kernels? Theorem (Bochner): A continuous, shift-invariant function $k(\tau)$ on $\mathbb{R}^d$ is positive definite if and only if it is the Fourier transform of a finite non-negative measure $S(\omega)$ .

k(\tau) = \int_{\mathbb{R}^d} e^{2\pi i \omega^T \tau} S(\omega) d\omega

Since $k$ is real, $S(\omega)$ must be symmetric ( $S(\omega) = S(-\omega)$ ). $S(\omega)$ is called the Spectral Density.

Spectral Mixture Kernels (Wilson & Adams, 2013): Most standard kernels are merely low-pass filters.

RBF: $S(\omega) \propto e^{-\|\omega\|^2}$ (Gaussian).
Matern: $S(\omega) \propto (1+\|\omega\|^2)^{-\nu-d/2}$ (Student-t / Cauchy-like). These spectral densities are centered at 0 and decay monotonically. They cannot model negative covariances or periodicity.

Theorem: The set of Gaussian Mixtures is dense in the set of all distribution functions (in the weak topology).

S(\omega) \approx \sum_{q=1}^Q w_q \mathcal{N}(\omega; \mu_q, \Sigma_q)

Taking the inverse Fourier transform of this density yields the Spectral Mixture (SM) Kernel:

k_{SM}(\tau) = \sum_{q=1}^Q w_q \exp(-2\pi^2 \tau^T \Sigma_q \tau) \cos(2\pi \mu_q^T \tau)

This kernel is interpretable:

$w_q$ : Contribution of the $q$ -th mode.
$\Sigma_q$ : Inverse lengthscale (frequency variance).
$\mu_q$ : Frequency of the periodicity.

By learning the weights, means, and variances of the spectral density, a GP can effectively perform a Non-parametric Fourier Analysis while handling noise and gaps in the data robustly. This eliminates the “Kernel Engineering” step. The kernel is learned from data.

5. Scaling: Inducing Points (Sparse GPs)

To beat the $O(N^3)$ bottleneck, we need to reduce the rank of the kernel matrix. Nystrom approximations are numerically unstable because they simply subsample the matrix. The modern approach, Variational Sparse GP (Titsias, 2009), treats the inducing points as distinct variational parameters.

The Augmented Model: Introduce $M \ll N$ inducing inputs $Z = \{z_1, \dots, z_M\}$ and inducing variables $u = f(Z)$ . The joint prior is:

p(f, u) = p(f|u) p(u)

where $p(u) = \mathcal{N}(0, K_{uu})$ and $p(f|u) = \mathcal{N}(K_{fu} K_{uu}^{-1} u, K_{ff} - Q_{ff})$ with $Q_{ff} = K_{fu} K_{uu}^{-1} K_{uf}$ .

Variational Inference: We approximate the true posterior $p(f, u | y)$ with a variational distribution $q(f, u)$ . Key assumption: $q(f, u) = p(f|u) q(u)$ . We preserve the exact conditional $p(f|u)$ but approximate the inducing distribution $q(u) = \mathcal{N}(m, S)$ . We minimize the KL divergence $\text{KL}(q \| p)$ , which is equivalent to maximizing the Evidence Lower Bound (ELBO):

\mathcal{L} = \mathbb{E}_{q(f, u)} \left[ \log \frac{p(y|f) p(f|u) p(u)}{p(f|u) q(u)} \right]

The difficult functional terms cancel out!

\mathcal{L} = \mathbb{E}_{q(f)} [\log p(y|f)] - \text{KL}(q(u) \| p(u))

The Bound Terms:

Expected Log Likelihood: $\sum_{i=1}^N \mathbb{E}_{q(f_i)} [\log \mathcal{N}(y_i | f_i, \sigma^2)]$ Since $q(f_i)$ is Gaussian (marginal of $q(f)$ ), this is analytically tractable. It involves only the diagonal variances $k(x_i, x_i)$ .
KL Regularizer: $\text{KL}(\mathcal{N}(m, S) \| \mathcal{N}(0, K_{uu}))$ This penalizes the inducing distribution from deviating from the prior.

Complexity: Computing the ELBO requires inverting $K_{uu}$ ( $O(M^3)$ ) and multiplying by $K_{uf}$ ( $O(NM^2)$ ). Total complexity: $O(NM^2)$ . This enables GPs to scale to millions of points (using mini-batching on the expected log-likelihood sum).

6. Deep Kernel Learning and Implementation

Usually GPs have fixed kernels. Deep Kernel Learning (DKL): $k(x, y) = k_{base}(\phi_\theta(x), \phi_\theta(y))$ . Where $\phi_\theta$ is a neural network. We optimize $\theta$ by maximizing the Marginal Likelihood.

\log p(y) = -\frac{1}{2} y^T K_\theta^{-1} y - \frac{1}{2} \log |K_\theta| - \frac{n}{2} \log 2\pi

Gradient computation requires backpropagating through the Cholesky decomposition. $\partial K^{-1} = -K^{-1} (\partial K) K^{-1}$ . Modern frameworks (JAX, GPyTorch) handle this automatically.


import jax
import jax.numpy as jnp
from jax import random, jit, grad, value_and_grad
import matplotlib.pyplot as plt
 
# -------------------------------------------------------------
# 1. Spectral Mixture Kernel Implementation
# -------------------------------------------------------------
def spectral_mixture_kernel(x1, x2, weights, means, variances):
    """
    Computes SM Kernel: k(tau) = sum w_q * exp(-2pi^2 tau^T Sigma tau) * cos(2pi mu^T tau)
    x1, x2: (N, D)
    weights: (Q,)
    means: (Q, D)
    variances: (Q, D)
    """
    # Tau: distance vectors
    tau = x1[:, None, :] - x2[None, :, :] # (N, N, D)
    tau_sq = tau ** 2
    
    # We sum over Q components
    # Vectorized computation for speed
    def component_func(w, m, v):
        # exp term: exp(-2 pi^2 * tau^2 * v)
        exp_term = jnp.exp(-2.0 * jnp.pi**2 * jnp.sum(tau_sq * v, axis=-1))
        # cos term: cos(2 pi * mu * tau)
        cos_term = jnp.cos(2.0 * jnp.pi * jnp.sum(tau * m, axis=-1))
        return w * exp_term * cos_term
        
    # vmap over the Q mixture components
    components = jax.vmap(component_func)(weights, means, variances)
    return jnp.sum(components, axis=0)
 
# -------------------------------------------------------------
# 2. Exact GP Implementation
# -------------------------------------------------------------
class ExactGP:
    def __init__(self, kernel_fn):
        self.kernel_fn = kernel_fn
        
    def loss_fn(self, params, X, y):
        # Unpack params
        noise = jnp.exp(params['log_noise'])
        # Kernel
        K = self.kernel_fn(X, X, params) + noise * jnp.eye(X.shape[0])
        # Cholesky
        L = jnp.linalg.cholesky(K)
        # Solve
        alpha = jax.scipy.linalg.cho_solve((L, True), y)
        
        # MLL
        term1 = -0.5 * jnp.dot(y.flatten(), alpha.flatten())
        term2 = -jnp.sum(jnp.log(jnp.diag(L)))
        term3 = -0.5 * len(y) * jnp.log(2 * jnp.pi)
        return -(term1 + term2 + term3) # Return Negative Log Likelihood
 
    def predict(self, params, X_train, y_train, X_test):
        noise = jnp.exp(params['log_noise'])
        K_xx = self.kernel_fn(X_train, X_train, params) + noise * jnp.eye(X_train.shape[0])
        K_xs = self.kernel_fn(X_train, X_test, params)
        K_ss = self.kernel_fn(X_test, X_test, params)
        
        L = jnp.linalg.cholesky(K_xx)
        alpha = jax.scipy.linalg.cho_solve((L, True), y_train)
        
        mu = K_xs.T @ alpha
        v = jax.scipy.linalg.solve_triangular(L, K_xs, lower=True)
        var = K_ss - v.T @ v
        return mu, jnp.diag(var)
 
# -------------------------------------------------------------
# 3. Training Loop
# -------------------------------------------------------------
def train_sm_gp(X, y, num_components=4, steps=1000):
    key = random.PRNGKey(42)
    D = X.shape[1]
    
    # Initialize parameters randomly (crucial for SM kernel)
    # Mixture weights, frequencies, lengthscales
    params = {
        'log_weights': jnp.log(jnp.ones(num_components) / num_components),
        'log_means': random.normal(key, (num_components, D)),
        'log_vars': random.normal(key, (num_components, D)) - 2.0, # Start with small bandwidth
        'log_noise': jnp.log(0.1)
    }
    
    # Wrapper to unpack dict to args for kernel
    def wrapped_kernel(x1, x2, p):
        return spectral_mixture_kernel(
            x1, x2, 
            jnp.exp(p['log_weights']), 
            jnp.exp(p['log_means']), 
            jnp.exp(p['log_vars'])
        )
        
    gp = ExactGP(wrapped_kernel)
    optimizer = jax.experimental.optimizers.adam(1e-2)
    
    @jit
    def update(i, opt_state):
        p = optimizer.params_fn(opt_state)
        grads = grad(gp.loss_fn)(p, X, y)
        return optimizer.update_fn(i, grads, opt_state)
        
    opt_state = optimizer.init_fn(params)
    for i in range(steps):
        opt_state = update(i, opt_state)
        
    final_params = optimizer.params_fn(opt_state)
    return gp, final_params
 
# The SM Kernel allows the GP to extrapolate patterns (like cardiac rhythm or airline passengers)
# far beyond the training window, unlike the RBF kernel which reverts to the mean.

7. Approximate Inference for Classification

So far we assumed $y = f(x) + \epsilon$ (Gaussian noise). For Binary Classification, $y \in \{-1, 1\}$ , the likelihood is non-Gaussian (Bernoulli/Logistic).

p(y_i | f_i) = \sigma(y_i f_i)

The posterior $p(f|y) \propto p(y|f) p(f)$ is no longer Gaussian. The integral $\int p(y|f) p(f) df$ is intractable. We need approximations.

7.1 The Laplace Approximation: We approximate the posterior with a Gaussian $q(f) = \mathcal{N}(\hat{f}, A^{-1})$ centered at the Maximum A Posteriori (MAP) mode.

Find mode $\hat{f} = \text{argmax} (\log p(y|f) + \log p(f))$ .
Compute the Hessian of the negative log posterior at the mode: $A = - \nabla \nabla \log p(f|y) |_{\hat{f}} = K^{-1} + W$ where $W$ is diagonal matrix of second derivatives of the likelihood (e.g. $\pi_i(1-\pi_i)$ for logistic).
Use $q(f)$ for predictions. $\mathbb{E}[f_*] = k_*^T K^{-1} \hat{f}$ The variance accounts for the curvature $W$ .

7.2 Expectation Propagation (EP): Laplace is a local approximation. EP is global. We approximate each likelihood term $p(y_i|f_i)$ with an unnormalized Gaussian site function $t_i(f_i)$ .

p(y|f) p(f) \approx \prod t_i(f_i) p(f)

We iteratively update the site parameters (cavity distributions) by moment matching the marginals. EP generally works better than Laplace for GP classification but is iteratively more expensive.

8. Conclusion: The Golden Standard of Uncertainty

Gaussian Processes represent the “Bayesian Ideal” for regression. They are mathematically elegant, handling uncertainty, smoothness, and finite-sample corrections exactly. While the $O(N^3)$ cost was historically prohibitive, modern methods (Sparse Variational GPs, Spectral Kernels, Deep Kernel Learning) have made them scalable to millions of data points. Perhaps most importantly, they serve as the rigorous limit for Deep Neural Networks. When we train a massive Neural Network, we are implicitly performing Kernel Regression in a complex RKHS. Understanding GPs is therefore not just about “old school” statistics; it is about understanding the fundamental limits of Learning Theory itself.

Historical Timeline

Year	Event	Significance
1900	Karl Pearson	Invents PCA (Linear Latent Variable Model).
1940s	Kolmogorov/Wiener	Develop optimal linear filtering (Kriging).
1994	Radford Neal	Proves limit of infinite Neural Networks is a GP.
2006	Rasmussen & Williams	Publish “Gaussian Processes for Machine Learning”.
2009	Michalis Titsias	Sparse Variational GPs (Inducing Points).
2013	Wilson & Adams	Spectral Mixture Kernels.
2016	Wilson et al.	Deep Kernel Learning.
2018	Jacot et al.	Neural Tangent Kernel (NTK).

Appendix A: Deriving the Matern Spectral Density

The Matern kernel arises not from arbitrary engineering, but from the Green’s function of a Stochastic Partial Differential Equation (SPDE). Consider the SPDE driven by white noise $W(x)$ :

\left( \frac{2\nu}{\ell^2} - \Delta \right)^{\frac{\nu}{2} + \frac{d}{4}} f(x) = W(x)

This is a fractional Helmholtz equation. Taking the Fourier Transform of both sides:

\left( \frac{2\nu}{\ell^2} + \|\omega\|^2 \right)^{\frac{\nu}{2} + \frac{d}{4}} \hat{f}(\omega) = \hat{W}(\omega)

Since $W$ is white noise, its spectral density is constant $S_W(\omega) = \text{const}$ . The power spectral density $S_f(\omega)$ of the solution is the square of the transfer function modulus:

S_f(\omega) \propto \left| \left( \frac{2\nu}{\ell^2} + \|\omega\|^2 \right)^{-(\frac{\nu}{2} + \frac{d}{4})} \right|^2

S_f(\omega) \propto \left( \frac{2\nu}{\ell^2} + \|\omega\|^2 \right)^{-\nu - \frac{d}{2}}

This matches the spectral density of the Matern class perfectly. To get the spatial kernel $k(r)$ , we take the inverse Fourier transform.

k(r) = \int_{\mathbb{R}^d} e^{i \omega \cdot x} \frac{1}{(\lambda^2 + \|\omega\|^2)^\alpha} d\omega

Using Schwinger parametrizations and Bessel function identities, this integral yields:

k_\nu(r) = \frac{2^{1-\nu}}{\Gamma(\nu)} (\sqrt{2\nu} r)^\nu K_\nu(\sqrt{2\nu} r)

Physical Interpretation:

$\nu=1/2$ : The SPDE is first order (Ornstein-Uhlenbeck). Physics of Brownian motion with friction.
$\nu \to \infty$ : The operator converges to the Heat Operator limit, yielding the Diffusion (Heat) Kernel, which is the Gaussian RBF.

Appendix B: The Neural Tangent Kernel (NTK) Link

Why do GPs matter for Deep Learning? Neal’s Theorem (1994): A single-layer Bayesian Neural Network with infinite width converges to a Gaussian Process. Let $f(x) = \frac{1}{\sqrt{H}} \sum_{j=1}^H v_j \sigma(w_j^T x + b_j)$ . Central Limit Theorem implies that for fixed $x$ , $f(x)$ is a sum of i.i.d. variables, hence Gaussian.

The Modern View (Jacot et al., 2018): Consider the training dynamics of an infinite width network under Gradient Descent. The network evolves as a linear model with respect to a fixed kernel $\Theta(x, x')$ , called the Neural Tangent Kernel (NTK).

\Theta(x, x') = \langle \nabla_\theta f(x), \nabla_\theta f(x') \rangle

In the infinite width limit, $\Theta$ stays constant during training. The training dynamics are exactly given by the GP / Kernel Ridge Regression solution:

f_t(x) = \Theta(x, X) \Theta(X, X)^{-1} (I - e^{-\eta \Theta(X, X) t}) y

This proves that “Training a massive Neural Network” is functionally equivalent to “Performing Kernel Regression with a specific architecture-dependent kernel”. We can compute these kernels analytically for ReLUs, CNNs, etc. For a ReLU network, the “Arc-Cosine Kernel” emerges:

k_{ReLU}(x, x') \propto \|\theta\| \|\theta'\| \left( \sin \phi + (\pi - \phi) \cos \phi \right)

where $\phi$ is the angle between inputs. This gives us a closed-form way to understand what “inductive bias” a Neural Network architecture imposes: it specifies a specific RKHS norm.

Derivation of the Arc-Cosine Kernel: Consider a single neuron $h(x) = \sigma(w^T x)$ with $w \sim \mathcal{N}(0, I)$ . The kernel is $k(x, y) = \mathbb{E}_w [\sigma(w^T x) \sigma(w^T y)]$ . Let $\sigma(z) = \max(0, z)$ (ReLU). By rotational invariance, we can align $x$ with the first axis and $y$ in the plane spanned by the first two axes. Input vectors: $x = \|x\| e_1$ , $y = \|y\| (\cos \phi e_1 + \sin \phi e_2)$ . Weights: $w = (w_1, w_2, \dots)$ . The activation becomes $\max(0, \|x\| w_1) \max(0, \|y\| (w_1 \cos \phi + w_2 \sin \phi))$ . We need to compute the integral:

J(\phi) = \int \int \max(0, u) \max(0, u \cos \phi + v \sin \phi) e^{-(u^2+v^2)/2} du dv

The integration domain is the sector where both arguments are positive. Using polar coordinates for $(u, v) = r(\cos \theta, \sin \theta)$ :

J(\phi) = \int_0^\infty r^3 e^{-r^2/2} dr \int_{-\pi/2 + \phi}^{\pi/2} \cos \theta \cos(\theta - \phi) d\theta

The radial integral $\int r^3 e^{-r^2/2} dr = 2$ . The angular integral yields the factor $\frac{1}{2\pi} (\sin \phi + (\pi - \phi) \cos \phi)$ . This result (Cho & Saul, 2009) connects the geometric angle between data points directly to the correlation structure of the Deep Network. It implies that ReLU networks measuring similarity based on angle, not Euclidean distance (unlike RBF). This explains their superior performance on high-dimensional image data, where angular distance is more meaningful than radial distance due to concentration of measure.

Appendix C: Gaussian Process Latent Variable Models (GPLVM)

Standard PCA assumes data $Y$ comes from a linear mapping of latent variables $X$ : $Y = WX + \epsilon$ . Dual Probabilistic PCA (Lawrence, 2004) marginalizes $W$ and optimizes $X$ . This yields a Gaussian Process mapping from latent space to data space: $Y \sim \mathcal{N}(0, K(X, X) + \sigma^2 I)$ . The GPLVM optimizes the latent locations $X$ to maximize the likelihood of the data $Y$ .

The Objective:

\log p(Y|X) = -\frac{D}{2} \log |K| - \frac{1}{2} \text{Tr}(K^{-1} Y Y^T) - \text{const}

Since $X$ is inside the non-linear kernel $K$ , this is a non-convex optimization problem. GPLVM learns the manifold structure. If we use a linear kernel, we recover PCA. If we use an RBF kernel, we get a non-linear embedding that preserves local distance (similar to t-SNE, but probabilistic).

Bayesian GPLVM (Titsias & Lawrence, 2010): The standard GPLVM treats $X$ as point parameters (MAP). We can be fully Bayesian and place a prior $X \sim \mathcal{N}(0, I)$ and approximate the posterior $q(X)$ . This requires a Variational Lower Bound slightly more complex than the Sparse GP one, because the kernel matrix $K_{ff}$ itself is now a random variable (since it depends on stochastic $X$ ). Titsias derived a closed form for the $\Psi$ -statistics:

\Psi_0 = \text{Tr}(\mathbb{E}_q[K_{ff}])

\Psi_1 = \mathbb{E}_q[K_{fu}]

\Psi_2 = \mathbb{E}_q[K_{uf} K_{fu}]

For the RBF kernel, these expectations are analytically tractable (involving Gaussian convolution). This allows for “Automatic Relevance Determination” (ARD) on the latent dimensions, effectively learning the intrinsic dimensionality of the data automatically.

Appendix D: Glossary of Definitions

Covariance Kernel: Encodes the similarity between points. Must be PSD.
Inducing Points: Pseudopoints used to summarize the infinite process for scalability ( $O(NM^2)$ ).
Marginal Likelihood: The probability of data given hyperparameters, integrating out the function. The objective for model selection.
Posterior Contraction: How the uncertainty shrinks near observed data.
RKHS: Hilbert space defined by the kernel. Contains the “smooth” functions.
Spectral Density: Fourier transform of the kernel (Bochner’s Theorem).

References

1. Rasmussen, C. E., & Williams, C. K. I. (2006). “Gaussian Processes for Machine Learning”. The standard text. Covers regression, classification, covariance functions, and theoretical properties.

2. Wilson, A. G., & Adams, R. P. (2013). “Gaussian Process Kernels for Pattern Discovery and Extrapolation”. Introduced Spectral Mixture Kernels. Showed that standard kernels are just local averages; spectral kernels can model long-range structure and negative covariance.

3. Titsias, M. (2009). “Variational Learning of Inducing Variables in Sparse Gaussian Processes”. The breakthrough that made GPs scalable. Instead of Nystrom (which ignores data), Titsias treated inducing points as variational parameters to be optimized.

4. Jacot, A. et al. (2018). “Neural Tangent Kernel: Convergence and Generalization in Neural Networks”. The landmark paper connecting infinite width Neural Networks to Kernel methods, providing exact solvability for Deep Learning dynamics.

5. Lawrence, N. (2004). “Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data”. Introduced the GPLVM framework, flipping the GP regression problem to optimize inputs rather than weights.

6. MacKay, D. J. C. (1998). “Introduction to Gaussian Processes”. A classic tutorial by the late Sir David MacKay. Connects GPs to Neural Networks (early) and provides a Bayesian perspective on regularization.

7. Quinonero-Candela, J., & Rasmussen, C. E. (2005). “A Unifying View of Sparse Approximate Gaussian Process Regression”. The paper that organized the confusing zoo of sparse approximations (FITC, DTC, PITC) into a single effective theory.