RKHS & The Representer Theorem

The standard introduction to kernel methods starts with a feature map ϕ\phi into some high-dimensional space F\mathcal{F} and then defines

k(x,y)=ϕ(x),ϕ(y)Fk(x, y) = \langle \phi(x), \phi(y) \rangle_{\mathcal{F}}

as a computational shortcut. The functional analysis view flips this around and the kernel comes first. A positive definite kernel KK picks out a unique Hilbert space HK\mathcal{H}_K where pointwise evaluation is continuous. The constraint f(x)CxfH|f(x)| \le C_x \|f\|_{\mathcal{H}} forces function values at points to be well-defined and this is unlike L2L^2 where functions are only defined up to measure-zero sets.

The cost is generality and the payoff is regularity. This trade-off is the foundation of non-parametric statistics.

It ties together three things that look unrelated and those are positive definite kernels and reproducing kernel Hilbert spaces and Green’s functions of differential operators.


Moore-Aronszajn

Let H\mathcal{H} be a Hilbert space of real-valued functions on a set X\mathcal{X}. Call it an RKHS when δx:ff(x)\delta_x: f \mapsto f(x) is bounded for every xXx \in \mathcal{X}.

Riesz representation then hands you a unique KxHK_x \in \mathcal{H} for each xx with

f(x)=f,KxHfHf(x) = \langle f, K_x \rangle_\mathcal{H} \quad \forall f \in \mathcal{H}

Set K(x,y)=Kx,KyHK(x, y) = \langle K_x, K_y \rangle_\mathcal{H} and the reproducing property falls out right away.

K(x,)=Kx    K(,x),K(,y)H=K(x,y)K(x, \cdot) = K_x \implies \langle K(\cdot, x), K(\cdot, y) \rangle_\mathcal{H} = K(x, y)

and Kx2=K(x,x)0\|K_x\|^2 = K(x, x) \ge 0.

Theorem (Moore-Aronszajn, 1950): Every symmetric positive definite kernel k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} has a unique RKHS Hk\mathcal{H}_k with kk as its reproducing kernel.

The proof is constructive. Take all finite linear combinations of kernel sections

H0={f()=i=1nαik(xi,):nN,αiR,xiX}\mathcal{H}_0 = \left\{ f(\cdot) = \sum_{i=1}^n \alpha_i k(x_i, \cdot) : n \in \mathbb{N}, \alpha_i \in \mathbb{R}, x_i \in \mathcal{X} \right\}

Give it the inner product

f,gH0=i=1nj=1mαiβjk(xi,yj)\langle f, g \rangle_{\mathcal{H}_0} = \sum_{i=1}^n \sum_{j=1}^m \alpha_i \beta_j k(x_i, y_j)

which is well-defined because kk is positive definite. Complete the space. Since f(x)k(x,x)f|f(x)| \le \sqrt{k(x,x)} \|f\| the Cauchy sequences converge pointwise to actual functions and not abstract equivalence classes. The completion is made of functions and evaluation continuity carries over to the closure. \square

Bochner and shift-invariant kernels

On X=Rd\mathcal{X} = \mathbb{R}^d a shift-invariant kernel has the form k(x,y)=ψ(xy)k(x, y) = \psi(x - y). Bochner’s theorem says ψ\psi is positive definite iff it is the Fourier transform of a finite non-negative measure μ\mu.

ψ(δ)=RdeiωTδdμ(ω)\psi(\delta) = \int_{\mathbb{R}^d} e^{i \omega^T \delta} d\mu(\omega)

The Gaussian or RBF kernel has a Gaussian spectral measure with full-frequency support and that is what gives it universal approximation. A band-limited spectral measure gives a kernel that only sees certain frequencies and the Sinc kernel is the standard example.

Rahimi and Recht in 2007 exploited this directly by sampling frequencies ωμ\omega \sim \mu to get Random Fourier Features.


The Representer Theorem

SVM and kernel ridge regression solutions are always kernel expansions centered at the data points. The representer theorem explains why.

Take any regularized risk minimization

J(f)=i=1nL(yi,f(xi))+λΩ(fH)J(f) = \sum_{i=1}^n L(y_i, f(x_i)) + \lambda \Omega(\|f\|_\mathcal{H})

LL is whatever loss you want and Ω\Omega is strictly increasing and usually just Ω(z)=z2\Omega(z) = z^2.

Theorem (Scholkopf, Herbrich, Smola 2001): The minimizer ff^* satisfies

f(x)=i=1nαik(xi,x)f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x)

Proof. Split f=f+ff = f_{||} + f_{\perp} where ff_{||} lives in V=span{kx1,,kxn}V = \text{span}\{k_{x_1}, \dots, k_{x_n}\} and fVf_{\perp} \perp V.

At the data points the orthogonal part vanishes.

f(xi)=f,kxi=f,kxif(x_i) = \langle f, k_{x_i} \rangle = \langle f_{||}, k_{x_i} \rangle

So the loss term only depends on ff_{||}. But the norm splits as follows.

f2=f2+f2\|f\|^2 = \|f_{||}\|^2 + \|f_{\perp}\|^2

Any nonzero ff_{\perp} pumps up the penalty without reducing the loss. So f=ff^* = f_{||}. \square

The whole argument boils down to orthogonal projection.

SVMs as a special case

Plug in hinge loss L(y,f(x))=max(0,1yf(x))L(y, f(x)) = \max(0, 1 - y f(x)) and you get

minfHi=1nmax(0,1yif(xi))+λ2fH2\min_{f \in \mathcal{H}} \sum_{i=1}^n \max(0, 1 - y_i f(x_i)) + \frac{\lambda}{2} \|f\|_\mathcal{H}^2

The representer theorem gives f(x)=αik(xi,x)f(x) = \sum \alpha_i k(x_i, x). Sparsity with most αi=0\alpha_i = 0 comes from the loss and not the kernel. The hinge loss is flat for correctly-classified points far from the margin so their Lagrange multipliers vanish. The kernel expansion comes from the regularizer and the sparsity comes from the loss. Two separate mechanisms.


Kernel Ridge Regression

Consider squared error loss

J(f)=yf22+λfH2J(f) = \|y - \mathbf{f}\|_2^2 + \lambda \|f\|_\mathcal{H}^2

Substitute f(x)=αjk(xj,x)f(x) = \sum \alpha_j k(x_j, x). The data fit becomes yKα22\|y - K\alpha\|_2^2 and the regularizer becomes αTKα\alpha^T K \alpha.

J(α)=(yKα)T(yKα)+λαTKαJ(\alpha) = (y - K\alpha)^T (y - K\alpha) + \lambda \alpha^T K \alpha

Differentiate and set to zero and you get

(K+λI)α=y(K + \lambda I)\alpha = y α=(K+λI)1y\alpha = (K + \lambda I)^{-1} y

The prediction at xx_* is then f(x)=kT(K+λI)1yf(x_*) = k_*^T (K + \lambda I)^{-1} y.

Connection to Gaussian process regression

This is the same expression as the GP posterior mean. A GP prior fGP(0,k)f \sim \mathcal{GP}(0, k) with noise variance σn2\sigma_n^2 gives a posterior mean of kT(K+σn2I)1yk_*^T (K + \sigma_n^2 I)^{-1} y and this matches the kernel ridge prediction when λ=σn2\lambda = \sigma_n^2.

The negative log-posterior under a Gaussian likelihood is

12σn2(yif(xi))2+12fH2\frac{1}{2\sigma_n^2} \sum (y_i - f(x_i))^2 + \frac{1}{2} \|f\|_\mathcal{H}^2

which is the ridge objective up to a constant. Ridge regularization is Bayesian inference with a Gaussian prior and λ\lambda is the noise-to-signal ratio.


Green’s Functions and Differential Operators

RKHS theory ties into PDEs through Green’s functions.

Given a linear differential operator PP its Green’s function GG solves PG(,y)=δ(y)P G(\cdot, y) = \delta(\cdot - y). If k(x,y)k(x,y) is the Green’s function of L=PPL = P^* P then the RKHS norm becomes

fH2=PfL22=(Pf(x))2dx\|f\|_\mathcal{H}^2 = \|P f\|_{L^2}^2 = \int (Pf(x))^2 dx

Cubic splines are a concrete example. The kernel k(x,y)=min(x,y)2(3max(x,y)min(x,y))k(x, y) = \min(x, y)^2 (3\max(x, y) - \min(x, y)) produces functions that minimize

fH2=(f(x))2dx\|f\|_\mathcal{H}^2 = \int (f''(x))^2 dx

The operator is d2/dx2d^2/dx^2 and the kernel is the Green’s function of d4/dx4d^4/dx^4. Spline interpolation solves a variational problem and finds the interpolant of minimum curvature.


Neural Tangent Kernel

Infinite-width networks converge to GPs. The question is which kernel.

Taylor expand f(x;θ)f(x; \theta) around initialization θ0\theta_0

f(x;θ)f(x;θ0)+θf(x;θ0)T(θθ0)f(x; \theta) \approx f(x; \theta_0) + \nabla_\theta f(x; \theta_0)^T (\theta - \theta_0)

This is a linear model with feature map ϕ(x)=θf(x;θ0)\phi(x) = \nabla_\theta f(x; \theta_0). The NTK is

Θ(x,y)=limmp=1Pθpf(x)θpf(y)\Theta(x, y) = \lim_{m \to \infty} \sum_{p=1}^P \partial_{\theta_p} f(x) \partial_{\theta_p} f(y)

For ReLU networks you can compute this analytically layer by layer.

Two regimes matter. In the lazy regime the network is very wide and the weights barely move and ΘtΘ0\Theta_t \approx \Theta_0 and training reduces to kernel ridge regression with a fixed kernel. The representer theorem applies directly. In the rich regime the network has finite width and the kernel evolves during training and the network learns features and not just coefficients. Standard kernel theory no longer applies.


Mercer’s Theorem

Let XX be compact and kk continuous. The integral operator

Tkf(x)=Xk(x,y)f(y)dμ(y)T_k f(x) = \int_X k(x, y) f(y) d\mu(y)

is compact and self-adjoint and positive. The spectral theorem gives

k(x,y)=j=1λjej(x)ej(y)k(x, y) = \sum_{j=1}^\infty \lambda_j e_j(x) e_j(y)

with λj0\lambda_j \ge 0 and {ej}\{e_j\} orthonormal in L2(μ)L^2(\mu).

The RKHS norm in this basis is

fH2=j=1f,ejL22λj\|f\|_\mathcal{H}^2 = \sum_{j=1}^\infty \frac{\langle f, e_j \rangle_{L^2}^2}{\lambda_j}

High-frequency eigenfunctions usually have small λj\lambda_j. For a finite RKHS norm the high-frequency coefficients have to decay accordingly. This pins down exactly the smoothness the kernel imposes.


Implementation

import numpy as np import matplotlib.pyplot as plt def rbf_kernel(X1, X2, gamma=10.0): diffs = X1[:, np.newaxis, :] - X2[np.newaxis, :, :] dists_sq = np.sum(diffs**2, axis=2) return np.exp(-gamma * dists_sq) def kernel_ridge_regression(X_train, y_train, X_test, lambd=1e-5): # K matrix K = rbf_kernel(X_train, X_train) n = len(X_train) # Solve (K + lambda I) alpha = y alpha = np.linalg.solve(K + lambd * np.eye(n), y_train) # Predict K_star = rbf_kernel(X_test, X_train) f_star = K_star @ alpha return f_star def demo_representer(): # Data: Noisy Sine np.random.seed(42) X = np.sort(np.random.rand(20, 1) * 2 - 1, axis=0) y = np.sin(3 * np.pi * X).ravel() + 0.1 * np.random.randn(20) X_test = np.linspace(-1, 1, 200).reshape(-1, 1) # Fit models with different regularization preds_tight = kernel_ridge_regression(X, y, X_test, lambd=1e-5) # Interpolates preds_loose = kernel_ridge_regression(X, y, X_test, lambd=1.0) # Smooth plt.figure(figsize=(10, 6)) plt.scatter(X, y, c='r', label='Data') plt.plot(X_test, preds_tight, label='Lambda=1e-5 (Interp)') plt.plot(X_test, preds_loose, label='Lambda=1.0 (Smooth)') plt.title('Representer Theorem in Action') plt.legend() # plt.show() # Observation: # Small lambda -> Functions passes through points (Noise interpolation, High Norm). # Large lambda -> Functions is flatter (Low Norm). # Both are just sums of Gaussians centered at data points.

Kernel Mean Embeddings and MMD

You can embed whole distributions into the RKHS. The mean embedding of PP is

μP=ExP[k(x,)]=Xk(x,)dP(x)\mu_P = \mathbb{E}_{x \sim P} [k(x, \cdot)] = \int_\mathcal{X} k(x, \cdot) dP(x)

If the kernel is characteristic and the Gaussian kernel qualifies then this embedding is injective and μP\mu_P pins down PP uniquely.

This gives you a natural distance between distributions.

MMD2(P,Q)=μPμQH2\text{MMD}^2(P, Q) = \|\mu_P - \mu_Q\|_\mathcal{H}^2

Expanding it gives

MMD2(P,Q)=Ex,xP[k(x,x)]2ExP,yQ[k(x,y)]+Ey,yQ[k(y,y)]\text{MMD}^2(P, Q) = \mathbb{E}_{x, x' \sim P}[k(x, x')] - 2\mathbb{E}_{x \sim P, y \sim Q}[k(x, y)] + \mathbb{E}_{y, y' \sim Q}[k(y, y')]

This is useful for two-sample testing and for generative models like MMD-GANs. The unbiased estimator is closed-form and differentiable and that is rare for distribution distances.


Ill-posedness and Scalability

An infinite-dimensional RKHS like the RBF one means λj0\lambda_j \to 0 and Tk1T_k^{-1} is unbounded. Without regularization small perturbations in the observations fire back as large oscillations in ff and this is the standard ill-posed inverse problem.

Computation is a separate issue. The Gram matrix is n×nn \times n and asks for O(n2)O(n^2) storage and O(n3)O(n^3) to invert. For nn in the millions direct methods are infeasible.

Random Fourier Features from Rahimi and Recht handle this by Monte Carlo approximation of the Bochner integral.

ϕ(x)=1D[cos(ω1Tx),sin(ω1Tx),,cos(ωDTx),sin(ωDTx)]\phi(x) = \sqrt{\frac{1}{D}}\left[\cos(\omega_1^T x), \sin(\omega_1^T x), \dots, \cos(\omega_D^T x), \sin(\omega_D^T x)\right]

This reduces a non-parametric problem to linear regression in a random feature space. The Nystrom method is the other main approach and it uses a low-rank approximation of KK through landmark points.


RKHS-Sobolev equivalence

Theorem: For the Matern kernel kνk_\nu the space Hk\mathcal{H}_k is isomorphic to the Sobolev space Hs(Rd)H^s(\mathbb{R}^d) with s=ν+d/2s = \nu + d/2.

The Sobolev norm in the Fourier domain is

fHs2=(1+ω2)sf^(ω)2dω\|f\|_{H^s}^2 = \int (1 + \|\omega\|^2)^s |\hat{f}(\omega)|^2 d\omega

For a shift-invariant kernel with spectrum S(ω)S(\omega) the RKHS norm is

fH2=f^(ω)2S(ω)dω\|f\|_\mathcal{H}^2 = \int \frac{|\hat{f}(\omega)|^2}{S(\omega)} d\omega

Matern has S(ω)(1+ω2)(ν+d/2)S(\omega) \propto (1 + \|\omega\|^2)^{-(\nu+d/2)} and substituting gives

fH2=f^(ω)2(1+ω2)ν+d/2dω\|f\|_\mathcal{H}^2 = \int |\hat{f}(\omega)|^2 (1 + \|\omega\|^2)^{\nu+d/2} d\omega

which is exactly the Hν+d/2H^{\nu+d/2} norm.

Corollary: RKHS functions are continuous when s>d/2s > d/2 by Sobolev embedding. Since s=ν+d/2s = \nu + d/2 you need ν>0\nu > 0. All Matern kernels with ν>0\nu > 0 give continuous functions.


The Nystrom approximation

Approximate KK with mnm \ll n landmark points. Let KmmK_{mm} be the landmark kernel matrix and KnmK_{nm} the cross-kernel matrix. Then

K~=KnmKmm1KnmT\tilde{K} = K_{nm} K_{mm}^{-1} K_{nm}^T

Woodbury gives you the regularized inverse in O(nm2)O(nm^2).

(K~+λI)1=1λ(IKnm(λKmm+KnmTKnm)1KnmT)(\tilde{K} + \lambda I)^{-1} = \frac{1}{\lambda} \left( I - K_{nm} (\lambda K_{mm} + K_{nm}^T K_{nm})^{-1} K_{nm}^T \right)

This is linear in nn instead of cubic. Pick the landmarks by K-means or leverage score sampling.


Why the RKHS norm measures smoothness

For k(x,y)=ψ(xy)k(x, y) = \psi(x-y) shift-invariant the inner product comes from Plancherel’s theorem.

f,gH=f^(ω)g^(ω)ψ^(ω)dω\langle f, g \rangle_\mathcal{H} = \int \frac{\hat{f}(\omega) \overline{\hat{g}(\omega)}}{\hat{\psi}(\omega)} d\omega

For the Gaussian kernel ψ^(ω)=eσ2ω2\hat{\psi}(\omega) = e^{-\sigma^2 \omega^2}. The inverse eσ2ω2e^{\sigma^2 \omega^2} grows exponentially with frequency. So fH2\|f\|_\mathcal{H}^2 slams high frequencies hard and the RKHS functions come out analytic.

More generally the regularization problem minyTKα2+λα,TKα\min \|y - T_K \alpha\|^2 + \lambda \langle \alpha, T_K \alpha \rangle reduces to

(TK+λI)α=y(T_K + \lambda I) \alpha = y

which is Tikhonov regularization for recovering a function from noisy point evaluations and this ties RKHS directly to classical regularization theory.


Kernels and deep learning

Three facts fit together.

First infinite-width networks at initialization are GPs and Neal showed this in 1996. Second under gradient descent the network evolves as a kernel predictor.

tft(x)=ηi=1nΘt(x,xi)(ft(xi)yi)\partial_t f_t(x) = - \eta \sum_{i=1}^n \Theta_t(x, x_i) (f_t(x_i) - y_i)

In the lazy regime where the network is very wide ΘtΘ0\Theta_t \approx \Theta_0 stays constant and training reduces to kernel ridge regression with the NTK.

Third at finite width Θt\Theta_t changes during training. The kernel lines up with the target function and this is a form of feature learning that fixed-kernel methods cannot replicate. How and why the kernel evolves is still a central open problem in deep learning theory.


Kernels and their spectra

The spectral measure through Bochner is what picks out which functions live in the RKHS.

Gaussian (RBF): ψ(x)=ex2/2σ2\psi(x) = e^{-\|x\|^2 / 2\sigma^2}. Gaussian spectrum and full-frequency support. The RKHS is dense in C(X)C(\mathcal{X}) and functions are CC^\infty.

Laplacian: ψ(x)=ex1/σ\psi(x) = e^{-\|x\|_1 / \sigma}. Cauchy spectrum with heavy tails. High frequencies are tolerated. Functions are Lipschitz but not differentiable at x=yx = y. Well-suited to sharp transitions.

Matern: Generalized Student-t spectrum (1+ω2)νd/2(1 + \|\omega\|^2)^{-\nu - d/2}. The parameter ν\nu controls differentiability and ν\nu \to \infty recovers RBF.

NTK (ReLU): Polynomial spectral decay that goes roughly like O(ωd1)O(\|\omega\|^{-d-1}). This explains why neural networks handle non-smooth targets better than RBF kernels because the RBF prior forces too much smoothness.


How RKHS theory developed

The story starts with James Mercer’s 1909 proof of the spectral decomposition of positive integral operators which was a result in pure analysis with no tie to machine learning. Two decades later Sergei Sobolev introduced distributions and generalized functions in the 1930s and laid the functional-analytic groundwork. The decisive step came in 1950 when Nachman Aronszajn formalized RKHS theory and proved the Moore-Aronszajn theorem and pinned down the one-to-one correspondence between positive definite kernels and reproducing kernel Hilbert spaces. Grace Wahba through the 1970s connected spline smoothing to RKHS theory and introduced Generalized Cross-Validation and made the theory practical for statistics.

The machine learning era kicked off in 1992 when Boser and Guyon and Vapnik introduced the kernel trick for support vector machines and turned RKHS from a theoretical curiosity into a computational tool. Radford Neal showed in 1996 that infinite neural networks converge to Gaussian processes and hinted at the deep connection between kernels and neural architectures. The representer theorem was generalized to arbitrary monotonic regularizers by Scholkopf et al. in 2001. Rahimi and Recht introduced Random Fourier Features in 2007 and made kernel methods scalable by leaning on Bochner’s theorem. And most recently Jacot et al. in 2018 introduced the Neural Tangent Kernel and closed the circle by showing that wide neural networks are in a precise sense kernel machines.


Key terms

  • Characteristic Kernel: A kernel whose mean embedding is injective, meaning it uniquely identifies distributions.
  • Gram Matrix: The n×nn \times n matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j) formed by evaluating the kernel at all pairs of data points.
  • Green’s Function: The impulse response of a differential operator. When the kernel is a Green’s function, the RKHS norm penalizes derivatives.
  • Mercer’s Theorem: The eigenfunction expansion k(x,y)=λjej(x)ej(y)k(x,y) = \sum \lambda_j e_j(x) e_j(y), which decomposes a kernel into its spectral components.
  • MMD: Maximum Mean Discrepancy, a distance between distributions computed via their mean embeddings in an RKHS.
  • Nystrom Method: A low-rank approximation of the Gram matrix using a subset of landmark points, reducing O(n3)O(n^3) to O(nm2)O(nm^2).
  • Positive Definite Function: A function kk such that the Gram matrix is PSD for every finite point set.
  • RKHS: A Hilbert space of functions in which pointwise evaluation ff(x)f \mapsto f(x) is a continuous linear functional.
  • Shift-Invariant: A kernel of the form k(x,y)=ψ(xy)k(x,y) = \psi(x - y), depending only on the difference.
  • Spectral Measure: The Fourier transform of a shift-invariant kernel, which determines which frequencies the RKHS can represent.

References

1. Scholkopf, B., & Smola, A. J. (2002). “Learning with Kernels”. The comprehensive reference for SVMs and Regularization Networks.

2. Aronszajn, N. (1950). “Theory of reproducing kernels”. The foundational math paper. Proves Moore-Aronszajn.

3. Jacot, A. et al. (2018). “Neural Tangent Kernel”. Proved that infinite width networks are Kernel Machines. Sparked the “Modern Era” of deep learning theory.

4. Rahimi, A., & Recht, B. (2007). “Random Features for Large-Scale Kernel Machines”. Showed that explicit feature maps sampled from the Fourier transform can approximate kernels efficiently. O(n)O(n) instead of O(n3)O(n^3).

5. Berlinet, A., & Thomas-Agnan, C. (2011). “Reproducing Kernel Hilbert Spaces in Probability and Statistics”. Rigorous treatment of the functional analysis aspects.

6. Wahba, G. (1990). “Spline Models for Observational Data”. The bible of splines and thin-plate splines as RKHS problems.