RKHS & The Representer Theorem
The standard introduction to kernel methods starts with a feature map into some high-dimensional space and then defines
as a computational shortcut. The functional analysis view flips this around and the kernel comes first. A positive definite kernel picks out a unique Hilbert space where pointwise evaluation is continuous. The constraint forces function values at points to be well-defined and this is unlike where functions are only defined up to measure-zero sets.
The cost is generality and the payoff is regularity. This trade-off is the foundation of non-parametric statistics.
It ties together three things that look unrelated and those are positive definite kernels and reproducing kernel Hilbert spaces and Green’s functions of differential operators.
Moore-Aronszajn
Let be a Hilbert space of real-valued functions on a set . Call it an RKHS when is bounded for every .
Riesz representation then hands you a unique for each with
Set and the reproducing property falls out right away.
and .
Theorem (Moore-Aronszajn, 1950): Every symmetric positive definite kernel has a unique RKHS with as its reproducing kernel.
The proof is constructive. Take all finite linear combinations of kernel sections
Give it the inner product
which is well-defined because is positive definite. Complete the space. Since the Cauchy sequences converge pointwise to actual functions and not abstract equivalence classes. The completion is made of functions and evaluation continuity carries over to the closure.
Bochner and shift-invariant kernels
On a shift-invariant kernel has the form . Bochner’s theorem says is positive definite iff it is the Fourier transform of a finite non-negative measure .
The Gaussian or RBF kernel has a Gaussian spectral measure with full-frequency support and that is what gives it universal approximation. A band-limited spectral measure gives a kernel that only sees certain frequencies and the Sinc kernel is the standard example.
Rahimi and Recht in 2007 exploited this directly by sampling frequencies to get Random Fourier Features.
The Representer Theorem
SVM and kernel ridge regression solutions are always kernel expansions centered at the data points. The representer theorem explains why.
Take any regularized risk minimization
is whatever loss you want and is strictly increasing and usually just .
Theorem (Scholkopf, Herbrich, Smola 2001): The minimizer satisfies
Proof. Split where lives in and .
At the data points the orthogonal part vanishes.
So the loss term only depends on . But the norm splits as follows.
Any nonzero pumps up the penalty without reducing the loss. So .
The whole argument boils down to orthogonal projection.
SVMs as a special case
Plug in hinge loss and you get
The representer theorem gives . Sparsity with most comes from the loss and not the kernel. The hinge loss is flat for correctly-classified points far from the margin so their Lagrange multipliers vanish. The kernel expansion comes from the regularizer and the sparsity comes from the loss. Two separate mechanisms.
Kernel Ridge Regression
Consider squared error loss
Substitute . The data fit becomes and the regularizer becomes .
Differentiate and set to zero and you get
The prediction at is then .
Connection to Gaussian process regression
This is the same expression as the GP posterior mean. A GP prior with noise variance gives a posterior mean of and this matches the kernel ridge prediction when .
The negative log-posterior under a Gaussian likelihood is
which is the ridge objective up to a constant. Ridge regularization is Bayesian inference with a Gaussian prior and is the noise-to-signal ratio.
Green’s Functions and Differential Operators
RKHS theory ties into PDEs through Green’s functions.
Given a linear differential operator its Green’s function solves . If is the Green’s function of then the RKHS norm becomes
Cubic splines are a concrete example. The kernel produces functions that minimize
The operator is and the kernel is the Green’s function of . Spline interpolation solves a variational problem and finds the interpolant of minimum curvature.
Neural Tangent Kernel
Infinite-width networks converge to GPs. The question is which kernel.
Taylor expand around initialization
This is a linear model with feature map . The NTK is
For ReLU networks you can compute this analytically layer by layer.
Two regimes matter. In the lazy regime the network is very wide and the weights barely move and and training reduces to kernel ridge regression with a fixed kernel. The representer theorem applies directly. In the rich regime the network has finite width and the kernel evolves during training and the network learns features and not just coefficients. Standard kernel theory no longer applies.
Mercer’s Theorem
Let be compact and continuous. The integral operator
is compact and self-adjoint and positive. The spectral theorem gives
with and orthonormal in .
The RKHS norm in this basis is
High-frequency eigenfunctions usually have small . For a finite RKHS norm the high-frequency coefficients have to decay accordingly. This pins down exactly the smoothness the kernel imposes.
Implementation
import numpy as np
import matplotlib.pyplot as plt
def rbf_kernel(X1, X2, gamma=10.0):
diffs = X1[:, np.newaxis, :] - X2[np.newaxis, :, :]
dists_sq = np.sum(diffs**2, axis=2)
return np.exp(-gamma * dists_sq)
def kernel_ridge_regression(X_train, y_train, X_test, lambd=1e-5):
# K matrix
K = rbf_kernel(X_train, X_train)
n = len(X_train)
# Solve (K + lambda I) alpha = y
alpha = np.linalg.solve(K + lambd * np.eye(n), y_train)
# Predict
K_star = rbf_kernel(X_test, X_train)
f_star = K_star @ alpha
return f_star
def demo_representer():
# Data: Noisy Sine
np.random.seed(42)
X = np.sort(np.random.rand(20, 1) * 2 - 1, axis=0)
y = np.sin(3 * np.pi * X).ravel() + 0.1 * np.random.randn(20)
X_test = np.linspace(-1, 1, 200).reshape(-1, 1)
# Fit models with different regularization
preds_tight = kernel_ridge_regression(X, y, X_test, lambd=1e-5) # Interpolates
preds_loose = kernel_ridge_regression(X, y, X_test, lambd=1.0) # Smooth
plt.figure(figsize=(10, 6))
plt.scatter(X, y, c='r', label='Data')
plt.plot(X_test, preds_tight, label='Lambda=1e-5 (Interp)')
plt.plot(X_test, preds_loose, label='Lambda=1.0 (Smooth)')
plt.title('Representer Theorem in Action')
plt.legend()
# plt.show()
# Observation:
# Small lambda -> Functions passes through points (Noise interpolation, High Norm).
# Large lambda -> Functions is flatter (Low Norm).
# Both are just sums of Gaussians centered at data points.Kernel Mean Embeddings and MMD
You can embed whole distributions into the RKHS. The mean embedding of is
If the kernel is characteristic and the Gaussian kernel qualifies then this embedding is injective and pins down uniquely.
This gives you a natural distance between distributions.
Expanding it gives
This is useful for two-sample testing and for generative models like MMD-GANs. The unbiased estimator is closed-form and differentiable and that is rare for distribution distances.
Ill-posedness and Scalability
An infinite-dimensional RKHS like the RBF one means and is unbounded. Without regularization small perturbations in the observations fire back as large oscillations in and this is the standard ill-posed inverse problem.
Computation is a separate issue. The Gram matrix is and asks for storage and to invert. For in the millions direct methods are infeasible.
Random Fourier Features from Rahimi and Recht handle this by Monte Carlo approximation of the Bochner integral.
This reduces a non-parametric problem to linear regression in a random feature space. The Nystrom method is the other main approach and it uses a low-rank approximation of through landmark points.
RKHS-Sobolev equivalence
Theorem: For the Matern kernel the space is isomorphic to the Sobolev space with .
The Sobolev norm in the Fourier domain is
For a shift-invariant kernel with spectrum the RKHS norm is
Matern has and substituting gives
which is exactly the norm.
Corollary: RKHS functions are continuous when by Sobolev embedding. Since you need . All Matern kernels with give continuous functions.
The Nystrom approximation
Approximate with landmark points. Let be the landmark kernel matrix and the cross-kernel matrix. Then
Woodbury gives you the regularized inverse in .
This is linear in instead of cubic. Pick the landmarks by K-means or leverage score sampling.
Why the RKHS norm measures smoothness
For shift-invariant the inner product comes from Plancherel’s theorem.
For the Gaussian kernel . The inverse grows exponentially with frequency. So slams high frequencies hard and the RKHS functions come out analytic.
More generally the regularization problem reduces to
which is Tikhonov regularization for recovering a function from noisy point evaluations and this ties RKHS directly to classical regularization theory.
Kernels and deep learning
Three facts fit together.
First infinite-width networks at initialization are GPs and Neal showed this in 1996. Second under gradient descent the network evolves as a kernel predictor.
In the lazy regime where the network is very wide stays constant and training reduces to kernel ridge regression with the NTK.
Third at finite width changes during training. The kernel lines up with the target function and this is a form of feature learning that fixed-kernel methods cannot replicate. How and why the kernel evolves is still a central open problem in deep learning theory.
Kernels and their spectra
The spectral measure through Bochner is what picks out which functions live in the RKHS.
Gaussian (RBF): . Gaussian spectrum and full-frequency support. The RKHS is dense in and functions are .
Laplacian: . Cauchy spectrum with heavy tails. High frequencies are tolerated. Functions are Lipschitz but not differentiable at . Well-suited to sharp transitions.
Matern: Generalized Student-t spectrum . The parameter controls differentiability and recovers RBF.
NTK (ReLU): Polynomial spectral decay that goes roughly like . This explains why neural networks handle non-smooth targets better than RBF kernels because the RBF prior forces too much smoothness.
How RKHS theory developed
The story starts with James Mercer’s 1909 proof of the spectral decomposition of positive integral operators which was a result in pure analysis with no tie to machine learning. Two decades later Sergei Sobolev introduced distributions and generalized functions in the 1930s and laid the functional-analytic groundwork. The decisive step came in 1950 when Nachman Aronszajn formalized RKHS theory and proved the Moore-Aronszajn theorem and pinned down the one-to-one correspondence between positive definite kernels and reproducing kernel Hilbert spaces. Grace Wahba through the 1970s connected spline smoothing to RKHS theory and introduced Generalized Cross-Validation and made the theory practical for statistics.
The machine learning era kicked off in 1992 when Boser and Guyon and Vapnik introduced the kernel trick for support vector machines and turned RKHS from a theoretical curiosity into a computational tool. Radford Neal showed in 1996 that infinite neural networks converge to Gaussian processes and hinted at the deep connection between kernels and neural architectures. The representer theorem was generalized to arbitrary monotonic regularizers by Scholkopf et al. in 2001. Rahimi and Recht introduced Random Fourier Features in 2007 and made kernel methods scalable by leaning on Bochner’s theorem. And most recently Jacot et al. in 2018 introduced the Neural Tangent Kernel and closed the circle by showing that wide neural networks are in a precise sense kernel machines.
Key terms
- Characteristic Kernel: A kernel whose mean embedding is injective, meaning it uniquely identifies distributions.
- Gram Matrix: The matrix formed by evaluating the kernel at all pairs of data points.
- Green’s Function: The impulse response of a differential operator. When the kernel is a Green’s function, the RKHS norm penalizes derivatives.
- Mercer’s Theorem: The eigenfunction expansion , which decomposes a kernel into its spectral components.
- MMD: Maximum Mean Discrepancy, a distance between distributions computed via their mean embeddings in an RKHS.
- Nystrom Method: A low-rank approximation of the Gram matrix using a subset of landmark points, reducing to .
- Positive Definite Function: A function such that the Gram matrix is PSD for every finite point set.
- RKHS: A Hilbert space of functions in which pointwise evaluation is a continuous linear functional.
- Shift-Invariant: A kernel of the form , depending only on the difference.
- Spectral Measure: The Fourier transform of a shift-invariant kernel, which determines which frequencies the RKHS can represent.
References
1. Scholkopf, B., & Smola, A. J. (2002). “Learning with Kernels”. The comprehensive reference for SVMs and Regularization Networks.
2. Aronszajn, N. (1950). “Theory of reproducing kernels”. The foundational math paper. Proves Moore-Aronszajn.
3. Jacot, A. et al. (2018). “Neural Tangent Kernel”. Proved that infinite width networks are Kernel Machines. Sparked the “Modern Era” of deep learning theory.
4. Rahimi, A., & Recht, B. (2007). “Random Features for Large-Scale Kernel Machines”. Showed that explicit feature maps sampled from the Fourier transform can approximate kernels efficiently. instead of .
5. Berlinet, A., & Thomas-Agnan, C. (2011). “Reproducing Kernel Hilbert Spaces in Probability and Statistics”. Rigorous treatment of the functional analysis aspects.
6. Wahba, G. (1990). “Spline Models for Observational Data”. The bible of splines and thin-plate splines as RKHS problems.