RKHS & The Representer Theorem
Introduction to Machine Learning often treats “Kernels” as a computational trick:
This implies the feature map comes first, mapping data to some high-dimensional feature space . Functional Analysis takes the opposite view: The Kernel comes first. The Kernel defines a unique Hilbert space of functions —the Reproducing Kernel Hilbert Space (RKHS)—where evaluation is continuous. This continuity condition () is a severe constraint. It implies that “Dirac deltas” are bounded functionals, meaning we can talk about the value of a function at a point rigorously (unlike in , where functions are only defined almost everywhere). The trade-off we have here—limiting the infinite dimensional space to gain regularity—is the foundation of Non-Parametric Statistics. It also establishes the equivalence between Positive Definite Kernels, Reproducing Kernel Hilbert Spaces, and Green’s Functions of Differential Operators.
1. The Moore-Aronszajn Theorem
1.1 Reproducing Kernels
Let be a Hilbert space of real-valued functions on a non-empty set . is called a Reproducing Kernel Hilbert Space (RKHS) if the evaluation functional is a bounded linear operator for all . By the Riesz Representation Theorem, for every bounded linear functional, there exists a unique element in the Hilbert space that represents it. Thus, for each , there exists a unique function such that:
The Reproducing Kernel is defined as . From this definition, we derive the Reproducing Property:
Specifically, the norm of the evaluator is .
1.2 Theorem Statement & Proof Construction
Theorem (Moore-Aronszajn, 1950): To every symmetric, positive definite kernel , there corresponds a unique RKHS with as its reproducing kernel.
Proof Construction:
- Pre-Hilbert Space: Consider the set of all finite linear combinations of the kernel centered at points in :
- Inner Product: Define the inner product between and as: This is well-defined and positive definite because is a positive definite kernel.
- Completion: The space might not be complete (i.e., not a Hilbert space). We define as the completion of with respect to the norm . Crucially, because evaluations are bounded (), Cauchy sequences in the norm converge pointwise to well-defined functions. Thus, the completion consists of functions, not abstract equivalence classes.
1.3 Bochner’s Theorem (Shift-Invariant Kernels)
For , a kernel is shift-invariant if . Bochner’s Theorem states that a continuous shift-invariant function is positive definite if and only if is the Fourier transform of a finite non-negative measure (the spectral measure).
This connects RKHS theory to Harmonic Analysis.
- The Gaussian (RBF) Kernel corresponds to a Gaussian spectral measure. Since the Gaussian has support everywhere, the RBF kernel can approximate any continuous function (Universal Approximation).
- A band-limited spectral measure corresponds to a kernel that only contains certain frequencies (e.g., the Sinc kernel). This theorem is the basis of Random Fourier Features (Rahimi & Recht, 2007), which approximate kernels by sampling frequencies .
2. The Representer Theorem
Why do Support Vector Machines (SVMs) and Kernel Ridge Regression solutions always look like sums of kernels centered at the data points? This is not a coincidence; it is a fundamental result of the geometry of Hilbert spaces. Consider the problem of minimizing a regularized risk functional:
where is an arbitrary loss function (e.g., squared error, hinge loss) and is a strictly increasing function (usually ).
Theorem (Schölkopf, Herbrich, Smola 2001): The minimizer of the regularized risk admits the representation:
for some coefficients .
Proof (Orthogonal Projection): The key insight is that any function can be decomposed into a component lying in the subspace spanned by the data representers and a component orthogonal to it. Let . Decompose , where and .
- Data Consistency: The evaluation of on data points depends only on the parallel component: This is because by orthogonality. Thus, the loss term is unaffected by .
- Regularization: The norm satisfies: Since is strictly increasing, adding any non-zero orthogonal component strictly increases the penalty cost without improving the data fit.
- Conclusion: The minimizer must have . Therefore, .
2.5 Case Study: Support Vector Machines (SVM)
The classic SVM is simply the application of the Representer Theorem to the Hinge Loss:
The primal problem is:
By the Representer Theorem, . Why is the solution “sparse” (i.e., many )? This is explained by the dual problem. The are identifying the Support Vectors—the points that lie exactly on the margin boundary. Points that are correctly classified and far from the boundary have zero influence on the solution (Lagrange multipliers are zero). This shows that sparsity is a property of the Loss Function (Hinge), while the kernel expansion is a property of the Regularizer ( norm in RKHS).
3. Kernel Ridge Regression (Derivation)
Let’s apply the Representer Theorem to Squared Error Loss:
The objective functional becomes:
Substituting , we have:
- Data Fit: . Thus .
- Regularizer: .
The optimization problem is now a finite-dimensional quadratic in :
Taking the gradient w.r.t :
Setting to zero (assuming is invertible):
Yielding the closed-form solution:
The prediction at a new point is:
3.5 Connection to Gaussian Processes
This result is identical to the Posterior Mean of a Gaussian Process Regression with noise variance . Why? Consider a GP prior . The log-posterior is proportional to:
- Likelihood: .
- Prior: The RKHS norm corresponds to the “energy” of the function under the prior. Formally, .
Minimizing the negative log-posterior is equivalent to minimizing:
Multiplying by , we recover the Ridge Regression objective with . Thus, Ridge Regularization is Bayesian Inference with a Gaussian Prior. The parameter controls the trade-off between the noise floor and the signal prior variance.
4. Green’s Functions & Differential Operators
RKHSs are often solution spaces to differential equations. Consider the linear differential operator . The Green’s function is the impulse response:
Duality Theorem: If is the Green’s function of the self-adjoint operator , then the RKHS corresponds to the space of functions penalized by the differential operator :
Example: Cubic Splines
Consider the Cubic Spline Kernel . This generates the space of cubic splines, minimizing the curvature energy:
Here, the operator is the second derivative , and the kernel is the Green’s function of the biharmonic operator . This explains why “spline interpolation” is optimal: it minimizes the curvature energy among all interpolating functions.
5. Neural Tangent Kernel (NTK)
Deep Networks with infinite width converge to GPs. What is the kernel? Let be the network. Limit of large width . Initialization . Taylor expand around initialization:
This is a linear model with feature map . The characteristic kernel is the Neural Tangent Kernel:
For ReLU networks, we can compute analytically recursively. Regime of Training:
- Lazy Regime: Weights don’t move much. Network behaves like a fixed kernel machine with kernel .
- Rich Regime: Weights move significantly. Features adapt. Kernel evolves. The Representer Theorem applies strictly only in the Lazy regime. In the Rich regime, we are doing “Feature Learning”—learning the kernel itself.
6. Mercer’s Theorem (Spectral Decomposition)
If is compact and is continuous, define the integral operator:
Since is PSD, is a compact self-adjoint positive operator. By Spectral Theorem for Compact Operators, it has eigendecomposition:
where and are orthonormal eigenfunctions in . The RKHS Norm is:
Only functions where the coefficients decay fast enough (faster than ) are in the RKHS. This explains the smoothness: High frequency eigenfunctions usually have small eigenvalues. To have finite norm, must have very little energy in high frequencies.
7. Numerical Implementation (Kernel Ridge)
import numpy as np
import matplotlib.pyplot as plt
def rbf_kernel(X1, X2, gamma=10.0):
diffs = X1[:, np.newaxis, :] - X2[np.newaxis, :, :]
dists_sq = np.sum(diffs**2, axis=2)
return np.exp(-gamma * dists_sq)
def kernel_ridge_regression(X_train, y_train, X_test, lambd=1e-5):
# K matrix
K = rbf_kernel(X_train, X_train)
n = len(X_train)
# Solve (K + lambda I) alpha = y
alpha = np.linalg.solve(K + lambd * np.eye(n), y_train)
# Predict
K_star = rbf_kernel(X_test, X_train)
f_star = K_star @ alpha
return f_star
def demo_representer():
# Data: Noisy Sine
np.random.seed(42)
X = np.sort(np.random.rand(20, 1) * 2 - 1, axis=0)
y = np.sin(3 * np.pi * X).ravel() + 0.1 * np.random.randn(20)
X_test = np.linspace(-1, 1, 200).reshape(-1, 1)
# Fit models with different regularization
preds_tight = kernel_ridge_regression(X, y, X_test, lambd=1e-5) # Interpolates
preds_loose = kernel_ridge_regression(X, y, X_test, lambd=1.0) # Smooth
plt.figure(figsize=(10, 6))
plt.scatter(X, y, c='r', label='Data')
plt.plot(X_test, preds_tight, label='Lambda=1e-5 (Interp)')
plt.plot(X_test, preds_loose, label='Lambda=1.0 (Smooth)')
plt.title('Representer Theorem in Action')
plt.legend()
# plt.show()
# Observation:
# Small lambda -> Functions passes through points (Noise interpolation, High Norm).
# Large lambda -> Functions is flatter (Low Norm).
# Both are just sums of Gaussians centered at data points.8. Kernel Mean Embeddings and MMD
Can we map entire probability distributions into the RKHS? Yes. The Kernel Mean Embedding of a distribution is defined as:
If the kernel is Characteristic (e.g., Gaussian), this mapping is injective. The feature mean uniquely identifies the distribution . This allows us to define a distance between distributions and purely in the RKHS:
Maximum Mean Discrepancy (MMD)
Expanded via the inner product:
This is a powerful metric for Two-Sample Testing (checking if two datasets come from the same distribution) and training Generative Models (MMD-GANs), as it has a closed-form unbiased estimator that is differentiable.
9. Pathologies of Infinite Dimensions
If the RKHS is infinite dimensional (like RBF), the spectrum . The inverse operator is bounded, but is unbounded. This makes the inverse problem Ill-Posed. Small noise in observations can lead to massive errors in if we don’t regularize ( leads to overfitting). Also, “The Curse of Kernelization”: For large , computing is , Inverting is . We need approximation (Nystrom, Random Features). Random Features (Rahimi-Recht): Approximate the integral defining (Bochner) with Monte Carlo.
This turns a Non-Parametric problem back into a Parametric one (Linear Regression in Random Feature space).
10. Conclusion
The theory of Reproducing Kernel Hilbert Spaces gives rigorous mathematical footing to the intuition that “similarity” drives prediction. By mapping data into infinite-dimensional spaces where evaluation is continuous, we gain the ability to apply powerful tools from linear algebra (like projections and spectral decompositions) to non-linear functions.
From the classical foundations of Moore-Aronszajn and Mercer to the modern revelations of the Neural Tangent Kernel, the duality between Kernels (Data) and Operators (Functionals) remains central. Whether implicitly in a Neural Network or explicitly in a Gaussian Process, the Kernel defines the inductive bias of the model—determining which functions are “simple” and which are “complex.”
Historical Timeline
| Year | Event | Significance |
|---|---|---|
| 1909 | James Mercer | Proves the spectral decomposition of positive integral operators. |
| 1930s | Sergei Sobolev | Introduces distributions and generalized functions. |
| 1950 | Nachman Aronszajn | Formalizes RKHS theory and proves the Moore-Aronszajn theorem. |
| 1970s | Grace Wahba | Connects Splines to RKHS and introduces Generalized Cross-Validation. |
| 1992 | Boser, Guyon, Vapnik | Introduce the “Kernel Trick” for Support Vector Machines. |
| 1996 | Radford Neal | Shows infinite neural networks converge to Gaussian Processes. |
| 2001 | Representer Theorem | Generalized to arbitrary monotonic regularizers by Schölkopf et al. |
| 2007 | Rahimi & Recht | Introduce Random Fourier Features (~Bochner’s Theorem). |
| 2018 | Jacot et al. | Introduce the Neural Tangent Kernel (NTK). |
Appendix A: Proof of RKHS Equivalence to Sobolev Spaces
Theorem: For the Matern kernel , is isomorphic to the Sobolev space with . Proof: The Sobolev norm is defined in Fourier domain:
For a shift-invariant kernel with spectrum , the RKHS norm is:
(This is Mercer’s theorem in continuous domain). For Matern, . Substitute this into the RKHS norm:
This is exactly the Sobolev norm! Corollary: RKHS functions are continuous if (Sobolev Embedding Theorem). Since , we need . Matern kernels with always yield continuous functions.
Appendix B: The Nyström Method
The “Curse of Kernelization” is that the kernel matrix is .
- Storage:
- Inversion: For (e.g. ImageNet), this is intractable. The Nyström Method approximates using a low-rank decomposition based on a subset of landmark points. Let be the kernel matrix of the landmarks, and be the kernel evaluation between all points and landmarks. We approximate:
This is the optimal rank- approximation if the landmarks are chosen via K-means clustering or leverage score sampling. Using the Woodbury Matrix Identity, we can invert the regularized approximate kernel in time:
This reduces the complexity from cubic to linear in , making Kernel Methods scalable to millions of diagrams.
Appendix C: Regularization Operators and Inverse Problems
Why does the RKHS norm correspond to smoothness? Generally, for a shift-invariant kernel , the inner product can be written using Fourier transforms (Plancherel’s Theorem):
For the Gaussian kernel, . The inverse, , grows exponentially with frequency. Thus, the norm penalizes high frequencies exponentially fast. This assumes functions in the RKHS are Analytic (extremely smooth). Generally, we can view the Kernel as an integral operator . The regularization problem is equivalent to solving the operator equation:
This is Tikhonov Regularization for the ill-posed inverse problem of recovering a function from noisy pointwise evaluations.
Appendix D: The Kernel Trick in Deep Learning
The link between Neural Networks and Kernels is deep.
- Infinite Width: As shown in Section 5, infinite width networks at initialization coincide with Gaussian Processes (Neal, 1996).
- GD Dynamics: Under Gradient Descent, the network evolves as a Kernel Predictor with the Neural Tangent Kernel (NTK).
In the “Lazy Regime” (very large width), (constant kernel). The network is just doing Kernel Ridge Regression with the NTK.
- Feature Learning: In realistic regimes (finite width), evolves. The kernel “aligns” with the target function. This is something standard Kernel machines cannot do—they have a fixed prior. Understanding how the kernel evolves is the frontier of Deep Learning Theory (e.g., the Maximal Update Parametrization or ).
Appendix E: Common Kernels and their Spectral Measures
Understanding the spectral properties of kernels (via Bochner’s Theorem) gives insight into the “texture” of functions in the RKHS.
- Gaussian (RBF): . The spectrum is also Gaussian. Since it has support on all frequencies, the RKHS is dense in . Functions are .
- Laplacian: . The spectrum is Cauchy (). Decay is slow (heavy tails), meaning high-frequency components are allowed. This leads to non-smooth functions (Lipschitz but not differentiable at ). Used in modeling shock waves or sharp transitions.
- Matern Class: The spectrum is a generalized Student-t distribution . The parameter controls the rate of spectral decay and thus the differentiability of the functions. converges to RBF.
- Neural Tangent Kernel (RelU): The spectrum decays polynomially, roughly as . This explains why neural networks can fit non-smooth data better than RBF kernels (which are too smooth).
Appendix F: Glossary of Terms
- Characteristic Kernel: A kernel whose mean embedding is injective (maps distributions uniquely).
- Gram Matrix: The matrix .
- Green’s Function: The impulse response of a differential operator.
- Mercer’s Theorem: Representation of a kernel as a sum of eigenfunctions.
- MMD (Maximum Mean Discrepancy): A distance between distributions based on the difference of their mean embeddings.
- Nyström Method: A low-rank approximation technique for large kernel matrices.
- Positive Definite Function: A function that ensures the Gram matrix is always PSD.
- RKHS: A Hilbert space where point evaluation is a continuous linear functional.
- Shift-Invariant: A kernel that depends only on difference (stationary).
- Spectral Measure: The Fourier transform of a shift-invariant kernel (Bochner).
References
1. Schölkopf, B., & Smola, A. J. (2002). “Learning with Kernels”. The comprehensive reference for SVMs and Regularization Networks.
2. Aronszajn, N. (1950). “Theory of reproducing kernels”. The foundational math paper. Proves Moore-Aronszajn.
3. Jacot, A. et al. (2018). “Neural Tangent Kernel”. Proved that infinite width networks are Kernel Machines. Sparked the “Modern Era” of deep learning theory.
4. Rahimi, A., & Recht, B. (2007). “Random Features for Large-Scale Kernel Machines”. Showed that explicit feature maps sampled from the Fourier transform can approximate kernels efficiently. instead of .
5. Berlinet, A., & Thomas-Agnan, C. (2011). “Reproducing Kernel Hilbert Spaces in Probability and Statistics”. Rigorous treatment of the functional analysis aspects.
6. Wahba, G. (1990). “Spline Models for Observational Data”. The bible of splines and thin-plate splines as RKHS problems.