RKHS & The Representer Theorem

Introduction to Machine Learning often treats “Kernels” as a computational trick:

k(x,y)=ϕ(x),ϕ(y)Fk(x, y) = \langle \phi(x), \phi(y) \rangle_{\mathcal{F}}

This implies the feature map ϕ\phi comes first, mapping data to some high-dimensional feature space F\mathcal{F}. Functional Analysis takes the opposite view: The Kernel comes first. The Kernel KK defines a unique Hilbert space of functions HK\mathcal{H}_K—the Reproducing Kernel Hilbert Space (RKHS)—where evaluation is continuous. This continuity condition (f(x)CxfH|f(x)| \le C_x \|f\|_{\mathcal{H}}) is a severe constraint. It implies that “Dirac deltas” are bounded functionals, meaning we can talk about the value of a function at a point xx rigorously (unlike in L2L^2, where functions are only defined almost everywhere). The trade-off we have here—limiting the infinite dimensional space to gain regularity—is the foundation of Non-Parametric Statistics. It also establishes the equivalence between Positive Definite Kernels, Reproducing Kernel Hilbert Spaces, and Green’s Functions of Differential Operators.


1. The Moore-Aronszajn Theorem

1.1 Reproducing Kernels

Let H\mathcal{H} be a Hilbert space of real-valued functions on a non-empty set X\mathcal{X}. H\mathcal{H} is called a Reproducing Kernel Hilbert Space (RKHS) if the evaluation functional δx:ff(x)\delta_x: f \mapsto f(x) is a bounded linear operator for all xXx \in \mathcal{X}. By the Riesz Representation Theorem, for every bounded linear functional, there exists a unique element in the Hilbert space that represents it. Thus, for each xXx \in \mathcal{X}, there exists a unique function KxHK_x \in \mathcal{H} such that:

f(x)=f,KxHfHf(x) = \langle f, K_x \rangle_\mathcal{H} \quad \forall f \in \mathcal{H}

The Reproducing Kernel is defined as K(x,y)=Kx,KyHK(x, y) = \langle K_x, K_y \rangle_\mathcal{H}. From this definition, we derive the Reproducing Property:

K(x,)=Kx    K(,x),K(,y)H=K(x,y)K(x, \cdot) = K_x \implies \langle K(\cdot, x), K(\cdot, y) \rangle_\mathcal{H} = K(x, y)

Specifically, the norm of the evaluator is Kx2=K(x,x)0\|K_x\|^2 = K(x, x) \ge 0.

1.2 Theorem Statement & Proof Construction

Theorem (Moore-Aronszajn, 1950): To every symmetric, positive definite kernel k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R}, there corresponds a unique RKHS Hk\mathcal{H}_k with kk as its reproducing kernel.

Proof Construction:

  1. Pre-Hilbert Space: Consider the set of all finite linear combinations of the kernel centered at points in X\mathcal{X}: H0={f()=i=1nαik(xi,):nN,αiR,xiX}\mathcal{H}_0 = \left\{ f(\cdot) = \sum_{i=1}^n \alpha_i k(x_i, \cdot) : n \in \mathbb{N}, \alpha_i \in \mathbb{R}, x_i \in \mathcal{X} \right\}
  2. Inner Product: Define the inner product between f=αikxif = \sum \alpha_i k_{x_i} and g=βjkyjg = \sum \beta_j k_{y_j} as: f,gH0=i=1nj=1mαiβjk(xi,yj)\langle f, g \rangle_{\mathcal{H}_0} = \sum_{i=1}^n \sum_{j=1}^m \alpha_i \beta_j k(x_i, y_j) This is well-defined and positive definite because kk is a positive definite kernel.
  3. Completion: The space H0\mathcal{H}_0 might not be complete (i.e., not a Hilbert space). We define Hk\mathcal{H}_k as the completion of H0\mathcal{H}_0 with respect to the norm f=f,f\|f\| = \sqrt{\langle f, f \rangle}. Crucially, because evaluations are bounded (f(x)k(x,x)f|f(x)| \le \sqrt{k(x,x)} \|f\|), Cauchy sequences in the norm converge pointwise to well-defined functions. Thus, the completion consists of functions, not abstract equivalence classes.

1.3 Bochner’s Theorem (Shift-Invariant Kernels)

For X=Rd\mathcal{X} = \mathbb{R}^d, a kernel is shift-invariant if k(x,y)=ψ(xy)k(x, y) = \psi(x - y). Bochner’s Theorem states that a continuous shift-invariant function k(x,y)=ψ(xy)k(x, y) = \psi(x-y) is positive definite if and only if ψ\psi is the Fourier transform of a finite non-negative measure μ\mu (the spectral measure).

ψ(δ)=RdeiωTδdμ(ω)\psi(\delta) = \int_{\mathbb{R}^d} e^{i \omega^T \delta} d\mu(\omega)

This connects RKHS theory to Harmonic Analysis.

  • The Gaussian (RBF) Kernel corresponds to a Gaussian spectral measure. Since the Gaussian has support everywhere, the RBF kernel can approximate any continuous function (Universal Approximation).
  • A band-limited spectral measure corresponds to a kernel that only contains certain frequencies (e.g., the Sinc kernel). This theorem is the basis of Random Fourier Features (Rahimi & Recht, 2007), which approximate kernels by sampling frequencies ωμ\omega \sim \mu.

2. The Representer Theorem

Why do Support Vector Machines (SVMs) and Kernel Ridge Regression solutions always look like sums of kernels centered at the data points? This is not a coincidence; it is a fundamental result of the geometry of Hilbert spaces. Consider the problem of minimizing a regularized risk functional:

J(f)=i=1nL(yi,f(xi))+λΩ(fH)J(f) = \sum_{i=1}^n L(y_i, f(x_i)) + \lambda \Omega(\|f\|_\mathcal{H})

where LL is an arbitrary loss function (e.g., squared error, hinge loss) and Ω\Omega is a strictly increasing function (usually Ω(z)=z2\Omega(z) = z^2).

Theorem (Schölkopf, Herbrich, Smola 2001): The minimizer fHf^* \in \mathcal{H} of the regularized risk admits the representation:

f(x)=i=1nαik(xi,x)f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x)

for some coefficients αiR\alpha_i \in \mathbb{R}.

Proof (Orthogonal Projection): The key insight is that any function fHf \in \mathcal{H} can be decomposed into a component lying in the subspace spanned by the data representers and a component orthogonal to it. Let V=span{kx1,,kxn}V = \text{span}\{k_{x_1}, \dots, k_{x_n}\}. Decompose f=f+ff = f_{||} + f_{\perp}, where fVf_{||} \in V and fVf_{\perp} \perp V.

  1. Data Consistency: The evaluation of ff on data points depends only on the parallel component: f(xi)=f,kxi=f+f,kxi=f,kxif(x_i) = \langle f, k_{x_i} \rangle = \langle f_{||} + f_{\perp}, k_{x_i} \rangle = \langle f_{||}, k_{x_i} \rangle This is because f,kxi=0\langle f_{\perp}, k_{x_i} \rangle = 0 by orthogonality. Thus, the loss term L(yi,f(xi))\sum L(y_i, f(x_i)) is unaffected by ff_{\perp}.
  2. Regularization: The norm satisfies: f2=f2+f2\|f\|^2 = \|f_{||}\|^2 + \|f_{\perp}\|^2 Since Ω\Omega is strictly increasing, adding any non-zero orthogonal component strictly increases the penalty cost without improving the data fit.
  3. Conclusion: The minimizer must have f=0f_{\perp} = 0. Therefore, f=f=αik(xi,)f^* = f_{||} = \sum \alpha_i k(x_i, \cdot). \square

2.5 Case Study: Support Vector Machines (SVM)

The classic SVM is simply the application of the Representer Theorem to the Hinge Loss:

L(y,f(x))=max(0,1yf(x))L(y, f(x)) = \max(0, 1 - y f(x))

The primal problem is:

minfHi=1nmax(0,1yif(xi))+λ2fH2\min_{f \in \mathcal{H}} \sum_{i=1}^n \max(0, 1 - y_i f(x_i)) + \frac{\lambda}{2} \|f\|_\mathcal{H}^2

By the Representer Theorem, f(x)=αik(xi,x)f(x) = \sum \alpha_i k(x_i, x). Why is the solution “sparse” (i.e., many αi=0\alpha_i = 0)? This is explained by the dual problem. The αi\alpha_i are identifying the Support Vectors—the points that lie exactly on the margin boundary. Points that are correctly classified and far from the boundary have zero influence on the solution (Lagrange multipliers are zero). This shows that sparsity is a property of the Loss Function (Hinge), while the kernel expansion is a property of the Regularizer (L2L^2 norm in RKHS).


3. Kernel Ridge Regression (Derivation)

Let’s apply the Representer Theorem to Squared Error Loss:

L(y,f(x))=(yf(x))2L(y, f(x)) = (y - f(x))^2

The objective functional becomes:

J(f)=yf22+λfH2J(f) = \|y - \mathbf{f}\|_2^2 + \lambda \|f\|_\mathcal{H}^2

Substituting f(x)=αjk(xj,x)f(x) = \sum \alpha_j k(x_j, x), we have:

  1. Data Fit: f(xi)=[Kα]if(x_i) = [K \alpha]_i. Thus yf22=yKα22\|y - \mathbf{f}\|_2^2 = \|y - K\alpha\|_2^2.
  2. Regularizer: fH2=αikxi,αjkxj=αTKα\|f\|_\mathcal{H}^2 = \langle \sum \alpha_i k_{x_i}, \sum \alpha_j k_{x_j} \rangle = \alpha^T K \alpha.

The optimization problem is now a finite-dimensional quadratic in α\alpha:

J(α)=(yKα)T(yKα)+λαTKαJ(\alpha) = (y - K\alpha)^T (y - K\alpha) + \lambda \alpha^T K \alpha

Taking the gradient w.r.t α\alpha:

αJ=2K(yKα)+2λKα\nabla_\alpha J = -2K(y - K\alpha) + 2\lambda K \alpha

Setting to zero (assuming KK is invertible):

K(Kα+λαy)=0    (K+λI)α=yK(K\alpha + \lambda \alpha - y) = 0 \implies (K + \lambda I)\alpha = y

Yielding the closed-form solution:

α=(K+λI)1y\alpha = (K + \lambda I)^{-1} y

The prediction at a new point xx_* is:

f(x)=kTα=kT(K+λI)1yf(x_*) = k_*^T \alpha = k_*^T (K + \lambda I)^{-1} y

3.5 Connection to Gaussian Processes

This result is identical to the Posterior Mean of a Gaussian Process Regression with noise variance σn2=λ\sigma_n^2 = \lambda. Why? Consider a GP prior fGP(0,k)f \sim \mathcal{GP}(0, k). The log-posterior is proportional to:

logP(fy)logP(yf)+logP(f)\log P(f | y) \propto \log P(y | f) + \log P(f)
  • Likelihood: P(yf)exp(12σn2(yif(xi))2)P(y|f) \propto \exp\left(-\frac{1}{2\sigma_n^2} \sum (y_i - f(x_i))^2\right).
  • Prior: The RKHS norm fH2\|f\|_\mathcal{H}^2 corresponds to the “energy” of the function under the prior. Formally, P(f)exp(12fH2)P(f) \propto \exp\left(-\frac{1}{2} \|f\|_\mathcal{H}^2\right).

Minimizing the negative log-posterior is equivalent to minimizing:

12σn2(yif(xi))2+12fH2\frac{1}{2\sigma_n^2} \sum (y_i - f(x_i))^2 + \frac{1}{2} \|f\|_\mathcal{H}^2

Multiplying by 2σn22\sigma_n^2, we recover the Ridge Regression objective with λ=σn2\lambda = \sigma_n^2. Thus, Ridge Regularization is Bayesian Inference with a Gaussian Prior. The parameter λ\lambda controls the trade-off between the noise floor and the signal prior variance.


4. Green’s Functions & Differential Operators

RKHSs are often solution spaces to differential equations. Consider the linear differential operator PP. The Green’s function G(x,y)G(x, y) is the impulse response:

PG(,y)=δ(y)P G(\cdot, y) = \delta(\cdot - y)

Duality Theorem: If k(x,y)k(x, y) is the Green’s function of the self-adjoint operator L=PPL = P^* P, then the RKHS Hk\mathcal{H}_k corresponds to the space of functions penalized by the differential operator PP:

fH2=PfL22=(Pf(x))2dx\|f\|_\mathcal{H}^2 = \|P f\|_{L^2}^2 = \int (Pf(x))^2 dx

Example: Cubic Splines

Consider the Cubic Spline Kernel k(x,y)=min(x,y)2(3max(x,y)min(x,y))k(x, y) = \min(x, y)^2 (3\max(x, y) - \min(x, y)). This generates the space of cubic splines, minimizing the curvature energy:

fH2=(f(x))2dx\|f\|_\mathcal{H}^2 = \int (f''(x))^2 dx

Here, the operator is the second derivative P=d2dx2P = \frac{d^2}{dx^2}, and the kernel is the Green’s function of the biharmonic operator d4dx4\frac{d^4}{dx^4}. This explains why “spline interpolation” is optimal: it minimizes the curvature energy among all interpolating functions.


5. Neural Tangent Kernel (NTK)

Deep Networks with infinite width converge to GPs. What is the kernel? Let f(x;θ)f(x; \theta) be the network. Limit of large width mm \to \infty. Initialization θ0\theta_0. Taylor expand around initialization:

f(x;θ)f(x;θ0)+θf(x;θ0)T(θθ0)f(x; \theta) \approx f(x; \theta_0) + \nabla_\theta f(x; \theta_0)^T (\theta - \theta_0)

This is a linear model with feature map ϕ(x)=θf(x;θ0)\phi(x) = \nabla_\theta f(x; \theta_0). The characteristic kernel is the Neural Tangent Kernel:

Θ(x,y)=limmp=1Pθpf(x)θpf(y)\Theta(x, y) = \lim_{m \to \infty} \sum_{p=1}^P \partial_{\theta_p} f(x) \partial_{\theta_p} f(y)

For ReLU networks, we can compute Θ\Theta analytically recursively. Regime of Training:

  1. Lazy Regime: Weights don’t move much. Network behaves like a fixed kernel machine with kernel Θ\Theta.
  2. Rich Regime: Weights move significantly. Features adapt. Kernel evolves. The Representer Theorem applies strictly only in the Lazy regime. In the Rich regime, we are doing “Feature Learning”—learning the kernel itself.

6. Mercer’s Theorem (Spectral Decomposition)

If XX is compact and kk is continuous, define the integral operator:

Tkf(x)=Xk(x,y)f(y)dμ(y)T_k f(x) = \int_X k(x, y) f(y) d\mu(y)

Since kk is PSD, TkT_k is a compact self-adjoint positive operator. By Spectral Theorem for Compact Operators, it has eigendecomposition:

k(x,y)=j=1λjej(x)ej(y)k(x, y) = \sum_{j=1}^\infty \lambda_j e_j(x) e_j(y)

where λj0\lambda_j \ge 0 and ej(x)e_j(x) are orthonormal eigenfunctions in L2(μ)L^2(\mu). The RKHS Norm is:

fH2=j=1f,ejL22λj\|f\|_\mathcal{H}^2 = \sum_{j=1}^\infty \frac{\langle f, e_j \rangle_{L^2}^2}{\lambda_j}

Only functions where the coefficients decay fast enough (faster than λj\lambda_j) are in the RKHS. This explains the smoothness: High frequency eigenfunctions usually have small eigenvalues. To have finite norm, ff must have very little energy in high frequencies.


7. Numerical Implementation (Kernel Ridge)

import numpy as np import matplotlib.pyplot as plt def rbf_kernel(X1, X2, gamma=10.0): diffs = X1[:, np.newaxis, :] - X2[np.newaxis, :, :] dists_sq = np.sum(diffs**2, axis=2) return np.exp(-gamma * dists_sq) def kernel_ridge_regression(X_train, y_train, X_test, lambd=1e-5): # K matrix K = rbf_kernel(X_train, X_train) n = len(X_train) # Solve (K + lambda I) alpha = y alpha = np.linalg.solve(K + lambd * np.eye(n), y_train) # Predict K_star = rbf_kernel(X_test, X_train) f_star = K_star @ alpha return f_star def demo_representer(): # Data: Noisy Sine np.random.seed(42) X = np.sort(np.random.rand(20, 1) * 2 - 1, axis=0) y = np.sin(3 * np.pi * X).ravel() + 0.1 * np.random.randn(20) X_test = np.linspace(-1, 1, 200).reshape(-1, 1) # Fit models with different regularization preds_tight = kernel_ridge_regression(X, y, X_test, lambd=1e-5) # Interpolates preds_loose = kernel_ridge_regression(X, y, X_test, lambd=1.0) # Smooth plt.figure(figsize=(10, 6)) plt.scatter(X, y, c='r', label='Data') plt.plot(X_test, preds_tight, label='Lambda=1e-5 (Interp)') plt.plot(X_test, preds_loose, label='Lambda=1.0 (Smooth)') plt.title('Representer Theorem in Action') plt.legend() # plt.show() # Observation: # Small lambda -> Functions passes through points (Noise interpolation, High Norm). # Large lambda -> Functions is flatter (Low Norm). # Both are just sums of Gaussians centered at data points.

8. Kernel Mean Embeddings and MMD

Can we map entire probability distributions into the RKHS? Yes. The Kernel Mean Embedding of a distribution PP is defined as:

μP=ExP[k(x,)]=Xk(x,)dP(x)\mu_P = \mathbb{E}_{x \sim P} [k(x, \cdot)] = \int_\mathcal{X} k(x, \cdot) dP(x)

If the kernel is Characteristic (e.g., Gaussian), this mapping is injective. The feature mean μP\mu_P uniquely identifies the distribution PP. This allows us to define a distance between distributions PP and QQ purely in the RKHS:

Maximum Mean Discrepancy (MMD)

MMD2(P,Q)=μPμQH2\text{MMD}^2(P, Q) = \|\mu_P - \mu_Q\|_\mathcal{H}^2

Expanded via the inner product:

MMD2(P,Q)=Ex,xP[k(x,x)]2ExP,yQ[k(x,y)]+Ey,yQ[k(y,y)]\text{MMD}^2(P, Q) = \mathbb{E}_{x, x' \sim P}[k(x, x')] - 2\mathbb{E}_{x \sim P, y \sim Q}[k(x, y)] + \mathbb{E}_{y, y' \sim Q}[k(y, y')]

This is a powerful metric for Two-Sample Testing (checking if two datasets come from the same distribution) and training Generative Models (MMD-GANs), as it has a closed-form unbiased estimator that is differentiable.


9. Pathologies of Infinite Dimensions

If the RKHS is infinite dimensional (like RBF), the spectrum λj0\lambda_j \to 0. The inverse operator (Tk+λI)1(T_k + \lambda I)^{-1} is bounded, but Tk1T_k^{-1} is unbounded. This makes the inverse problem Ill-Posed. Small noise in observations can lead to massive errors in ff if we don’t regularize (λ=0\lambda=0 leads to overfitting). Also, “The Curse of Kernelization”: For large nn, computing KK is O(n2)O(n^2), Inverting is O(n3)O(n^3). We need approximation (Nystrom, Random Features). Random Features (Rahimi-Recht): Approximate the integral defining kk (Bochner) with Monte Carlo.

ϕ(x)=[cos(ω1x),,cos(ωDx)]\phi(x) = [\cos(\omega_1 x), \dots, \cos(\omega_D x)]

This turns a Non-Parametric problem back into a Parametric one (Linear Regression in Random Feature space).


10. Conclusion

The theory of Reproducing Kernel Hilbert Spaces gives rigorous mathematical footing to the intuition that “similarity” drives prediction. By mapping data into infinite-dimensional spaces where evaluation is continuous, we gain the ability to apply powerful tools from linear algebra (like projections and spectral decompositions) to non-linear functions.

From the classical foundations of Moore-Aronszajn and Mercer to the modern revelations of the Neural Tangent Kernel, the duality between Kernels (Data) and Operators (Functionals) remains central. Whether implicitly in a Neural Network or explicitly in a Gaussian Process, the Kernel defines the inductive bias of the model—determining which functions are “simple” and which are “complex.”


Historical Timeline

YearEventSignificance
1909James MercerProves the spectral decomposition of positive integral operators.
1930sSergei SobolevIntroduces distributions and generalized functions.
1950Nachman AronszajnFormalizes RKHS theory and proves the Moore-Aronszajn theorem.
1970sGrace WahbaConnects Splines to RKHS and introduces Generalized Cross-Validation.
1992Boser, Guyon, VapnikIntroduce the “Kernel Trick” for Support Vector Machines.
1996Radford NealShows infinite neural networks converge to Gaussian Processes.
2001Representer TheoremGeneralized to arbitrary monotonic regularizers by Schölkopf et al.
2007Rahimi & RechtIntroduce Random Fourier Features (~Bochner’s Theorem).
2018Jacot et al.Introduce the Neural Tangent Kernel (NTK).

Appendix A: Proof of RKHS Equivalence to Sobolev Spaces

Theorem: For the Matern kernel kνk_\nu, Hk\mathcal{H}_k is isomorphic to the Sobolev space Hs(Rd)H^s(\mathbb{R}^d) with s=ν+d/2s = \nu + d/2. Proof: The Sobolev norm is defined in Fourier domain:

fHs2=(1+ω2)sf^(ω)2dω\|f\|_{H^s}^2 = \int (1 + \|\omega\|^2)^s |\hat{f}(\omega)|^2 d\omega

For a shift-invariant kernel kk with spectrum S(ω)S(\omega), the RKHS norm is:

fH2=f^(ω)2S(ω)dω\|f\|_\mathcal{H}^2 = \int \frac{|\hat{f}(\omega)|^2}{S(\omega)} d\omega

(This is Mercer’s theorem in continuous domain). For Matern, S(ω)(1+ω2)(ν+d/2)S(\omega) \propto (1 + \|\omega\|^2)^{-(\nu+d/2)}. Substitute this into the RKHS norm:

fH2=f^(ω)2(1+ω2)ν+d/2dω\|f\|_\mathcal{H}^2 = \int |\hat{f}(\omega)|^2 (1 + \|\omega\|^2)^{\nu+d/2} d\omega

This is exactly the Sobolev Hν+d/2H^{\nu+d/2} norm! Corollary: RKHS functions are continuous if s>d/2s > d/2 (Sobolev Embedding Theorem). Since s=ν+d/2s = \nu + d/2, we need ν+d/2>d/2    ν>0\nu + d/2 > d/2 \implies \nu > 0. Matern kernels with ν>0\nu > 0 always yield continuous functions.


Appendix B: The Nyström Method

The “Curse of Kernelization” is that the kernel matrix KK is n×nn \times n.

  • Storage: O(n2)O(n^2)
  • Inversion: O(n3)O(n^3) For n=1,000,000n=1,000,000 (e.g. ImageNet), this is intractable. The Nyström Method approximates KK using a low-rank decomposition based on a subset of mnm \ll n landmark points. Let KmmK_{mm} be the kernel matrix of the landmarks, and KnmK_{nm} be the kernel evaluation between all points and landmarks. We approximate:
K~=KnmKmm1KnmT\tilde{K} = K_{nm} K_{mm}^{-1} K_{nm}^T

This is the optimal rank-mm approximation if the landmarks are chosen via K-means clustering or leverage score sampling. Using the Woodbury Matrix Identity, we can invert the regularized approximate kernel in O(nm2)O(nm^2) time:

(K~+λI)1=1λ(IKnm(λKmm+KnmTKnm)1KnmT)(\tilde{K} + \lambda I)^{-1} = \frac{1}{\lambda} \left( I - K_{nm} (\lambda K_{mm} + K_{nm}^T K_{nm})^{-1} K_{nm}^T \right)

This reduces the complexity from cubic to linear in nn, making Kernel Methods scalable to millions of diagrams.


Appendix C: Regularization Operators and Inverse Problems

Why does the RKHS norm correspond to smoothness? Generally, for a shift-invariant kernel k(x,y)=ψ(xy)k(x, y) = \psi(x-y), the inner product can be written using Fourier transforms (Plancherel’s Theorem):

f,gH=f^(ω)g^(ω)ψ^(ω)dω\langle f, g \rangle_\mathcal{H} = \int \frac{\hat{f}(\omega) \overline{\hat{g}(\omega)}}{\hat{\psi}(\omega)} d\omega

For the Gaussian kernel, ψ^(ω)=eσ2ω2\hat{\psi}(\omega) = e^{-\sigma^2 \omega^2}. The inverse, 1/ψ^(ω)=eσ2ω21/\hat{\psi}(\omega) = e^{\sigma^2 \omega^2}, grows exponentially with frequency. Thus, the norm fH2\|f\|_\mathcal{H}^2 penalizes high frequencies exponentially fast. This assumes functions in the RKHS are Analytic (extremely smooth). Generally, we can view the Kernel KK as an integral operator TKT_K. The regularization problem minyTKα2+λα,TKα\min \|y - T_K \alpha\|^2 + \lambda \langle \alpha, T_K \alpha \rangle is equivalent to solving the operator equation:

(TK+λI)α=y(T_K + \lambda I) \alpha = y

This is Tikhonov Regularization for the ill-posed inverse problem of recovering a function from noisy pointwise evaluations.


Appendix D: The Kernel Trick in Deep Learning

The link between Neural Networks and Kernels is deep.

  • Infinite Width: As shown in Section 5, infinite width networks at initialization coincide with Gaussian Processes (Neal, 1996).
  • GD Dynamics: Under Gradient Descent, the network evolves as a Kernel Predictor with the Neural Tangent Kernel (NTK).
tft(x)=ηi=1nΘt(x,xi)(ft(xi)yi)\partial_t f_t(x) = - \eta \sum_{i=1}^n \Theta_t(x, x_i) (f_t(x_i) - y_i)

In the “Lazy Regime” (very large width), ΘtΘ0\Theta_t \approx \Theta_0 (constant kernel). The network is just doing Kernel Ridge Regression with the NTK.

  • Feature Learning: In realistic regimes (finite width), Θt\Theta_t evolves. The kernel “aligns” with the target function. This is something standard Kernel machines cannot do—they have a fixed prior. Understanding how the kernel evolves is the frontier of Deep Learning Theory (e.g., the Maximal Update Parametrization or μP\mu P).

Appendix E: Common Kernels and their Spectral Measures

Understanding the spectral properties of kernels (via Bochner’s Theorem) gives insight into the “texture” of functions in the RKHS.

  • Gaussian (RBF): ψ(x)=ex2/2σ2\psi(x) = e^{-\|x\|^2 / 2\sigma^2}. The spectrum is also Gaussian. Since it has support on all frequencies, the RKHS is dense in C(X)C(\mathcal{X}). Functions are CC^\infty.
  • Laplacian: ψ(x)=ex1/σ\psi(x) = e^{-\|x\|_1 / \sigma}. The spectrum is Cauchy (1/(1+ω2)1/(1+\omega^2)). Decay is slow (heavy tails), meaning high-frequency components are allowed. This leads to non-smooth functions (Lipschitz but not differentiable at x=yx=y). Used in modeling shock waves or sharp transitions.
  • Matern Class: The spectrum is a generalized Student-t distribution (1+ω2)νd/2(1 + \|\omega\|^2)^{-\nu - d/2}. The parameter ν\nu controls the rate of spectral decay and thus the differentiability of the functions. ν\nu \to \infty converges to RBF.
  • Neural Tangent Kernel (RelU): The spectrum decays polynomially, roughly as O(ωd1)O(\|\omega\|^{-d-1}). This explains why neural networks can fit non-smooth data better than RBF kernels (which are too smooth).

Appendix F: Glossary of Terms

  • Characteristic Kernel: A kernel whose mean embedding is injective (maps distributions uniquely).
  • Gram Matrix: The n×nn \times n matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j).
  • Green’s Function: The impulse response of a differential operator.
  • Mercer’s Theorem: Representation of a kernel as a sum of eigenfunctions.
  • MMD (Maximum Mean Discrepancy): A distance between distributions based on the difference of their mean embeddings.
  • Nyström Method: A low-rank approximation technique for large kernel matrices.
  • Positive Definite Function: A function that ensures the Gram matrix is always PSD.
  • RKHS: A Hilbert space where point evaluation is a continuous linear functional.
  • Shift-Invariant: A kernel that depends only on difference xyx-y (stationary).
  • Spectral Measure: The Fourier transform of a shift-invariant kernel (Bochner).

References

1. Schölkopf, B., & Smola, A. J. (2002). “Learning with Kernels”. The comprehensive reference for SVMs and Regularization Networks.

2. Aronszajn, N. (1950). “Theory of reproducing kernels”. The foundational math paper. Proves Moore-Aronszajn.

3. Jacot, A. et al. (2018). “Neural Tangent Kernel”. Proved that infinite width networks are Kernel Machines. Sparked the “Modern Era” of deep learning theory.

4. Rahimi, A., & Recht, B. (2007). “Random Features for Large-Scale Kernel Machines”. Showed that explicit feature maps sampled from the Fourier transform can approximate kernels efficiently. O(n)O(n) instead of O(n3)O(n^3).

5. Berlinet, A., & Thomas-Agnan, C. (2011). “Reproducing Kernel Hilbert Spaces in Probability and Statistics”. Rigorous treatment of the functional analysis aspects.

6. Wahba, G. (1990). “Spline Models for Observational Data”. The bible of splines and thin-plate splines as RKHS problems.