RKHS & The Representer Theorem

7/12/2025

Introduction to Machine Learning often treats “Kernels” as a computational trick:

k(x, y) = \langle \phi(x), \phi(y) \rangle_{\mathcal{F}}

This implies the feature map $\phi$ comes first, mapping data to some high-dimensional feature space $\mathcal{F}$ . Functional Analysis takes the opposite view: The Kernel comes first. The Kernel $K$ defines a unique Hilbert space of functions $\mathcal{H}_K$ —the Reproducing Kernel Hilbert Space (RKHS)—where evaluation is continuous. This continuity condition ( $|f(x)| \le C_x \|f\|_{\mathcal{H}}$ ) is a severe constraint. It implies that “Dirac deltas” are bounded functionals, meaning we can talk about the value of a function at a point $x$ rigorously (unlike in $L^2$ , where functions are only defined almost everywhere). The trade-off we have here—limiting the infinite dimensional space to gain regularity—is the foundation of Non-Parametric Statistics. It also establishes the equivalence between Positive Definite Kernels, Reproducing Kernel Hilbert Spaces, and Green’s Functions of Differential Operators.

1. The Moore-Aronszajn Theorem

1.1 Reproducing Kernels

Let $\mathcal{H}$ be a Hilbert space of real-valued functions on a non-empty set $\mathcal{X}$ . $\mathcal{H}$ is called a Reproducing Kernel Hilbert Space (RKHS) if the evaluation functional $\delta_x: f \mapsto f(x)$ is a bounded linear operator for all $x \in \mathcal{X}$ . By the Riesz Representation Theorem, for every bounded linear functional, there exists a unique element in the Hilbert space that represents it. Thus, for each $x \in \mathcal{X}$ , there exists a unique function $K_x \in \mathcal{H}$ such that:

f(x) = \langle f, K_x \rangle_\mathcal{H} \quad \forall f \in \mathcal{H}

The Reproducing Kernel is defined as $K(x, y) = \langle K_x, K_y \rangle_\mathcal{H}$ . From this definition, we derive the Reproducing Property:

K(x, \cdot) = K_x \implies \langle K(\cdot, x), K(\cdot, y) \rangle_\mathcal{H} = K(x, y)

Specifically, the norm of the evaluator is $\|K_x\|^2 = K(x, x) \ge 0$ .

1.2 Theorem Statement & Proof Construction

Theorem (Moore-Aronszajn, 1950): To every symmetric, positive definite kernel $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ , there corresponds a unique RKHS $\mathcal{H}_k$ with $k$ as its reproducing kernel.

Proof Construction:

Pre-Hilbert Space: Consider the set of all finite linear combinations of the kernel centered at points in $\mathcal{X}$ : $\mathcal{H}_0 = \left\{ f(\cdot) = \sum_{i=1}^n \alpha_i k(x_i, \cdot) : n \in \mathbb{N}, \alpha_i \in \mathbb{R}, x_i \in \mathcal{X} \right\}$
Inner Product: Define the inner product between $f = \sum \alpha_i k_{x_i}$ and $g = \sum \beta_j k_{y_j}$ as: $\langle f, g \rangle_{\mathcal{H}_0} = \sum_{i=1}^n \sum_{j=1}^m \alpha_i \beta_j k(x_i, y_j)$ This is well-defined and positive definite because $k$ is a positive definite kernel.
Completion: The space $\mathcal{H}_0$ might not be complete (i.e., not a Hilbert space). We define $\mathcal{H}_k$ as the completion of $\mathcal{H}_0$ with respect to the norm $\|f\| = \sqrt{\langle f, f \rangle}$ . Crucially, because evaluations are bounded ( $|f(x)| \le \sqrt{k(x,x)} \|f\|$ ), Cauchy sequences in the norm converge pointwise to well-defined functions. Thus, the completion consists of functions, not abstract equivalence classes.

1.3 Bochner’s Theorem (Shift-Invariant Kernels)

For $\mathcal{X} = \mathbb{R}^d$ , a kernel is shift-invariant if $k(x, y) = \psi(x - y)$ . Bochner’s Theorem states that a continuous shift-invariant function $k(x, y) = \psi(x-y)$ is positive definite if and only if $\psi$ is the Fourier transform of a finite non-negative measure $\mu$ (the spectral measure).

\psi(\delta) = \int_{\mathbb{R}^d} e^{i \omega^T \delta} d\mu(\omega)

This connects RKHS theory to Harmonic Analysis.

The Gaussian (RBF) Kernel corresponds to a Gaussian spectral measure. Since the Gaussian has support everywhere, the RBF kernel can approximate any continuous function (Universal Approximation).
A band-limited spectral measure corresponds to a kernel that only contains certain frequencies (e.g., the Sinc kernel). This theorem is the basis of Random Fourier Features (Rahimi & Recht, 2007), which approximate kernels by sampling frequencies $\omega \sim \mu$ .

2. The Representer Theorem

Why do Support Vector Machines (SVMs) and Kernel Ridge Regression solutions always look like sums of kernels centered at the data points? This is not a coincidence; it is a fundamental result of the geometry of Hilbert spaces. Consider the problem of minimizing a regularized risk functional:

J(f) = \sum_{i=1}^n L(y_i, f(x_i)) + \lambda \Omega(\|f\|_\mathcal{H})

where $L$ is an arbitrary loss function (e.g., squared error, hinge loss) and $\Omega$ is a strictly increasing function (usually $\Omega(z) = z^2$ ).

Theorem (Schölkopf, Herbrich, Smola 2001): The minimizer $f^* \in \mathcal{H}$ of the regularized risk admits the representation:

f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x)

for some coefficients $\alpha_i \in \mathbb{R}$ .

Proof (Orthogonal Projection): The key insight is that any function $f \in \mathcal{H}$ can be decomposed into a component lying in the subspace spanned by the data representers and a component orthogonal to it. Let $V = \text{span}\{k_{x_1}, \dots, k_{x_n}\}$ . Decompose $f = f_{||} + f_{\perp}$ , where $f_{||} \in V$ and $f_{\perp} \perp V$ .

Data Consistency: The evaluation of $f$ on data points depends only on the parallel component: $f(x_i) = \langle f, k_{x_i} \rangle = \langle f_{||} + f_{\perp}, k_{x_i} \rangle = \langle f_{||}, k_{x_i} \rangle$ This is because $\langle f_{\perp}, k_{x_i} \rangle = 0$ by orthogonality. Thus, the loss term $\sum L(y_i, f(x_i))$ is unaffected by $f_{\perp}$ .
Regularization: The norm satisfies: $\|f\|^2 = \|f_{||}\|^2 + \|f_{\perp}\|^2$ Since $\Omega$ is strictly increasing, adding any non-zero orthogonal component strictly increases the penalty cost without improving the data fit.
Conclusion: The minimizer must have $f_{\perp} = 0$ . Therefore, $f^* = f_{||} = \sum \alpha_i k(x_i, \cdot)$ . $\square$

2.5 Case Study: Support Vector Machines (SVM)

The classic SVM is simply the application of the Representer Theorem to the Hinge Loss:

L(y, f(x)) = \max(0, 1 - y f(x))

The primal problem is:

\min_{f \in \mathcal{H}} \sum_{i=1}^n \max(0, 1 - y_i f(x_i)) + \frac{\lambda}{2} \|f\|_\mathcal{H}^2

By the Representer Theorem, $f(x) = \sum \alpha_i k(x_i, x)$ . Why is the solution “sparse” (i.e., many $\alpha_i = 0$ )? This is explained by the dual problem. The $\alpha_i$ are identifying the Support Vectors—the points that lie exactly on the margin boundary. Points that are correctly classified and far from the boundary have zero influence on the solution (Lagrange multipliers are zero). This shows that sparsity is a property of the Loss Function (Hinge), while the kernel expansion is a property of the Regularizer ( $L^2$ norm in RKHS).

3. Kernel Ridge Regression (Derivation)

Let’s apply the Representer Theorem to Squared Error Loss:

L(y, f(x)) = (y - f(x))^2

The objective functional becomes:

J(f) = \|y - \mathbf{f}\|_2^2 + \lambda \|f\|_\mathcal{H}^2

Substituting $f(x) = \sum \alpha_j k(x_j, x)$ , we have:

Data Fit: $f(x_i) = [K \alpha]_i$ . Thus $\|y - \mathbf{f}\|_2^2 = \|y - K\alpha\|_2^2$ .
Regularizer: $\|f\|_\mathcal{H}^2 = \langle \sum \alpha_i k_{x_i}, \sum \alpha_j k_{x_j} \rangle = \alpha^T K \alpha$ .

The optimization problem is now a finite-dimensional quadratic in $\alpha$ :

J(\alpha) = (y - K\alpha)^T (y - K\alpha) + \lambda \alpha^T K \alpha

Taking the gradient w.r.t $\alpha$ :

\nabla_\alpha J = -2K(y - K\alpha) + 2\lambda K \alpha

Setting to zero (assuming $K$ is invertible):

K(K\alpha + \lambda \alpha - y) = 0 \implies (K + \lambda I)\alpha = y

Yielding the closed-form solution:

\alpha = (K + \lambda I)^{-1} y

The prediction at a new point $x_*$ is:

f(x_*) = k_*^T \alpha = k_*^T (K + \lambda I)^{-1} y

3.5 Connection to Gaussian Processes

This result is identical to the Posterior Mean of a Gaussian Process Regression with noise variance $\sigma_n^2 = \lambda$ . Why? Consider a GP prior $f \sim \mathcal{GP}(0, k)$ . The log-posterior is proportional to:

\log P(f | y) \propto \log P(y | f) + \log P(f)

Likelihood: $P(y|f) \propto \exp\left(-\frac{1}{2\sigma_n^2} \sum (y_i - f(x_i))^2\right)$ .
Prior: The RKHS norm $\|f\|_\mathcal{H}^2$ corresponds to the “energy” of the function under the prior. Formally, $P(f) \propto \exp\left(-\frac{1}{2} \|f\|_\mathcal{H}^2\right)$ .

Minimizing the negative log-posterior is equivalent to minimizing:

\frac{1}{2\sigma_n^2} \sum (y_i - f(x_i))^2 + \frac{1}{2} \|f\|_\mathcal{H}^2

Multiplying by $2\sigma_n^2$ , we recover the Ridge Regression objective with $\lambda = \sigma_n^2$ . Thus, Ridge Regularization is Bayesian Inference with a Gaussian Prior. The parameter $\lambda$ controls the trade-off between the noise floor and the signal prior variance.

4. Green’s Functions & Differential Operators

RKHSs are often solution spaces to differential equations. Consider the linear differential operator $P$ . The Green’s function $G(x, y)$ is the impulse response:

P G(\cdot, y) = \delta(\cdot - y)

Duality Theorem: If $k(x, y)$ is the Green’s function of the self-adjoint operator $L = P^* P$ , then the RKHS $\mathcal{H}_k$ corresponds to the space of functions penalized by the differential operator $P$ :

\|f\|_\mathcal{H}^2 = \|P f\|_{L^2}^2 = \int (Pf(x))^2 dx

Example: Cubic Splines

Consider the Cubic Spline Kernel $k(x, y) = \min(x, y)^2 (3\max(x, y) - \min(x, y))$ . This generates the space of cubic splines, minimizing the curvature energy:

\|f\|_\mathcal{H}^2 = \int (f''(x))^2 dx

Here, the operator is the second derivative $P = \frac{d^2}{dx^2}$ , and the kernel is the Green’s function of the biharmonic operator $\frac{d^4}{dx^4}$ . This explains why “spline interpolation” is optimal: it minimizes the curvature energy among all interpolating functions.

5. Neural Tangent Kernel (NTK)

Deep Networks with infinite width converge to GPs. What is the kernel? Let $f(x; \theta)$ be the network. Limit of large width $m \to \infty$ . Initialization $\theta_0$ . Taylor expand around initialization:

f(x; \theta) \approx f(x; \theta_0) + \nabla_\theta f(x; \theta_0)^T (\theta - \theta_0)

This is a linear model with feature map $\phi(x) = \nabla_\theta f(x; \theta_0)$ . The characteristic kernel is the Neural Tangent Kernel:

\Theta(x, y) = \lim_{m \to \infty} \sum_{p=1}^P \partial_{\theta_p} f(x) \partial_{\theta_p} f(y)

For ReLU networks, we can compute $\Theta$ analytically recursively. Regime of Training:

Lazy Regime: Weights don’t move much. Network behaves like a fixed kernel machine with kernel $\Theta$ .
Rich Regime: Weights move significantly. Features adapt. Kernel evolves. The Representer Theorem applies strictly only in the Lazy regime. In the Rich regime, we are doing “Feature Learning”—learning the kernel itself.

6. Mercer’s Theorem (Spectral Decomposition)

If $X$ is compact and $k$ is continuous, define the integral operator:

T_k f(x) = \int_X k(x, y) f(y) d\mu(y)

Since $k$ is PSD, $T_k$ is a compact self-adjoint positive operator. By Spectral Theorem for Compact Operators, it has eigendecomposition:

k(x, y) = \sum_{j=1}^\infty \lambda_j e_j(x) e_j(y)

where $\lambda_j \ge 0$ and $e_j(x)$ are orthonormal eigenfunctions in $L^2(\mu)$ . The RKHS Norm is:

\|f\|_\mathcal{H}^2 = \sum_{j=1}^\infty \frac{\langle f, e_j \rangle_{L^2}^2}{\lambda_j}

Only functions where the coefficients decay fast enough (faster than $\lambda_j$ ) are in the RKHS. This explains the smoothness: High frequency eigenfunctions usually have small eigenvalues. To have finite norm, $f$ must have very little energy in high frequencies.

7. Numerical Implementation (Kernel Ridge)


import numpy as np
import matplotlib.pyplot as plt
 
def rbf_kernel(X1, X2, gamma=10.0):
    diffs = X1[:, np.newaxis, :] - X2[np.newaxis, :, :]
    dists_sq = np.sum(diffs**2, axis=2)
    return np.exp(-gamma * dists_sq)
 
def kernel_ridge_regression(X_train, y_train, X_test, lambd=1e-5):
    # K matrix
    K = rbf_kernel(X_train, X_train)
    n = len(X_train)
    
    # Solve (K + lambda I) alpha = y
    alpha = np.linalg.solve(K + lambd * np.eye(n), y_train)
    
    # Predict
    K_star = rbf_kernel(X_test, X_train)
    f_star = K_star @ alpha
    return f_star
 
def demo_representer():
    # Data: Noisy Sine
    np.random.seed(42)
    X = np.sort(np.random.rand(20, 1) * 2 - 1, axis=0)
    y = np.sin(3 * np.pi * X).ravel() + 0.1 * np.random.randn(20)
    
    X_test = np.linspace(-1, 1, 200).reshape(-1, 1)
    
    # Fit models with different regularization
    preds_tight = kernel_ridge_regression(X, y, X_test, lambd=1e-5) # Interpolates
    preds_loose = kernel_ridge_regression(X, y, X_test, lambd=1.0)  # Smooth
    
    plt.figure(figsize=(10, 6))
    plt.scatter(X, y, c='r', label='Data')
    plt.plot(X_test, preds_tight, label='Lambda=1e-5 (Interp)')
    plt.plot(X_test, preds_loose, label='Lambda=1.0 (Smooth)')
    plt.title('Representer Theorem in Action')
    plt.legend()
    # plt.show()
    
# Observation:
# Small lambda -> Functions passes through points (Noise interpolation, High Norm).
# Large lambda -> Functions is flatter (Low Norm).
# Both are just sums of Gaussians centered at data points.

8. Kernel Mean Embeddings and MMD

Can we map entire probability distributions into the RKHS? Yes. The Kernel Mean Embedding of a distribution $P$ is defined as:

\mu_P = \mathbb{E}_{x \sim P} [k(x, \cdot)] = \int_\mathcal{X} k(x, \cdot) dP(x)

If the kernel is Characteristic (e.g., Gaussian), this mapping is injective. The feature mean $\mu_P$ uniquely identifies the distribution $P$ . This allows us to define a distance between distributions $P$ and $Q$ purely in the RKHS:

Maximum Mean Discrepancy (MMD)

\text{MMD}^2(P, Q) = \|\mu_P - \mu_Q\|_\mathcal{H}^2

Expanded via the inner product:

\text{MMD}^2(P, Q) = \mathbb{E}_{x, x' \sim P}[k(x, x')] - 2\mathbb{E}_{x \sim P, y \sim Q}[k(x, y)] + \mathbb{E}_{y, y' \sim Q}[k(y, y')]

This is a powerful metric for Two-Sample Testing (checking if two datasets come from the same distribution) and training Generative Models (MMD-GANs), as it has a closed-form unbiased estimator that is differentiable.

9. Pathologies of Infinite Dimensions

If the RKHS is infinite dimensional (like RBF), the spectrum $\lambda_j \to 0$ . The inverse operator $(T_k + \lambda I)^{-1}$ is bounded, but $T_k^{-1}$ is unbounded. This makes the inverse problem Ill-Posed. Small noise in observations can lead to massive errors in $f$ if we don’t regularize ( $\lambda=0$ leads to overfitting). Also, “The Curse of Kernelization”: For large $n$ , computing $K$ is $O(n^2)$ , Inverting is $O(n^3)$ . We need approximation (Nystrom, Random Features). Random Features (Rahimi-Recht): Approximate the integral defining $k$ (Bochner) with Monte Carlo.

\phi(x) = [\cos(\omega_1 x), \dots, \cos(\omega_D x)]

This turns a Non-Parametric problem back into a Parametric one (Linear Regression in Random Feature space).

10. Conclusion

The theory of Reproducing Kernel Hilbert Spaces gives rigorous mathematical footing to the intuition that “similarity” drives prediction. By mapping data into infinite-dimensional spaces where evaluation is continuous, we gain the ability to apply powerful tools from linear algebra (like projections and spectral decompositions) to non-linear functions.

From the classical foundations of Moore-Aronszajn and Mercer to the modern revelations of the Neural Tangent Kernel, the duality between Kernels (Data) and Operators (Functionals) remains central. Whether implicitly in a Neural Network or explicitly in a Gaussian Process, the Kernel defines the inductive bias of the model—determining which functions are “simple” and which are “complex.”

Historical Timeline

Year	Event	Significance
1909	James Mercer	Proves the spectral decomposition of positive integral operators.
1930s	Sergei Sobolev	Introduces distributions and generalized functions.
1950	Nachman Aronszajn	Formalizes RKHS theory and proves the Moore-Aronszajn theorem.
1970s	Grace Wahba	Connects Splines to RKHS and introduces Generalized Cross-Validation.
1992	Boser, Guyon, Vapnik	Introduce the “Kernel Trick” for Support Vector Machines.
1996	Radford Neal	Shows infinite neural networks converge to Gaussian Processes.
2001	Representer Theorem	Generalized to arbitrary monotonic regularizers by Schölkopf et al.
2007	Rahimi & Recht	Introduce Random Fourier Features (~Bochner’s Theorem).
2018	Jacot et al.	Introduce the Neural Tangent Kernel (NTK).

Appendix A: Proof of RKHS Equivalence to Sobolev Spaces

Theorem: For the Matern kernel $k_\nu$ , $\mathcal{H}_k$ is isomorphic to the Sobolev space $H^s(\mathbb{R}^d)$ with $s = \nu + d/2$ . Proof: The Sobolev norm is defined in Fourier domain:

\|f\|_{H^s}^2 = \int (1 + \|\omega\|^2)^s |\hat{f}(\omega)|^2 d\omega

For a shift-invariant kernel $k$ with spectrum $S(\omega)$ , the RKHS norm is:

\|f\|_\mathcal{H}^2 = \int \frac{|\hat{f}(\omega)|^2}{S(\omega)} d\omega

(This is Mercer’s theorem in continuous domain). For Matern, $S(\omega) \propto (1 + \|\omega\|^2)^{-(\nu+d/2)}$ . Substitute this into the RKHS norm:

\|f\|_\mathcal{H}^2 = \int |\hat{f}(\omega)|^2 (1 + \|\omega\|^2)^{\nu+d/2} d\omega

This is exactly the Sobolev $H^{\nu+d/2}$ norm! Corollary: RKHS functions are continuous if $s > d/2$ (Sobolev Embedding Theorem). Since $s = \nu + d/2$ , we need $\nu + d/2 > d/2 \implies \nu > 0$ . Matern kernels with $\nu > 0$ always yield continuous functions.

Appendix B: The Nyström Method

The “Curse of Kernelization” is that the kernel matrix $K$ is $n \times n$ .

Storage: $O(n^2)$
Inversion: $O(n^3)$ For $n=1,000,000$ (e.g. ImageNet), this is intractable. The Nyström Method approximates $K$ using a low-rank decomposition based on a subset of $m \ll n$ landmark points. Let $K_{mm}$ be the kernel matrix of the landmarks, and $K_{nm}$ be the kernel evaluation between all points and landmarks. We approximate:

\tilde{K} = K_{nm} K_{mm}^{-1} K_{nm}^T

This is the optimal rank- $m$ approximation if the landmarks are chosen via K-means clustering or leverage score sampling. Using the Woodbury Matrix Identity, we can invert the regularized approximate kernel in $O(nm^2)$ time:

(\tilde{K} + \lambda I)^{-1} = \frac{1}{\lambda} \left( I - K_{nm} (\lambda K_{mm} + K_{nm}^T K_{nm})^{-1} K_{nm}^T \right)

This reduces the complexity from cubic to linear in $n$ , making Kernel Methods scalable to millions of diagrams.

Appendix C: Regularization Operators and Inverse Problems

Why does the RKHS norm correspond to smoothness? Generally, for a shift-invariant kernel $k(x, y) = \psi(x-y)$ , the inner product can be written using Fourier transforms (Plancherel’s Theorem):

\langle f, g \rangle_\mathcal{H} = \int \frac{\hat{f}(\omega) \overline{\hat{g}(\omega)}}{\hat{\psi}(\omega)} d\omega

For the Gaussian kernel, $\hat{\psi}(\omega) = e^{-\sigma^2 \omega^2}$ . The inverse, $1/\hat{\psi}(\omega) = e^{\sigma^2 \omega^2}$ , grows exponentially with frequency. Thus, the norm $\|f\|_\mathcal{H}^2$ penalizes high frequencies exponentially fast. This assumes functions in the RKHS are Analytic (extremely smooth). Generally, we can view the Kernel $K$ as an integral operator $T_K$ . The regularization problem $\min \|y - T_K \alpha\|^2 + \lambda \langle \alpha, T_K \alpha \rangle$ is equivalent to solving the operator equation:

(T_K + \lambda I) \alpha = y

This is Tikhonov Regularization for the ill-posed inverse problem of recovering a function from noisy pointwise evaluations.

Appendix D: The Kernel Trick in Deep Learning

The link between Neural Networks and Kernels is deep.

Infinite Width: As shown in Section 5, infinite width networks at initialization coincide with Gaussian Processes (Neal, 1996).
GD Dynamics: Under Gradient Descent, the network evolves as a Kernel Predictor with the Neural Tangent Kernel (NTK).

\partial_t f_t(x) = - \eta \sum_{i=1}^n \Theta_t(x, x_i) (f_t(x_i) - y_i)

In the “Lazy Regime” (very large width), $\Theta_t \approx \Theta_0$ (constant kernel). The network is just doing Kernel Ridge Regression with the NTK.

Feature Learning: In realistic regimes (finite width), $\Theta_t$ evolves. The kernel “aligns” with the target function. This is something standard Kernel machines cannot do—they have a fixed prior. Understanding how the kernel evolves is the frontier of Deep Learning Theory (e.g., the Maximal Update Parametrization or $\mu P$ ).

Appendix E: Common Kernels and their Spectral Measures

Understanding the spectral properties of kernels (via Bochner’s Theorem) gives insight into the “texture” of functions in the RKHS.

Gaussian (RBF): $\psi(x) = e^{-\|x\|^2 / 2\sigma^2}$ . The spectrum is also Gaussian. Since it has support on all frequencies, the RKHS is dense in $C(\mathcal{X})$ . Functions are $C^\infty$ .
Laplacian: $\psi(x) = e^{-\|x\|_1 / \sigma}$ . The spectrum is Cauchy ( $1/(1+\omega^2)$ ). Decay is slow (heavy tails), meaning high-frequency components are allowed. This leads to non-smooth functions (Lipschitz but not differentiable at $x=y$ ). Used in modeling shock waves or sharp transitions.
Matern Class: The spectrum is a generalized Student-t distribution $(1 + \|\omega\|^2)^{-\nu - d/2}$ . The parameter $\nu$ controls the rate of spectral decay and thus the differentiability of the functions. $\nu \to \infty$ converges to RBF.
Neural Tangent Kernel (RelU): The spectrum decays polynomially, roughly as $O(\|\omega\|^{-d-1})$ . This explains why neural networks can fit non-smooth data better than RBF kernels (which are too smooth).

Appendix F: Glossary of Terms

Characteristic Kernel: A kernel whose mean embedding is injective (maps distributions uniquely).
Gram Matrix: The $n \times n$ matrix $K_{ij} = k(x_i, x_j)$ .
Green’s Function: The impulse response of a differential operator.
Mercer’s Theorem: Representation of a kernel as a sum of eigenfunctions.
MMD (Maximum Mean Discrepancy): A distance between distributions based on the difference of their mean embeddings.
Nyström Method: A low-rank approximation technique for large kernel matrices.
Positive Definite Function: A function that ensures the Gram matrix is always PSD.
RKHS: A Hilbert space where point evaluation is a continuous linear functional.
Shift-Invariant: A kernel that depends only on difference $x-y$ (stationary).
Spectral Measure: The Fourier transform of a shift-invariant kernel (Bochner).

References

1. Schölkopf, B., & Smola, A. J. (2002). “Learning with Kernels”. The comprehensive reference for SVMs and Regularization Networks.

2. Aronszajn, N. (1950). “Theory of reproducing kernels”. The foundational math paper. Proves Moore-Aronszajn.

3. Jacot, A. et al. (2018). “Neural Tangent Kernel”. Proved that infinite width networks are Kernel Machines. Sparked the “Modern Era” of deep learning theory.

4. Rahimi, A., & Recht, B. (2007). “Random Features for Large-Scale Kernel Machines”. Showed that explicit feature maps sampled from the Fourier transform can approximate kernels efficiently. $O(n)$ instead of $O(n^3)$ .

5. Berlinet, A., & Thomas-Agnan, C. (2011). “Reproducing Kernel Hilbert Spaces in Probability and Statistics”. Rigorous treatment of the functional analysis aspects.

6. Wahba, G. (1990). “Spline Models for Observational Data”. The bible of splines and thin-plate splines as RKHS problems.