Wasserstein Gradient Flows
Gradient descent minimizes a function by following the path of steepest ascent in Euclidean space. But what if the object of our optimization is not a point in , but a Probability Distribution ? Standard metrics like the norm or KL divergence fail to capture the physical reality that mass must be “moved” across space. For instance, the gradient flow of energy often yields non-physical “pointwise” decay, ignoring the global configuration of the mass. This is akin to trying to rearrange the furniture in a room by turning some chairs invisible and hoping others appear in the right spots, rather than physically carrying them across the floor. The Wasserstein-2 Metric () provides the necessary “horizontal” structure for this physical transport. Under this metric, the space of probability measures becomes an infinite-dimensional Riemannian manifold, often called the Wasserstein Manifold. As discovered in the late 1990s by Jordan, Kinderlehrer, and Otto (JKO), many fundamental PDEs—including the Heat Equation, the Fokker-Planck equation, and the Porous Medium Equation—are simply gradient flows of physical functionals (Entropy, Internal Energy) in this Wasserstein geometry. This realization has profound implications. It means that the mathematical language of optimization can be used to solve complex PDEs, and the geometric language of manifolds can be used to understand the stability of stochastic processes. This post explores the Otto Calculus, deriving the geometric machinery required to perform optimization, calculus, and stability analysis on the manifold of measures.
1. The Dynamic Formulation: Benamou-Brenier
The traditional Kantorovich definition of Optimal Transport is static—it finds a coupling but says nothing about the path. In 2000, Jean-David Benamou and Yann Brenier revolutionized the field by viewing transport as a continuous Fluid Flow.
1.1 The Continuity Equation
Consider a density evolving over time . If the density is moved by a velocity field , the conservation of mass is governed by the Continuity Equation:
This equation ensures that no mass is created or destroyed; any change in local density must be accounted for by the flux .
1.2 Action and the Riemannian Metric
The Benamou-Brenier theorem states that the squared distance is the minimum kinetic energy required to morph into :
This minimization reveals the Riemannian structure of . We can think of as a point on the manifold and as a vector in the Tangent Space .
1.3 The Tangent Space and Helmholtz Decomposition
Not all velocity fields are “useful” for transport. If we add a divergence-free component (like a vortex ) to , it does not affect the density evolution () but it increases the kinetic energy. By the Helmholtz-Hodge Decomposition, the most efficient velocity field (the one that achieves the infimum) must be a Gradient Field:
Thus, the tangent space at is identified with the set of gradients of smooth functions:
The Riemannian metric (inner product) between two tangent vectors is:
This metric is “density-weighted,” reflecting that moving mass where there is no density costs nothing, while moving mass in high-density regions requires proportional energy.
2. Otto Calculus: Differentials and Gradients
Felix Otto (2001) formalized the idea of performing calculus on by identifying how a functional changes as the density evolves.
2.1 The Master Equation
Consider a path moving with velocity . The rate of change of is:
Substituting the continuity equation and integrating by parts (assuming decays at infinity):
By the Riesz Representation Theorem on the tangent space (with the metric), we identify the Wasserstein Gradient:
The gradient flow equation corresponds to the choice . Substituting this into the continuity equation yields the General Gradient Flow PDE:
2.2 The Wasserstein Hessian
Analogous to the gradient, we define the Hessian . It is a linear operator on the tangent space that describes the second-order change of the functional. For the entropy , the Hessian is related to the Fisher Information:
This structure is used to prove the Log-Sobolev Inequality and determines the stability of the gradient flow. If the Hessian is bounded below by (displacement convexity), the flow is contractive.
2.3 Bakry-Eméry Gamma Calculus
The relationship between the gradient and the Hessian is formalized by the Iterated Gradient :
In Wasserstein space, the condition is the geometric equivalent of saying the manifold has Ricci curvature . This is the “Bochner identity” of optimal transport, providing a coordinate-free way to analyze diffusion on non-Euclidean spaces.
3. Physical Case Studies: Diffusion as Optimization
The “Master Equation” above explains why many physical processes behave the way they do: they represent the fastest paths to maximize entropy or minimize potential energy.
3.1 The Heat Equation (Pure Diffusion)
Let be the Negative Entropy. The first variation is . The gradient is . Plugging this into the Master Equation:
This proves that the Heat Equation is the gradient flow of the entropy. Diffusion is the process of particles “transporting themselves” to fill space as efficiently as possible.
3.2 Fokker-Planck and the McKean-Vlasov Equation
Common in physics and biology is a functional with three components: Internal Energy, Potential Energy, and Interaction Energy.
The resulting flow is the McKean-Vlasov Equation:
If , then , and we recover the standard Fokker-Planck equation if . This equation is used to model everything from the training Dynamics of Neural Networks (in the mean-field limit) to the swarming behavior of insects. In the Neural Network context, represents the loss surface, and represents the interactions between neurons or particles in a particle-based optimizer.
4. The JKO Scheme: Time Discretization
The continuous gradient flow is often difficult to solve directly. Jordan, Kinderlehrer, and Otto (1998) introduced a discrete-time approximation that captures the geometry of the flow.
4.1 The Proximal Point Interpretation
In Euclidean space, the implicit Euler step for is . This is equivalent to the Proximal Operator:
The JKO Scheme generalizes this to the space of measures by replacing the Euclidean distance with the distance:
This variational problem balances the desire to decrease the functional with the “cost of moving” mass from . As , the sequence converges to the solution of the gradient flow PDE. This approach is effective and serves as the fundamental way we prove the existence and uniqueness of solutions to these PDEs.
4.2 Particle and SDE Discretizations
For many physical functionals, the gradient flow corresponds to a Langevin SDE. . The JKO scheme can be viewed as the “mean field” limit of a system of particles interacting via potentials. By simulating particles, we obtain a Monte-Carlo approximation of the Wasserstein flow:
where .
5. Functional Inequalities and Convergence
The Riemannian structure of allows us to translate global geometric properties (like curvature) into functional properties (like convergence rates).
5.1 Ricci Curvature and Displacement Convexity
We say a functional is -Displacement Convex if it is -convex along Wasserstein geodesics. This is equivalent to the base space having a Ricci curvature bounded below by . If is -displacement convex with , the gradient flow converges Exponentially Fast to the unique global minimizer :
5.2 LSI, HWI, and Talagrand
These relationships link Entropy (), Wasserstein distance (), and Fisher Information ().
- Log-Sobolev Inequality (LSI): .
- Talagrand’s Inequality: .
- HWI Inequality: . These inequalities imply that if you control the entropy (Information), you control the physical distance mass must move. This is a powerful tool for proving the stability of Monte Carlo sampling (MCMC) and the generalization error in deep learning.
6. Stein Variational Gradient Descent (SVGD)
Can we perform a Wasserstein gradient flow without knowing the density ? This is the motivation for Stein Variational Gradient Descent (SVGD). SVGD identifies a velocity field in a Reproducing Kernel Hilbert Space (RKHS) that maximizes the rate of entropy reduction.
6.1 The Stein Mapping
Instead of the full Wasserstein gradient (which requires the density), SVGD restricts the velocity field to the form:
where is a kernel (e.g., RBF). This expression is the Stein Operator. The result is a set of particles that “push” each other away (reparative force from ) while being pulled toward the regions of high density (attractive force from ). SVGD is effectively a gradient flow in a space where the metric is defined by the kernel. For infinite bandwidth, it recovers the flow; for narrow kernels, it behaves like independent Langevin chains.
7. Optimal Transport and Mean Field Games
What happens when agents each move according to a Wasserstein flow to minimize their own cost? This leads to Mean Field Games (MFG). In an MFG, an agent at position minimizes:
where is the distribution of all other agents. The equilibrium is described by a system of two coupled PDEs:
- Hamilton-Jacobi-Bellman (HJB): Describes the value function of an individual agent.
- Fokker-Planck (FP): Describes the evolution of the population density . The optimal transport problem is a special “potential” case of an MFG where the agents are incentivized to move from to at minimum cost. This connection allows us to use OT solvers to predict the behavior of large-scale decentralized systems, such as traffic flow or financial markets.
8. Neural Gradient Flows
Modern deep learning has embraced Wasserstein flows via Neural ODEs and Flow Matching. Instead of discrete updates, we parameterize the velocity field with a neural network. The density then evolves by pushing a simple base distribution (e.g., Gaussian) through the ODE:
By minimizing the distance between the evolved density and the data distribution, we are effectively performing a gradient flow in the space of parameters that mimics the Wasserstein flow in the space of measures. This is the foundation of Continuous Normalizing Flows (CNF) and the recent revolution in Diffusion Models.
9. Wasserstein Proximal Operators and Optimization
In the machine learning context, we often want to minimize a loss where parameterizes a distribution. The Wasserstein Proximal Operator is defined as:
This is the building block of the JKO scheme. Modern solvers compute this by:
- Entropic Regularization: Using Sinkhorn iterations to approximate (as seen in the Optimal Transport post).
- Particle Flows: Moving particles along the gradient of the proximal objective.
- Kernel Methods: Using Reproducing Kernel Hilbert Spaces (RKHS) to smooth the density updates. This operator is contractive under displacement convexity, making it a robust alternative to standard gradient descent in the space of measures. It is especially useful for Variational Inference, where we seek to approximate a complex posterior with a simpler distribution while respecting the geometry of the state space.
10. Mean Field Limits and Propagation of Chaos
Why does a system of discrete particles converge to a continuous PDE? This is the study of Mean Field Limits. Consider agents moving according to a coupled system of SDEs:
As , the influence of any single particle becomes negligible (). Under the Propagation of Chaos assumption (Henry McKean, 1966), the joint distribution of all particles converges to a tensor product of identical marginal distributions. Each particle effectively follows the “mean field” produced by the aggregate ensemble. The limiting density satisfies the McKean-Vlasov equation derived in Section 3.2. This provides the mathematical justification for using particle filters, ensemble Kalmn filters, and particle-based variational inference to solve high-dimensional transport problems.
11. Implementation: Langevin Dynamics as JKO
We simulate the JKO flow of a potential plus entropy. This is equivalent to sampling from the Gibbs measure . The particles follow the discretized Langevin equation, which is the gradient flow of the free energy.
import numpy as np
import matplotlib.pyplot as plt
def potential_v(x):
# Double well potential (non-convex, multiple modes)
return (x**2 - 1)**2
def grad_v(x):
# grad V = 4x(x^2 - 1)
return 4 * x * (x**2 - 1)
def run_wasserstein_flow(n_particles=2000, n_steps=2000, dt=0.005, noise_scale=1.0):
# Start with a narrow Gaussian at 0 (unstable saddle)
X = np.random.normal(0, 0.05, n_particles)
history = []
times = [0, 100, 500, 1000, 2000]
for t in range(n_steps + 1):
# Update via Langevin Dynamics (discretized gradient flow)
diffusion = np.random.normal(0, np.sqrt(2 * noise_scale * dt), n_particles)
X = X - grad_v(X) * dt + diffusion
if t in times:
history.append(X.copy())
return history
def visualize_flow(history):
plt.figure(figsize=(12, 7))
x_range = np.linspace(-2.5, 2.5, 200)
# Plot target Gibbs density
V_vals = potential_v(x_range)
target = np.exp(-V_vals)
target /= np.trapz(target, x_range)
plt.plot(x_range, target, 'k--', lw=2, label='Target (Gibbs) Density')
# Plot particle histograms at different times
colors = plt.cm.viridis(np.linspace(0, 1, len(history)))
for i, (snap, color) in enumerate(zip(history, colors)):
plt.hist(snap, bins=60, density=True, alpha=0.3, color=color, label=f'T={i}')
plt.title("Convergence of Particle System to Equilibrium (Wasserstein Flow)")
plt.xlabel("State x")
plt.ylabel("Density")
plt.legend()
plt.grid(True, alpha=0.3)12. Conclusion: Optimization in the Space of Measures
The discovery that diffusion and Fokker-Planck equations are gradient flows has unified PDE theory, statistical mechanics, and optimization. By viewing the space of measures as a Riemannian manifold, we gain access to a suite of geometric tools—Hessians, curvature, and geodesics—that reveal the underlying structure of dissipative systems. Whether we are training neural networks, modeling cellular swarms, or sampling from complex posteriors, Wasserstein flows provide the most natural language for the evolution of probability.
Historical Timeline
| Year | Event | Significance |
|---|---|---|
| 1926 | Erwin Schrödinger | Relates Diffusion to Entropic Interpolation (Schrödinger Bridges). |
| 1966 | Henry McKean | Propagation of Chaos and McKean-Vlasov equations. |
| 1997 | Robert McCann | Displacement Convexity. |
| 1998 | Jordan, Kinderlehrer, Otto | The JKO Scheme for Fokker-Planck. |
| 2000 | Benamou & Brenier | Dynamic Formulation of OT (Fluid view). |
| 2001 | Felix Otto | Otto Calculus and Riemannian geometry of . |
| 2016 | Liu & Wang | Stein Variational Gradient Descent (SVGD). |
Appendix A: Displacement Convexity (McCann)
When is convex on Wasserstein space? Standard convexity is insufficient because it creates multimodality during interpolation. Displacement Convexity requires convexity along the Wasserstein geodesic , where is the Brenier map. Robert McCann proved that potential and interaction energies are displacement convex if their kernels are convex, and internal energy is displacement convex if the pressure satisfies the condition.
Appendix B: The Porous Medium Equation
Consider the functional . Following the Otto Calculus derivation, the gradient flow is . This is the Porous Medium Equation. Unlike the Heat Equation (which has infinite propagation speed), the PME has Finite Propagation Speed. If the support of is compact, it remains compact for all time, modeling gas flow in porous rocks or biological dispersal.
Appendix C: Metric Graphs and Discrete Flows
Can we define Wasserstein flows on a discrete graph ? Yes, using the Maas-Mielke framework. The distance between distributions on vertices is defined by a discrete continuity equation. This leads to Discrete Ricci Curvature (Bakry-Emery curvature on graphs). If a graph has positive curvature, random walks converge exponentially fast to the stationary distribution.
Appendix D: The JKO Scheme Proof Sketch
To prove that the JKO scheme converges to the PDE, one uses the Euler-Lagrange Equation of the variational problem. The optimality condition relates the first variation of the functional to the potential of the optimal transport map. Taking the gradient and substituting into the continuity equation provides a consistent discretization of the physical flux.
Appendix E: Mean Field Limit of Particle Systems
The transition from particles to the PDE is rigorous. Under the Propagation of Chaos assumption, the joint distribution of particles converges to the product of marginals. As , the empirical measure converges weakly to the solution of the McKean-Vlasov equation, with noise vanishing into collective pressure terms.
Appendix F: Talagrand’s Inequality and Information Theory
Talagrand’s inequality states that for a Gaussian measure , . This implies that if a distribution is close to Gaussian in terms of information (low KL), it must also be physically close (low ). This is used to prove the concentration of measure for Lipschitz functions.
Appendix G: Regularity Theory (Ambrosio-Gigli-Savaré)
The textbook Gradient Flows (2008) provides the rigorous Hilbert space framework for these flows. Key result: Even when the functional is not smooth, if it is Lower Semi-Continuous and displacement convex, the JKO scheme defines a unique Curve of Maximal Slope, which is the unique solution to the evolution equation.
Appendix H: Quantum Optimal Transport and Lindblad Equations
How do Wasserstein flows generalize to the quantum world? In quantum mechanics, the state is a Density Matrix . The entropy is the Von Neumann entropy . Gradient flows on the space of density matrices, under a non-commutative version of the metric (developed by Carlen and Maas), recover the Lindblad Master Equation. This shows that the decoherence and dissipation of quantum systems are also optimal transport processes, where the “mass” being moved is the probability flux between energy levels.
Appendix I: Information Geometry (Fisher-Rao) vs. Wasserstein
There are two primary ways to make the space of measures a manifold:
- Information Geometry (Fisher-Rao): The distance is based on the KL divergence locally (Fisher Information Metric). This metric is Vertical: it only cares about the change in probability at each point, ignoring the distance between the points themselves.
- Wasserstein Geometry: The distance is based on the cost of moving mass Horizontally. The gradient flow of entropy in Fisher-Rao geometry is the Exponential Decay , while in Wasserstein geometry, it is the Heat Equation . This distinction is crucial in optimization: natural gradient descent (Fisher) is good for parameter estimation, while Wasserstein gradient descent is better for data interpolation and generative modeling.
Appendix J: Glossary of Terms
- Continuity Equation: A PDE expressing the conservation of mass: .
- Displacement Convexity: Convexity of a functional along Wasserstein geodesics, ensuring unique minimizers and contractive flows.
- JKO Scheme: A variational time-discretization of gradient flows in the Wasserstein metric.
- McKean-Vlasov Equation: A non-linear PDE describing the evolution of a density under potential and interaction forces.
- Otto Calculus: A formal Riemannian framework for performing calculus on the space of probability measures.
- Propagation of Chaos: The property that a system of interacting particles becomes independent as .
- Stein Operator: A differential operator used in SVGD to map the score function of a target distribution to a velocity field.
- Wasserstein Manifold: The formal infinite-dimensional manifold structure of .
References
1. Jordan, R., Kinderlehrer, D., & Otto, F. (1998). “The variational formulation of the Fokker-Planck equation”. 2. Otto, F. (2001). “The geometry of dissipative evolution equations”. 3. McCann, R. J. (1997). “A convexity principle for interacting gases”. 4. Ambrosio, L., Gigli, N., & Savaré, G. (2008). “Gradient Flows…”. 5. Villani, C. (2009). “Optimal Transport: Old and New”. 6. Bakry, D., et al. (2014). “Analysis and Geometry of Markov Diffusion Operators”. 7. Liu, Q., & Wang, D. (2016). “Stein Variational Gradient Descent”. 8. Benamou, J. D., & Brenier, Y. (2000). “Fluid mechanics solution to OT”.