Wasserstein Gradient Flows

Gradient descent minimizes a function F(x)F(x) by following the path of steepest ascent in Euclidean space. But what if the object of our optimization is not a point in Rd\mathbb{R}^d, but a Probability Distribution ρ\rho? Standard metrics like the L2L^2 norm or KL divergence fail to capture the physical reality that mass must be “moved” across space. For instance, the L2L^2 gradient flow of energy often yields non-physical “pointwise” decay, ignoring the global configuration of the mass. This is akin to trying to rearrange the furniture in a room by turning some chairs invisible and hoping others appear in the right spots, rather than physically carrying them across the floor. The Wasserstein-2 Metric (W2W_2) provides the necessary “horizontal” structure for this physical transport. Under this metric, the space of probability measures P2(Rd)\mathcal{P}_2(\mathbb{R}^d) becomes an infinite-dimensional Riemannian manifold, often called the Wasserstein Manifold. As discovered in the late 1990s by Jordan, Kinderlehrer, and Otto (JKO), many fundamental PDEs—including the Heat Equation, the Fokker-Planck equation, and the Porous Medium Equation—are simply gradient flows of physical functionals (Entropy, Internal Energy) in this Wasserstein geometry. This realization has profound implications. It means that the mathematical language of optimization can be used to solve complex PDEs, and the geometric language of manifolds can be used to understand the stability of stochastic processes. This post explores the Otto Calculus, deriving the geometric machinery required to perform optimization, calculus, and stability analysis on the manifold of measures.


1. The Dynamic Formulation: Benamou-Brenier

The traditional Kantorovich definition of Optimal Transport is static—it finds a coupling γ\gamma but says nothing about the path. In 2000, Jean-David Benamou and Yann Brenier revolutionized the field by viewing transport as a continuous Fluid Flow.

1.1 The Continuity Equation

Consider a density ρt(x)\rho_t(x) evolving over time t[0,1]t \in [0, 1]. If the density is moved by a velocity field vt(x)v_t(x), the conservation of mass is governed by the Continuity Equation:

tρt+(ρtvt)=0\partial_t \rho_t + \nabla \cdot (\rho_t v_t) = 0

This equation ensures that no mass is created or destroyed; any change in local density must be accounted for by the flux ρv\rho v.

1.2 Action and the Riemannian Metric

The Benamou-Brenier theorem states that the squared W2W_2 distance is the minimum kinetic energy required to morph ρ0\rho_0 into ρ1\rho_1:

W22(ρ0,ρ1)=inf(ρ,v){01Rdρt(x)vt(x)2dxdt:tρ+(ρv)=0}W_2^2(\rho_0, \rho_1) = \inf_{(\rho, v)} \left\{ \int_0^1 \int_{\mathbb{R}^d} \rho_t(x) |v_t(x)|^2 dx dt : \partial_t \rho + \nabla \cdot (\rho v) = 0 \right\}

This minimization reveals the Riemannian structure of P2\mathcal{P}_2. We can think of ρ\rho as a point on the manifold and vv as a vector in the Tangent Space TρP2T_\rho \mathcal{P}_2.

1.3 The Tangent Space and Helmholtz Decomposition

Not all velocity fields vv are “useful” for transport. If we add a divergence-free component (like a vortex ×ψ\nabla \times \psi) to vv, it does not affect the density evolution (curl=0\nabla \cdot \text{curl} = 0) but it increases the kinetic energy. By the Helmholtz-Hodge Decomposition, the most efficient velocity field (the one that achieves the infimum) must be a Gradient Field:

v=ϕv = \nabla \phi

Thus, the tangent space at ρ\rho is identified with the set of gradients of smooth functions:

TρP2{ϕ:ϕCc(Rd)}L2(ρ)T_\rho \mathcal{P}_2 \cong \overline{\{ \nabla \phi : \phi \in C_c^\infty(\mathbb{R}^d) \}}^{L^2(\rho)}

The Riemannian metric (inner product) between two tangent vectors ϕ1,ϕ2\nabla \phi_1, \nabla \phi_2 is:

gρ(ϕ1,ϕ2)=Rdϕ1(x),ϕ2(x)ρ(x)dxg_\rho(\nabla \phi_1, \nabla \phi_2) = \int_{\mathbb{R}^d} \langle \nabla \phi_1(x), \nabla \phi_2(x) \rangle \rho(x) dx

This metric is “density-weighted,” reflecting that moving mass where there is no density costs nothing, while moving mass in high-density regions requires proportional energy.


2. Otto Calculus: Differentials and Gradients

Felix Otto (2001) formalized the idea of performing calculus on P2\mathcal{P}_2 by identifying how a functional F(ρ)\mathcal{F}(\rho) changes as the density ρ\rho evolves.

2.1 The Master Equation

Consider a path ρt\rho_t moving with velocity vtv_t. The rate of change of F\mathcal{F} is:

ddtF(ρt)=RdδFδρ(x)tρt(x)dx\frac{d}{dt} \mathcal{F}(\rho_t) = \int_{\mathbb{R}^d} \frac{\delta \mathcal{F}}{\delta \rho}(x) \partial_t \rho_t(x) dx

Substituting the continuity equation tρ=(ρv)\partial_t \rho = -\nabla \cdot (\rho v) and integrating by parts (assuming ρ\rho decays at infinity):

ddtF(ρt)=δFδρ(ρv)dx=(δFδρ)vρdx\frac{d}{dt} \mathcal{F}(\rho_t) = - \int \frac{\delta \mathcal{F}}{\delta \rho} \nabla \cdot (\rho v) dx = \int \nabla \left( \frac{\delta \mathcal{F}}{\delta \rho} \right) \cdot v \rho dx

By the Riesz Representation Theorem on the tangent space (with the gρg_\rho metric), we identify the Wasserstein Gradient:

gradWF(ρ)=(δFδρ)\text{grad}_{W} \mathcal{F}(\rho) = \nabla \left( \frac{\delta \mathcal{F}}{\delta \rho} \right)

The gradient flow equation tρ=gradWF\partial_t \rho = -\text{grad}_{W} \mathcal{F} corresponds to the choice v=gradWFv = -\text{grad}_{W} \mathcal{F}. Substituting this into the continuity equation yields the General Gradient Flow PDE:

tρ=(ρδFδρ)\partial_t \rho = \nabla \cdot \left( \rho \nabla \frac{\delta \mathcal{F}}{\delta \rho} \right)

2.2 The Wasserstein Hessian

Analogous to the gradient, we define the Hessian HessWF\text{Hess}_W \mathcal{F}. It is a linear operator on the tangent space that describes the second-order change of the functional. For the entropy F(ρ)=ρlogρ\mathcal{F}(\rho) = \int \rho \log \rho, the Hessian is related to the Fisher Information:

HessWF(ρ)ϕ,ϕgρ=ijijϕijϕρdx\langle \text{Hess}_W \mathcal{F}(\rho) \nabla \phi, \nabla \phi \rangle_{g_\rho} = \int \sum_{ij} \partial_{ij} \phi \partial_{ij} \phi \rho dx

This structure is used to prove the Log-Sobolev Inequality and determines the stability of the gradient flow. If the Hessian is bounded below by λI\lambda I (displacement convexity), the flow is contractive.

2.3 Bakry-Eméry Gamma Calculus

The relationship between the gradient and the Hessian is formalized by the Iterated Gradient Γ2(ϕ)\Gamma_2(\phi):

Γ2(ϕ)=12Δϕ2ϕ,Δϕ\Gamma_2(\phi) = \frac{1}{2} \Delta |\nabla \phi|^2 - \langle \nabla \phi, \nabla \Delta \phi \rangle

In Wasserstein space, the condition Γ2(ϕ)λϕ2\Gamma_2(\phi) \ge \lambda |\nabla \phi|^2 is the geometric equivalent of saying the manifold has Ricci curvature λ\ge \lambda. This is the “Bochner identity” of optimal transport, providing a coordinate-free way to analyze diffusion on non-Euclidean spaces.


3. Physical Case Studies: Diffusion as Optimization

The “Master Equation” above explains why many physical processes behave the way they do: they represent the fastest paths to maximize entropy or minimize potential energy.

3.1 The Heat Equation (Pure Diffusion)

Let F(ρ)=ρlogρdx\mathcal{F}(\rho) = \int \rho \log \rho dx be the Negative Entropy. The first variation is δFδρ=logρ+1\frac{\delta \mathcal{F}}{\delta \rho} = \log \rho + 1. The gradient is (logρ)=ρρ\nabla (\log \rho) = \frac{\nabla \rho}{\rho}. Plugging this into the Master Equation:

tρ=(ρρρ)=(ρ)=Δρ\partial_t \rho = \nabla \cdot \left( \rho \frac{\nabla \rho}{\rho} \right) = \nabla \cdot (\nabla \rho) = \Delta \rho

This proves that the Heat Equation is the gradient flow of the entropy. Diffusion is the process of particles “transporting themselves” to fill space as efficiently as possible.

3.2 Fokker-Planck and the McKean-Vlasov Equation

Common in physics and biology is a functional with three components: Internal Energy, Potential Energy, and Interaction Energy.

F(ρ)=U(ρ)dx+V(x)ρ(dx)+12(Wρ)ρ(dx)\mathcal{F}(\rho) = \int U(\rho) dx + \int V(x) \rho(dx) + \frac{1}{2} \int (W * \rho) \rho(dx)

The resulting flow is the McKean-Vlasov Equation:

tρ=(ρU(ρ)+ρV+ρ(Wρ))\partial_t \rho = \nabla \cdot (\rho \nabla U'(\rho) + \rho \nabla V + \rho \nabla (W * \rho))

If U(ρ)=ρlogρU(\rho) = \rho \log \rho, then U(ρ)=logρ+1U'(\rho) = \log \rho + 1, and we recover the standard Fokker-Planck equation if W=0W=0. This equation is used to model everything from the training Dynamics of Neural Networks (in the mean-field limit) to the swarming behavior of insects. In the Neural Network context, VV represents the loss surface, and WW represents the interactions between neurons or particles in a particle-based optimizer.


4. The JKO Scheme: Time Discretization

The continuous gradient flow tρ=gradWF(ρ)\partial_t \rho = -\text{grad}_W \mathcal{F}(\rho) is often difficult to solve directly. Jordan, Kinderlehrer, and Otto (1998) introduced a discrete-time approximation that captures the geometry of the flow.

4.1 The Proximal Point Interpretation

In Euclidean space, the implicit Euler step for xk+1x_{k+1} is xk+1=xkτF(xk+1)x_{k+1} = x_k - \tau \nabla F(x_{k+1}). This is equivalent to the Proximal Operator:

xk+1=argminx{F(x)+12τxxk2}x_{k+1} = \text{argmin}_{x} \left\{ F(x) + \frac{1}{2\tau} \| x - x_k \|^2 \right\}

The JKO Scheme generalizes this to the space of measures by replacing the Euclidean distance with the W2W_2 distance:

ρk+1=argminρP2{F(ρ)+12τW22(ρ,ρk)}\rho_{k+1} = \text{argmin}_{\rho \in \mathcal{P}_2} \left\{ \mathcal{F}(\rho) + \frac{1}{2\tau} W_2^2(\rho, \rho_k) \right\}

This variational problem balances the desire to decrease the functional F\mathcal{F} with the “cost of moving” mass from ρk\rho_k. As τ0\tau \to 0, the sequence ρk\rho_k converges to the solution of the gradient flow PDE. This approach is effective and serves as the fundamental way we prove the existence and uniqueness of solutions to these PDEs.

4.2 Particle and SDE Discretizations

For many physical functionals, the gradient flow corresponds to a Langevin SDE. dXt=V(Xt)dt+2dBtdX_t = -\nabla V(X_t) dt + \sqrt{2} dB_t. The JKO scheme can be viewed as the “mean field” limit of a system of particles interacting via potentials. By simulating NN particles, we obtain a Monte-Carlo approximation of the Wasserstein flow:

Xk+1i=XkiτV(Xki)+2τξkiX_{k+1}^i = X_k^i - \tau \nabla V(X_k^i) + \sqrt{2\tau} \xi_k^i

where ξkiN(0,I)\xi_k^i \sim \mathcal{N}(0, I).


5. Functional Inequalities and Convergence

The Riemannian structure of P2\mathcal{P}_2 allows us to translate global geometric properties (like curvature) into functional properties (like convergence rates).

5.1 Ricci Curvature and Displacement Convexity

We say a functional F\mathcal{F} is λ\lambda-Displacement Convex if it is λ\lambda-convex along Wasserstein geodesics. This is equivalent to the base space having a Ricci curvature bounded below by λ\lambda. If F\mathcal{F} is λ\lambda-displacement convex with λ>0\lambda > 0, the gradient flow converges Exponentially Fast to the unique global minimizer ρ\rho_\infty:

W2(ρt,ρ)eλtW2(ρ0,ρ)W_2(\rho_t, \rho_\infty) \le e^{-\lambda t} W_2(\rho_0, \rho_\infty)

5.2 LSI, HWI, and Talagrand

These relationships link Entropy (HH), Wasserstein distance (WW), and Fisher Information (II).

  1. Log-Sobolev Inequality (LSI): H(ρρ)12λI(ρρ)H(\rho | \rho_\infty) \le \frac{1}{2\lambda} I(\rho | \rho_\infty).
  2. Talagrand’s Inequality: W22(ρ,ρ)2λH(ρρ)W_2^2(\rho, \rho_\infty) \le \frac{2}{\lambda} H(\rho | \rho_\infty).
  3. HWI Inequality: H(ρρ)W2(ρ,ρ)I(ρρ)λ2W22H(\rho | \rho_\infty) \le W_2(\rho, \rho_\infty) \sqrt{I(\rho | \rho_\infty)} - \frac{\lambda}{2} W_2^2. These inequalities imply that if you control the entropy (Information), you control the physical distance mass must move. This is a powerful tool for proving the stability of Monte Carlo sampling (MCMC) and the generalization error in deep learning.

6. Stein Variational Gradient Descent (SVGD)

Can we perform a Wasserstein gradient flow without knowing the density ρ\rho? This is the motivation for Stein Variational Gradient Descent (SVGD). SVGD identifies a velocity field vv in a Reproducing Kernel Hilbert Space (RKHS) Hd\mathcal{H}^d that maximizes the rate of entropy reduction.

6.1 The Stein Mapping

Instead of the full Wasserstein gradient (which requires the density), SVGD restricts the velocity field to the form:

v(x)=Exρ[logp(x)k(x,)+xk(x,)]v(x) = \mathbb{E}_{x \sim \rho} [ \nabla \log p(x) k(x, \cdot) + \nabla_x k(x, \cdot) ]

where kk is a kernel (e.g., RBF). This expression is the Stein Operator. The result is a set of particles that “push” each other away (reparative force from k\nabla k) while being pulled toward the regions of high density (attractive force from logp\nabla \log p). SVGD is effectively a gradient flow in a space where the metric is defined by the kernel. For infinite bandwidth, it recovers the W2W_2 flow; for narrow kernels, it behaves like independent Langevin chains.


7. Optimal Transport and Mean Field Games

What happens when NN agents each move according to a Wasserstein flow to minimize their own cost? This leads to Mean Field Games (MFG). In an MFG, an agent at position xx minimizes:

infvE[0TL(xt,vt,ρt)dt+Φ(xT,ρT)]\inf_{v} \mathbb{E} \left[ \int_0^T L(x_t, v_t, \rho_t) dt + \Phi(x_T, \rho_T) \right]

where ρt\rho_t is the distribution of all other agents. The equilibrium is described by a system of two coupled PDEs:

  1. Hamilton-Jacobi-Bellman (HJB): Describes the value function u(t,x)u(t, x) of an individual agent.
  2. Fokker-Planck (FP): Describes the evolution of the population density ρ(t,x)\rho(t, x). The optimal transport problem is a special “potential” case of an MFG where the agents are incentivized to move from ρ0\rho_0 to ν\nu at minimum cost. This connection allows us to use OT solvers to predict the behavior of large-scale decentralized systems, such as traffic flow or financial markets.

8. Neural Gradient Flows

Modern deep learning has embraced Wasserstein flows via Neural ODEs and Flow Matching. Instead of discrete updates, we parameterize the velocity field vθ(t,x)v_\theta(t, x) with a neural network. The density ρt\rho_t then evolves by pushing a simple base distribution (e.g., Gaussian) through the ODE:

dXtdt=vθ(t,Xt)\frac{dX_t}{dt} = v_\theta(t, X_t)

By minimizing the distance between the evolved density and the data distribution, we are effectively performing a gradient flow in the space of parameters that mimics the Wasserstein flow in the space of measures. This is the foundation of Continuous Normalizing Flows (CNF) and the recent revolution in Diffusion Models.


9. Wasserstein Proximal Operators and Optimization

In the machine learning context, we often want to minimize a loss L(θ)\mathcal{L}(\theta) where θ\theta parameterizes a distribution. The Wasserstein Proximal Operator is defined as:

ProxτL(ρ)=argminν{L(ν)+12τW22(ν,ρ)}\text{Prox}_{\tau \mathcal{L}}(\rho) = \arg \min_{\nu} \left\{ \mathcal{L}(\nu) + \frac{1}{2\tau} W_2^2(\nu, \rho) \right\}

This is the building block of the JKO scheme. Modern solvers compute this by:

  1. Entropic Regularization: Using Sinkhorn iterations to approximate W22W_2^2 (as seen in the Optimal Transport post).
  2. Particle Flows: Moving particles xix_i along the gradient of the proximal objective.
  3. Kernel Methods: Using Reproducing Kernel Hilbert Spaces (RKHS) to smooth the density updates. This operator is contractive under displacement convexity, making it a robust alternative to standard gradient descent in the space of measures. It is especially useful for Variational Inference, where we seek to approximate a complex posterior with a simpler distribution while respecting the geometry of the state space.

10. Mean Field Limits and Propagation of Chaos

Why does a system of NN discrete particles converge to a continuous PDE? This is the study of Mean Field Limits. Consider NN agents moving according to a coupled system of SDEs:

dXti=(V(Xti)1NjiW(XtiXtj))dt+2dBtidX_t^i = \left( -\nabla V(X_t^i) - \frac{1}{N} \sum_{j \neq i} \nabla W(X_t^i - X_t^j) \right) dt + \sqrt{2} dB_t^i

As NN \to \infty, the influence of any single particle becomes negligible (1/N01/N \to 0). Under the Propagation of Chaos assumption (Henry McKean, 1966), the joint distribution of all particles converges to a tensor product of identical marginal distributions. Each particle effectively follows the “mean field” produced by the aggregate ensemble. The limiting density ρt\rho_t satisfies the McKean-Vlasov equation derived in Section 3.2. This provides the mathematical justification for using particle filters, ensemble Kalmn filters, and particle-based variational inference to solve high-dimensional transport problems.


11. Implementation: Langevin Dynamics as JKO

We simulate the JKO flow of a potential plus entropy. This is equivalent to sampling from the Gibbs measure p(x)eV(x)/ϵp(x) \propto e^{-V(x)/\epsilon}. The particles follow the discretized Langevin equation, which is the gradient flow of the free energy.

import numpy as np import matplotlib.pyplot as plt def potential_v(x): # Double well potential (non-convex, multiple modes) return (x**2 - 1)**2 def grad_v(x): # grad V = 4x(x^2 - 1) return 4 * x * (x**2 - 1) def run_wasserstein_flow(n_particles=2000, n_steps=2000, dt=0.005, noise_scale=1.0): # Start with a narrow Gaussian at 0 (unstable saddle) X = np.random.normal(0, 0.05, n_particles) history = [] times = [0, 100, 500, 1000, 2000] for t in range(n_steps + 1): # Update via Langevin Dynamics (discretized gradient flow) diffusion = np.random.normal(0, np.sqrt(2 * noise_scale * dt), n_particles) X = X - grad_v(X) * dt + diffusion if t in times: history.append(X.copy()) return history def visualize_flow(history): plt.figure(figsize=(12, 7)) x_range = np.linspace(-2.5, 2.5, 200) # Plot target Gibbs density V_vals = potential_v(x_range) target = np.exp(-V_vals) target /= np.trapz(target, x_range) plt.plot(x_range, target, 'k--', lw=2, label='Target (Gibbs) Density') # Plot particle histograms at different times colors = plt.cm.viridis(np.linspace(0, 1, len(history))) for i, (snap, color) in enumerate(zip(history, colors)): plt.hist(snap, bins=60, density=True, alpha=0.3, color=color, label=f'T={i}') plt.title("Convergence of Particle System to Equilibrium (Wasserstein Flow)") plt.xlabel("State x") plt.ylabel("Density") plt.legend() plt.grid(True, alpha=0.3)

12. Conclusion: Optimization in the Space of Measures

The discovery that diffusion and Fokker-Planck equations are gradient flows has unified PDE theory, statistical mechanics, and optimization. By viewing the space of measures as a Riemannian manifold, we gain access to a suite of geometric tools—Hessians, curvature, and geodesics—that reveal the underlying structure of dissipative systems. Whether we are training neural networks, modeling cellular swarms, or sampling from complex posteriors, Wasserstein flows provide the most natural language for the evolution of probability.


Historical Timeline

YearEventSignificance
1926Erwin SchrödingerRelates Diffusion to Entropic Interpolation (Schrödinger Bridges).
1966Henry McKeanPropagation of Chaos and McKean-Vlasov equations.
1997Robert McCannDisplacement Convexity.
1998Jordan, Kinderlehrer, OttoThe JKO Scheme for Fokker-Planck.
2000Benamou & BrenierDynamic Formulation of OT (Fluid view).
2001Felix OttoOtto Calculus and Riemannian geometry of P2\mathcal{P}_2.
2016Liu & WangStein Variational Gradient Descent (SVGD).

Appendix A: Displacement Convexity (McCann)

When is F(ρ)\mathcal{F}(\rho) convex on Wasserstein space? Standard convexity F((1t)ρ0+tρ1)\mathcal{F}((1-t)\rho_0 + t\rho_1) is insufficient because it creates multimodality during interpolation. Displacement Convexity requires convexity along the Wasserstein geodesic ρt=((1t)Id+tT)#ρ0\rho_t = ((1-t)\text{Id} + tT)_\# \rho_0, where TT is the Brenier map. Robert McCann proved that potential and interaction energies are displacement convex if their kernels are convex, and internal energy is displacement convex if the pressure satisfies the PρP0P' \rho - P \ge 0 condition.


Appendix B: The Porous Medium Equation

Consider the functional F(ρ)=1m1ρmdx\mathcal{F}(\rho) = \frac{1}{m-1} \int \rho^m dx. Following the Otto Calculus derivation, the gradient flow is tρ=Δ(ρm)\partial_t \rho = \Delta (\rho^m). This is the Porous Medium Equation. Unlike the Heat Equation (which has infinite propagation speed), the PME has Finite Propagation Speed. If the support of ρ0\rho_0 is compact, it remains compact for all time, modeling gas flow in porous rocks or biological dispersal.


Appendix C: Metric Graphs and Discrete Flows

Can we define Wasserstein flows on a discrete graph G=(V,E)G = (V, E)? Yes, using the Maas-Mielke framework. The distance between distributions on vertices is defined by a discrete continuity equation. This leads to Discrete Ricci Curvature (Bakry-Emery curvature on graphs). If a graph has positive curvature, random walks converge exponentially fast to the stationary distribution.


Appendix D: The JKO Scheme Proof Sketch

To prove that the JKO scheme converges to the PDE, one uses the Euler-Lagrange Equation of the variational problem. The optimality condition relates the first variation of the functional to the potential of the optimal transport map. Taking the gradient and substituting into the continuity equation provides a consistent discretization of the physical flux.


Appendix E: Mean Field Limit of Particle Systems

The transition from NN particles to the PDE is rigorous. Under the Propagation of Chaos assumption, the joint distribution of NN particles converges to the product of marginals. As NN \to \infty, the empirical measure converges weakly to the solution of the McKean-Vlasov equation, with noise vanishing into collective pressure terms.


Appendix F: Talagrand’s Inequality and Information Theory

Talagrand’s T2T_2 inequality states that for a Gaussian measure γ\gamma, W22(ρ,γ)2KL(ργ)W_2^2(\rho, \gamma) \le 2 \text{KL}(\rho || \gamma). This implies that if a distribution is close to Gaussian in terms of information (low KL), it must also be physically close (low W2W_2). This is used to prove the concentration of measure for Lipschitz functions.


Appendix G: Regularity Theory (Ambrosio-Gigli-Savaré)

The textbook Gradient Flows (2008) provides the rigorous Hilbert space framework for these flows. Key result: Even when the functional F\mathcal{F} is not smooth, if it is Lower Semi-Continuous and displacement convex, the JKO scheme defines a unique Curve of Maximal Slope, which is the unique solution to the evolution equation.


Appendix H: Quantum Optimal Transport and Lindblad Equations

How do Wasserstein flows generalize to the quantum world? In quantum mechanics, the state is a Density Matrix σ\sigma. The entropy is the Von Neumann entropy S(σ)=Tr(σlogσ)S(\sigma) = -\text{Tr}(\sigma \log \sigma). Gradient flows on the space of density matrices, under a non-commutative version of the W2W_2 metric (developed by Carlen and Maas), recover the Lindblad Master Equation. This shows that the decoherence and dissipation of quantum systems are also optimal transport processes, where the “mass” being moved is the probability flux between energy levels.


Appendix I: Information Geometry (Fisher-Rao) vs. Wasserstein

There are two primary ways to make the space of measures a manifold:

  1. Information Geometry (Fisher-Rao): The distance is based on the KL divergence locally (Fisher Information Metric). This metric is Vertical: it only cares about the change in probability at each point, ignoring the distance between the points themselves.
  2. Wasserstein Geometry: The distance is based on the cost of moving mass Horizontally. The gradient flow of entropy in Fisher-Rao geometry is the Exponential Decay tρ=ρ(logρ+1)\partial_t \rho = -\rho(\log \rho + 1), while in Wasserstein geometry, it is the Heat Equation tρ=Δρ\partial_t \rho = \Delta \rho. This distinction is crucial in optimization: natural gradient descent (Fisher) is good for parameter estimation, while Wasserstein gradient descent is better for data interpolation and generative modeling.

Appendix J: Glossary of Terms

  • Continuity Equation: A PDE expressing the conservation of mass: tρ+div(ρv)=0\partial_t \rho + \text{div}(\rho v) = 0.
  • Displacement Convexity: Convexity of a functional along Wasserstein geodesics, ensuring unique minimizers and contractive flows.
  • JKO Scheme: A variational time-discretization of gradient flows in the Wasserstein metric.
  • McKean-Vlasov Equation: A non-linear PDE describing the evolution of a density under potential and interaction forces.
  • Otto Calculus: A formal Riemannian framework for performing calculus on the space of probability measures.
  • Propagation of Chaos: The property that a system of NN interacting particles becomes independent as NN \to \infty.
  • Stein Operator: A differential operator used in SVGD to map the score function of a target distribution to a velocity field.
  • Wasserstein Manifold: The formal infinite-dimensional manifold structure of (P2,W2)(\mathcal{P}_2, W_2).

References

1. Jordan, R., Kinderlehrer, D., & Otto, F. (1998). “The variational formulation of the Fokker-Planck equation”. 2. Otto, F. (2001). “The geometry of dissipative evolution equations”. 3. McCann, R. J. (1997). “A convexity principle for interacting gases”. 4. Ambrosio, L., Gigli, N., & Savaré, G. (2008). “Gradient Flows…”. 5. Villani, C. (2009). “Optimal Transport: Old and New”. 6. Bakry, D., et al. (2014). “Analysis and Geometry of Markov Diffusion Operators”. 7. Liu, Q., & Wang, D. (2016). “Stein Variational Gradient Descent”. 8. Benamou, J. D., & Brenier, Y. (2000). “Fluid mechanics solution to OT”.