Wasserstein Gradient Flows
Gradient descent minimizes by chasing the steepest direction in Euclidean space and the obvious next question is what happens when the thing you are optimizing is not a point in but a probability distribution and you want the optimizer to respect how mass actually moves.
Neither the norm nor KL divergence cares about the cost of physically moving mass and an gradient flow wipes out mass in one region and spawns it somewhere else with no regard for geometry and the result is a path that makes no physical sense.
The metric fixes this by turning into an infinite-dimensional Riemannian manifold and Jordan and Kinderlehrer and Otto (1998) showed that a lot of classical PDEs like the heat equation and Fokker-Planck and the porous medium equation are gradient flows of familiar functionals like entropy and internal energy sitting in this geometry.
This lets you point optimization language at PDEs and manifold geometry at stochastic processes and the rest of this piece builds Otto calculus and the tools for optimization and stability analysis on the space of measures.
1. Benamou-Brenier: Transport as Fluid Flow
Kantorovich’s formulation is static and it picks out a coupling but says nothing about the path and Benamou and Brenier (2000) recast optimal transport as a fluid mechanics problem where you watch mass actually move.
A density moving under a velocity field satisfies the continuity equation.
This is conservation of mass and the total measure holds steady while the flux is run by and nothing leaks in or out.
Their theorem says the squared distance equals the minimum kinetic energy you need to push into .
This is where the Riemannian structure shows up and the density plays the role of a point on the manifold and sits in the tangent space .
Tangent space structure
Not every velocity field actually moves mass and a divergence-free piece like a vortex leaves the density untouched and only burns kinetic energy and by Helmholtz-Hodge the optimal velocity has to be a gradient.
And so the tangent space at is a closure of gradients.
The inner product on tangent vectors and weights everything by the density.
Moving mass where there is no density is free and moving mass through high-density regions costs more.
2. Otto Calculus
Otto (2001) pinned down how to differentiate functionals on and the whole thing falls out of the continuity equation and an integration by parts.
Take a path with velocity and look at the rate of change of along this path.
Plug the continuity equation in and integrate by parts.
And by the Riesz representation theorem on the Wasserstein gradient drops out clean.
Setting and dropping it back into the continuity equation gives the general Wasserstein gradient flow PDE.
The specific equation you get hangs entirely on which you pick and different functionals give different classical PDEs.
Hessian and Fisher information
The Wasserstein Hessian holds the second-order behavior and for it ties straight into Fisher information.
When this Hessian sits bounded below by you get displacement convexity and the flow contracts and this is the route to the log-Sobolev inequality.
Bakry-Emery
The iterated gradient shows up like this.
The condition means Ricci curvature is at least and this is the Bochner identity written in the language of optimal transport.
3. Classical PDEs as Gradient Flows
Specific choices of bring back classical evolution equations and the same machinery cranks out entropy flows and McKean-Vlasov and porous medium.
Heat equation
Take which is negative entropy and the first variation is and the gradient is and plugging it into the general equation gives the heat equation.
The heat equation is the gradient flow of entropy and diffusion is particles redistributing to maximize entropy as efficiently as they can in the Wasserstein metric.
McKean-Vlasov
Now take a functional with three pieces and those pieces are internal energy and potential energy and interaction energy.
The gradient flow drops straight out of the same recipe.
Setting and brings back Fokker-Planck and this one setup models everything from mean-field neural network training to collective biological motion and in the neural network setting is the loss surface and holds how particles push on each other.
4. The JKO Scheme
Solving head-on is usually hopeless and the JKO scheme (1998) discretizes the flow in a way that respects the Wasserstein geometry and turns the problem into a sequence of variational steps.
In the implicit Euler step is the same thing as the proximal operator.
Swap Euclidean distance out for .
Each step weighs dropping against the cost of moving mass away from and as the iterates slide onto the PDE solution and the JKO construction also hands you existence and uniqueness proofs by building the solution as a limit of JKO iterates.
Particle discretization
When the gradient flow matches a Langevin SDE the JKO scheme drops down to a particle simulation.
with and this is a Monte Carlo discretization of the Wasserstein flow and nothing more.
5. Convergence via Functional Inequalities
The Riemannian structure turns geometric properties like curvature into concrete convergence rates and you can read off mixing times directly from the Hessian.
If is -displacement convex with the flow converges exponentially fast.
Three fundamental inequalities tie together entropy and Wasserstein distance and Fisher information and they all come out of the same curvature bound.
- Log-Sobolev (LSI):
- Talagrand:
- HWI:
These form a chain and bounding Fisher information bounds entropy and that bounds how far mass has to travel and they show up all over MCMC convergence proofs and generalization bounds.
6. SVGD
Stein Variational Gradient Descent (Liu and Wang 2016) runs a Wasserstein-like flow when you cannot evaluate directly and only have the score to work with.
Restrict velocity fields to an RKHS and pick out the one that drops entropy the fastest.
The first term pulls particles toward high-density regions and the second shoves them apart through and keeps them from collapsing onto each other and the kernel bandwidth slides you between two regimes and infinite bandwidth brings back the flow and narrow kernels give you independent Langevin chains.
7. Mean Field Games
Look at agents each running their own Wasserstein flow to minimize an individual cost and an agent at position solves a control problem.
with the distribution of all the other agents and each one responding to the crowd.
At equilibrium this system splits into two coupled PDEs and you get Hamilton-Jacobi-Bellman for the individual value function and Fokker-Planck for the population density and optimal transport is the special potential case where agents want to move from to at minimum cost and this lets you point OT solvers at traffic flow and financial markets and other decentralized systems.
8. Neural Gradient Flows
Parameterize the velocity field with a neural network and push a base distribution like a Gaussian through the ODE and watch it land on the data distribution.
The objective is to shrink the distance between the evolved density and the data distribution and this ends up being gradient descent in parameter space that tracks a Wasserstein flow in measure space and this is what sits under continuous normalizing flows and it gives the theoretical skeleton of diffusion models.
9. Wasserstein Proximal Operators
The JKO variational step now gets used directly in machine learning and it shows up as a proximal operator on measures.
In practice you compute these through entropic regularization like Sinkhorn (see Optimal Transport) or particle flows or RKHS smoothing and the proximal operator contracts under displacement convexity and it pays off in variational inference when you want to respect the geometry of the state space instead of just minimizing KL divergence.
10. Propagation of Chaos
Why does the particle system actually converge to the PDE and what makes the limit work out. Look at coupled SDEs.
As each particle’s influence on any one other particle fades to zero and McKean (1966) showed the joint distribution factors into identical marginals and each particle ends up tracking the mean field of the ensemble and the limiting density satisfies the McKean-Vlasov equation and this is what justifies particle filters and ensemble Kalman filters and particle-based variational inference.
11. Implementation: Langevin Dynamics as JKO
You can simulate the JKO flow of potential plus entropy by sampling from with discretized Langevin and the code below runs particles through a double well and watches them settle.
import numpy as np
import matplotlib.pyplot as plt
def potential_v(x):
# Double well potential (non-convex, multiple modes)
return (x**2 - 1)**2
def grad_v(x):
# grad V = 4x(x^2 - 1)
return 4 * x * (x**2 - 1)
def run_wasserstein_flow(n_particles=2000, n_steps=2000, dt=0.005, noise_scale=1.0):
# Start with a narrow Gaussian at 0 (unstable saddle)
X = np.random.normal(0, 0.05, n_particles)
history = []
times = [0, 100, 500, 1000, 2000]
for t in range(n_steps + 1):
# Update via Langevin Dynamics (discretized gradient flow)
diffusion = np.random.normal(0, np.sqrt(2 * noise_scale * dt), n_particles)
X = X - grad_v(X) * dt + diffusion
if t in times:
history.append(X.copy())
return history
def visualize_flow(history):
plt.figure(figsize=(12, 7))
x_range = np.linspace(-2.5, 2.5, 200)
# Plot target Gibbs density
V_vals = potential_v(x_range)
target = np.exp(-V_vals)
target /= np.trapz(target, x_range)
plt.plot(x_range, target, 'k--', lw=2, label='Target (Gibbs) Density')
# Plot particle histograms at different times
colors = plt.cm.viridis(np.linspace(0, 1, len(history)))
for i, (snap, color) in enumerate(zip(history, colors)):
plt.hist(snap, bins=60, density=True, alpha=0.3, color=color, label=f'T={i}')
plt.title("Convergence of Particle System to Equilibrium (Wasserstein Flow)")
plt.xlabel("State x")
plt.ylabel("Density")
plt.legend()
plt.grid(True, alpha=0.3)12. Summary
Seeing Fokker-Planck and diffusion as gradient flows pulls together PDE theory and statistical mechanics and optimization and the Riemannian manifold of measures hands you Hessians and curvature and geodesics and the full toolkit for dissipative systems and whether you care about generative modeling or biological dynamics or posterior sampling the Wasserstein perspective gives you the right geometric framework.
Displacement convexity
Standard convexity is the wrong notion here because linear interpolation of measures drags in spurious multimodality and it loosens the geometry in the wrong way. Displacement convexity instead asks for convexity along the Wasserstein geodesic with the Brenier map and McCann showed that potential and interaction energies are displacement convex when their kernels are convex and that internal energy is displacement convex when .
Porous medium equation
Take and Otto calculus hands you straight out.
Unlike the heat equation which has infinite propagation speed the porous medium equation has finite propagation speed and compact initial support stays compact and it models gas flow through rock and biological dispersal.
Discrete flows on graphs
Wasserstein flows extend to graphs through the Maas-Mielke framework and you define a distance using a discrete continuity equation and you get a discrete Ricci curvature out of Bakry-Emery on graphs and positive curvature gives you exponentially fast mixing of random walks.
JKO convergence sketch
Write the Euler-Lagrange equation for the JKO variational problem and the optimality condition ties the first variation of to the optimal transport potential and then you take the gradient and substitute into continuity and you end up with a consistent discretization of the flux.
Mean field limit
Under propagation of chaos the -particle joint distribution drifts toward a product of marginals and the empirical measure converges weakly to the McKean-Vlasov solution as and individual noise terms collapse into collective pressure.
Talagrand’s inequality and concentration
Talagrand’s says that for Gaussian you get and low KL means physically close and that means low and this gets used for concentration of measure of Lipschitz functions.
Regularity theory
The 2008 Ambrosio Gigli Savare book sets out the rigorous framework and nails down the theory you need. Even for non-smooth as long as it is lower semicontinuous and displacement convex JKO defines a unique curve of maximal slope and this curve is the solution to the evolution equation.
Quantum optimal transport
The state is a density matrix and the entropy is Von Neumann and gradient flows under a non-commutative metric from Carlen and Maas bring back the Lindblad master equation and quantum decoherence and dissipation turn out to be transport processes with probability flux moving between energy levels.
Fisher-Rao vs. Wasserstein
There are two ways to put manifold structure on the space of measures and they look at the problem from different angles.
Fisher-Rao comes from information geometry and builds distance out of local KL divergence and it is vertical and it picks up probability changes at each point but ignores distances between points and the entropy gradient flow is which is pure exponential decay.
Wasserstein builds distance out of physically moving mass and it is horizontal and the entropy gradient flow is which is the heat equation.
The practical split is that natural gradient which is Fisher gets used for parameter estimation and the Wasserstein gradient gets used for generative modeling and interpolation and each one fits its own job.
Key developments in Wasserstein gradient flows
The lineage of this field runs through a handful of decisive contributions and each one unlocked the next move. Schrodinger (1926) first noticed the link between diffusion and entropic interpolation and the idea came back decades later as Schrodinger bridges and McKean (1966) proved propagation of chaos for interacting particle systems and nailed down the mean-field limit that justifies modern particle methods and McCann (1997) introduced displacement convexity which is the right notion of convexity for functionals on measure space and it kicked out the naive and incorrect linear interpolation of densities.
The watershed was the 1998 paper by Jordan and Kinderlehrer and Otto which showed that the Fokker-Planck equation is a gradient flow of entropy in Wasserstein space and laid down the JKO scheme as a variational time-stepping method and Otto (2001) then worked out the full Riemannian calculus on and gave the field its computational backbone and Benamou and Brenier (2000) contributed the dynamic fluid mechanics formulation of optimal transport and tied the Lagrangian and Eulerian perspectives together and more recently Liu and Wang (2016) introduced SVGD and pulled Wasserstein gradient flow ideas into practical machine learning inference.
Concepts at a glance
The continuity equation says mass is conserved and it is the fundamental constraint on any flow of probability and everything else builds on top of it. Displacement convexity asks for convexity along Wasserstein geodesics and not along linear interpolations and this is the condition that hands you unique minimizers and contractive flows. The JKO scheme is a variational time-discretization where each step solves a proximal problem weighing energy reduction against transport cost and the limit gives back the continuous PDE.
The McKean-Vlasov equation describes density evolution under combined potential and internal and interaction forces and it is the general form from which heat and Fokker-Planck and porous medium equations drop out as special cases. Otto calculus is the Riemannian calculus on that makes all of this precise and it defines gradients and Hessians and curvature for functionals on measure space. Propagation of chaos is the theorem that interacting particles become independent in the limit and it justifies particle-based methods and the Stein operator maps the score function to a velocity field in SVGD and the Wasserstein manifold is the infinite-dimensional Riemannian manifold on which all of these flows live.
References
1. Jordan, R., Kinderlehrer, D., & Otto, F. (1998). “The variational formulation of the Fokker-Planck equation”. 2. Otto, F. (2001). “The geometry of dissipative evolution equations”. 3. McCann, R. J. (1997). “A convexity principle for interacting gases”. 4. Ambrosio, L., Gigli, N., & Savare, G. (2008). “Gradient Flows…”. 5. Villani, C. (2009). “Optimal Transport: Old and New”. 6. Bakry, D., et al. (2014). “Analysis and Geometry of Markov Diffusion Operators”. 7. Liu, Q., & Wang, D. (2016). “Stein Variational Gradient Descent”. 8. Benamou, J. D., & Brenier, Y. (2000). “Fluid mechanics solution to OT”.