Optimal Transport & Geometry
Comparing two probability distributions usually starts with KL divergence or total variation and both of them wipe out the geometry of the underlying space. Two Dirac masses sitting an arbitrarily small apart have infinite KL distance and TV distance of exactly 1 no matter how close they actually sit.
We want a metric that holds the ground distance between points and the easiest way to picture it is to think of probability measures as piles of dirt. The distance between two piles is the smallest amount of work needed to reshape one into the other and this is optimal transport and it gives us the Wasserstein metric.
This post walks from Monge’s 1781 dirt-moving problem through Kantorovich’s relaxation and Brenier’s theorem and entropic regularization and into the modern computational toolkit.
Monge and Kantorovich
Monge asked in 1781, given distributions and on , find the map that drops the total transport cost as low as it can go.
The constraint means for all Borel sets and Monge was literally moving soil around to build fortifications. He saw that optimal transport rays should not cross and this is an early glimpse of the fact that optimal maps are gradients of convex potentials.
When the cost is quadratic with and the map is smooth it satisfies the change of variables formula
where and are the densities of and . Brenier’s theorem says for a convex and dropping that in gives the Monge-Ampere equation
This is a fully nonlinear second-order elliptic PDE and the optimal transport problem comes down to solving it under the boundary condition .
Monge’s formulation has a snag because sometimes no deterministic map exists at all. If is a point mass and is spread over a sphere no single-valued map works and Kantorovich fixed this in 1942 by loosening the map into a coupling which is a joint distribution on .
where .
This turns a nonlinear problem into a linear program and the feasible set is convex and weakly compact so minimizers exist and the full machinery of LP duality opens up.
Duality
Every linear program has a dual and here the dual sits over potential functions.
This drops out of Fenchel-Rockafellar duality and the dual variables are Lagrange multipliers for the marginal constraints. The condition has a natural economic reading which says there is no arbitrage in the transport market.
When is a metric the dual simplifies because the potential has to be 1-Lipschitz and we get the Kantorovich-Rubinstein form
This is exactly the objective that WGANs optimize and gives useful gradients even when the two distributions have disjoint supports.
Brenier’s Theorem
For quadratic cost the optimal map has a rigid structure.
Theorem (Brenier, 1991). If has a density, there is a unique convex function (up to an additive constant) such that
The optimal map is a gradient and the Jacobian is symmetric positive semi-definite so the flow is irrotational and particles follow the most direct path.
As an analogy any matrix splits as (a rotation times a positive definite matrix) and in the same way any vector field pushing to splits as
where is convex and is measure-preserving and the optimal map picks out the piece that does the actual rearrangement and drops the wasted motion.
Entropic Regularization and Sinkhorn
The LP formulation costs for points and this gets way too expensive at scale.
We drop in an entropy penalty and get
The KKT conditions push the optimal coupling into a Gibbs form
Dropping in and gives
The matrix is the Gibbs kernel and as you get exact OT back and as you get the product measure .
In the discrete case the marginal constraints turn into
and we flip back and forth between the updates
This is Sinkhorn’s algorithm and each iteration is a matrix-vector product that parallelizes trivially on a GPU. The main numerical headache is that for small the entries of underflow and so we need log-domain stabilization.
Log-Stabilized Sinkhorn
import jax
import jax.numpy as jnp
def log_sum_exp(x, axis=None):
max_x = jnp.max(x, axis=axis, keepdims=True)
return max_x + jnp.log(jnp.sum(jnp.exp(x - max_x), axis=axis, keepdims=True))
def sinkhorn_log_domain(X, Y, epsilon=0.01, max_iter=100):
n, m = X.shape[0], Y.shape[0]
mu_log = -jnp.log(n) * jnp.ones(n) # Assuming uniform weights
nu_log = -jnp.log(m) * jnp.ones(m)
diff = X[:, None, :] - Y[None, :, :]
C = jnp.sum(diff**2, axis=-1) # Euclidean cost
f = jnp.zeros(n) # potential 1
g = jnp.zeros(m) # potential 2
for _ in range(max_iter):
# f update: f = epsilon * (log_mu - LSE((g - C)/epsilon))
M1 = (g[None, :] - C) / epsilon
f = epsilon * (mu_log - log_sum_exp(M1, axis=1).flatten())
# g update: g = epsilon * (log_nu - LSE((f - C)/epsilon))
M2 = (f[:, None] - C) / epsilon
g = epsilon * (nu_log - log_sum_exp(M2, axis=0).flatten())
log_gamma = (f[:, None] + g[None, :] - C) / epsilon
gamma = jnp.exp(log_gamma)
return gamma, jnp.sum(gamma * C)Benamou-Brenier: Transport as Fluid Flow
Instead of a static map transport can be seen as a time-dependent flow. Benamou and Brenier showed in 2000 that the Wasserstein distance equals the smallest kinetic energy of a density flowing from to ,
subject to mass conservation
Mass is conserved with no creation and no destruction and the Wasserstein distance is the action of the cheapest flow. Along the optimal path particles move in straight lines at constant velocity and from the Eulerian side the density moves like a pressureless gas and tools from computational fluid dynamics plug in directly. These geodesics interpolate between distributions by moving mass around rather than fading it in and out and the result is physically meaningful transitions.
WGANs and Domain Adaptation
Standard GANs minimize Jensen-Shannon divergence and it sits at a constant when supports are disjoint and this is a fundamental wall in the way of stable training. WGANs use instead,
The 1-Lipschitz constraint gets held in place through a gradient penalty where we add to the loss and this keeps the critic 1-Lipschitz and gives a real signal even when the distributions sit far apart.
For domain adaptation OT gives a map that lines up a source distribution with a target while holding the relative geometry and Wasserstein barycenters push this to more than two distributions,
A barycenter drops the total transport cost to a set of inputs as low as it can and it shows up in data fusion and medical imaging and astronomy and anywhere you need to average over distributions and not over points.
Gromov-Wasserstein
Standard OT needs and to sit in the same metric space and when they do not, as with word embeddings in different languages or a 3D mesh against a 2D image, is not even defined.
The idea is to compare internal geometry instead and ask how far sits from in compared to how far sits from in .
The Gromov-Wasserstein problem is a quadratic assignment problem and it is NP-hard in general but with entropic regularization you can solve it iteratively through Sinkhorn. Modern geometry processing uses this for shape matching across different coordinate systems.
One-Dimensional Transport via Quantile Functions
In one dimension the problem drops down a lot and the optimal map is
You stack the source CDF with the target quantile function and the distance turns into
You can compute this in just by sorting and the sliced Wasserstein distance uses this by dropping high-dimensional distributions onto random 1D lines and computing 1D transport on each and averaging the results.
Displacement Convexity
Wasserstein space holds nontrivial curvature and McCann brought in displacement convexity in 1997 where a functional on measures is displacement convex if it stays convex along Wasserstein geodesics.
Some key examples are potential energy for convex and internal energy for a good and negative entropy .
This is the foundation of the Otto calculus and it recasts the heat equation as gradient flow of entropy in Wasserstein space.
Transport on Graphs
On a graph transport turns into a network flow problem where you move mass from supply nodes to demand nodes and the cost is given by shortest-path distances ,
This is minimum-cost flow and you can solve it with dual ascent or successive shortest paths. The distance on graphs is tied to the cut norm and the graph Laplacian and optimal flow lines bunch up along bottleneck edges and this is a property used for community detection.
Wasserstein Splines
Given a sequence of distributions linear interpolation fades densities in and out but Wasserstein splines morph them instead by dropping the acceleration in the Wasserstein metric as low as it can go.
This shows up in animation where you morph character poses and in single-cell biology where you piece together gene expression trajectories and the idea is to pick out the most physically plausible path through the space of cell states.
Unbalanced Transport
Standard OT holds mass conservation but in many applications the total mass actually moves, for example when you line up population distributions in 1950 and 2020.
Unbalanced OT lets mass get made and lets mass get wiped out at a cost and swaps the hard marginal constraints for KL penalties,
This gives the Hellinger-Kantorovich distance and it fits cell tracking where cells divide and die and it fits video processing where objects walk into and out of the frame.
Timeline
| Year | Event | Significance |
|---|---|---|
| 1781 | Gaspard Monge | Defines the “Earth Mover’s” problem. |
| 1942 | Leonid Kantorovich | Relaxes the problem to couplings (Linear Programming). |
| 1975 | Kantorovich | Nobel Prize in Economics. |
| 1987 | Yann Brenier | Proves the Polar Factorization (Brenier’s Theorem). |
| 1997 | Robert McCann | Introduces Displacement Convexity. |
| 2000 | Benamou & Brenier | Dynamic formulation (Fluid mechanics link). |
| 2009 | Cedric Villani | ”Optimal Transport: Old and New”. |
| 2013 | Marco Cuturi | Sinkhorn Algorithm (Entropic Regularization). |
| 2017 | Arjovsky et al. | Wasserstein GANs. |
Deriving Monge-Ampere from Optimal Transport
Start with the optimal map pushing to and the change of variables gives, for any test function ,
and if we hand it densities and we get
Under quadratic cost Brenier says with convex and so and dropping that in gives
The whole transport problem sits inside a single PDE.
Kantorovich and Linear Programming
Kantorovich built his formulation in 1942 while tuning production in a plywood factory and he saw that transport is a special case of linear programming. The dual potentials and are shadow prices and in an efficient economy the internal prices of raw materials have to line up with the transport objective. This stays the backbone of supply chain optimization.
Wasserstein Barycenters
Given distributions , their Wasserstein barycenter is
Unlike the Euclidean mean which averages coordinates this averages the locations of mass and so the Wasserstein mean of two shifted Gaussians is a Gaussian sitting between them and not a bimodal mixture. In practice the convolutional Wasserstein barycenters approach uses heat kernels on meshes and grids to parallelize the Sinkhorn updates.
Regularity (Caffarelli)
Regularity of the optimal map hangs on the geometry of the supports and the densities.
Two conditions matter. The target support has to be convex or picks up singularities and the densities have to stay bounded away from zero and infinity and this gives Holder continuity of the map.
Fisher-Rao vs. Wasserstein
There are two natural Riemannian structures sitting on the space of probability measures.
- Fisher-Rao picks out local density variations. It has constant positive curvature and needs overlapping supports and its geodesics interpolate by reweighting.
- Wasserstein picks out physical displacement of mass. It lives in Alexandrov spaces and stays well-defined for disjoint supports and its geodesics interpolate by moving mass.
Sinkhorn Divergence
The choice of in Sinkhorn drops in a bias from the entropic term and the Sinkhorn divergence wipes this out,
It stays non-negative and it metrizes weak convergence and it interpolates between Wasserstein as and the kernel mean embedding as .
Semi-Discrete Transport and Power Diagrams
When you move a continuous density over to discrete Dirac masses the optimal map cuts space into power cells,
where are dual weights and you solve the problem with Newton’s method on the dual.
Differentiating Through Transport
Many applications need gradients of the transport map and that means computing and that is exactly the Monge-Ampere equation again.
In the discrete Sinkhorn setting backpropagating through all iterations means you have to hold the full history in memory and so the efficient move is to use the stationarity condition and differentiate implicitly. This gives the Jacobian of the OT plan with respect to the input locations and without holding the computational graph around and it is essential for shape matching where you tune point positions to drop the transport cost.
Connection to Geometric Integration
Wasserstein flows hold a Hamiltonian structure and the flow in Wasserstein space can be read as a Hamiltonian trajectory where the internal energy plays the role of the Hamiltonian.
The JKO scheme by Jordan and Kinderlehrer and Otto is the variational integrator,
Each step is a proximal map in Wasserstein space and this shows that many diffusion processes and the heat equation too are really gradient descent steps in the Wasserstein metric. I find this one of the most satisfying results in the whole theory.
Multimarginal Transport
We can couple distributions and not just two,
The state space blows up exponentially in and so naive computation is out of reach but when the cost has special structure, for example when it leans only on the mean or factors along a tree, specialized Sinkhorn-like algorithms fire.
OT and Density Functional Theory
The Hohenberg-Kohn theorem says the ground state of a multi-electron system is a functional of the electron density and picking out the minimizing configuration can be framed as multimarginal OT with repulsive Coulomb cost . The electrons settle into a pattern that drops repulsion as low as it can go while holding the overall density and this link from Cotar and Pass ties optimal transport right into quantum chemistry.
Quantum Optimal Transport
Quantum OT pushes these ideas over to density matrices and instead of moving mass between points you shuttle quantum states through the Lindblad equation. The resulting Lindblad-Wasserstein metric gives a geometric framework for decoherence and open quantum systems and it is still early days.
Concepts used in this post
- Brenier Map. This is the one and only optimal transport map for quadratic cost and it is the gradient of a convex function so and existence needs to be absolutely continuous.
- c-Transform. This is the generalized Legendre transform for a cost function and the optimal dual potentials satisfy .
- Coupling. A coupling is a joint distribution with fixed marginals and the set of couplings forms a convex polytope and the optimal coupling drops the expected transport cost as low as it can go.
- McCann Interpolation. This is the geodesic in Wasserstein space where is the Brenier map and it is the straight line through measure space.
- Push-forward. Given a map and a measure the push-forward is given by .
- Sinkhorn Divergence. This is the debiased entropic OT cost and it wipes out the entropic bias and metrizes weak convergence.
References
1. Villani, C. (2009). “Optimal Transport: Old and New”. The reference for everything. Links OT to Ricci curvature and geometry.
2. Cuturi, M. (2013). “Sinkhorn Distances: Lightspeed Computation of Optimal Transport”. The paper that made OT practical for machine learning.
3. Peyre, G., & Cuturi, M. (2019). “Computational Optimal Transport”. A comprehensive computational guide.
4. Benamou, J. D., & Brenier, Y. (2000). “Fluid mechanics solution to mass transfer”. The dynamic formulation.
5. McCann, R. J. (1997). “Convexity principle for interacting gases”. Introduced displacement convexity.
6. Solomon, J., et al. (2015). “Convolutional Wasserstein Barycenters”. Fast algorithms for barycenters on geometric domains.
7. Arjovsky, M., et al. (2017). “Wasserstein GAN”. Applied OT stability to generative adversarial networks.
8. Caffarelli, L. A. (1992). “The regularity of mappings with a convex potential”. Regularity theory for the Monge-Ampere equation.