Optimal Transport & Geometry

Comparing two probability distributions usually starts with KL divergence or total variation and both of them wipe out the geometry of the underlying space. Two Dirac masses sitting an arbitrarily small ϵ\epsilon apart have infinite KL distance and TV distance of exactly 1 no matter how close they actually sit.

We want a metric that holds the ground distance d(x,y)d(x,y) between points and the easiest way to picture it is to think of probability measures as piles of dirt. The distance between two piles is the smallest amount of work needed to reshape one into the other and this is optimal transport and it gives us the Wasserstein metric.

This post walks from Monge’s 1781 dirt-moving problem through Kantorovich’s relaxation and Brenier’s theorem and entropic regularization and into the modern computational toolkit.


Monge and Kantorovich

Monge asked in 1781, given distributions μ\mu and ν\nu on Rd\mathbb{R}^d, find the map T:RdRdT: \mathbb{R}^d \to \mathbb{R}^d that drops the total transport cost as low as it can go.

minT#μ=νRdc(x,T(x))dμ(x)\min_{T_\# \mu = \nu} \int_{\mathbb{R}^d} c(x, T(x)) d\mu(x)

The constraint T#μ=νT_\# \mu = \nu means ν(B)=μ(T1(B))\nu(B) = \mu(T^{-1}(B)) for all Borel sets BB and Monge was literally moving soil around to build fortifications. He saw that optimal transport rays should not cross and this is an early glimpse of the fact that optimal maps are gradients of convex potentials.

When the cost is quadratic with c(x,y)=12xy2c(x, y) = \frac{1}{2}|x-y|^2 and the map TT is smooth it satisfies the change of variables formula

f(x)=g(T(x))detT(x)f(x) = g(T(x)) |\det \nabla T(x)|

where ff and gg are the densities of μ\mu and ν\nu. Brenier’s theorem says T=uT = \nabla u for a convex uu and dropping that in gives the Monge-Ampere equation

det(2u(x))=f(x)g(u(x))\det(\nabla^2 u(x)) = \frac{f(x)}{g(\nabla u(x))}

This is a fully nonlinear second-order elliptic PDE and the optimal transport problem comes down to solving it under the boundary condition u(spt μ)=spt ν\nabla u(\text{spt } \mu) = \text{spt } \nu.

Monge’s formulation has a snag because sometimes no deterministic map exists at all. If μ\mu is a point mass and ν\nu is spread over a sphere no single-valued map works and Kantorovich fixed this in 1942 by loosening the map into a coupling γ\gamma which is a joint distribution on X×YX \times Y.

OT(μ,ν)=infγΠ(μ,ν)X×Yc(x,y)dγ(x,y)\text{OT}(\mu, \nu) = \inf_{\gamma \in \Pi(\mu, \nu)} \iint_{X \times Y} c(x, y) d\gamma(x, y)

where Π(μ,ν)={γP(X×Y):PX#γ=μ,PY#γ=ν}\Pi(\mu, \nu) = \{ \gamma \in \mathcal{P}(X \times Y) : P_{X \#} \gamma = \mu, P_{Y \#} \gamma = \nu \}.

This turns a nonlinear problem into a linear program and the feasible set is convex and weakly compact so minimizers exist and the full machinery of LP duality opens up.


Duality

Every linear program has a dual and here the dual sits over potential functions.

minγcdγ=supϕ,ψ{Xϕdμ+Yψdν:ϕ(x)+ψ(y)c(x,y)}\min_{\gamma} \int c d\gamma = \sup_{\phi, \psi} \left\{ \int_X \phi d\mu + \int_Y \psi d\nu : \phi(x) + \psi(y) \le c(x, y) \right\}

This drops out of Fenchel-Rockafellar duality and the dual variables (ϕ,ψ)(\phi, \psi) are Lagrange multipliers for the marginal constraints. The condition ϕ(x)+ψ(y)c(x,y)\phi(x) + \psi(y) \le c(x,y) has a natural economic reading which says there is no arbitrage in the transport market.

When c(x,y)=d(x,y)c(x,y) = d(x,y) is a metric the dual simplifies because the potential ϕ\phi has to be 1-Lipschitz and we get the Kantorovich-Rubinstein form

W1(μ,ν)=supϕL1ϕd(μν)W_1(\mu, \nu) = \sup_{\| \phi \|_L \le 1} \int \phi d(\mu - \nu)

This is exactly the objective that WGANs optimize and W1W_1 gives useful gradients even when the two distributions have disjoint supports.


Brenier’s Theorem

For quadratic cost c(x,y)=12xy2c(x, y) = \frac{1}{2} |x - y|^2 the optimal map has a rigid structure.

Theorem (Brenier, 1991). If μ\mu has a density, there is a unique convex function u:RdRu: \mathbb{R}^d \to \mathbb{R} (up to an additive constant) such that

T(x)=u(x)T(x) = \nabla u(x)

The optimal map is a gradient and the Jacobian T=2u\nabla T = \nabla^2 u is symmetric positive semi-definite so the flow is irrotational and particles follow the most direct path.

As an analogy any matrix splits as M=RUM = RU (a rotation times a positive definite matrix) and in the same way any vector field pushing μ\mu to ν\nu splits as

X=uSX = \nabla u \circ S

where uu is convex and SS is measure-preserving and the optimal map picks out the piece that does the actual rearrangement and drops the wasted motion.


Entropic Regularization and Sinkhorn

The LP formulation costs O(n3logn)O(n^3 \log n) for nn points and this gets way too expensive at scale.

We drop in an entropy penalty H(γ)=γ(logγ1)dγH(\gamma) = - \int \gamma (\log \gamma - 1) d\gamma and get

OTϵ(μ,ν)=infγΠ(μ,ν)cdγ+ϵH(γ)\text{OT}_\epsilon(\mu, \nu) = \inf_{\gamma \in \Pi(\mu, \nu)} \iint c d\gamma + \epsilon H(\gamma)

The KKT conditions push the optimal coupling into a Gibbs form

γ(x,y)=exp(ϕ(x)+ψ(y)c(x,y)ϵ)\gamma^*(x, y) = \exp \left( \frac{\phi(x) + \psi(y) - c(x, y)}{\epsilon} \right)

Dropping in u(x)=exp(ϕ(x)/ϵ)u(x) = \exp(\phi(x)/\epsilon) and v(y)=exp(ψ(y)/ϵ)v(y) = \exp(\psi(y)/\epsilon) gives

γ(x,y)=u(x)ec(x,y)/ϵv(y)\gamma^*(x, y) = u(x) e^{-c(x, y)/\epsilon} v(y)

The matrix K(x,y)=ec(x,y)/ϵK(x,y) = e^{-c(x,y)/\epsilon} is the Gibbs kernel and as ϵ0\epsilon \to 0 you get exact OT back and as ϵ\epsilon \to \infty you get the product measure μν\mu \otimes \nu.

In the discrete case the marginal constraints turn into

u(Kv)=a,v(KTu)=bu \odot (K v) = a, \quad v \odot (K^T u) = b

and we flip back and forth between the updates

u(k+1)=aKv(k),v(k+1)=bKTu(k+1)u^{(k+1)} = \frac{a}{K v^{(k)}}, \quad v^{(k+1)} = \frac{b}{K^T u^{(k+1)}}

This is Sinkhorn’s algorithm and each iteration is a matrix-vector product that parallelizes trivially on a GPU. The main numerical headache is that for small ϵ\epsilon the entries of KK underflow and so we need log-domain stabilization.

Log-Stabilized Sinkhorn

import jax import jax.numpy as jnp def log_sum_exp(x, axis=None): max_x = jnp.max(x, axis=axis, keepdims=True) return max_x + jnp.log(jnp.sum(jnp.exp(x - max_x), axis=axis, keepdims=True)) def sinkhorn_log_domain(X, Y, epsilon=0.01, max_iter=100): n, m = X.shape[0], Y.shape[0] mu_log = -jnp.log(n) * jnp.ones(n) # Assuming uniform weights nu_log = -jnp.log(m) * jnp.ones(m) diff = X[:, None, :] - Y[None, :, :] C = jnp.sum(diff**2, axis=-1) # Euclidean cost f = jnp.zeros(n) # potential 1 g = jnp.zeros(m) # potential 2 for _ in range(max_iter): # f update: f = epsilon * (log_mu - LSE((g - C)/epsilon)) M1 = (g[None, :] - C) / epsilon f = epsilon * (mu_log - log_sum_exp(M1, axis=1).flatten()) # g update: g = epsilon * (log_nu - LSE((f - C)/epsilon)) M2 = (f[:, None] - C) / epsilon g = epsilon * (nu_log - log_sum_exp(M2, axis=0).flatten()) log_gamma = (f[:, None] + g[None, :] - C) / epsilon gamma = jnp.exp(log_gamma) return gamma, jnp.sum(gamma * C)

Benamou-Brenier: Transport as Fluid Flow

Instead of a static map transport can be seen as a time-dependent flow. Benamou and Brenier showed in 2000 that the Wasserstein distance equals the smallest kinetic energy of a density ρ(t,x)\rho(t,x) flowing from μ\mu to ν\nu,

W22(μ,ν)=infρ,v01ρ(t,x)v(t,x)2dxdtW_2^2(\mu, \nu) = \inf_{\rho, v} \int_0^1 \int \rho(t, x) |v(t, x)|^2 dx dt

subject to mass conservation

tρ+(ρv)=0\partial_t \rho + \nabla \cdot (\rho v) = 0

Mass is conserved with no creation and no destruction and the Wasserstein distance is the action of the cheapest flow. Along the optimal path particles move in straight lines at constant velocity and from the Eulerian side the density moves like a pressureless gas and tools from computational fluid dynamics plug in directly. These geodesics interpolate between distributions by moving mass around rather than fading it in and out and the result is physically meaningful transitions.


WGANs and Domain Adaptation

Standard GANs minimize Jensen-Shannon divergence and it sits at a constant when supports are disjoint and this is a fundamental wall in the way of stable training. WGANs use W1W_1 instead,

minGmaxDL1ExPdata[D(x)]EzPnoise[D(G(z))]\min_G \max_{\| D \|_L \le 1} \mathbb{E}_{x \sim P_{data}}[D(x)] - \mathbb{E}_{z \sim P_{noise}}[D(G(z))]

The 1-Lipschitz constraint gets held in place through a gradient penalty where we add (D(x)21)2(\| \nabla D(x) \|_2 - 1)^2 to the loss and this keeps the critic 1-Lipschitz and gives a real signal even when the distributions sit far apart.

For domain adaptation OT gives a map TT that lines up a source distribution μS\mu_S with a target μT\mu_T while holding the relative geometry and Wasserstein barycenters push this to more than two distributions,

μˉ=argminμλiW22(μ,μi)\bar{\mu} = \arg \min_\mu \sum \lambda_i W_2^2(\mu, \mu_i)

A barycenter drops the total transport cost to a set of inputs as low as it can and it shows up in data fusion and medical imaging and astronomy and anywhere you need to average over distributions and not over points.


Gromov-Wasserstein

Standard OT needs XX and YY to sit in the same metric space and when they do not, as with word embeddings in different languages or a 3D mesh against a 2D image, xy\|x - y\| is not even defined.

The idea is to compare internal geometry instead and ask how far xx sits from xx' in XX compared to how far yy sits from yy' in YY.

GW(μ,ν)=infγdX(x,x)dY(y,y)2dγ(x,y)dγ(x,y)\text{GW}(\mu, \nu) = \inf_\gamma \iint |d_X(x, x') - d_Y(y, y')|^2 d\gamma(x, y) d\gamma(x', y')

The Gromov-Wasserstein problem is a quadratic assignment problem and it is NP-hard in general but with entropic regularization you can solve it iteratively through Sinkhorn. Modern geometry processing uses this for shape matching across different coordinate systems.


One-Dimensional Transport via Quantile Functions

In one dimension the problem drops down a lot and the optimal map is

T(x)=Fν1(Fμ(x))T(x) = F_\nu^{-1}(F_\mu(x))

You stack the source CDF with the target quantile function and the WpW_p distance turns into

Wp(μ,ν)=(01Fμ1(t)Fν1(t)pdt)1/pW_p(\mu, \nu) = \left( \int_0^1 |F_\mu^{-1}(t) - F_\nu^{-1}(t)|^p dt \right)^{1/p}

You can compute this in O(nlogn)O(n \log n) just by sorting and the sliced Wasserstein distance uses this by dropping high-dimensional distributions onto random 1D lines and computing 1D transport on each and averaging the results.


Displacement Convexity

Wasserstein space holds nontrivial curvature and McCann brought in displacement convexity in 1997 where a functional F\mathcal{F} on measures is displacement convex if it stays convex along Wasserstein geodesics.

Some key examples are potential energy V(ρ)=V(x)ρ(x)dx\mathcal{V}(\rho) = \int V(x) \rho(x) dx for convex VV and internal energy E(ρ)=A(ρ(x))dx\mathcal{E}(\rho) = \int A(\rho(x)) dx for a good AA and negative entropy H(ρ)=ρlogρdx\mathcal{H}(\rho) = \int \rho \log \rho dx.

This is the foundation of the Otto calculus and it recasts the heat equation as gradient flow of entropy in Wasserstein space.


Transport on Graphs

On a graph G=(V,E)G = (V, E) transport turns into a network flow problem where you move mass from supply nodes μ\mu to demand nodes ν\nu and the cost is given by shortest-path distances dG(i,j)d_G(i,j),

minγ0i,jVdG(i,j)γijs.t.jγij=μi,iγij=νj\min_{\gamma \ge 0} \sum_{i, j \in V} d_G(i, j) \gamma_{ij} \quad \text{s.t.} \quad \sum_j \gamma_{ij} = \mu_i, \quad \sum_i \gamma_{ij} = \nu_j

This is minimum-cost flow and you can solve it with dual ascent or successive shortest paths. The W1W_1 distance on graphs is tied to the cut norm and the graph Laplacian and optimal flow lines bunch up along bottleneck edges and this is a property used for community detection.


Wasserstein Splines

Given a sequence of distributions μ0,μ1,,μT\mu_0, \mu_1, \dots, \mu_T linear interpolation fades densities in and out but Wasserstein splines morph them instead by dropping the acceleration in the Wasserstein metric as low as it can go.

This shows up in animation where you morph character poses and in single-cell biology where you piece together gene expression trajectories and the idea is to pick out the most physically plausible path through the space of cell states.


Unbalanced Transport

Standard OT holds mass conservation but in many applications the total mass actually moves, for example when you line up population distributions in 1950 and 2020.

Unbalanced OT lets mass get made and lets mass get wiped out at a cost and swaps the hard marginal constraints for KL penalties,

UOT(μ,ν)=infγcdγ+αKL(PX#γμ)+βKL(PY#γν)\text{UOT}(\mu, \nu) = \inf_{\gamma} \iint c d\gamma + \alpha \text{KL}(P_{X\#} \gamma | \mu) + \beta \text{KL}(P_{Y\#} \gamma | \nu)

This gives the Hellinger-Kantorovich distance and it fits cell tracking where cells divide and die and it fits video processing where objects walk into and out of the frame.


Timeline

YearEventSignificance
1781Gaspard MongeDefines the “Earth Mover’s” problem.
1942Leonid KantorovichRelaxes the problem to couplings (Linear Programming).
1975KantorovichNobel Prize in Economics.
1987Yann BrenierProves the Polar Factorization (Brenier’s Theorem).
1997Robert McCannIntroduces Displacement Convexity.
2000Benamou & BrenierDynamic formulation (Fluid mechanics link).
2009Cedric VillaniOptimal Transport: Old and New”.
2013Marco CuturiSinkhorn Algorithm (Entropic Regularization).
2017Arjovsky et al.Wasserstein GANs.

Deriving Monge-Ampere from Optimal Transport

Start with the optimal map T:XYT: X \to Y pushing μ\mu to ν\nu and the change of variables gives, for any test function ψ\psi,

Yψ(y)dν(y)=Xψ(T(x))dμ(x)\int_Y \psi(y) d\nu(y) = \int_X \psi(T(x)) d\mu(x)

and if we hand it densities ff and gg we get

g(T(x))detT(x)=f(x)g(T(x)) |\det \nabla T(x)| = f(x)

Under quadratic cost Brenier says T=uT = \nabla u with uu convex and so T=2u\nabla T = \nabla^2 u and dropping that in gives

det(2u(x))=f(x)g(u(x))\det(\nabla^2 u(x)) = \frac{f(x)}{g(\nabla u(x))}

The whole transport problem sits inside a single PDE.


Kantorovich and Linear Programming

Kantorovich built his formulation in 1942 while tuning production in a plywood factory and he saw that transport is a special case of linear programming. The dual potentials ϕ\phi and ψ\psi are shadow prices and in an efficient economy the internal prices of raw materials have to line up with the transport objective. This stays the backbone of supply chain optimization.


Wasserstein Barycenters

Given distributions μ1,,μk\mu_1, \dots, \mu_k, their Wasserstein barycenter is

μ=argminμi=1kwiW22(μ,μi)\mu^* = \arg \min_\mu \sum_{i=1}^k w_i W_2^2(\mu, \mu_i)

Unlike the Euclidean mean which averages coordinates this averages the locations of mass and so the Wasserstein mean of two shifted Gaussians is a Gaussian sitting between them and not a bimodal mixture. In practice the convolutional Wasserstein barycenters approach uses heat kernels on meshes and grids to parallelize the Sinkhorn updates.


Regularity (Caffarelli)

Regularity of the optimal map hangs on the geometry of the supports and the densities.

Two conditions matter. The target support has to be convex or uu picks up singularities and the densities have to stay bounded away from zero and infinity and this gives Holder continuity of the map.


Fisher-Rao vs. Wasserstein

There are two natural Riemannian structures sitting on the space of probability measures.

  1. Fisher-Rao picks out local density variations. It has constant positive curvature and needs overlapping supports and its geodesics interpolate by reweighting.
  2. Wasserstein picks out physical displacement of mass. It lives in Alexandrov spaces and stays well-defined for disjoint supports and its geodesics interpolate by moving mass.

Sinkhorn Divergence

The choice of ϵ\epsilon in Sinkhorn drops in a bias from the entropic term and the Sinkhorn divergence wipes this out,

Sϵ(μ,ν)=OTϵ(μ,ν)12OTϵ(μ,μ)12OTϵ(ν,ν)S_\epsilon(\mu, \nu) = \text{OT}_\epsilon(\mu, \nu) - \frac{1}{2}\text{OT}_\epsilon(\mu, \mu) - \frac{1}{2}\text{OT}_\epsilon(\nu, \nu)

It stays non-negative and it metrizes weak convergence and it interpolates between Wasserstein as ϵ0\epsilon \to 0 and the kernel mean embedding as ϵ\epsilon \to \infty.


Semi-Discrete Transport and Power Diagrams

When you move a continuous density μ\mu over to discrete Dirac masses {yj}\{y_j\} the optimal map cuts space into power cells,

Cj={x:xyj2ψjxyk2ψkk}C_j = \{ x : |x - y_j|^2 - \psi_j \le |x - y_k|^2 - \psi_k \quad \forall k \}

where ψj\psi_j are dual weights and you solve the problem with Newton’s method on the dual.


Differentiating Through Transport

Many applications need gradients of the transport map T(x)=u(x)T(x) = \nabla u(x) and that means computing 2u(x)\nabla^2 u(x) and that is exactly the Monge-Ampere equation again.

In the discrete Sinkhorn setting backpropagating through all iterations means you have to hold the full history in memory and so the efficient move is to use the stationarity condition u=a/(Kv)u = a/(Kv) and differentiate implicitly. This gives the Jacobian of the OT plan with respect to the input locations XX and YY without holding the computational graph around and it is essential for shape matching where you tune point positions to drop the transport cost.


Connection to Geometric Integration

Wasserstein flows hold a Hamiltonian structure and the flow ρ(t)\rho(t) in Wasserstein space can be read as a Hamiltonian trajectory where the internal energy plays the role of the Hamiltonian.

The JKO scheme by Jordan and Kinderlehrer and Otto is the variational integrator,

ρn+1=argminρ(12hW22(ρ,ρn)+F(ρ))\rho_{n+1} = \arg \min_\rho \left( \frac{1}{2h} W_2^2(\rho, \rho_n) + \mathcal{F}(\rho) \right)

Each step is a proximal map in Wasserstein space and this shows that many diffusion processes and the heat equation too are really gradient descent steps in the Wasserstein metric. I find this one of the most satisfying results in the whole theory.


Multimarginal Transport

We can couple NN distributions and not just two,

infγΠ(μ1,,μN) ⁣C(x1,,xN)dγ(x1,,xN)\inf_{\gamma \in \Pi(\mu_1, \dots, \mu_N)} \int \dots \int C(x_1, \dots, x_N) d\gamma(x_1, \dots, x_N)

The state space blows up exponentially in NN and so naive computation is out of reach but when the cost has special structure, for example when it leans only on the mean or factors along a tree, specialized Sinkhorn-like algorithms fire.


OT and Density Functional Theory

The Hohenberg-Kohn theorem says the ground state of a multi-electron system is a functional of the electron density ρ\rho and picking out the minimizing configuration can be framed as multimarginal OT with repulsive Coulomb cost c(x,y)=1/xyc(x,y) = 1/|x-y|. The electrons settle into a pattern that drops repulsion as low as it can go while holding the overall density and this link from Cotar and Pass ties optimal transport right into quantum chemistry.


Quantum Optimal Transport

Quantum OT pushes these ideas over to density matrices and instead of moving mass between points you shuttle quantum states through the Lindblad equation. The resulting Lindblad-Wasserstein metric gives a geometric framework for decoherence and open quantum systems and it is still early days.


Concepts used in this post

  • Brenier Map. This is the one and only optimal transport map for quadratic cost and it is the gradient of a convex function uu so T=uT = \nabla u and existence needs μ\mu to be absolutely continuous.
  • c-Transform. This is the generalized Legendre transform ϕc(y)=infx[c(x,y)ϕ(x)]\phi^c(y) = \inf_x [c(x,y) - \phi(x)] for a cost function cc and the optimal dual potentials satisfy ψ=ϕc\psi = \phi^c.
  • Coupling. A coupling is a joint distribution γΠ(μ,ν)\gamma \in \Pi(\mu, \nu) with fixed marginals and the set of couplings forms a convex polytope and the optimal coupling drops the expected transport cost as low as it can go.
  • McCann Interpolation. This is the geodesic in Wasserstein space ρt=((1t)Id+tT)#μ\rho_t = ((1-t)\text{Id} + t T)_\# \mu where TT is the Brenier map and it is the straight line through measure space.
  • Push-forward. Given a map TT and a measure μ\mu the push-forward T#μT_\# \mu is given by (T#μ)(A)=μ(T1(A))(T_\# \mu)(A) = \mu(T^{-1}(A)).
  • Sinkhorn Divergence. This is the debiased entropic OT cost Sϵ(μ,ν)=OTϵ(μ,ν)12OTϵ(μ,μ)12OTϵ(ν,ν)S_\epsilon(\mu, \nu) = \text{OT}_\epsilon(\mu, \nu) - \frac{1}{2}\text{OT}_\epsilon(\mu, \mu) - \frac{1}{2}\text{OT}_\epsilon(\nu, \nu) and it wipes out the entropic bias and metrizes weak convergence.

References

1. Villani, C. (2009). “Optimal Transport: Old and New”. The reference for everything. Links OT to Ricci curvature and geometry.

2. Cuturi, M. (2013). “Sinkhorn Distances: Lightspeed Computation of Optimal Transport”. The paper that made OT practical for machine learning.

3. Peyre, G., & Cuturi, M. (2019). “Computational Optimal Transport”. A comprehensive computational guide.

4. Benamou, J. D., & Brenier, Y. (2000). “Fluid mechanics solution to mass transfer”. The dynamic formulation.

5. McCann, R. J. (1997). “Convexity principle for interacting gases”. Introduced displacement convexity.

6. Solomon, J., et al. (2015). “Convolutional Wasserstein Barycenters”. Fast algorithms for barycenters on geometric domains.

7. Arjovsky, M., et al. (2017). “Wasserstein GAN”. Applied OT stability to generative adversarial networks.

8. Caffarelli, L. A. (1992). “The regularity of mappings with a convex potential”. Regularity theory for the Monge-Ampere equation.