Optimal Transport & Geometry
How do we compare two probability distributions? Standard metrics like KL Divergence or Total Variation fail to capture the underlying geometry of the space. KL divergence is essentially “vertical”—it compares densities at specific points but ignores how far mass must travel to reach a target. If two Dirac masses are separated by a tiny epsilon, their KL distance is infinite, and their TV distance is 1. We want a horizontal metric that respects the ground distance . This is the motivation for Optimal Transport (OT). By viewing probability measures as “piles of dirt,” we define the distance as the minimum work required to transport one pile to another. This leads to the Wasserstein Metric, which endows the space of measures with a rich Riemannian-like structure. This post provides a rigorous journey from Monge’s 18th-century “Earth-Moving” problem to the modern “Sinkhorn” revolution and the Monge-Ampère PDE.
1. The Monge-Kantorovich Formulations
1.1 Monge’s Problem: The Quest for the Optimal Map
In 1781, Gaspard Monge asked: Given two distributions and on , what is the map that minimizes the total cost of transport?
where is the Push-forward Constraint: for all Borel sets . Geometric Motivation: Monge’s original application was the movement of soil (déblais) to build fortifications (remblais). He realized that the transport lines should not intersect—a precursor to the modern understanding that optimal maps are gradients of convex potentials. The Monge-Ampère Connection: Under a quadratic cost , if the map is smooth, it satisfies the Change of Variables Formula:
where and are densities of and . Brenier’s theorem (Section 3) tells us for a convex potential . Substituting this into the density equation yields the Monge-Ampère Equation:
This is a highly non-linear, second-order elliptic PDE. Solving for the optimal transport map is equivalent to finding the solution to this PDE under the boundary condition .
1.2 Kantorovich’s Relaxation: From Maps to Couplings
Monge’s problem is mathematically “brittle” because a map might not exist. If is a Dirac mass at the origin and is a uniform distribution on a sphere, no single map can move the mass to multiple destinations. Leonid Kantorovich relaxed the map to a Coupling (or transport plan) , which is a joint distribution on the product space .
where . This relaxation transforms a non-linear problem into a Linear Program in the space of measures. The set of couplings is convex and weakly compact, ensuring the existence of a minimizer.
2. Duality and High-Level “Prices”
2.1 The Kantorovich Duality
The primal problem (minimizing over couplings) has a dual problem (maximizing over potential functions). Theorem (Kantorovich Duality):
Proof Sketch via Fenchel-Rockafellar: We can view the primal problem as:
where and is the indicator of the marginal constraints. By Fenchel-Rockafellar Duality:
The dual variables emerge as the Lagrange multipliers for the marginal constraints. The condition is the requirement that no “arbitrage” is possible in the transport market.
2.2 The Kantorovich-Rubinstein Theorem
If is a distance metric, the dual potentials satisfy , which implies that the function is 1-Lipschitz.
This version of the Wasserstein distance, the Metric, is the basis for Wasserstein GANs. It provides a robust gradient for training generative models even when the generated and real distributions have non-overlapping supports.
3. Brenier’s Theorem: The Polar Factorization
Under quadratic cost , the geometry of the optimal map is exceptionally rigid. Theorem (Brenier, 1991): If has a density (doesn’t charge small sets), there exists a unique (up to additive constant) convex function such that the optimal transport map is given by the gradient:
Interpretation: Moving along a Gradient: The optimal map never “swirls.” Since is a gradient of a convex function, its Jacobian is symmetric and positive semi-definite. Physically, this means that the particles follow the path of least resistance, and the overall flow is “irrotational” in the sense of fluid dynamics.
The Polar Factorization of Maps: Brenier’s theorem is often viewed as the measure-theoretic analog of the QR or Polar decomposition of matrices. Just as any matrix (rotation times positive definite), any vector field that pushes to can be decomposed as:
where is convex and is a -measure-preserving map. The optimal transport map is the “purest” part of this decomposition—the part that does all the work of rearranging mass without any unnecessary shuffling.
4. Entropic Regularization: The Sinkhorn Speedup
Solving the original Linear Programming problem for data points (using the Simplex or Interior Point methods) scales as . This is computationally prohibitive for modern machine learning, where can be or more. To address this, we add an Entropic Regularization term :
The Gibbs Kernel Mapping: The first-order conditions for the Lagrangian of this regularized problem imply that the optimal coupling must take the form of a Gibbs Distribution:
Setting and , we get the factorized form:
where the term is the Gibbs Kernel. As , we recover the exact OT. As , the coupling becomes the independent product . The marginal constraints and then become a system of non-linear integral equations. In the discrete case, these are:
We solve this by oscillating between the two updates:
This is Sinkhorn’s Algorithm. Because it only involves matrix-vector multiplications, it is extremely parallelizable (GPU-friendly). However, for small , the values in underflow to zero. We must perform the updates in the Log-Domain (using LogSumExp) to maintain numerical stability.
Implementation (Log-Stabilized Sinkhorn)
import jax
import jax.numpy as jnp
def log_sum_exp(x, axis=None):
max_x = jnp.max(x, axis=axis, keepdims=True)
return max_x + jnp.log(jnp.sum(jnp.exp(x - max_x), axis=axis, keepdims=True))
def sinkhorn_log_domain(X, Y, epsilon=0.01, max_iter=100):
n, m = X.shape[0], Y.shape[0]
mu_log = -jnp.log(n) * jnp.ones(n) # Assuming uniform weights
nu_log = -jnp.log(m) * jnp.ones(m)
diff = X[:, None, :] - Y[None, :, :]
C = jnp.sum(diff**2, axis=-1) # Euclidean cost
f = jnp.zeros(n) # potential 1
g = jnp.zeros(m) # potential 2
for _ in range(max_iter):
# f update: f = epsilon * (log_mu - LSE((g - C)/epsilon))
M1 = (g[None, :] - C) / epsilon
f = epsilon * (mu_log - log_sum_exp(M1, axis=1).flatten())
# g update: g = epsilon * (log_nu - LSE((f - C)/epsilon))
M2 = (f[:, None] - C) / epsilon
g = epsilon * (nu_log - log_sum_exp(M2, axis=0).flatten())
log_gamma = (f[:, None] + g[None, :] - C) / epsilon
gamma = jnp.exp(log_gamma)
return gamma, jnp.sum(gamma * C)5. The Dynamic Formulation: Benamou-Brenier
Can we view the transport not just as a static map, but as a continuous Fluid Flow? In 2000, Benamou and Brenier proved that the Wasserstein distance is equal to the minimum kinetic energy of a density moving from to .
subject to the Continuity Equation:
Physical Interpretation: Mass Conservation and Least Action: Imagine the probability mass as an incompressible fluid. The continuity equation ensures mass conservation—no “dirt” disappears or is created during transport. This formulation reveal that the Wasserstein distance is the Action of the minimal-energy fluid flow. Along the optimal path (the Wasserstein Geodesic), particles move in straight lines with constant velocity. This is the Lagrangian Perspective. From the Eulerian Perspective, the density evolves according to a pressure-less gas equation. This link allows us to solve OT using sophisticated tools from computational fluid dynamics (CFD) and Mean Field Games. These geodesics provide a natural way to interpolate between distributions, ensuring that mass is moved efficiently without the “fading” artifacts seen in linear interpolation.
6. Applications: WGANs and Domain Adaptation
6.1 Wasserstein GANs (WGAN)
Standard GANs minimize the Jensen-Shannon Divergence, which is non-smooth when the distributions have disjoint supports (causing “mode collapse” or vanishing gradients). WGANs use the distance (Section 2.1). The “Critic” network is trained to maximize the Kantorovich-Rubinstein objective:
To enforce the 1-Lipschitz constraint, we use Gradient Penalty: adding a loss term . This ensures the critic provides useful gradients everywhere, even when the distributions don’t overlap.
6.2 Barycenters & Domain Adaptation
In Domain Adaptation, we want to align the source distribution with the target . OT provides a map that “morphs” the features while preserving their relative geometry. Wasserstein Barycenters generalize the concept of a mean to distributions:
The barycenter of several point clouds is a representative distribution that minimizes the total transport cost to all of them, which is used for data fusion and noise reduction in medical imaging and astronomy.
7. Gromov-Wasserstein: Comparing Incomparable Spaces
What if and live in different metric spaces? For example, mapping English word embeddings in to French word embeddings in where the axes don’t align, or comparing a 3D mesh to a 2D image. We cannot compute . Instead, we compare intra-domain distances: and . Gromov-Wasserstein (GW) minimizes the discrepancy between these distances:
This is a Quadratic Assignment Problem, which is generally NP-Hard. However, with entropic regularization, we can solve it using a series of Sinkhorn iterations. GW is the “Shape Matching” engine of modern geometry processing, allowing computers to “see” the similarity between two objects even if they are in different coordinate systems.
8. One-Dimensional Optimal Transport: The Sorting Solution
In one dimension (), Optimal Transport is remarkably simple. Sorting the data is optimal. If and are measures on , and are their Cumulative Distribution Functions (CDFs), the optimal map is:
Algorithm:
- Compute the CDF of the source.
- Compute the Inverse CDF (Quantile function) of the target.
- The composition is the optimal transport map. The distance in 1D is simply the distance between the quantile functions:
This solution (dominated by sorting) is used in Sliced Wasserstein distances, where we project high-dimensional distributions onto random 1D lines, solve the transport, and average the results.
9. McCann’s Displacement Convexity
Is the Wasserstein space “flat” or “curved”? Robert McCann (1997) introduced the concept of Displacement Convexity. A functional on the space of measures is displacement convex if it is convex along the Wasserstein geodesics (the Benamou-Brenier paths). Examples of displacement convex functionals:
- Potential Energy: if is convex.
- Internal Energy: for certain functions .
- Negative Entropy: . This property is the foundation of the Otto Calculus, which views the Heat Equation as the gradient flow of the entropy functional in the Wasserstein metric. It explains why the Heat Equation is “optimal”—it is the fastest way to move probability mass to maximize entropy.
10. Discrete Optimal Transport on Graphs
When we move from continuous space to a Graph , the transport problem becomes a Network Flow problem. If we want to move mass from nodes with supply to nodes with demand , the cost is determined by the shortest path distances on the graph.
This is equivalent to the Minimum Cost Flow problem, which can be solved efficiently using the Dual Ascent or Successive Shortest Path algorithms. On graphs, the distance is particularly interesting because it relates to the Cut-Norm and the Graph Laplacian. The optimal flow lines often reveal the bottleneck structure of the network, making OT a powerful tool for community detection and robustness analysis in infrastructure networks.
11. Wasserstein Splines and Temporal Interpolation
In many applications, we don’t just want the distance between two measures; we want to interpolate a series of measures . Wasserstein Splines generalize the idea of cubic splines to the manifold of probability measures. Instead of linear interpolation (which just fades densities), Wasserstein splines use the displacement convexity properties to ensure that the intermediate distributions “morph” naturally. This is used in Computer Animation (morphing between character poses) and Single-Cell Biology (reconstructing the trajectory of a cell’s gene expression across different time points). By minimizing the “acceleration” in the Wasserstein metric, we find the most physically plausible path through the space of possible cell states.
12. Unbalanced Optimal Transport
What if the total mass of and is not equal? For example, comparing a population density from 1950 to 2020 where the total population has changed. Standard OT requires , which forces mass conservation. Unbalanced Optimal Transport (UOT) relaxes this by allowing mass to be created or destroyed, at a cost. We replace the marginal constraints with a penalty term (often the KL divergence):
This leads to the Gaussian Hellinger-Kantorovich distance. UOT is used in Cell Tracking, where cells might divide or die, and in Image Processing, where objects might enter or leave the frame. By parameterizing the mass creation cost, we can tune the model to be more or less sensitive to global mass changes.
13. Conclusion: The Geometry of Information
Optimal Transport has moved from a niche corner of resource allocation to the bedrock of modern data science. It provides the horizontal link that traditional information theory lacked. By respecting the underlying metric of the state space, it allows us to build models that are physically consistent, geometrically intuitive, and computationally robust. Whether we are training GANs, aligning biological datasets, or proving the stability of galaxies, the language of OT provides the most natural coordinate system for the evolution of probability measures.
Historical Timeline
| Year | Event | Significance |
|---|---|---|
| 1781 | Gaspard Monge | Defines the “Earth Mover’s” problem. |
| 1942 | Leonid Kantorovich | Relaxes the problem to couplings (Linear Programming). |
| 1975 | Kantorovich | Nobel Prize in Economics. |
| 1987 | Yann Brenier | Proves the Polar Factorization (Brenier’s Theorem). |
| 1997 | Robett McCann | Introduces Displacement Convexity. |
| 2000 | Benamou & Brenier | Dynamic formulation (Fluid mechanics link). |
| 2009 | Cédric Villani | ”Optimal Transport: Old and New”. |
| 2013 | Marco Cuturi | Sinkhorn Algorithm (Entropic Regularization). |
| 2017 | Arjovsky et al. | Wasserstein GANs. |
Appendix A: Deriving Monge-Ampère from Optimal Transport
Consider the optimal map pushing to . By the change of variables formula, for any test function :
Assuming densities and :
Under quadratic cost, Brenier’s theorem gives , where is convex. Then (the Hessian). The equation becomes the Monge-Ampère Equation:
This PDE encapsulates the entirety of the optimal transport problem in a single differential constraint.
Appendix B: Kantorovich and the Origins of Linear Programming
Leonid Kantorovich developed the formulation of Optimal Transport in 1942 while optimizing production in a plywood factory. He realized that the transport problem was a special case of Linear Programming. His insight was that the “Dual Potentials” and represent the internal prices (or shadow prices) of the system. In a perfectly efficient economy, the internal prices of raw materials must be consistent with the final transport objective—a principle that remains the foundation of modern supply chain optimization.
Appendix C: Wasserstein Barycenters and Discrete Geometry
Given a set of distributions , their Wasserstein Barycenter is:
Unlike the Euclidean mean (which averages coordinates), the Wasserstein mean averages the “locations” of the mass. For example, the Wasserstein mean of two shifted Gaussians is a Gaussian in the middle. Computationally, we use Convolutional Wasserstein Barycenters, which use heat kernels on meshes or grids to speed up the Sinkhorn updates across multiple domains simultaneously.
Appendix D: Regularity Theory (Caffarelli)
When is the optimal map smooth?
- Convexity of Support: The target support must be convex. If not, develops singularities.
- Holder Continuity: If densities are bounded and target support is convex, the map is continuous.
Appendix E: Fisher-Rao vs. Wasserstein (Information Geometry)
There are two primary ways to make the space of measures a manifold:
- Fisher-Rao Metric: local change in density (Information Theory). “Vertical”.
- Wasserstein Metric: physical movement of mass (Optimal Transport). “Horizontal”.
Comparison:
- Geometry: Fisher-Rao leads to constant positive curvature. Wasserstein leads to Alexandrov spaces.
- Support: Fisher-Rao requires overlapping supports. Wasserstein does not.
- Geodesics: Fisher geodesics “fade”. Wasserstein geodesics “translate”.
Appendix F: Stability of Sinkhorn Potentials
A major challenge in using Sinkhorn is the choice of . The Sinkhorn Divergence corrects the entropic bias:
This divergence is non-negative and defines a valid metric that interpolates between Wasserstein () and the Kernel Mean Embedding ().
Appendix G: Semi-Discrete Optimal Transport and Power Diagrams
In semi-discrete OT, we move a continuous density to discrete Dirac masses . The optimal map partitions the space into Power Cells:
where are dual weights. This is solved using Newton’s method on the dual objective.
Appendix H: Numerical Jacobians and Second-Order Optimization
In many OT applications, we need to differentiate through the transport map . This requires computing the Hessian . In the continuous case, this is exactly what the Monge-Ampère equation describes. In the discrete/Sinkhorn case, the Jacobian of the map can be computed via Implicit Differentiation or Automatic Differentiation (AD). However, standard AD can be memory-intensive for many Sinkhorn iterations. Instead, we use the fact that the Sinkhorn fixed point satisfies . By differentiating this stationarity condition, we can compute the Jacobian of the OT plan with respect to the input locations without storing the entire iteration history. This is critical for applications like Optimal Transport for Shape Matching, where we optimize the positions of points to minimize their transport cost to a target mesh.
Appendix I: Connection to Geometric Integration
As discussed in the previous post on Geometric Integration, Hamiltonian systems preserve the symplectic form. There is a deep connection between the Wasserstein Metric and Hamiltonian Flows. The flow in Wasserstein space can be seen as a Hamiltonian trajectory in the space of measures, where the “Internal Energy” acts as the Hamiltonian. The Jordan-Kinderlehrer-Otto (JKO) Scheme is a variational integrator for these flows:
This scheme is a discrete-time version of the gradient flow, preserving the displacement convexity of the energy functional . It shows that many diffusion processes (like the Heat Equation) are actually descent steps in the Wasserstein metric.
Appendix J: Multimarginal Optimal Transport
Instead of two distributions and , what if we want to find a coupling between distributions ? This is Multimarginal Optimal Transport (MOT):
The state space grows exponentially with , making naive computations impossible. However, if the cost has a specific structure (e.g., depends only on the mean or follows a tree), we can use specialized Sinkhorn-like algorithms. MOT is used in Multi-objective Optimization and Multi-agent Systems.
Appendix K: OT and Density Functional Theory (DFT)
In quantum chemistry, the Hohenberg-Kohn theorem states that the ground state of a multi-electron system is a functional of the electron density . Finding the electron configuration that minimizes energy can be framed as a Multimarginal OT problem with a repulsive cost . The electrons “transport” themselves to positions that minimize Coulomb repulsion while matching the overall density. This connection, developed by Cotar and Pass, provides a rigorous bridge between Optimal Transport and the fundamental equations of the physical world.
Appendix L: Connections to Quantum Optimal Transport
Quantum Optimal Transport (QOT) extends these ideas to density matrices . Instead of transport between points, we consider the cost of transforming one quantum state into another. This involves the Lindblad Equation and the Lindblad-Wasserstein metric, providing a geometric framework for understanding decoherence and the evolution of open quantum systems.
Appendix M: Glossary of Terms
- Brenier Map: The gradient of a convex function that is optimal for quadratic cost.
- c-Transform: The generalized Legendre transform for a cost function .
- Coupling: A joint distribution with fixed marginals.
- McCann Interpolation: The straight line in Wasserstein space.
- Push-forward: The measure induced by a map.
- Sinkhorn Divergence: A debiased version of the entropic OT cost.
References
1. Villani, C. (2009). “Optimal Transport: Old and New”. The Bible of the field. Links OT to Ricci curvature and geometry.
2. Cuturi, M. (2013). “Sinkhorn Distances: Lightspeed Computation of Optimal Transport”. The paper that brought OT into modern machine learning via entropic regularization.
3. Peyré, G., & Cuturi, M. (2019). “Computational Optimal Transport”. A comprehensive computational guide.
4. Benamou, J. D., & Brenier, Y. (2000). “Fluid mechanics solution to mass transfer”. Established the dynamic, fluid-mechanics view of OT.
5. McCann, R. J. (1997). “Convexity principle for interacting gases”. Introduced displacement convexity.
6. Solomon, J., et al. (2015). “Convolutional Wasserstein Barycenters”. Fast algorithms for barycenters on geometric domains.
7. Arjovsky, M., et al. (2017). “Wasserstein GAN”. Applied OT stability to generative adversarial networks.
8. Caffarelli, L. A. (1992). “The regularity of mappings with a convex potential”. Fundamental regularity theory for the Monge-Ampère equation.