Optimal Transport & Geometry

How do we compare two probability distributions? Standard metrics like KL Divergence or Total Variation fail to capture the underlying geometry of the space. KL divergence is essentially “vertical”—it compares densities at specific points but ignores how far mass must travel to reach a target. If two Dirac masses are separated by a tiny epsilon, their KL distance is infinite, and their TV distance is 1. We want a horizontal metric that respects the ground distance d(x,y)d(x, y). This is the motivation for Optimal Transport (OT). By viewing probability measures as “piles of dirt,” we define the distance as the minimum work required to transport one pile to another. This leads to the Wasserstein Metric, which endows the space of measures with a rich Riemannian-like structure. This post provides a rigorous journey from Monge’s 18th-century “Earth-Moving” problem to the modern “Sinkhorn” revolution and the Monge-Ampère PDE.


1. The Monge-Kantorovich Formulations

1.1 Monge’s Problem: The Quest for the Optimal Map

In 1781, Gaspard Monge asked: Given two distributions μ\mu and ν\nu on Rd\mathbb{R}^d, what is the map T:RdRdT: \mathbb{R}^d \to \mathbb{R}^d that minimizes the total cost of transport?

minT#μ=νRdc(x,T(x))dμ(x)\min_{T_\# \mu = \nu} \int_{\mathbb{R}^d} c(x, T(x)) d\mu(x)

where T#μ=νT_\# \mu = \nu is the Push-forward Constraint: ν(B)=μ(T1(B))\nu(B) = \mu(T^{-1}(B)) for all Borel sets BB. Geometric Motivation: Monge’s original application was the movement of soil (déblais) to build fortifications (remblais). He realized that the transport lines should not intersect—a precursor to the modern understanding that optimal maps are gradients of convex potentials. The Monge-Ampère Connection: Under a quadratic cost c(x,y)=12xy2c(x, y) = \frac{1}{2}|x-y|^2, if the map TT is smooth, it satisfies the Change of Variables Formula:

f(x)=g(T(x))detT(x)f(x) = g(T(x)) |\det \nabla T(x)|

where ff and gg are densities of μ\mu and ν\nu. Brenier’s theorem (Section 3) tells us T=uT = \nabla u for a convex potential uu. Substituting this into the density equation yields the Monge-Ampère Equation:

det(2u(x))=f(x)g(u(x))\det(\nabla^2 u(x)) = \frac{f(x)}{g(\nabla u(x))}

This is a highly non-linear, second-order elliptic PDE. Solving for the optimal transport map is equivalent to finding the solution to this PDE under the boundary condition u(spt μ)=spt ν\nabla u(\text{spt } \mu) = \text{spt } \nu.

1.2 Kantorovich’s Relaxation: From Maps to Couplings

Monge’s problem is mathematically “brittle” because a map TT might not exist. If μ\mu is a Dirac mass at the origin and ν\nu is a uniform distribution on a sphere, no single map can move the mass to multiple destinations. Leonid Kantorovich relaxed the map TT to a Coupling (or transport plan) γ\gamma, which is a joint distribution on the product space X×YX \times Y.

OT(μ,ν)=infγΠ(μ,ν)X×Yc(x,y)dγ(x,y)\text{OT}(\mu, \nu) = \inf_{\gamma \in \Pi(\mu, \nu)} \iint_{X \times Y} c(x, y) d\gamma(x, y)

where Π(μ,ν)={γP(X×Y):PX#γ=μ,PY#γ=ν}\Pi(\mu, \nu) = \{ \gamma \in \mathcal{P}(X \times Y) : P_{X \#} \gamma = \mu, P_{Y \#} \gamma = \nu \}. This relaxation transforms a non-linear problem into a Linear Program in the space of measures. The set of couplings is convex and weakly compact, ensuring the existence of a minimizer.


2. Duality and High-Level “Prices”

2.1 The Kantorovich Duality

The primal problem (minimizing over couplings) has a dual problem (maximizing over potential functions). Theorem (Kantorovich Duality):

minγcdγ=supϕ,ψ{Xϕdμ+Yψdν:ϕ(x)+ψ(y)c(x,y)}\min_{\gamma} \int c d\gamma = \sup_{\phi, \psi} \left\{ \int_X \phi d\mu + \int_Y \psi d\nu : \phi(x) + \psi(y) \le c(x, y) \right\}

Proof Sketch via Fenchel-Rockafellar: We can view the primal problem as:

infγF(γ)+G(Aγ)\inf_{\gamma} F(\gamma) + G(A \gamma)

where F(γ)=c,γ+ιγ0F(\gamma) = \langle c, \gamma \rangle + \iota_{\gamma \ge 0} and GG is the indicator of the marginal constraints. By Fenchel-Rockafellar Duality:

infγΦ(γ)=suppΦ(Ap)\inf_{\gamma} \Phi(\gamma) = \sup_{p} -\Phi^*(-A^* p)

The dual variables p=(ϕ,ψ)p = (\phi, \psi) emerge as the Lagrange multipliers for the marginal constraints. The condition ϕ(x)+ψ(y)c(x,y)\phi(x) + \psi(y) \le c(x, y) is the requirement that no “arbitrage” is possible in the transport market.

2.2 The Kantorovich-Rubinstein Theorem

If c(x,y)=d(x,y)c(x, y) = d(x, y) is a distance metric, the dual potentials satisfy ϕ(x)(ψ(y))d(x,y)\phi(x) - (-\psi(y)) \le d(x, y), which implies that the function ϕ\phi is 1-Lipschitz.

W1(μ,ν)=supϕL1ϕd(μν)W_1(\mu, \nu) = \sup_{\| \phi \|_L \le 1} \int \phi d(\mu - \nu)

This version of the Wasserstein distance, the W1W_1 Metric, is the basis for Wasserstein GANs. It provides a robust gradient for training generative models even when the generated and real distributions have non-overlapping supports.


3. Brenier’s Theorem: The Polar Factorization

Under quadratic cost c(x,y)=12xy2c(x, y) = \frac{1}{2} |x - y|^2, the geometry of the optimal map is exceptionally rigid. Theorem (Brenier, 1991): If μ\mu has a density (doesn’t charge small sets), there exists a unique (up to additive constant) convex function u:RdRu: \mathbb{R}^d \to \mathbb{R} such that the optimal transport map TT is given by the gradient:

T(x)=u(x)T(x) = \nabla u(x)

Interpretation: Moving along a Gradient: The optimal map never “swirls.” Since TT is a gradient of a convex function, its Jacobian T=2u\nabla T = \nabla^2 u is symmetric and positive semi-definite. Physically, this means that the particles follow the path of least resistance, and the overall flow is “irrotational” in the sense of fluid dynamics.

The Polar Factorization of Maps: Brenier’s theorem is often viewed as the measure-theoretic analog of the QR or Polar decomposition of matrices. Just as any matrix M=RUM = RU (rotation times positive definite), any vector field XX that pushes μ\mu to ν\nu can be decomposed as:

X=uSX = \nabla u \circ S

where uu is convex and SS is a μ\mu-measure-preserving map. The optimal transport map is the “purest” part of this decomposition—the part that does all the work of rearranging mass without any unnecessary shuffling.


4. Entropic Regularization: The Sinkhorn Speedup

Solving the original Linear Programming problem for nn data points (using the Simplex or Interior Point methods) scales as O(n3logn)O(n^3 \log n). This is computationally prohibitive for modern machine learning, where nn can be 10510^5 or more. To address this, we add an Entropic Regularization term H(γ)=γ(logγ1)dγH(\gamma) = - \int \gamma (\log \gamma - 1) d\gamma:

OTϵ(μ,ν)=infγΠ(μ,ν)cdγ+ϵH(γ)\text{OT}_\epsilon(\mu, \nu) = \inf_{\gamma \in \Pi(\mu, \nu)} \iint c d\gamma + \epsilon H(\gamma)

The Gibbs Kernel Mapping: The first-order conditions for the Lagrangian of this regularized problem imply that the optimal coupling γ\gamma must take the form of a Gibbs Distribution:

γ(x,y)=exp(ϕ(x)+ψ(y)c(x,y)ϵ)\gamma^*(x, y) = \exp \left( \frac{\phi(x) + \psi(y) - c(x, y)}{\epsilon} \right)

Setting u(x)=exp(ϕ(x)/ϵ)u(x) = \exp(\phi(x)/\epsilon) and v(y)=exp(ψ(y)/ϵ)v(y) = \exp(\psi(y)/\epsilon), we get the factorized form:

γ(x,y)=u(x)ec(x,y)/ϵv(y)\gamma^*(x, y) = u(x) e^{-c(x, y)/\epsilon} v(y)

where the term K(x,y)=ec(x,y)/ϵK(x, y) = e^{-c(x, y)/\epsilon} is the Gibbs Kernel. As ϵ0\epsilon \to 0, we recover the exact OT. As ϵ\epsilon \to \infty, the coupling becomes the independent product μν\mu \otimes \nu. The marginal constraints Pxγ=μP_x \gamma = \mu and Pyγ=νP_y \gamma = \nu then become a system of non-linear integral equations. In the discrete case, these are:

u(Kv)=a,v(KTu)=bu \odot (K v) = a, \quad v \odot (K^T u) = b

We solve this by oscillating between the two updates:

u(k+1)=aKv(k),v(k+1)=bKTu(k+1)u^{(k+1)} = \frac{a}{K v^{(k)}}, \quad v^{(k+1)} = \frac{b}{K^T u^{(k+1)}}

This is Sinkhorn’s Algorithm. Because it only involves matrix-vector multiplications, it is extremely parallelizable (GPU-friendly). However, for small ϵ\epsilon, the values in KK underflow to zero. We must perform the updates in the Log-Domain (using LogSumExp) to maintain numerical stability.

Implementation (Log-Stabilized Sinkhorn)

import jax import jax.numpy as jnp def log_sum_exp(x, axis=None): max_x = jnp.max(x, axis=axis, keepdims=True) return max_x + jnp.log(jnp.sum(jnp.exp(x - max_x), axis=axis, keepdims=True)) def sinkhorn_log_domain(X, Y, epsilon=0.01, max_iter=100): n, m = X.shape[0], Y.shape[0] mu_log = -jnp.log(n) * jnp.ones(n) # Assuming uniform weights nu_log = -jnp.log(m) * jnp.ones(m) diff = X[:, None, :] - Y[None, :, :] C = jnp.sum(diff**2, axis=-1) # Euclidean cost f = jnp.zeros(n) # potential 1 g = jnp.zeros(m) # potential 2 for _ in range(max_iter): # f update: f = epsilon * (log_mu - LSE((g - C)/epsilon)) M1 = (g[None, :] - C) / epsilon f = epsilon * (mu_log - log_sum_exp(M1, axis=1).flatten()) # g update: g = epsilon * (log_nu - LSE((f - C)/epsilon)) M2 = (f[:, None] - C) / epsilon g = epsilon * (nu_log - log_sum_exp(M2, axis=0).flatten()) log_gamma = (f[:, None] + g[None, :] - C) / epsilon gamma = jnp.exp(log_gamma) return gamma, jnp.sum(gamma * C)

5. The Dynamic Formulation: Benamou-Brenier

Can we view the transport not just as a static map, but as a continuous Fluid Flow? In 2000, Benamou and Brenier proved that the Wasserstein distance is equal to the minimum kinetic energy of a density ρ(t,x)\rho(t, x) moving from μ\mu to ν\nu.

W22(μ,ν)=infρ,v01ρ(t,x)v(t,x)2dxdtW_2^2(\mu, \nu) = \inf_{\rho, v} \int_0^1 \int \rho(t, x) |v(t, x)|^2 dx dt

subject to the Continuity Equation:

tρ+(ρv)=0\partial_t \rho + \nabla \cdot (\rho v) = 0

Physical Interpretation: Mass Conservation and Least Action: Imagine the probability mass as an incompressible fluid. The continuity equation ensures mass conservation—no “dirt” disappears or is created during transport. This formulation reveal that the Wasserstein distance is the Action of the minimal-energy fluid flow. Along the optimal path (the Wasserstein Geodesic), particles move in straight lines with constant velocity. This is the Lagrangian Perspective. From the Eulerian Perspective, the density ρ(t,x)\rho(t, x) evolves according to a pressure-less gas equation. This link allows us to solve OT using sophisticated tools from computational fluid dynamics (CFD) and Mean Field Games. These geodesics provide a natural way to interpolate between distributions, ensuring that mass is moved efficiently without the “fading” artifacts seen in linear interpolation.


6. Applications: WGANs and Domain Adaptation

6.1 Wasserstein GANs (WGAN)

Standard GANs minimize the Jensen-Shannon Divergence, which is non-smooth when the distributions have disjoint supports (causing “mode collapse” or vanishing gradients). WGANs use the W1W_1 distance (Section 2.1). The “Critic” network DD is trained to maximize the Kantorovich-Rubinstein objective:

minGmaxDL1ExPdata[D(x)]EzPnoise[D(G(z))]\min_G \max_{\| D \|_L \le 1} \mathbb{E}_{x \sim P_{data}}[D(x)] - \mathbb{E}_{z \sim P_{noise}}[D(G(z))]

To enforce the 1-Lipschitz constraint, we use Gradient Penalty: adding a loss term (D(x)21)2(\| \nabla D(x) \|_2 - 1)^2. This ensures the critic provides useful gradients everywhere, even when the distributions don’t overlap.

6.2 Barycenters & Domain Adaptation

In Domain Adaptation, we want to align the source distribution μS\mu_S with the target μT\mu_T. OT provides a map TT that “morphs” the features while preserving their relative geometry. Wasserstein Barycenters generalize the concept of a mean to distributions:

μˉ=argminμλiW22(μ,μi)\bar{\mu} = \arg \min_\mu \sum \lambda_i W_2^2(\mu, \mu_i)

The barycenter of several point clouds is a representative distribution that minimizes the total transport cost to all of them, which is used for data fusion and noise reduction in medical imaging and astronomy.


7. Gromov-Wasserstein: Comparing Incomparable Spaces

What if XX and YY live in different metric spaces? For example, mapping English word embeddings in R300\mathbb{R}^{300} to French word embeddings in R300\mathbb{R}^{300} where the axes don’t align, or comparing a 3D mesh to a 2D image. We cannot compute c(x,y)=xyc(x, y) = \|x-y\|. Instead, we compare intra-domain distances: dX(x,x)d_X(x, x') and dY(y,y)d_Y(y, y'). Gromov-Wasserstein (GW) minimizes the discrepancy between these distances:

GW(μ,ν)=infγdX(x,x)dY(y,y)2dγ(x,y)dγ(x,y)\text{GW}(\mu, \nu) = \inf_\gamma \iint |d_X(x, x') - d_Y(y, y')|^2 d\gamma(x, y) d\gamma(x', y')

This is a Quadratic Assignment Problem, which is generally NP-Hard. However, with entropic regularization, we can solve it using a series of Sinkhorn iterations. GW is the “Shape Matching” engine of modern geometry processing, allowing computers to “see” the similarity between two objects even if they are in different coordinate systems.


8. One-Dimensional Optimal Transport: The Sorting Solution

In one dimension (d=1d=1), Optimal Transport is remarkably simple. Sorting the data is optimal. If μ\mu and ν\nu are measures on R\mathbb{R}, and Fμ,FνF_\mu, F_\nu are their Cumulative Distribution Functions (CDFs), the optimal map is:

T(x)=Fν1(Fμ(x))T(x) = F_\nu^{-1}(F_\mu(x))

Algorithm:

  1. Compute the CDF of the source.
  2. Compute the Inverse CDF (Quantile function) of the target.
  3. The composition is the optimal transport map. The WpW_p distance in 1D is simply the LpL^p distance between the quantile functions:
Wp(μ,ν)=(01Fμ1(t)Fν1(t)pdt)1/pW_p(\mu, \nu) = \left( \int_0^1 |F_\mu^{-1}(t) - F_\nu^{-1}(t)|^p dt \right)^{1/p}

This O(nlogn)O(n \log n) solution (dominated by sorting) is used in Sliced Wasserstein distances, where we project high-dimensional distributions onto random 1D lines, solve the transport, and average the results.


9. McCann’s Displacement Convexity

Is the Wasserstein space “flat” or “curved”? Robert McCann (1997) introduced the concept of Displacement Convexity. A functional F\mathcal{F} on the space of measures is displacement convex if it is convex along the Wasserstein geodesics (the Benamou-Brenier paths). Examples of displacement convex functionals:

  1. Potential Energy: V(ρ)=V(x)ρ(x)dx\mathcal{V}(\rho) = \int V(x) \rho(x) dx if VV is convex.
  2. Internal Energy: E(ρ)=A(ρ(x))dx\mathcal{E}(\rho) = \int A(\rho(x)) dx for certain functions AA.
  3. Negative Entropy: H(ρ)=ρlogρdx\mathcal{H}(\rho) = \int \rho \log \rho dx. This property is the foundation of the Otto Calculus, which views the Heat Equation as the gradient flow of the entropy functional in the Wasserstein metric. It explains why the Heat Equation is “optimal”—it is the fastest way to move probability mass to maximize entropy.

10. Discrete Optimal Transport on Graphs

When we move from continuous space to a Graph G=(V,E)G = (V, E), the transport problem becomes a Network Flow problem. If we want to move mass from nodes with supply μ\mu to nodes with demand ν\nu, the cost is determined by the shortest path distances dG(i,j)d_G(i, j) on the graph.

minγ0i,jVdG(i,j)γijs.t.jγij=μi,iγij=νj\min_{\gamma \ge 0} \sum_{i, j \in V} d_G(i, j) \gamma_{ij} \quad \text{s.t.} \quad \sum_j \gamma_{ij} = \mu_i, \quad \sum_i \gamma_{ij} = \nu_j

This is equivalent to the Minimum Cost Flow problem, which can be solved efficiently using the Dual Ascent or Successive Shortest Path algorithms. On graphs, the W1W_1 distance is particularly interesting because it relates to the Cut-Norm and the Graph Laplacian. The optimal flow lines often reveal the bottleneck structure of the network, making OT a powerful tool for community detection and robustness analysis in infrastructure networks.


11. Wasserstein Splines and Temporal Interpolation

In many applications, we don’t just want the distance between two measures; we want to interpolate a series of measures μ0,μ1,,μT\mu_0, \mu_1, \dots, \mu_T. Wasserstein Splines generalize the idea of cubic splines to the manifold of probability measures. Instead of linear interpolation (which just fades densities), Wasserstein splines use the displacement convexity properties to ensure that the intermediate distributions “morph” naturally. This is used in Computer Animation (morphing between character poses) and Single-Cell Biology (reconstructing the trajectory of a cell’s gene expression across different time points). By minimizing the “acceleration” in the Wasserstein metric, we find the most physically plausible path through the space of possible cell states.


12. Unbalanced Optimal Transport

What if the total mass of μ\mu and ν\nu is not equal? For example, comparing a population density from 1950 to 2020 where the total population has changed. Standard OT requires T#μ=νT_\# \mu = \nu, which forces mass conservation. Unbalanced Optimal Transport (UOT) relaxes this by allowing mass to be created or destroyed, at a cost. We replace the marginal constraints with a penalty term (often the KL divergence):

UOT(μ,ν)=infγcdγ+αKL(PX#γμ)+βKL(PY#γν)\text{UOT}(\mu, \nu) = \inf_{\gamma} \iint c d\gamma + \alpha \text{KL}(P_{X\#} \gamma | \mu) + \beta \text{KL}(P_{Y\#} \gamma | \nu)

This leads to the Gaussian Hellinger-Kantorovich distance. UOT is used in Cell Tracking, where cells might divide or die, and in Image Processing, where objects might enter or leave the frame. By parameterizing the mass creation cost, we can tune the model to be more or less sensitive to global mass changes.


13. Conclusion: The Geometry of Information

Optimal Transport has moved from a niche corner of resource allocation to the bedrock of modern data science. It provides the horizontal link that traditional information theory lacked. By respecting the underlying metric of the state space, it allows us to build models that are physically consistent, geometrically intuitive, and computationally robust. Whether we are training GANs, aligning biological datasets, or proving the stability of galaxies, the language of OT provides the most natural coordinate system for the evolution of probability measures.


Historical Timeline

YearEventSignificance
1781Gaspard MongeDefines the “Earth Mover’s” problem.
1942Leonid KantorovichRelaxes the problem to couplings (Linear Programming).
1975KantorovichNobel Prize in Economics.
1987Yann BrenierProves the Polar Factorization (Brenier’s Theorem).
1997Robett McCannIntroduces Displacement Convexity.
2000Benamou & BrenierDynamic formulation (Fluid mechanics link).
2009Cédric VillaniOptimal Transport: Old and New”.
2013Marco CuturiSinkhorn Algorithm (Entropic Regularization).
2017Arjovsky et al.Wasserstein GANs.

Appendix A: Deriving Monge-Ampère from Optimal Transport

Consider the optimal map T:XYT: X \to Y pushing μ\mu to ν\nu. By the change of variables formula, for any test function ψ\psi:

Yψ(y)dν(y)=Xψ(T(x))dμ(x)\int_Y \psi(y) d\nu(y) = \int_X \psi(T(x)) d\mu(x)

Assuming densities ff and gg:

g(T(x))detT(x)=f(x)g(T(x)) |\det \nabla T(x)| = f(x)

Under quadratic cost, Brenier’s theorem gives T=uT = \nabla u, where uu is convex. Then T=2u\nabla T = \nabla^2 u (the Hessian). The equation becomes the Monge-Ampère Equation:

det(2u(x))=f(x)g(u(x))\det(\nabla^2 u(x)) = \frac{f(x)}{g(\nabla u(x))}

This PDE encapsulates the entirety of the optimal transport problem in a single differential constraint.


Appendix B: Kantorovich and the Origins of Linear Programming

Leonid Kantorovich developed the formulation of Optimal Transport in 1942 while optimizing production in a plywood factory. He realized that the transport problem was a special case of Linear Programming. His insight was that the “Dual Potentials” ϕ\phi and ψ\psi represent the internal prices (or shadow prices) of the system. In a perfectly efficient economy, the internal prices of raw materials must be consistent with the final transport objective—a principle that remains the foundation of modern supply chain optimization.


Appendix C: Wasserstein Barycenters and Discrete Geometry

Given a set of distributions μ1,,μk\mu_1, \dots, \mu_k, their Wasserstein Barycenter is:

μ=argminμi=1kwiW22(μ,μi)\mu^* = \arg \min_\mu \sum_{i=1}^k w_i W_2^2(\mu, \mu_i)

Unlike the Euclidean mean (which averages coordinates), the Wasserstein mean averages the “locations” of the mass. For example, the Wasserstein mean of two shifted Gaussians is a Gaussian in the middle. Computationally, we use Convolutional Wasserstein Barycenters, which use heat kernels on meshes or grids to speed up the Sinkhorn updates across multiple domains simultaneously.


Appendix D: Regularity Theory (Caffarelli)

When is the optimal map TT smooth?

  1. Convexity of Support: The target support must be convex. If not, uu develops singularities.
  2. Holder Continuity: If densities are bounded and target support is convex, the map is continuous.

Appendix E: Fisher-Rao vs. Wasserstein (Information Geometry)

There are two primary ways to make the space of measures a manifold:

  1. Fisher-Rao Metric: local change in density (Information Theory). “Vertical”.
  2. Wasserstein Metric: physical movement of mass (Optimal Transport). “Horizontal”.

Comparison:

  • Geometry: Fisher-Rao leads to constant positive curvature. Wasserstein leads to Alexandrov spaces.
  • Support: Fisher-Rao requires overlapping supports. Wasserstein does not.
  • Geodesics: Fisher geodesics “fade”. Wasserstein geodesics “translate”.

Appendix F: Stability of Sinkhorn Potentials

A major challenge in using Sinkhorn is the choice of ϵ\epsilon. The Sinkhorn Divergence corrects the entropic bias:

Sϵ(μ,ν)=OTϵ(μ,ν)12OTϵ(μ,μ)12OTϵ(ν,ν)S_\epsilon(\mu, \nu) = \text{OT}_\epsilon(\mu, \nu) - \frac{1}{2}\text{OT}_\epsilon(\mu, \mu) - \frac{1}{2}\text{OT}_\epsilon(\nu, \nu)

This divergence is non-negative and defines a valid metric that interpolates between Wasserstein (ϵ0\epsilon \to 0) and the Kernel Mean Embedding (ϵ\epsilon \to \infty).


Appendix G: Semi-Discrete Optimal Transport and Power Diagrams

In semi-discrete OT, we move a continuous density μ\mu to discrete Dirac masses {yj}\{y_j\}. The optimal map TT partitions the space into Power Cells:

Cj={x:xyj2ψjxyk2ψkk}C_j = \{ x : |x - y_j|^2 - \psi_j \le |x - y_k|^2 - \psi_k \quad \forall k \}

where ψj\psi_j are dual weights. This is solved using Newton’s method on the dual objective.


Appendix H: Numerical Jacobians and Second-Order Optimization

In many OT applications, we need to differentiate through the transport map T(x)=u(x)T(x) = \nabla u(x). This requires computing the Hessian 2u(x)\nabla^2 u(x). In the continuous case, this is exactly what the Monge-Ampère equation describes. In the discrete/Sinkhorn case, the Jacobian of the map can be computed via Implicit Differentiation or Automatic Differentiation (AD). However, standard AD can be memory-intensive for many Sinkhorn iterations. Instead, we use the fact that the Sinkhorn fixed point satisfies u=a/(Kv)u = a / (Kv). By differentiating this stationarity condition, we can compute the Jacobian of the OT plan with respect to the input locations X,YX, Y without storing the entire iteration history. This is critical for applications like Optimal Transport for Shape Matching, where we optimize the positions of points to minimize their transport cost to a target mesh.


Appendix I: Connection to Geometric Integration

As discussed in the previous post on Geometric Integration, Hamiltonian systems preserve the symplectic form. There is a deep connection between the Wasserstein Metric and Hamiltonian Flows. The flow ρ(t)\rho(t) in Wasserstein space can be seen as a Hamiltonian trajectory in the space of measures, where the “Internal Energy” acts as the Hamiltonian. The Jordan-Kinderlehrer-Otto (JKO) Scheme is a variational integrator for these flows:

ρn+1=argminρ(12hW22(ρ,ρn)+F(ρ))\rho_{n+1} = \arg \min_\rho \left( \frac{1}{2h} W_2^2(\rho, \rho_n) + \mathcal{F}(\rho) \right)

This scheme is a discrete-time version of the gradient flow, preserving the displacement convexity of the energy functional F\mathcal{F}. It shows that many diffusion processes (like the Heat Equation) are actually descent steps in the Wasserstein metric.


Appendix J: Multimarginal Optimal Transport

Instead of two distributions μ\mu and ν\nu, what if we want to find a coupling between NN distributions μ1,,μN\mu_1, \dots, \mu_N? This is Multimarginal Optimal Transport (MOT):

infγΠ(μ1,,μN) ⁣C(x1,,xN)dγ(x1,,xN)\inf_{\gamma \in \Pi(\mu_1, \dots, \mu_N)} \int \dots \int C(x_1, \dots, x_N) d\gamma(x_1, \dots, x_N)

The state space grows exponentially with NN, making naive computations impossible. However, if the cost CC has a specific structure (e.g., depends only on the mean or follows a tree), we can use specialized Sinkhorn-like algorithms. MOT is used in Multi-objective Optimization and Multi-agent Systems.


Appendix K: OT and Density Functional Theory (DFT)

In quantum chemistry, the Hohenberg-Kohn theorem states that the ground state of a multi-electron system is a functional of the electron density ρ\rho. Finding the electron configuration that minimizes energy can be framed as a Multimarginal OT problem with a repulsive cost c(x,y)=1/xyc(x, y) = 1/|x-y|. The electrons “transport” themselves to positions that minimize Coulomb repulsion while matching the overall density. This connection, developed by Cotar and Pass, provides a rigorous bridge between Optimal Transport and the fundamental equations of the physical world.


Appendix L: Connections to Quantum Optimal Transport

Quantum Optimal Transport (QOT) extends these ideas to density matrices ρ\rho. Instead of transport between points, we consider the cost of transforming one quantum state into another. This involves the Lindblad Equation and the Lindblad-Wasserstein metric, providing a geometric framework for understanding decoherence and the evolution of open quantum systems.


Appendix M: Glossary of Terms

  • Brenier Map: The gradient of a convex function that is optimal for quadratic cost.
  • c-Transform: The generalized Legendre transform for a cost function cc.
  • Coupling: A joint distribution with fixed marginals.
  • McCann Interpolation: The straight line in Wasserstein space.
  • Push-forward: The measure induced by a map.
  • Sinkhorn Divergence: A debiased version of the entropic OT cost.

References

1. Villani, C. (2009). “Optimal Transport: Old and New”. The Bible of the field. Links OT to Ricci curvature and geometry.

2. Cuturi, M. (2013). “Sinkhorn Distances: Lightspeed Computation of Optimal Transport”. The paper that brought OT into modern machine learning via entropic regularization.

3. Peyré, G., & Cuturi, M. (2019). “Computational Optimal Transport”. A comprehensive computational guide.

4. Benamou, J. D., & Brenier, Y. (2000). “Fluid mechanics solution to mass transfer”. Established the dynamic, fluid-mechanics view of OT.

5. McCann, R. J. (1997). “Convexity principle for interacting gases”. Introduced displacement convexity.

6. Solomon, J., et al. (2015). “Convolutional Wasserstein Barycenters”. Fast algorithms for barycenters on geometric domains.

7. Arjovsky, M., et al. (2017). “Wasserstein GAN”. Applied OT stability to generative adversarial networks.

8. Caffarelli, L. A. (1992). “The regularity of mappings with a convex potential”. Fundamental regularity theory for the Monge-Ampère equation.