Information Geometry & Natural Gradients

1/13/2026

1. Problem Formulation

Let $(\Omega, \mathcal{F}, \mu)$ be a measure space. Consider a parametric family of probability measures $\mathcal{S} = \{ P_\xi \mid \xi \in \Xi \}$ dominated by $\mu$ , where $\Xi \subseteq \mathbb{R}^d$ is an open set of parameters. We denote the Radon-Nikodym derivatives (probability densities) by:

p(x; \xi) = \frac{dP_\xi}{d\mu}(x)

Our objective is to define a geometric structure (a Riemannian metric $g$ and an affine connection $\nabla$ ) on the manifold $\mathcal{S}$ that satisfies intrinsic invariance.

The tension arises from the arbitrary nature of the parameterization $\xi$ . In standard Euclidean optimization (e.g., Gradient Descent), we implicitly assume the distance between $P_\xi$ and $P_{\xi + \delta \xi}$ is $\|\delta \xi\|_2$ . This is physically unjustified. A change in coordinates $\xi \to \phi(\xi)$ distorts this distance metric. Furthermore, the Euclidean distance is not invariant to the geometry of the sample space $\Omega$ .

We require a divergence functional $D[P : Q]$ such that the induced metric structure is:

Invariant to Reparameterization: $ds^2(\xi, \xi + d\xi)$ is a scalar invariant.
Invariant to Sufficient Statistics: If $T: \Omega \to \Omega'$ is a sufficient statistic for $\xi$ , then the geometry of $\mathcal{S}$ must be identical to the geometry of the induced family on $\Omega'$ .

2. The Tools (Definitions and Regularity)

To ensure the existence of the Fisher Information and the validity of valid Taylor expansions, we explicitly state the required regularity conditions. Note that “standard” assumptions in machine learning often violate these (e.g., ReLU networks leading to singular Hessians, or uniform distributions violating support independence).

Assumption A1 (Identifiability): The map $\xi \mapsto P_\xi$ is injective. That is, $\xi \neq \xi' \implies P_\xi \neq P_{\xi'}$ on a set of non-zero measure. Failure Mode: In neural networks, permutation symmetry of neurons violates this locally. Overparameterization violates this globally (manifolds of equivalent solutions).

Assumption A2 (common Support): The support of the density, $\text{supp}(P_\xi) = \{ x \in \Omega \mid p(x; \xi) > 0 \}$ , is independent of $\xi$ . Failure Mode: Uniform distribution $U[0, \xi]$ . The support boundary depends on $\xi$ , making the likelihood non-differentiable.

Assumption A3 (Smoothness): The log-likelihood function $\ell(\xi; x) = \log p(x; \xi)$ is $k$ -times differentiable with respect to $\xi$ , where $k \ge 3$ .

Assumption A4 (Regularity of Integration): Differentiation with respect to $\xi$ and integration with respect to $\mu$ commute. Specifically:

\nabla_\xi \int_{\Omega} p(x; \xi) d\mu(x) = \int_{\Omega} \nabla_\xi p(x; \xi) d\mu(x)

This assumes the score function exists and is uniformly integrable.

3. Derivation of the Metric

We define the geometry locally via the Kullback-Leibler divergence $D_{KL}(\xi \| \xi + \delta \xi)$ as $\delta \xi \to 0$ . We do not assume this is a metric distance a priori; KL is not symmetric and fails the triangle inequality. However, its second-order Taylor expansion induces a quadratic form.

D_{KL}(\xi \| \xi') = \int p(x; \xi) \log \frac{p(x; \xi)}{p(x; \xi')} d\mu(x)

Let $\xi' = \xi + \delta \xi$ . Expand $\ell(x; \xi') = \log p(x; \xi')$ around $\xi$ :

\ell(\xi + \delta \xi) = \ell(\xi) + (\nabla \ell)^T \delta \xi + \frac{1}{2} \delta \xi^T (\nabla^2 \ell) \delta \xi + O(\|\delta \xi\|^3)

Substituting this into the KL definition:

D_{KL} \approx \mathbb{E}_{\xi} \left[ \ell(\xi) - \left( \ell(\xi) + (\nabla \ell)^T \delta \xi + \frac{1}{2} \delta \xi^T (\nabla^2 \ell) \delta \xi \right) \right]

D_{KL} \approx -\mathbb{E}_{\xi} [ (\nabla \ell)^T ] \delta \xi - \frac{1}{2} \delta \xi^T \mathbb{E}_{\xi} [ \nabla^2 \ell ] \delta \xi

Step 3.1: The Vanishing Linear Term We must verify that $\mathbb{E}_\xi [\nabla \ell(x; \xi)] = 0$ .

\mathbb{E}_\xi [\nabla \log p(x; \xi)] = \int p(x; \xi) \frac{\nabla p(x; \xi)}{p(x; \xi)} d\mu(x) = \int \nabla p(x; \xi) d\mu(x)

By Reference to Assumption A4, we exchange derivative and integral:

\nabla \int p(x; \xi) d\mu(x) = \nabla(1) = 0

Thus, the first-order term vanishes. This is necessary for $D_{KL}$ to be a local minimum at $\delta \xi = 0$ .

Step 3.2: The Quadratic Form We are left with the Hessian of the likelihood:

D_{KL} \approx \frac{1}{2} \delta \xi^T \left( -\mathbb{E}_{\xi} [ \nabla^2 \ell ] \right) \delta \xi

We invoke the identity linking the Hessian to the outer product of scores. Differentiating the score identity $\int (\nabla \log p) p dx = 0$ again:

\nabla \int (\nabla \ell) p dx = \int \left( (\nabla^2 \ell) p + (\nabla \ell)(\nabla p)^T \right) dx = 0

Using $\nabla p = p \nabla \ell$ :

\int \left( \nabla^2 \ell + (\nabla \ell)(\nabla \ell)^T \right) p dx = 0

\mathbb{E}[\nabla^2 \ell] + \mathbb{E}[(\nabla \ell)(\nabla \ell)^T] = 0

Thus, the Fisher Information Matrix $G(\xi)$ is defined equivalently as:

G_{ij}(\xi) = \mathbb{E} \left[ \frac{\partial \ell}{\partial \xi_i} \frac{\partial \ell}{\partial \xi_j} \right] = - \mathbb{E} \left[ \frac{\partial^2 \ell}{\partial \xi_i \partial \xi_j} \right]

The local distance is given by the quadratic form $ds^2 = \delta \xi^T G(\xi) \delta \xi$ . This defines a Riemannian metric on $\mathcal{S}$ .

4. Uniqueness: Chentsov’s Theorem

Is this the only valid metric? Chentsov (1972) proved that the Fisher Information Metric is the unique Riemannian metric (up to a scaling factor) that is invariant under congruent embeddings by Markov morphisms.

Construct: Let $F: \Omega \to \Omega'$ be a measurable map (statistic). This induces a mapping from measures on $\Omega$ to measures on $\Omega'$ . If $F$ is a sufficient statistic, no information is lost. The distance between $P_\xi$ and $P_{\xi+\delta\xi}$ must be identical to the distance between their images under $F$ . Standard Euclidean distance fails this. The Fisher Metric, being defined by the covariance of the score, inherently respects sufficiency.

g_{ij}^{(T(X))}(\theta) = g_{ij}^{(X)}(\theta) \quad \text{iff } T \text{ is sufficient.}

5. Dualistic Geometry and A ffine Connections

A metric $g$ allows us to measure lengths and angles. To define “straight lines” (geodesics) and discuss flatness, we need an affine connection $\nabla$ . Standard Riemannian geometry uses the Levi-Civita connection $\nabla^{(0)}$ , which is determined uniquely by the conditions:

Metric compatibility: $\nabla g = 0$
Torsion-freeness.

In Statistical Manifolds, however, we naturally encounter a family of connections $\nabla^{(\alpha)}$ parameterized by $\alpha \in \mathbb{R}$ .

The $\alpha$ -Connection The Christoffel symbols of the first kind for the $\alpha$ -connection are defined as:

\Gamma_{ijk}^{(\alpha)} = \mathbb{E} \left[ \left( \partial_i \partial_j \ell + \frac{1-\alpha}{2} (\partial_i \ell)(\partial_j \ell) \right) (\partial_k \ell) \right]

This definition facilitates simplification using the Skewness Tensor $T_{ijk} = \mathbb{E}[(\partial_i \ell)(\partial_j \ell)(\partial_k \ell)]$ . Differentiating the metric identity $\partial_k g_{ij}$ :

\partial_k g_{ij} = \mathbb{E}[ \partial_k ( (\partial_i \ell)(\partial_j \ell) ) ] = T_{ijk} + \mathbb{E}[(\partial_k \partial_i \ell)(\partial_j \ell)] + \mathbb{E}[(\partial_i \ell)(\partial_k \partial_j \ell)]

Standard derivation leads to:

\Gamma_{ijk}^{(\alpha)} = \Gamma_{ijk}^{(0)} - \frac{\alpha}{2} T_{ijk}

where $\Gamma^{(0)}$ is the Levi-Civita connection.

Duality: Two connections $\nabla$ and $\nabla^*$ are said to be dual with respect to metric $g$ if for all vector fields $X, Y, Z$ :

X \langle Y, Z \rangle_g = \langle \nabla_X Y, Z \rangle_g + \langle Y, \nabla_X^* Z \rangle_g

Theorem: The $\alpha$ -connection and $(-\alpha)$ -connection are dual. Specifically, the Exponential Connection ( $\alpha=1$ ) and the Mixture Connection ( $\alpha=-1$ ) are duals.

6. Case Study: The Hyperbolic Geometry of the Normal Family

We now apply our tools to the most fundamental object in statistics: the Univariate Gaussian. We derive the Riemannian structure directly.

Consider the manifold $\mathcal{S} = \{ N(\mu, \sigma^2) \mid \mu \in \mathbb{R}, \sigma > 0 \}$ . The density is:

p(x; \mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma} \exp \left( - \frac{(x-\mu)^2}{2\sigma^2} \right)

The log-likelihood $\ell(x; \mu, \sigma)$ :

\ell = -\frac{1}{2} \log(2\pi) - \log \sigma - \frac{(x-\mu)^2}{2\sigma^2}

Step 6.1: The Score Function We compute the partial derivatives (scores) with respect to coordinates $\xi^1 = \mu, \xi^2 = \sigma$ .

\partial_\mu \ell = \frac{x-\mu}{\sigma^2}

\partial_\sigma \ell = -\frac{1}{\sigma} + \frac{(x-\mu)^2}{\sigma^3}

Step 6.2: The Fisher Information Matrix We compute the elements of $G_{ij} = \mathbb{E}[(\partial_i \ell)(\partial_j \ell)]$ .

Element $g_{\mu\mu}$ :

g_{\mu\mu} = \mathbb{E} \left[ \left( \frac{x-\mu}{\sigma^2} \right)^2 \right] = \frac{1}{\sigma^4} \mathbb{E}[(x-\mu)^2]

Since $\mathbb{E}[(x-\mu)^2] = \text{Var}(x) = \sigma^2$ :

g_{\mu\mu} = \frac{\sigma^2}{\sigma^4} = \frac{1}{\sigma^2}

Element $g_{\mu\sigma}$ :

g_{\mu\sigma} = \mathbb{E} \left[ \left( \frac{x-\mu}{\sigma^2} \right) \left( \frac{(x-\mu)^2}{\sigma^3} - \frac{1}{\sigma} \right) \right]

This involves $\mathbb{E}[(x-\mu)^3]$ (skewness) and $\mathbb{E}[(x-\mu)]$ (mean). For a Gaussian, odd central moments vanish.

g_{\mu\sigma} = 0

This implies the parameters $\mu$ and $\sigma$ are orthogonality in the Riemannian sense.

Element $g_{\sigma\sigma}$ :

g_{\sigma\sigma} = \mathbb{E} \left[ \left( \frac{(x-\mu)^2}{\sigma^3} - \frac{1}{\sigma} \right)^2 \right]

= \frac{1}{\sigma^6} \mathbb{E}[(x-\mu)^4] - \frac{2}{\sigma^4} \mathbb{E}[(x-\mu)^2] + \frac{1}{\sigma^2}

Recall the Gaussian moments: $\mathbb{E}[(x-\mu)^2] = \sigma^2$ , $\mathbb{E}[(x-\mu)^4] = 3\sigma^4$ .

= \frac{3\sigma^4}{\sigma^6} - \frac{2\sigma^2}{\sigma^4} + \frac{1}{\sigma^2}

= \frac{3}{\sigma^2} - \frac{2}{\sigma^2} + \frac{1}{\sigma^2} = \frac{2}{\sigma^2}

Thus, the Fisher Information Matrix is:

G(\mu, \sigma) = \begin{pmatrix} 1/\sigma^2 & 0 \\ 0 & 2/\sigma^2 \end{pmatrix}

Step 6.3: The Riemannian Metric and Distance The line element $ds^2$ is:

ds^2 = \frac{d\mu^2 + 2d\sigma^2}{\sigma^2}

This closely resembles the metric of the Poincaré Upper Half-Plane model of Hyperbolic Geometry ( $ds^2 = \frac{dx^2 + dy^2}{y^2}$ ). The factor of 2 indicates a difference in curvature scaling.

Step 6.4: Geodesics To find the shortest paths (geodesics) between distributions $N(\mu_1, \sigma_1)$ and $N(\mu_2, \sigma_2)$ , we solve the Euler-Lagrange equations for the functional $L = \int \frac{\sqrt{\dot{\mu}^2 + 2\dot{\sigma}^2}}{\sigma} dt$ .

The geodesics correspond to:

Vertical lines: If $\mu_1 = \mu_2$ , the path is simply scaling the variance.
Semi-ellipses: If $\mu_1 \neq \mu_2$ , the geodesics are semi-ellipses centered on the $\mu$ -axis.

This confirms that the manifold of Gaussian distributions has constant negative curvature. We are not operating in a flat space; we are operating in a hyperbolic space. Traditional Euclidean averaging of parameters ( $\bar{\mu}, \bar{\sigma}$ ) is not the geometric center (Fréchet mean) of the distributions.

7. The Exponential Family (1-flatness)

Consider the exponential family in canonical parameters $\theta$ :

p(x; \theta) = \exp( \theta^i F_i(x) - \psi(\theta) )

\ell(x; \theta) = \theta^i F_i(x) - \psi(\theta)

We analyze the curvature.

\partial_i \ell = F_i(x) - \partial_i \psi(\theta)

\partial_i \partial_j \ell = -\partial_i \partial_j \psi(\theta)

The second derivative is constant with respect to $x$ . This is the crucial property. Substitute into the definition of $\Gamma_{ijk}^{(1)}$ (the e-connection):

\Gamma_{ijk}^{(1)} = \mathbb{E} [ \partial_i \partial_j \ell \cdot \partial_k \ell ]

Since $\partial_i \partial_j \ell$ is deterministic (independent of $x$ ), it comes out of the expectation:

\Gamma_{ijk}^{(1)} = (\partial_i \partial_j \ell) \mathbb{E}[\partial_k \ell]

Reference Step 3.1: $\mathbb{E}[\partial_k \ell] = 0$ . Therefore:

\Gamma_{ijk}^{(1)} = 0

Conclusion: The exponential family manifold is flat under the e-connection. The parameters $\theta$ are an affine coordinate system. Geodesics are straight lines in $\theta$ : $\theta(t) = (1-t)\theta_1 + t\theta_2$ .

8. The Mixture Family (-1-flatness)

Consider the mixture family:

p(x; \eta) = (1 - \sum \eta_i) p_0(x) + \sum \eta_i p_i(x)

This manifold is flat under the m-connection ( $\alpha=-1$ ). The expectation parameters $\eta$ form the affine coordinate system. Geodesics are linear mixtures: $P(t) = (1-t)P_1 + tP_2$ .

The Generalized Pythagorean Theorem: Since $\nabla^{(1)}$ and $\nabla^{(-1)}$ are dual flat, if we have a triangle $P, Q, R$ where the e-geodesic $PQ$ is orthogonal to the m-geodesic $QR$ at $Q$ , then:

D_{KL}(P \| R) = D_{KL}(P \| Q) + D_{KL}(Q \| R)

Proof: We provide a derivation below. Let $\theta$ be the $e$ -affine coordinates (natural parameters) and $\eta$ be the $m$ -affine coordinates (expectation parameters). $P$ has coordinates $\theta_P$ , $R$ has coordinates $\theta_R$ . $Q$ lies on the $e$ -geodesic from $P$ to some other point, or we describe the geodesics.

Let the curve $\gamma(t)$ connecting $P$ and $Q$ be an e-geodesic. In the $\theta$ -coordinate system, this is a straight line:

\theta(t) = (1-t)\theta_P + t\theta_Q

The tangent vector is $\dot{\theta} = \theta_Q - \theta_P$ .

Let the curve connecting $Q$ and $R$ be an m-geodesic. In the $\eta$ -coordinate system, this is a straight line. (Note: $\eta$ and $\theta$ are dual coordinate systems). The tangent vector at $Q$ is best expressed in dual coordinates.

Consider the KL divergence definition between members of an exponential family:

D_{KL}(P \| R) = \psi(\theta_R) - \psi(\theta_P) - \eta_P \cdot (\theta_R - \theta_P)

Note that $D_{KL}(P\|R)$ corresponds to the Bregman divergence on the convex potential $\psi$ :

D_{KL}(P_{\theta_P} \| P_{\theta_R}) = \psi(\theta_R) - \psi(\theta_P) - \nabla \psi(\theta_P) \cdot (\theta_R - \theta_P)

= \psi(\theta_R) - \psi(\theta_P) - \eta_P \cdot (\theta_R - \theta_P)

This assumes $\eta_P = \nabla \psi(\theta_P)$ .

Now expand the RHS terms:

D_{KL}(P\|Q) + D_{KL}(Q\|R) = [\psi(\theta_Q) - \psi(\theta_P) - \eta_P(\theta_Q - \theta_P)] + [\psi(\theta_R) - \psi(\theta_Q) - \eta_Q(\theta_R - \theta_Q)]

Summing them:

= -\psi(\theta_P) + \psi(\theta_R) - \eta_P(\theta_Q - \theta_P) - \eta_Q(\theta_R - \theta_Q)

We want this to equal $D_{KL}(P\|R) = \psi(\theta_R) - \psi(\theta_P) - \eta_P(\theta_R - \theta_P)$ .

The difference $\Delta$ is:

\Delta = (D_{KL}(P\|Q) + D_{KL}(Q\|R)) - D_{KL}(P\|R)

\Delta = -\eta_P(\theta_Q - \theta_P) - \eta_Q(\theta_R - \theta_Q) + \eta_P(\theta_R - \theta_P)

Group by $\eta$ :

\Delta = \eta_P(\theta_R - \theta_P - \theta_Q + \theta_P) - \eta_Q(\theta_R - \theta_Q)

\Delta = \eta_P(\theta_R - \theta_Q) - \eta_Q(\theta_R - \theta_Q)

\Delta = (\eta_P - \eta_Q) \cdot (\theta_R - \theta_Q)

For the Pythagorean theorem to hold ( $\Delta = 0$ ), we require:

(\eta_P - \eta_Q) \cdot (\theta_R - \theta_Q) = 0

Geometric Interpretation:

$\eta_P - \eta_Q$ : Change in dual parameter along the path $P \to Q$ .
$\theta_R - \theta_Q$ : Change in primal parameter along path $Q \to R$ .

If $P \to Q$ is an m-geodesic, then $\eta$ changes linearly, so $\eta_P - \eta_Q$ is the tangent vector (in $\eta$ -space). If $Q \to R$ is an e-geodesic, then $\theta$ changes linearly, so $\theta_R - \theta_Q$ is the tangent vector (in $\theta$ -space).

Thus, if the m-geodesic $PQ$ is orthogonal to the e-geodesic $QR$ , the divergence splits. orthogonality here means the Euclidean dot product of the parameters in dual spaces is 0. This justifies the Projection Theorem: The m-projection of $P$ onto a e-flat submanifold is unique and satisfies the Pythagorean relation.

Application: The Maximum Likelihood Estimator (MLE) is the m-projection of the empirical distribution $\hat{P}_{data}$ onto the model manifold $\mathcal{S}$ .

9. Pathologies: The Uniform Boundary

Violating Assumption A2 leads to pathologies. Consider the uniform distribution $p(x; heta) = U[0, heta] = \frac{1}{\theta} \mathbb{I}(0 \le x \le \theta)$ .

\ell(x; \theta) = -\log \theta

\nabla \ell = -\frac{1}{\theta}

Observe that $\mathbb{E}[\nabla \ell] = \int_0^\theta (-\frac{1}{\theta}) \frac{1}{\theta} dx = -\frac{1}{\theta}$ . This violates the zero-score condition. The derivation in Section 3 collapses. Why? Because $\frac{d}{d\theta} \int_0^\theta p dx \neq \int_0^\theta \nabla p dx$ . The Leibniz integral rule picks up a boundary term: $p(\theta; \theta) \cdot 1$ . $p(\theta; \theta) = 1/\theta$ . So $\nabla(1) = \int \nabla p + p(\theta) = -1/\theta + 1/\theta = 0$ . Our tools must be adjusted to include boundary terms.

Fisher Information Singularity: $G = \mathbb{E}[(\nabla \ell)^2] = \mathbb{E}[1/\theta^2] = 1/\theta^2$ While this appears finite, the regularity conditions for the Cramer-Rao bound ( $Var(\hat{\theta}) \ge 1/(nG)$ ) require A2/A4. Since A2 fails, Cramer-Rao does not apply. The MLE is $\hat{\theta} = \max(X_i)$ . The variance of $\hat{\theta}$ scales as $O(n^{-2})$ , which is faster than the $O(n^{-1})$ rate predicted by Fisher. This phenomenon, known as “Super-efficiency,” violates the geometric intuition. The manifold has a boundary that contains information.

10. Natural Gradient Descent

We perform optimization on $\mathcal{S}$ . We wish to minimize a loss $\mathcal{L}(\theta)$ . The straightforward update $\theta_{new} = \theta - \eta \nabla \mathcal{L}$ is geometrically invalid because $\nabla \mathcal{L}$ is a covariant vector (1-form), while $\Delta \theta$ is a contravariant vector. They cannot be added.

We formulate the update as:

\min_{\delta \theta} \mathcal{L}(\theta + \delta \theta) \quad \text{subject to} \quad D_{KL}(\theta \| \theta + \delta \theta) = \epsilon

Approximating $D_{KL} \approx \frac{1}{2} \delta \theta^T G \delta \theta$ :

\min v^T \nabla \mathcal{L} \quad \text{s.t.} \quad v^T G v = 2\epsilon

This yields the Natural Gradient update:

\delta \theta = -\eta G^{-1}(\theta) \nabla \mathcal{L}(\theta)

11. Implementation (JAX)

We verify the orthogonality of the e- and m- geodesics and the convergence of Natural Gradient vs SGD on a warped Gaussian landscape.

FISHER INFORMATION MANIFOLD
MODEL: N(θ, diag(θ² + 1))
VISUALIZATION: TISSOT INDICATRIX (EXACT)


import jax
import jax.numpy as jnp
from jax import random, grad, jit, vmap, lax
from jax.scipy.stats import multivariate_normal
from typing import NamedTuple, Tuple
 
# ------------------------------------------------------------------
# SYSTEM CONFIGURATION
# ------------------------------------------------------------------
SEED = 42
LEARNING_RATE = 0.1
NUM_STEPS = 100
DAMPING = 1e-4
 
# ------------------------------------------------------------------
# 1. Manifold Definition: Warped Gaussian
# ------------------------------------------------------------------
def get_sigma(theta: jax.Array) -> jax.Array:
    """
    Constructs the covariance matrix Sigma(theta) = diag(theta^2 + 1).
    Ensures positive definiteness everywhere.
    """
    return jnp.diag(theta**2 + 1.0)
 
def log_likelihood(theta: jax.Array, x: jax.Array) -> jax.Array:
    """ Computes sum of log-likelihoods for data x given theta. """
    mu = theta
    cov = get_sigma(theta)
    return jnp.sum(multivariate_normal.logpdf(x, mu, cov))
 
# ------------------------------------------------------------------
# 2. Fisher Information Computation
# ------------------------------------------------------------------
@jit
def compute_fisher_mc(theta: jax.Array, key: jax.Array, num_samples: int = 1000) -> jax.Array:
    """
    Approximates the Fisher Information Matrix using Monte Carlo integration.
    G(theta) = E[score * score^T]
    """
    cov = get_sigma(theta)
    # Sampling from the model distribution at theta
    samples = random.multivariate_normal(key, theta, cov, shape=(num_samples,))
    
    def score_fn(t, x_single):
        return grad(lambda p: multivariate_normal.logpdf(x_single, p, get_sigma(p)))(t)
    
    # Vectorized score computation
    scores = vmap(lambda x: score_fn(theta, x))(samples)
    
    # Outer product expectation
    outer_products = vmap(lambda s: jnp.outer(s, s))(scores)
    return jnp.mean(outer_products, axis=0)
 
# ------------------------------------------------------------------
# 3. Optimization Loop (JIT-Compiled Scan)
# ------------------------------------------------------------------
def loss_fn(theta: jax.Array, batch: jax.Array) -> jax.Array:
    """ Negative Log Likelihood Loss. """
    return -log_likelihood(theta, batch) / batch.shape[0]
 
class OptState(NamedTuple):
    theta_sgd: jax.Array
    theta_ngd: jax.Array
    key: jax.Array
 
@jit
def update_step_sgd(theta: jax.Array, batch: jax.Array) -> jax.Array:
    grads = grad(loss_fn)(theta, batch)
    return theta - LEARNING_RATE * grads
 
@jit
def update_step_ngd(theta: jax.Array, batch: jax.Array, key: jax.Array) -> jax.Array:
    grads = grad(loss_fn)(theta, batch)
    fisher = compute_fisher_mc(theta, key)
    
    # Natural Gradient: G^-1 * grad
    # Numerically stable solve: (G + damping * I) * update = grad
    regularized_fisher = fisher + DAMPING * jnp.eye(fisher.shape[0])
    nat_grad = jnp.linalg.solve(regularized_fisher, grads)
    
    return theta - LEARNING_RATE * nat_grad
 
@jit
def run_experiment() -> Tuple[jax.Array, jax.Array]:
    """ Fully compiled training loop using lax.scan. """
    key = random.PRNGKey(SEED)
    key, subkey_data, subkey_init = random.split(key, 3)
    
    # Ground Truth
    true_theta = jnp.array([2.0, 3.0])
    data = random.multivariate_normal(
        subkey_data, true_theta, get_sigma(true_theta), shape=(500,)
    )
    
    # Initialization
    theta_0 = jnp.array([0.5, 0.5])
    init_state = OptState(theta_sgd=theta_0, theta_ngd=theta_0, key=subkey_init)
    
    def step_fn(state: OptState, _):
        key, subkey_ngd = random.split(state.key)
        
        # Parallel updates
        next_sgd = update_step_sgd(state.theta_sgd, data)
        next_ngd = update_step_ngd(state.theta_ngd, data, subkey_ngd)
        
        new_state = OptState(theta_sgd=next_sgd, theta_ngd=next_ngd, key=key)
        # Record trajectories
        return new_state, (next_sgd, next_ngd)
 
    # Execute simulation
    final_state, (path_sgd, path_ngd) = lax.scan(step_fn, init_state, None, length=NUM_STEPS)
    
    return path_sgd, path_ngd

11. The Cramer-Rao Bound (Geometric Interpretation)

The Cramer-Rao Lower Bound (CRLB) is the fundamental limit of frequentist inference. Usually derived via algebraic manipulation of covariance, it is geometrically the Cauchy-Schwarz inequality on the Statistical Manifold.

Consider an unbiased estimator $\hat{\theta}(X)$ for $\theta$ . Let $v \in T_\theta \mathcal{S}$ be a tangent vector. The score function $S_\theta(x) = \nabla_\theta \ell(x; \theta)$ lives in the Hilbert space $L^2(P_\theta)$ . The Fisher Information is the norm in this space: $\|S_\theta\|^2 = \mathbb{E}[S_\theta S_\theta^T] = G(\theta)$

Consider the covariance between the estimator error $\hat{\theta} - \theta$ and the score $S_\theta$ .

\text{Cov}(\hat{\theta}, S_\theta) = \mathbb{E}[(\hat{\theta} - \theta) S_\theta^T]

Using the identity $\nabla p = p S_\theta$ :

= \int (\hat{\theta}(x) - \theta) \nabla p(x; \theta) d\mu(x)

Using $\nabla \int \hat{\theta} p dx = \nabla \theta = I$ :

= I - \theta \cdot 0 = I

(Assuming A4 holds).

Now apply the standard matrix inequality for covariance matrices:

\text{Cov}(A, B)^T \text{Var}(B)^{-1} \text{Cov}(A, B) \le \text{Var}(A)

Here $A = \hat{\theta}$ , $B = S_\theta$ . $\text{Var}(B) = G(\theta)$ .

I^T G(\theta)^{-1} I \le \text{Var}(\hat{\theta})

Var(\hat{\theta}) \ge G(\theta)^{-1}

Interpretation: The variance of any estimator is bounded by the inverse curvature of the manifold.

High curvature ( $G$ large) $\to$ Distributions are far apart $\to$ Easy to distinguish $\to$ Low Variance.
Low curvature ( $G$ small) $\to$ Distributions are similar $\to$ Hard to distinguish $\to$ High Variance.

12. Singular Learning Theory (The Geometry of Degeneracy)

Consider the case where Assumption A1 (Identifiability) fails? This is the case in Deep Learning. A Neural Network with permutable nodes is non-identifiable. The Fisher Matrix implies singularities.

\Theta_{sing} = \{ \theta \in \Theta \mid \det(G(\theta)) = 0 \}

At these points, the manifold dimension collapses. The “Tangent Space” is no longer a vector space; it is a tangent cone.

Watanabe’s Discovery: Sumio Watanabe (2009) proved that in singular regions, the Bayesian posterior does not converge as $N(0, G^{-1})$ . Instead of the standard Asymptotic Expansion:

F_n = -\log Z_n \approx n L_{min} + \frac{d}{2} \log n

The complexity term $\frac{d}{2} \log n$ is replaced by $\lambda \log n$ , where $\lambda$ is the Real Log Canonical Threshold (RLCT).

\lambda < \frac{d}{2}

This means singular models are less complex than their parameter count suggests. Geometrically, the volume of the posterior contraction is determined by the resolution of singularities in algebraic geometry. Standard Information Geometry (Riemannian) fails here. We require Singular Information Geometry.

13. Conclusion: From Geometry to Optimization

We have established:

Strict Construction: The Fisher metric arises uniquely from invariance requirements (Chentsov).
Dual Structure: The manifold is simultaneously $e$ -flat and $m$ -flat (Amari).
Fundamental Bounds: The Cramer-Rao bound is the Cauchy-Schwarz inequality on $T_\theta \mathcal{S}$ .
Singularity: Modern Deep Learning lives in the breakdown of this theory (Singular Learning Theory), where $G$ is rank-deficient.
Operationalization: The Natural Gradient $G^{-1} \nabla \mathcal{L}$ is the only type-safe first-order optimization step.

The Critique of Adam: Adaptive methods like Adam approximate $G$ by a diagonal matrix $\text{diag}(\sqrt{\mathbb{E}[g^2]})$ . Strictly, this is dimensionally inconsistent. $G$ is a $(0, 2)$ -tensor. Its square root is not well-defined geometrically as a pre-conditioner in this way. Standard Natural Gradient scales by $G^{-1}$ (units $1/g^2$ ). Adam scales by $G^{-1/2}$ (units $1/g$ ). This implies Adam is not approximating curvature; it is normalizing magnitude. It operates on a different heuristic (Sign Descent) rather than Riemannian steepest descent.

Final Thought: Information Geometry provides the rigorous language to discuss optimization in probability space. Without it, we are merely adjusting knobs in a coordinate system that doesn’t exist.

Historical Timeline

Year	Event	Significance
1945	C.R. Rao	Introduces Fisher Information Metric (Riemannian).
1972	N. Chentsov	Proves Uniqueness Theorem for the metric.
1979	B. Efron	Defines statistical curvature.
1985	Shun-ichi Amari	Develops Dualistic Geometry ( $\alpha$ -connections).
1998	Amari	Proposes Natural Gradient Descent.
2009	Sumio Watanabe	Singular Learning Theory (Algebraic Geometry of learning).
2014	Pascanu & Bengio	Revisited Natural Gradient for Neural Networks.

Appendix A: The Legendre Duality

The duality between Exponential and Mixture families is an instance of Legendre Transformation in convex analysis.

Let $\psi(\theta)$ be the convex function (cumulant generating function) defining the exponential family: $\psi(\theta) = \log \int \exp(\theta \cdot F(x)) d\mu(x)$ The dual potential $\phi(\eta)$ is the Legendre conjugate of $\psi$ : $\phi(\eta) = \sup_{\theta} \{ \theta \cdot \eta - \psi(\theta) \}$ The supremum is attained at the point where the gradient matches: $\eta = \nabla \psi(\theta)$ This mapping $\theta \mapsto \eta$ is the coordinate transformation from natural parameters to expectation parameters. $\eta_i = \mathbb{E}[F_i(x)]$

The function $\phi(\eta)$ turns out to be the negative entropy (plus constants). The convex duality implies: $\theta = \nabla \phi(\eta)$ Thus, the transformation between coordinate systems is given by the gradient of a convex potential. The Hessian of the potential is the metric: $G_{ij}(\theta) = \frac{\partial^2 \psi}{\partial \theta_i \partial \theta_j}$ $G^{ij}(\eta) = \frac{\partial^2 \phi}{\partial \eta_i \partial \eta_j}$ Matrices $G(\theta)$ and $G(\eta)$ are inverses of each other (up to coordinate change Jacobian). This confirms the Riemannian structure is consistent across dual representations.

Appendix B: Fisher vs. Wasserstein

Optimal Transport (Wasserstein Metric) is increasingly used for loss functions. How does it compare to Fisher geometry?

1. The Objects:

Fisher Information describes the geometry of the Parameter Space $\mathcal{S}$ . It is defined on the manifold of densities.
Wasserstein Distance describes the geometry of the Sample Space $\Omega$ lifted to measures. It depends on the ground metric $d_\Omega(x, y)$ .

2. The Geodesics:

Fisher Geodesic (e-connection): $\log p_t(x) = (1-t) \log p_0(x) + t \log p_1(x) - \psi(t)$ This is a multiplicative interpolation. Example: Interpolating $N(0, 1)$ and $N(10, 1)$ . The intermediate passes through $N(5, 1)$ if we stay in Gaussian family. But the mixture distribution (m-geodesic) is bimodal.
Wasserstein Geodesic (Displacement): $T(x) = x + t \nabla \phi(x)$ This is a horizontal displacement of probability mass. Example: $N(0, 1) \to N(10, 1)$ . The density physically slides across the axis. $N(5, 1)$ is the midpoint.

3. When to use which?

Fisher: When you care about inference. How much information does a sample $x$ give about $\theta$ ?
Wasserstein: When you care about mass transport. How much work does it take to morph image A into image B?

The geometric distinction is categorical: Fisher comes from the entropy Hessian (Dualistic). Wasserstein comes from the Kinetic Energy minimization (Benamou-Brenier).

Appendix C: Glossary of Definitions

Affine Connection: Geometric object defining parallel transport and derivatives.
Fisher Information Metric: The unique Riemannian metric on probability manifolds.
Natural Gradient: Steepest descent direction accounting for curvature ( $G^{-1} \nabla \mathcal{L}$ ).
Statistical Manifold: A family of probability distributions equipped with geometric structure.
Dual Connections: Pair of connections ( $\nabla, \nabla^*$ ) satisfying the duality condition w.r.t the metric.
Kullback-Leibler Divergence: The canonical divergence generating the Fisher metric.

References

1. Amari, S., & Nagaoka, H. (2000). “Methods of Information Geometry”. The Bible of the field. Defines $\alpha$ -connections, dually flat spaces, and applications to estimation.

2. Chentsov, N. N. (1972). “Statistical Decision Rules and Optimal Inference”. Proved the uniqueness of the Fisher metric based on Markov invariance.

3. Rao, C. R. (1945). “Information and the accuracy attainable in the estimation of statistical parameters”. The original paper proposing the Riemannian metric.

4. Watanabe, S. (2009). “Algebraic Geometry and Statistical Learning Theory”. The foundation of Singular Learning Theory, handling the breakdown of regular information geometry in neural networks.

5. Martens, J. (2014). “New insights and perspectives on the natural gradient method”. A modern analysis of why Natural Gradient works for deep learning (K-FAC).