Information Geometry & Natural Gradients

1. Problem Formulation

Let (Ω,F,μ)(\Omega, \mathcal{F}, \mu) be a measure space. Consider a parametric family of probability measures S={PξξΞ}\mathcal{S} = \{ P_\xi \mid \xi \in \Xi \} dominated by μ\mu, where ΞRd\Xi \subseteq \mathbb{R}^d is an open set of parameters. We denote the Radon-Nikodym derivatives (probability densities) by:

p(x;ξ)=dPξdμ(x)p(x; \xi) = \frac{dP_\xi}{d\mu}(x)

Our objective is to define a geometric structure (a Riemannian metric gg and an affine connection \nabla) on the manifold S\mathcal{S} that satisfies intrinsic invariance.

The tension arises from the arbitrary nature of the parameterization ξ\xi. In standard Euclidean optimization (e.g., Gradient Descent), we implicitly assume the distance between PξP_\xi and Pξ+δξP_{\xi + \delta \xi} is δξ2\|\delta \xi\|_2. This is physically unjustified. A change in coordinates ξϕ(ξ)\xi \to \phi(\xi) distorts this distance metric. Furthermore, the Euclidean distance is not invariant to the geometry of the sample space Ω\Omega.

We require a divergence functional D[P:Q]D[P : Q] such that the induced metric structure is:

  1. Invariant to Reparameterization: ds2(ξ,ξ+dξ)ds^2(\xi, \xi + d\xi) is a scalar invariant.
  2. Invariant to Sufficient Statistics: If T:ΩΩT: \Omega \to \Omega' is a sufficient statistic for ξ\xi, then the geometry of S\mathcal{S} must be identical to the geometry of the induced family on Ω\Omega'.

2. The Tools (Definitions and Regularity)

To ensure the existence of the Fisher Information and the validity of valid Taylor expansions, we explicitly state the required regularity conditions. Note that “standard” assumptions in machine learning often violate these (e.g., ReLU networks leading to singular Hessians, or uniform distributions violating support independence).

Assumption A1 (Identifiability): The map ξPξ\xi \mapsto P_\xi is injective. That is, ξξ    PξPξ\xi \neq \xi' \implies P_\xi \neq P_{\xi'} on a set of non-zero measure. Failure Mode: In neural networks, permutation symmetry of neurons violates this locally. Overparameterization violates this globally (manifolds of equivalent solutions).

Assumption A2 (common Support): The support of the density, supp(Pξ)={xΩp(x;ξ)>0}\text{supp}(P_\xi) = \{ x \in \Omega \mid p(x; \xi) > 0 \}, is independent of ξ\xi. Failure Mode: Uniform distribution U[0,ξ]U[0, \xi]. The support boundary depends on ξ\xi, making the likelihood non-differentiable.

Assumption A3 (Smoothness): The log-likelihood function (ξ;x)=logp(x;ξ)\ell(\xi; x) = \log p(x; \xi) is kk-times differentiable with respect to ξ\xi, where k3k \ge 3.

Assumption A4 (Regularity of Integration): Differentiation with respect to ξ\xi and integration with respect to μ\mu commute. Specifically:

ξΩp(x;ξ)dμ(x)=Ωξp(x;ξ)dμ(x)\nabla_\xi \int_{\Omega} p(x; \xi) d\mu(x) = \int_{\Omega} \nabla_\xi p(x; \xi) d\mu(x)

This assumes the score function exists and is uniformly integrable.

3. Derivation of the Metric

We define the geometry locally via the Kullback-Leibler divergence DKL(ξξ+δξ)D_{KL}(\xi \| \xi + \delta \xi) as δξ0\delta \xi \to 0. We do not assume this is a metric distance a priori; KL is not symmetric and fails the triangle inequality. However, its second-order Taylor expansion induces a quadratic form.

DKL(ξξ)=p(x;ξ)logp(x;ξ)p(x;ξ)dμ(x)D_{KL}(\xi \| \xi') = \int p(x; \xi) \log \frac{p(x; \xi)}{p(x; \xi')} d\mu(x)

Let ξ=ξ+δξ\xi' = \xi + \delta \xi. Expand (x;ξ)=logp(x;ξ)\ell(x; \xi') = \log p(x; \xi') around ξ\xi:

(ξ+δξ)=(ξ)+()Tδξ+12δξT(2)δξ+O(δξ3)\ell(\xi + \delta \xi) = \ell(\xi) + (\nabla \ell)^T \delta \xi + \frac{1}{2} \delta \xi^T (\nabla^2 \ell) \delta \xi + O(\|\delta \xi\|^3)

Substituting this into the KL definition:

DKLEξ[(ξ)((ξ)+()Tδξ+12δξT(2)δξ)]D_{KL} \approx \mathbb{E}_{\xi} \left[ \ell(\xi) - \left( \ell(\xi) + (\nabla \ell)^T \delta \xi + \frac{1}{2} \delta \xi^T (\nabla^2 \ell) \delta \xi \right) \right] DKLEξ[()T]δξ12δξTEξ[2]δξD_{KL} \approx -\mathbb{E}_{\xi} [ (\nabla \ell)^T ] \delta \xi - \frac{1}{2} \delta \xi^T \mathbb{E}_{\xi} [ \nabla^2 \ell ] \delta \xi

Step 3.1: The Vanishing Linear Term We must verify that Eξ[(x;ξ)]=0\mathbb{E}_\xi [\nabla \ell(x; \xi)] = 0.

Eξ[logp(x;ξ)]=p(x;ξ)p(x;ξ)p(x;ξ)dμ(x)=p(x;ξ)dμ(x)\mathbb{E}_\xi [\nabla \log p(x; \xi)] = \int p(x; \xi) \frac{\nabla p(x; \xi)}{p(x; \xi)} d\mu(x) = \int \nabla p(x; \xi) d\mu(x)

By Reference to Assumption A4, we exchange derivative and integral:

p(x;ξ)dμ(x)=(1)=0\nabla \int p(x; \xi) d\mu(x) = \nabla(1) = 0

Thus, the first-order term vanishes. This is necessary for DKLD_{KL} to be a local minimum at δξ=0\delta \xi = 0.

Step 3.2: The Quadratic Form We are left with the Hessian of the likelihood:

DKL12δξT(Eξ[2])δξD_{KL} \approx \frac{1}{2} \delta \xi^T \left( -\mathbb{E}_{\xi} [ \nabla^2 \ell ] \right) \delta \xi

We invoke the identity linking the Hessian to the outer product of scores. Differentiating the score identity (logp)pdx=0\int (\nabla \log p) p dx = 0 again:

()pdx=((2)p+()(p)T)dx=0\nabla \int (\nabla \ell) p dx = \int \left( (\nabla^2 \ell) p + (\nabla \ell)(\nabla p)^T \right) dx = 0

Using p=p\nabla p = p \nabla \ell:

(2+()()T)pdx=0\int \left( \nabla^2 \ell + (\nabla \ell)(\nabla \ell)^T \right) p dx = 0 E[2]+E[()()T]=0\mathbb{E}[\nabla^2 \ell] + \mathbb{E}[(\nabla \ell)(\nabla \ell)^T] = 0

Thus, the Fisher Information Matrix G(ξ)G(\xi) is defined equivalently as:

Gij(ξ)=E[ξiξj]=E[2ξiξj]G_{ij}(\xi) = \mathbb{E} \left[ \frac{\partial \ell}{\partial \xi_i} \frac{\partial \ell}{\partial \xi_j} \right] = - \mathbb{E} \left[ \frac{\partial^2 \ell}{\partial \xi_i \partial \xi_j} \right]

The local distance is given by the quadratic form ds2=δξTG(ξ)δξds^2 = \delta \xi^T G(\xi) \delta \xi. This defines a Riemannian metric on S\mathcal{S}.

4. Uniqueness: Chentsov’s Theorem

Is this the only valid metric? Chentsov (1972) proved that the Fisher Information Metric is the unique Riemannian metric (up to a scaling factor) that is invariant under congruent embeddings by Markov morphisms.

Construct: Let F:ΩΩF: \Omega \to \Omega' be a measurable map (statistic). This induces a mapping from measures on Ω\Omega to measures on Ω\Omega'. If FF is a sufficient statistic, no information is lost. The distance between PξP_\xi and Pξ+δξP_{\xi+\delta\xi} must be identical to the distance between their images under FF. Standard Euclidean distance fails this. The Fisher Metric, being defined by the covariance of the score, inherently respects sufficiency.

gij(T(X))(θ)=gij(X)(θ)iff T is sufficient.g_{ij}^{(T(X))}(\theta) = g_{ij}^{(X)}(\theta) \quad \text{iff } T \text{ is sufficient.}

5. Dualistic Geometry and A ffine Connections

A metric gg allows us to measure lengths and angles. To define “straight lines” (geodesics) and discuss flatness, we need an affine connection \nabla. Standard Riemannian geometry uses the Levi-Civita connection (0)\nabla^{(0)}, which is determined uniquely by the conditions:

  1. Metric compatibility: g=0\nabla g = 0
  2. Torsion-freeness.

In Statistical Manifolds, however, we naturally encounter a family of connections (α)\nabla^{(\alpha)} parameterized by αR\alpha \in \mathbb{R}.

The α\alpha-Connection The Christoffel symbols of the first kind for the α\alpha-connection are defined as:

Γijk(α)=E[(ij+1α2(i)(j))(k)]\Gamma_{ijk}^{(\alpha)} = \mathbb{E} \left[ \left( \partial_i \partial_j \ell + \frac{1-\alpha}{2} (\partial_i \ell)(\partial_j \ell) \right) (\partial_k \ell) \right]

This definition facilitates simplification using the Skewness Tensor Tijk=E[(i)(j)(k)]T_{ijk} = \mathbb{E}[(\partial_i \ell)(\partial_j \ell)(\partial_k \ell)]. Differentiating the metric identity kgij\partial_k g_{ij}:

kgij=E[k((i)(j))]=Tijk+E[(ki)(j)]+E[(i)(kj)]\partial_k g_{ij} = \mathbb{E}[ \partial_k ( (\partial_i \ell)(\partial_j \ell) ) ] = T_{ijk} + \mathbb{E}[(\partial_k \partial_i \ell)(\partial_j \ell)] + \mathbb{E}[(\partial_i \ell)(\partial_k \partial_j \ell)]

Standard derivation leads to:

Γijk(α)=Γijk(0)α2Tijk\Gamma_{ijk}^{(\alpha)} = \Gamma_{ijk}^{(0)} - \frac{\alpha}{2} T_{ijk}

where Γ(0)\Gamma^{(0)} is the Levi-Civita connection.

Duality: Two connections \nabla and \nabla^* are said to be dual with respect to metric gg if for all vector fields X,Y,ZX, Y, Z:

XY,Zg=XY,Zg+Y,XZgX \langle Y, Z \rangle_g = \langle \nabla_X Y, Z \rangle_g + \langle Y, \nabla_X^* Z \rangle_g

Theorem: The α\alpha-connection and (α)(-\alpha)-connection are dual. Specifically, the Exponential Connection (α=1\alpha=1) and the Mixture Connection (α=1\alpha=-1) are duals.

6. Case Study: The Hyperbolic Geometry of the Normal Family

We now apply our tools to the most fundamental object in statistics: the Univariate Gaussian. We derive the Riemannian structure directly.

Consider the manifold S={N(μ,σ2)μR,σ>0}\mathcal{S} = \{ N(\mu, \sigma^2) \mid \mu \in \mathbb{R}, \sigma > 0 \}. The density is:

p(x;μ,σ)=12πσexp((xμ)22σ2)p(x; \mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma} \exp \left( - \frac{(x-\mu)^2}{2\sigma^2} \right)

The log-likelihood (x;μ,σ)\ell(x; \mu, \sigma):

=12log(2π)logσ(xμ)22σ2\ell = -\frac{1}{2} \log(2\pi) - \log \sigma - \frac{(x-\mu)^2}{2\sigma^2}

Step 6.1: The Score Function We compute the partial derivatives (scores) with respect to coordinates ξ1=μ,ξ2=σ\xi^1 = \mu, \xi^2 = \sigma.

μ=xμσ2\partial_\mu \ell = \frac{x-\mu}{\sigma^2} σ=1σ+(xμ)2σ3\partial_\sigma \ell = -\frac{1}{\sigma} + \frac{(x-\mu)^2}{\sigma^3}

Step 6.2: The Fisher Information Matrix We compute the elements of Gij=E[(i)(j)]G_{ij} = \mathbb{E}[(\partial_i \ell)(\partial_j \ell)].

Element gμμg_{\mu\mu}:

gμμ=E[(xμσ2)2]=1σ4E[(xμ)2]g_{\mu\mu} = \mathbb{E} \left[ \left( \frac{x-\mu}{\sigma^2} \right)^2 \right] = \frac{1}{\sigma^4} \mathbb{E}[(x-\mu)^2]

Since E[(xμ)2]=Var(x)=σ2\mathbb{E}[(x-\mu)^2] = \text{Var}(x) = \sigma^2:

gμμ=σ2σ4=1σ2g_{\mu\mu} = \frac{\sigma^2}{\sigma^4} = \frac{1}{\sigma^2}

Element gμσg_{\mu\sigma}:

gμσ=E[(xμσ2)((xμ)2σ31σ)]g_{\mu\sigma} = \mathbb{E} \left[ \left( \frac{x-\mu}{\sigma^2} \right) \left( \frac{(x-\mu)^2}{\sigma^3} - \frac{1}{\sigma} \right) \right]

This involves E[(xμ)3]\mathbb{E}[(x-\mu)^3] (skewness) and E[(xμ)]\mathbb{E}[(x-\mu)] (mean). For a Gaussian, odd central moments vanish.

gμσ=0g_{\mu\sigma} = 0

This implies the parameters μ\mu and σ\sigma are orthogonality in the Riemannian sense.

Element gσσg_{\sigma\sigma}:

gσσ=E[((xμ)2σ31σ)2]g_{\sigma\sigma} = \mathbb{E} \left[ \left( \frac{(x-\mu)^2}{\sigma^3} - \frac{1}{\sigma} \right)^2 \right] =1σ6E[(xμ)4]2σ4E[(xμ)2]+1σ2= \frac{1}{\sigma^6} \mathbb{E}[(x-\mu)^4] - \frac{2}{\sigma^4} \mathbb{E}[(x-\mu)^2] + \frac{1}{\sigma^2}

Recall the Gaussian moments: E[(xμ)2]=σ2\mathbb{E}[(x-\mu)^2] = \sigma^2, E[(xμ)4]=3σ4\mathbb{E}[(x-\mu)^4] = 3\sigma^4.

=3σ4σ62σ2σ4+1σ2= \frac{3\sigma^4}{\sigma^6} - \frac{2\sigma^2}{\sigma^4} + \frac{1}{\sigma^2} =3σ22σ2+1σ2=2σ2= \frac{3}{\sigma^2} - \frac{2}{\sigma^2} + \frac{1}{\sigma^2} = \frac{2}{\sigma^2}

Thus, the Fisher Information Matrix is:

G(μ,σ)=(1/σ2002/σ2)G(\mu, \sigma) = \begin{pmatrix} 1/\sigma^2 & 0 \\ 0 & 2/\sigma^2 \end{pmatrix}

Step 6.3: The Riemannian Metric and Distance The line element ds2ds^2 is:

ds2=dμ2+2dσ2σ2ds^2 = \frac{d\mu^2 + 2d\sigma^2}{\sigma^2}

This closely resembles the metric of the Poincaré Upper Half-Plane model of Hyperbolic Geometry (ds2=dx2+dy2y2ds^2 = \frac{dx^2 + dy^2}{y^2}). The factor of 2 indicates a difference in curvature scaling.

Step 6.4: Geodesics To find the shortest paths (geodesics) between distributions N(μ1,σ1)N(\mu_1, \sigma_1) and N(μ2,σ2)N(\mu_2, \sigma_2), we solve the Euler-Lagrange equations for the functional L=μ˙2+2σ˙2σdtL = \int \frac{\sqrt{\dot{\mu}^2 + 2\dot{\sigma}^2}}{\sigma} dt.

The geodesics correspond to:

  1. Vertical lines: If μ1=μ2\mu_1 = \mu_2, the path is simply scaling the variance.
  2. Semi-ellipses: If μ1μ2\mu_1 \neq \mu_2, the geodesics are semi-ellipses centered on the μ\mu-axis.

This confirms that the manifold of Gaussian distributions has constant negative curvature. We are not operating in a flat space; we are operating in a hyperbolic space. Traditional Euclidean averaging of parameters (μˉ,σˉ\bar{\mu}, \bar{\sigma}) is not the geometric center (Fréchet mean) of the distributions.

7. The Exponential Family (1-flatness)

Consider the exponential family in canonical parameters θ\theta:

p(x;θ)=exp(θiFi(x)ψ(θ))p(x; \theta) = \exp( \theta^i F_i(x) - \psi(\theta) ) (x;θ)=θiFi(x)ψ(θ)\ell(x; \theta) = \theta^i F_i(x) - \psi(\theta)

We analyze the curvature.

i=Fi(x)iψ(θ)\partial_i \ell = F_i(x) - \partial_i \psi(\theta) ij=ijψ(θ)\partial_i \partial_j \ell = -\partial_i \partial_j \psi(\theta)

The second derivative is constant with respect to xx. This is the crucial property. Substitute into the definition of Γijk(1)\Gamma_{ijk}^{(1)} (the e-connection):

Γijk(1)=E[ijk]\Gamma_{ijk}^{(1)} = \mathbb{E} [ \partial_i \partial_j \ell \cdot \partial_k \ell ]

Since ij\partial_i \partial_j \ell is deterministic (independent of xx), it comes out of the expectation:

Γijk(1)=(ij)E[k]\Gamma_{ijk}^{(1)} = (\partial_i \partial_j \ell) \mathbb{E}[\partial_k \ell]

Reference Step 3.1: E[k]=0\mathbb{E}[\partial_k \ell] = 0. Therefore:

Γijk(1)=0\Gamma_{ijk}^{(1)} = 0

Conclusion: The exponential family manifold is flat under the e-connection. The parameters θ\theta are an affine coordinate system. Geodesics are straight lines in θ\theta: θ(t)=(1t)θ1+tθ2\theta(t) = (1-t)\theta_1 + t\theta_2.

8. The Mixture Family (-1-flatness)

Consider the mixture family:

p(x;η)=(1ηi)p0(x)+ηipi(x)p(x; \eta) = (1 - \sum \eta_i) p_0(x) + \sum \eta_i p_i(x)

This manifold is flat under the m-connection (α=1\alpha=-1). The expectation parameters η\eta form the affine coordinate system. Geodesics are linear mixtures: P(t)=(1t)P1+tP2P(t) = (1-t)P_1 + tP_2.

The Generalized Pythagorean Theorem: Since (1)\nabla^{(1)} and (1)\nabla^{(-1)} are dual flat, if we have a triangle P,Q,RP, Q, R where the e-geodesic PQPQ is orthogonal to the m-geodesic QRQR at QQ, then:

DKL(PR)=DKL(PQ)+DKL(QR)D_{KL}(P \| R) = D_{KL}(P \| Q) + D_{KL}(Q \| R)

Proof: We provide a derivation below. Let θ\theta be the ee-affine coordinates (natural parameters) and η\eta be the mm-affine coordinates (expectation parameters). PP has coordinates θP\theta_P, RR has coordinates θR\theta_R. QQ lies on the ee-geodesic from PP to some other point, or we describe the geodesics.

Let the curve γ(t)\gamma(t) connecting PP and QQ be an e-geodesic. In the θ\theta-coordinate system, this is a straight line:

θ(t)=(1t)θP+tθQ\theta(t) = (1-t)\theta_P + t\theta_Q

The tangent vector is θ˙=θQθP\dot{\theta} = \theta_Q - \theta_P.

Let the curve connecting QQ and RR be an m-geodesic. In the η\eta-coordinate system, this is a straight line. (Note: η\eta and θ\theta are dual coordinate systems). The tangent vector at QQ is best expressed in dual coordinates.

Consider the KL divergence definition between members of an exponential family:

DKL(PR)=ψ(θR)ψ(θP)ηP(θRθP)D_{KL}(P \| R) = \psi(\theta_R) - \psi(\theta_P) - \eta_P \cdot (\theta_R - \theta_P)

Note that DKL(PR)D_{KL}(P\|R) corresponds to the Bregman divergence on the convex potential ψ\psi:

DKL(PθPPθR)=ψ(θR)ψ(θP)ψ(θP)(θRθP)D_{KL}(P_{\theta_P} \| P_{\theta_R}) = \psi(\theta_R) - \psi(\theta_P) - \nabla \psi(\theta_P) \cdot (\theta_R - \theta_P) =ψ(θR)ψ(θP)ηP(θRθP)= \psi(\theta_R) - \psi(\theta_P) - \eta_P \cdot (\theta_R - \theta_P)

This assumes ηP=ψ(θP)\eta_P = \nabla \psi(\theta_P).

Now expand the RHS terms:

DKL(PQ)+DKL(QR)=[ψ(θQ)ψ(θP)ηP(θQθP)]+[ψ(θR)ψ(θQ)ηQ(θRθQ)]D_{KL}(P\|Q) + D_{KL}(Q\|R) = [\psi(\theta_Q) - \psi(\theta_P) - \eta_P(\theta_Q - \theta_P)] + [\psi(\theta_R) - \psi(\theta_Q) - \eta_Q(\theta_R - \theta_Q)]

Summing them:

=ψ(θP)+ψ(θR)ηP(θQθP)ηQ(θRθQ)= -\psi(\theta_P) + \psi(\theta_R) - \eta_P(\theta_Q - \theta_P) - \eta_Q(\theta_R - \theta_Q)

We want this to equal DKL(PR)=ψ(θR)ψ(θP)ηP(θRθP)D_{KL}(P\|R) = \psi(\theta_R) - \psi(\theta_P) - \eta_P(\theta_R - \theta_P).

The difference Δ\Delta is:

Δ=(DKL(PQ)+DKL(QR))DKL(PR)\Delta = (D_{KL}(P\|Q) + D_{KL}(Q\|R)) - D_{KL}(P\|R) Δ=ηP(θQθP)ηQ(θRθQ)+ηP(θRθP)\Delta = -\eta_P(\theta_Q - \theta_P) - \eta_Q(\theta_R - \theta_Q) + \eta_P(\theta_R - \theta_P)

Group by η\eta:

Δ=ηP(θRθPθQ+θP)ηQ(θRθQ)\Delta = \eta_P(\theta_R - \theta_P - \theta_Q + \theta_P) - \eta_Q(\theta_R - \theta_Q) Δ=ηP(θRθQ)ηQ(θRθQ)\Delta = \eta_P(\theta_R - \theta_Q) - \eta_Q(\theta_R - \theta_Q) Δ=(ηPηQ)(θRθQ)\Delta = (\eta_P - \eta_Q) \cdot (\theta_R - \theta_Q)

For the Pythagorean theorem to hold (Δ=0\Delta = 0), we require:

(ηPηQ)(θRθQ)=0(\eta_P - \eta_Q) \cdot (\theta_R - \theta_Q) = 0

Geometric Interpretation:

  • ηPηQ\eta_P - \eta_Q: Change in dual parameter along the path PQP \to Q.
  • θRθQ\theta_R - \theta_Q: Change in primal parameter along path QRQ \to R.

If PQP \to Q is an m-geodesic, then η\eta changes linearly, so ηPηQ\eta_P - \eta_Q is the tangent vector (in η\eta-space). If QRQ \to R is an e-geodesic, then θ\theta changes linearly, so θRθQ\theta_R - \theta_Q is the tangent vector (in θ\theta-space).

Thus, if the m-geodesic PQPQ is orthogonal to the e-geodesic QRQR, the divergence splits. orthogonality here means the Euclidean dot product of the parameters in dual spaces is 0. This justifies the Projection Theorem: The m-projection of PP onto a e-flat submanifold is unique and satisfies the Pythagorean relation.

Application: The Maximum Likelihood Estimator (MLE) is the m-projection of the empirical distribution P^data\hat{P}_{data} onto the model manifold S\mathcal{S}.

9. Pathologies: The Uniform Boundary

Violating Assumption A2 leads to pathologies. Consider the uniform distribution p(x;heta)=U[0,heta]=1θI(0xθ)p(x; heta) = U[0, heta] = \frac{1}{\theta} \mathbb{I}(0 \le x \le \theta).

(x;θ)=logθ\ell(x; \theta) = -\log \theta =1θ\nabla \ell = -\frac{1}{\theta}

Observe that E[]=0θ(1θ)1θdx=1θ\mathbb{E}[\nabla \ell] = \int_0^\theta (-\frac{1}{\theta}) \frac{1}{\theta} dx = -\frac{1}{\theta}. This violates the zero-score condition. The derivation in Section 3 collapses. Why? Because ddθ0θpdx0θpdx\frac{d}{d\theta} \int_0^\theta p dx \neq \int_0^\theta \nabla p dx. The Leibniz integral rule picks up a boundary term: p(θ;θ)1p(\theta; \theta) \cdot 1. p(θ;θ)=1/θp(\theta; \theta) = 1/\theta. So (1)=p+p(θ)=1/θ+1/θ=0\nabla(1) = \int \nabla p + p(\theta) = -1/\theta + 1/\theta = 0. Our tools must be adjusted to include boundary terms.

Fisher Information Singularity: G=E[()2]=E[1/θ2]=1/θ2G = \mathbb{E}[(\nabla \ell)^2] = \mathbb{E}[1/\theta^2] = 1/\theta^2 While this appears finite, the regularity conditions for the Cramer-Rao bound (Var(θ^)1/(nG)Var(\hat{\theta}) \ge 1/(nG)) require A2/A4. Since A2 fails, Cramer-Rao does not apply. The MLE is θ^=max(Xi)\hat{\theta} = \max(X_i). The variance of θ^\hat{\theta} scales as O(n2)O(n^{-2}), which is faster than the O(n1)O(n^{-1}) rate predicted by Fisher. This phenomenon, known as “Super-efficiency,” violates the geometric intuition. The manifold has a boundary that contains information.

10. Natural Gradient Descent

We perform optimization on S\mathcal{S}. We wish to minimize a loss L(θ)\mathcal{L}(\theta). The straightforward update θnew=θηL\theta_{new} = \theta - \eta \nabla \mathcal{L} is geometrically invalid because L\nabla \mathcal{L} is a covariant vector (1-form), while Δθ\Delta \theta is a contravariant vector. They cannot be added.

We formulate the update as:

minδθL(θ+δθ)subject toDKL(θθ+δθ)=ϵ\min_{\delta \theta} \mathcal{L}(\theta + \delta \theta) \quad \text{subject to} \quad D_{KL}(\theta \| \theta + \delta \theta) = \epsilon

Approximating DKL12δθTGδθD_{KL} \approx \frac{1}{2} \delta \theta^T G \delta \theta:

minvTLs.t.vTGv=2ϵ\min v^T \nabla \mathcal{L} \quad \text{s.t.} \quad v^T G v = 2\epsilon

This yields the Natural Gradient update:

δθ=ηG1(θ)L(θ)\delta \theta = -\eta G^{-1}(\theta) \nabla \mathcal{L}(\theta)

11. Implementation (JAX)

We verify the orthogonality of the e- and m- geodesics and the convergence of Natural Gradient vs SGD on a warped Gaussian landscape.

FISHER INFORMATION MANIFOLD
MODEL: N(θ, diag(θ² + 1))
VISUALIZATION: TISSOT INDICATRIX (EXACT)
import jax import jax.numpy as jnp from jax import random, grad, jit, vmap, lax from jax.scipy.stats import multivariate_normal from typing import NamedTuple, Tuple # ------------------------------------------------------------------ # SYSTEM CONFIGURATION # ------------------------------------------------------------------ SEED = 42 LEARNING_RATE = 0.1 NUM_STEPS = 100 DAMPING = 1e-4 # ------------------------------------------------------------------ # 1. Manifold Definition: Warped Gaussian # ------------------------------------------------------------------ def get_sigma(theta: jax.Array) -> jax.Array: """ Constructs the covariance matrix Sigma(theta) = diag(theta^2 + 1). Ensures positive definiteness everywhere. """ return jnp.diag(theta**2 + 1.0) def log_likelihood(theta: jax.Array, x: jax.Array) -> jax.Array: """ Computes sum of log-likelihoods for data x given theta. """ mu = theta cov = get_sigma(theta) return jnp.sum(multivariate_normal.logpdf(x, mu, cov)) # ------------------------------------------------------------------ # 2. Fisher Information Computation # ------------------------------------------------------------------ @jit def compute_fisher_mc(theta: jax.Array, key: jax.Array, num_samples: int = 1000) -> jax.Array: """ Approximates the Fisher Information Matrix using Monte Carlo integration. G(theta) = E[score * score^T] """ cov = get_sigma(theta) # Sampling from the model distribution at theta samples = random.multivariate_normal(key, theta, cov, shape=(num_samples,)) def score_fn(t, x_single): return grad(lambda p: multivariate_normal.logpdf(x_single, p, get_sigma(p)))(t) # Vectorized score computation scores = vmap(lambda x: score_fn(theta, x))(samples) # Outer product expectation outer_products = vmap(lambda s: jnp.outer(s, s))(scores) return jnp.mean(outer_products, axis=0) # ------------------------------------------------------------------ # 3. Optimization Loop (JIT-Compiled Scan) # ------------------------------------------------------------------ def loss_fn(theta: jax.Array, batch: jax.Array) -> jax.Array: """ Negative Log Likelihood Loss. """ return -log_likelihood(theta, batch) / batch.shape[0] class OptState(NamedTuple): theta_sgd: jax.Array theta_ngd: jax.Array key: jax.Array @jit def update_step_sgd(theta: jax.Array, batch: jax.Array) -> jax.Array: grads = grad(loss_fn)(theta, batch) return theta - LEARNING_RATE * grads @jit def update_step_ngd(theta: jax.Array, batch: jax.Array, key: jax.Array) -> jax.Array: grads = grad(loss_fn)(theta, batch) fisher = compute_fisher_mc(theta, key) # Natural Gradient: G^-1 * grad # Numerically stable solve: (G + damping * I) * update = grad regularized_fisher = fisher + DAMPING * jnp.eye(fisher.shape[0]) nat_grad = jnp.linalg.solve(regularized_fisher, grads) return theta - LEARNING_RATE * nat_grad @jit def run_experiment() -> Tuple[jax.Array, jax.Array]: """ Fully compiled training loop using lax.scan. """ key = random.PRNGKey(SEED) key, subkey_data, subkey_init = random.split(key, 3) # Ground Truth true_theta = jnp.array([2.0, 3.0]) data = random.multivariate_normal( subkey_data, true_theta, get_sigma(true_theta), shape=(500,) ) # Initialization theta_0 = jnp.array([0.5, 0.5]) init_state = OptState(theta_sgd=theta_0, theta_ngd=theta_0, key=subkey_init) def step_fn(state: OptState, _): key, subkey_ngd = random.split(state.key) # Parallel updates next_sgd = update_step_sgd(state.theta_sgd, data) next_ngd = update_step_ngd(state.theta_ngd, data, subkey_ngd) new_state = OptState(theta_sgd=next_sgd, theta_ngd=next_ngd, key=key) # Record trajectories return new_state, (next_sgd, next_ngd) # Execute simulation final_state, (path_sgd, path_ngd) = lax.scan(step_fn, init_state, None, length=NUM_STEPS) return path_sgd, path_ngd

11. The Cramer-Rao Bound (Geometric Interpretation)

The Cramer-Rao Lower Bound (CRLB) is the fundamental limit of frequentist inference. Usually derived via algebraic manipulation of covariance, it is geometrically the Cauchy-Schwarz inequality on the Statistical Manifold.

Consider an unbiased estimator θ^(X)\hat{\theta}(X) for θ\theta. Let vTθSv \in T_\theta \mathcal{S} be a tangent vector. The score function Sθ(x)=θ(x;θ)S_\theta(x) = \nabla_\theta \ell(x; \theta) lives in the Hilbert space L2(Pθ)L^2(P_\theta). The Fisher Information is the norm in this space: Sθ2=E[SθSθT]=G(θ)\|S_\theta\|^2 = \mathbb{E}[S_\theta S_\theta^T] = G(\theta)

Consider the covariance between the estimator error θ^θ\hat{\theta} - \theta and the score SθS_\theta.

Cov(θ^,Sθ)=E[(θ^θ)SθT]\text{Cov}(\hat{\theta}, S_\theta) = \mathbb{E}[(\hat{\theta} - \theta) S_\theta^T]

Using the identity p=pSθ\nabla p = p S_\theta:

=(θ^(x)θ)p(x;θ)dμ(x)= \int (\hat{\theta}(x) - \theta) \nabla p(x; \theta) d\mu(x)

Using θ^pdx=θ=I\nabla \int \hat{\theta} p dx = \nabla \theta = I:

=Iθ0=I= I - \theta \cdot 0 = I

(Assuming A4 holds).

Now apply the standard matrix inequality for covariance matrices:

Cov(A,B)TVar(B)1Cov(A,B)Var(A)\text{Cov}(A, B)^T \text{Var}(B)^{-1} \text{Cov}(A, B) \le \text{Var}(A)

Here A=θ^A = \hat{\theta}, B=SθB = S_\theta. Var(B)=G(θ)\text{Var}(B) = G(\theta).

ITG(θ)1IVar(θ^)I^T G(\theta)^{-1} I \le \text{Var}(\hat{\theta}) Var(θ^)G(θ)1Var(\hat{\theta}) \ge G(\theta)^{-1}

Interpretation: The variance of any estimator is bounded by the inverse curvature of the manifold.

  • High curvature (GG large) \to Distributions are far apart \to Easy to distinguish \to Low Variance.
  • Low curvature (GG small) \to Distributions are similar \to Hard to distinguish \to High Variance.

12. Singular Learning Theory (The Geometry of Degeneracy)

Consider the case where Assumption A1 (Identifiability) fails? This is the case in Deep Learning. A Neural Network with permutable nodes is non-identifiable. The Fisher Matrix implies singularities.

Θsing={θΘdet(G(θ))=0}\Theta_{sing} = \{ \theta \in \Theta \mid \det(G(\theta)) = 0 \}

At these points, the manifold dimension collapses. The “Tangent Space” is no longer a vector space; it is a tangent cone.

Watanabe’s Discovery: Sumio Watanabe (2009) proved that in singular regions, the Bayesian posterior does not converge as N(0,G1)N(0, G^{-1}). Instead of the standard Asymptotic Expansion:

Fn=logZnnLmin+d2lognF_n = -\log Z_n \approx n L_{min} + \frac{d}{2} \log n

The complexity term d2logn\frac{d}{2} \log n is replaced by λlogn\lambda \log n, where λ\lambda is the Real Log Canonical Threshold (RLCT).

λ<d2\lambda < \frac{d}{2}

This means singular models are less complex than their parameter count suggests. Geometrically, the volume of the posterior contraction is determined by the resolution of singularities in algebraic geometry. Standard Information Geometry (Riemannian) fails here. We require Singular Information Geometry.

13. Conclusion: From Geometry to Optimization

We have established:

  1. Strict Construction: The Fisher metric arises uniquely from invariance requirements (Chentsov).
  2. Dual Structure: The manifold is simultaneously ee-flat and mm-flat (Amari).
  3. Fundamental Bounds: The Cramer-Rao bound is the Cauchy-Schwarz inequality on TθST_\theta \mathcal{S}.
  4. Singularity: Modern Deep Learning lives in the breakdown of this theory (Singular Learning Theory), where GG is rank-deficient.
  5. Operationalization: The Natural Gradient G1LG^{-1} \nabla \mathcal{L} is the only type-safe first-order optimization step.

The Critique of Adam: Adaptive methods like Adam approximate GG by a diagonal matrix diag(E[g2])\text{diag}(\sqrt{\mathbb{E}[g^2]}). Strictly, this is dimensionally inconsistent. GG is a (0,2)(0, 2)-tensor. Its square root is not well-defined geometrically as a pre-conditioner in this way. Standard Natural Gradient scales by G1G^{-1} (units 1/g21/g^2). Adam scales by G1/2G^{-1/2} (units 1/g1/g). This implies Adam is not approximating curvature; it is normalizing magnitude. It operates on a different heuristic (Sign Descent) rather than Riemannian steepest descent.

Final Thought: Information Geometry provides the rigorous language to discuss optimization in probability space. Without it, we are merely adjusting knobs in a coordinate system that doesn’t exist.


Historical Timeline

YearEventSignificance
1945C.R. RaoIntroduces Fisher Information Metric (Riemannian).
1972N. ChentsovProves Uniqueness Theorem for the metric.
1979B. EfronDefines statistical curvature.
1985Shun-ichi AmariDevelops Dualistic Geometry (α\alpha-connections).
1998AmariProposes Natural Gradient Descent.
2009Sumio WatanabeSingular Learning Theory (Algebraic Geometry of learning).
2014Pascanu & BengioRevisited Natural Gradient for Neural Networks.

Appendix A: The Legendre Duality

The duality between Exponential and Mixture families is an instance of Legendre Transformation in convex analysis.

Let ψ(θ)\psi(\theta) be the convex function (cumulant generating function) defining the exponential family: ψ(θ)=logexp(θF(x))dμ(x)\psi(\theta) = \log \int \exp(\theta \cdot F(x)) d\mu(x) The dual potential ϕ(η)\phi(\eta) is the Legendre conjugate of ψ\psi: ϕ(η)=supθ{θηψ(θ)}\phi(\eta) = \sup_{\theta} \{ \theta \cdot \eta - \psi(\theta) \} The supremum is attained at the point where the gradient matches: η=ψ(θ)\eta = \nabla \psi(\theta) This mapping θη\theta \mapsto \eta is the coordinate transformation from natural parameters to expectation parameters. ηi=E[Fi(x)]\eta_i = \mathbb{E}[F_i(x)]

The function ϕ(η)\phi(\eta) turns out to be the negative entropy (plus constants). The convex duality implies: θ=ϕ(η)\theta = \nabla \phi(\eta) Thus, the transformation between coordinate systems is given by the gradient of a convex potential. The Hessian of the potential is the metric: Gij(θ)=2ψθiθjG_{ij}(\theta) = \frac{\partial^2 \psi}{\partial \theta_i \partial \theta_j} Gij(η)=2ϕηiηjG^{ij}(\eta) = \frac{\partial^2 \phi}{\partial \eta_i \partial \eta_j} Matrices G(θ)G(\theta) and G(η)G(\eta) are inverses of each other (up to coordinate change Jacobian). This confirms the Riemannian structure is consistent across dual representations.


Appendix B: Fisher vs. Wasserstein

Optimal Transport (Wasserstein Metric) is increasingly used for loss functions. How does it compare to Fisher geometry?

1. The Objects:

  • Fisher Information describes the geometry of the Parameter Space S\mathcal{S}. It is defined on the manifold of densities.
  • Wasserstein Distance describes the geometry of the Sample Space Ω\Omega lifted to measures. It depends on the ground metric dΩ(x,y)d_\Omega(x, y).

2. The Geodesics:

  • Fisher Geodesic (e-connection): logpt(x)=(1t)logp0(x)+tlogp1(x)ψ(t)\log p_t(x) = (1-t) \log p_0(x) + t \log p_1(x) - \psi(t) This is a multiplicative interpolation. Example: Interpolating N(0,1)N(0, 1) and N(10,1)N(10, 1). The intermediate passes through N(5,1)N(5, 1) if we stay in Gaussian family. But the mixture distribution (m-geodesic) is bimodal.

  • Wasserstein Geodesic (Displacement): T(x)=x+tϕ(x)T(x) = x + t \nabla \phi(x) This is a horizontal displacement of probability mass. Example: N(0,1)N(10,1)N(0, 1) \to N(10, 1). The density physically slides across the axis. N(5,1)N(5, 1) is the midpoint.

3. When to use which?

  • Fisher: When you care about inference. How much information does a sample xx give about θ\theta?
  • Wasserstein: When you care about mass transport. How much work does it take to morph image A into image B?

The geometric distinction is categorical: Fisher comes from the entropy Hessian (Dualistic). Wasserstein comes from the Kinetic Energy minimization (Benamou-Brenier).


Appendix C: Glossary of Definitions

  • Affine Connection: Geometric object defining parallel transport and derivatives.
  • Fisher Information Metric: The unique Riemannian metric on probability manifolds.
  • Natural Gradient: Steepest descent direction accounting for curvature (G1LG^{-1} \nabla \mathcal{L}).
  • Statistical Manifold: A family of probability distributions equipped with geometric structure.
  • Dual Connections: Pair of connections (,\nabla, \nabla^*) satisfying the duality condition w.r.t the metric.
  • Kullback-Leibler Divergence: The canonical divergence generating the Fisher metric.

References

1. Amari, S., & Nagaoka, H. (2000). “Methods of Information Geometry”. The Bible of the field. Defines α\alpha-connections, dually flat spaces, and applications to estimation.

2. Chentsov, N. N. (1972). “Statistical Decision Rules and Optimal Inference”. Proved the uniqueness of the Fisher metric based on Markov invariance.

3. Rao, C. R. (1945). “Information and the accuracy attainable in the estimation of statistical parameters”. The original paper proposing the Riemannian metric.

4. Watanabe, S. (2009). “Algebraic Geometry and Statistical Learning Theory”. The foundation of Singular Learning Theory, handling the breakdown of regular information geometry in neural networks.

5. Martens, J. (2014). “New insights and perspectives on the natural gradient method”. A modern analysis of why Natural Gradient works for deep learning (K-FAC).