Random Fields & Non-Convex Optimization

Gradient descent in non-convex landscapes

Minimizing a non-convex function L(θ)L(\theta) in NN dimensions is NP-hard in general and yet gradient descent trains huge neural networks to near-zero training error. Why does practice beat the worst case by so much?

In two dimensions non-convex landscapes really are hard because local minima pile up and gradient methods get stuck. High dimensions are a different animal. As NN grows the local minima crowd in near the global minimum and critical points at higher energy have negative Hessian eigenvalues and turn into saddle points that gradient descent slips out of.

We can count critical points at a given energy level.

CritN(E)=#{θSN1:L(θ)=0,L(θ)[E,E+dE]}\text{Crit}_N(E) = \# \{ \theta \in \mathbb{S}^{N-1} : \nabla L(\theta) = 0, L(\theta) \in [E, E+dE] \}

When logCritN(E)NΣ(E)\log \text{Crit}_N(E) \sim N \Sigma(E) the critical points are exponential in NN. The complexity function Σ(E)\Sigma(E) and the Hessian index distribution at each energy level together run the show for landscape structure. The key pattern is that as EE drops the Morse index (the number of negative Hessian eigenvalues) drops right alongside it. High-energy critical points are saddles with many negative eigenvalues and true local minima with index zero only show up near the global minimum.

We nail this down using the Kac-Rice formula for isotropic Gaussian random fields on the sphere.

Setup

Let f:SN1(N)Rf: \mathbb{S}^{N-1}(\sqrt{N}) \to \mathbb{R} be a centered Gaussian random field. We look at its critical points and excursion sets Au={θf(θ)u}A_u = \{ \theta \mid f(\theta) \ge u \}. What we care about is the geometry of the optimization surface itself and the random landscape sitting as a Riemannian manifold.

The covariance kernel

E[f(θ)f(θ)]=C(θ,θ)\mathbb{E}[f(\theta) f(\theta')] = C(\theta, \theta')

is taken to be isotropic and the correlation only leans on the overlap q=1Nθ,θq = \frac{1}{N} \langle \theta, \theta' \rangle.

C(θ,θ)=Nξ(q)C(\theta, \theta') = N \xi(q)

For the pure pp-spin model which is the standard setting in spherical spin glass theory

fp(θ)=1N(p1)/2i1,,ip=1NJi1ipθi1θipf_p(\theta) = \frac{1}{N^{(p-1)/2}} \sum_{i_1, \dots, i_p = 1}^N J_{i_1 \dots i_p} \theta_{i_1} \dots \theta_{i_p}

the covariance is ξ(q)=qp\xi(q) = q^p. The polynomial covariance structure makes the Hessian spectrum workable in the large-NN limit.

Because the state space SN1\mathbb{S}^{N-1} is compact we work right on the manifold. The gradient gets projected onto TθSN1T_\theta \mathbb{S}^{N-1}.

Sf(θ)=f(θ)f(θ),θθ2θ\nabla_{\mathbb{S}} f(\theta) = \nabla f(\theta) - \frac{\langle \nabla f(\theta), \theta \rangle}{\|\theta\|^2} \theta

The Riemannian Hessian picks up a curvature correction from the second fundamental form.

S2f(θ)=Proj[2f(θ)f,θNIN]\nabla^2_{\mathbb{S}} f(\theta) = \text{Proj} \left[ \nabla^2 f(\theta) - \frac{\langle \nabla f, \theta \rangle}{N} I_N \right]

At critical points this curvature term adds roughly EI-E \cdot I, and when EE is very negative it pushes a big positive constant onto every Hessian eigenvalue and makes low-energy critical points more convex. And this falls straight out of the sphere’s extrinsic curvature.


Kac-Rice formula

We want Nk(E)\mathcal{N}_k(E) which is the expected number of index-kk critical points at energy EE. The Kac-Rice formula gives

E[Crit(E)]=SN1E[det(2f(θ))f(θ)=0,f(θ)=E]pf,f(E,0)dθ\mathbb{E}[\text{Crit}(E)] = \int_{\mathbb{S}^{N-1}} \mathbb{E} \left[ | \det(\nabla^2 f(\theta)) | \, \big| \, \nabla f(\theta) = 0, f(\theta) = E \right] p_{f, \nabla f}(E, 0) d\theta

and isotropy kills the θ\theta-dependence and cuts this down to

Vol(SN1)pf,f(E,0)E[detHf=0,f=E]\text{Vol}(\mathbb{S}^{N-1}) \cdot p_{f, \nabla f}(E, 0) \cdot \mathbb{E} [ | \det H | \mid \nabla f=0, f=E ]

where Vol(SN1)=2πN/2Γ(N/2)(\mathbb{S}^{N-1}) = \frac{2 \pi^{N/2}}{\Gamma(N/2)}.

Joint density. The pair (f(θ),f(θ))(f(\theta), \nabla f(\theta)) is a centered Gaussian vector and at q=1q=1 you get

Var(f)=Nξ(1)=N\text{Var}(f) = N \xi(1) = N Var(f)=diag(ξ(1),,ξ(1),)\text{Var}(\nabla f) = \text{diag}(\xi'(1), \dots, \xi'(1), \dots)

assuming ξ(1)=1\xi(1)=1. For isotropic models ff and f\nabla f come out uncorrelated and so

p(E,0)=12πNeE2/2N(12πξ(1))N1p(E, 0) = \frac{1}{\sqrt{2\pi N}} e^{-E^2 / 2N} \cdot \left( \frac{1}{\sqrt{2\pi \xi'(1)}} \right)^{N-1}

Conditional Hessian. This is where random matrix theory walks in. Conditioned on f(θ)=Ef(\theta) = E the Hessian follows a shifted GOE

2f(θ)=distNξ(1)MGOEξ(1)ξ(1)EIN\nabla^2 f(\theta) \stackrel{dist}{=} \sqrt{N \xi''(1)} M_{GOE} - \frac{\xi''(1)}{\xi'(1)} E I_N

where MGOEM_{GOE} is a standard Gaussian orthogonal ensemble matrix. The eigenvalue distribution of MGOEM_{GOE} heads toward the Wigner semicircle as NN \to \infty.

ρsc(λ)=12π4λ2on [2,2]\rho_{sc}(\lambda) = \frac{1}{2\pi} \sqrt{4 - \lambda^2} \quad \text{on } [-2, 2]

The Hessian eigenvalues follow this semicircle shifted by ϵ=ξ(1)ξ(1)Nξ(1)E-\epsilon = -\frac{\xi''(1)}{\xi'(1) \sqrt{N \xi''(1)}} E.

For the pp-spin Hamiltonian the energy is extensive and of order NN. Write u=E/Nu = E/N for the intensive energy and the scaled Hessian h=1N2fh = \frac{1}{\sqrt{N}} \nabla^2 f has eigenvalues following ρsc(λcu)\rho_{sc}(\lambda - cu).

Computing the determinant. We need E[detH]\mathbb{E}[|\det H|] and we take logs.

1NlogdetHlogxρ(x)dx\frac{1}{N} \log |\det H| \approx \int \log |x| \rho(x) dx

This is the annealed approximation. The complexity is

Σ(u)=limN1NlogE[Crit(Nu)]\Sigma(u) = \lim_{N \to \infty} \frac{1}{N} \log \mathbb{E}[\text{Crit}(Nu)]

and pulling it all together you get

Σ(u)=12log(p1)u22+logλαuρsc(λ)dλ\Sigma(u) = \frac{1}{2} \log(p-1) - \frac{u^2}{2} + \int \log|\lambda - \alpha u| \rho_{sc}(\lambda) d\lambda

where α\alpha is a model-dependent curvature constant.

The saddle-to-minimum transition

The integral logλshiftρscdλ\int \log|\lambda - \text{shift}| \rho_{sc} d\lambda behaves nicely but the topology of the critical points changes when the shift pushes the semicircle all the way onto one half-line.

At high energy where u0u \approx 0 there is no shift and the spectrum sits on [2,2][-2,2] and roughly half the eigenvalues are negative, and the critical points are saddles of high index.

At low energy where u0u \ll 0 the shift is positive and once the shift passes 22 the whole spectrum sits in R+\mathbb{R}^+ and every critical point is a local minimum.

The threshold energy EthE_{th} is set by the condition that the left edge of the shifted semicircle hits zero.

2+shift(Eth)=0-2 + \text{shift}(E_{th}) = 0

Auffinger and Ben Arous and Cerny worked this out in 2013 for the pure pp-spin.

Eth=Ep2p1E_{th} = - E_{\infty} \sqrt{\frac{p-2}{p-1}}

where EE_{\infty} is the ground state energy.

Above EthE_{th} the critical points are saddles and local minima are exponentially suppressed, and below EthE_{th} the local minima pile up, and at the ground state energy EgsE_{gs} the complexity goes back to zero. Gradient descent does not get stuck at high energies because the curvature structure stops traps from forming up there.

Moments of the GOE and Catalan numbers

The Wigner semicircle law can be pulled out from the moment sequence of the GOE alone and the combinatorics are worth spelling out.

Work out the moments mk=E[1NTr(Mk)]m_k = \mathbb{E} \left[ \frac{1}{N} \text{Tr}(M^k) \right]. Odd moments die by symmetry and for even moments

m2k=1Ni1,,i2kE[Mi1i2Mi2i3Mi2ki1]m_{2k} = \frac{1}{N} \sum_{i_1, \dots, i_{2k}} \mathbb{E} [ M_{i_1 i_2} M_{i_2 i_3} \dots M_{i_{2k} i_1} ]

and by Wick’s theorem only paired indices pitch in. The constraint is that the pairings have to be non-crossing and you draw the indices around a circle with no crossing lines. The count of non-crossing pairings of 2k2k points is the kk-th Catalan number.

Ck=1k+1(2kk)C_k = \frac{1}{k+1} \binom{2k}{k}

Which distribution has moments 0,C1,0,C2,0, C_1, 0, C_2, \ldots? The semicircle. To check it set x=2cosθx = 2\cos\theta in

m2k=12π22x2k4x2dxm_{2k} = \frac{1}{2\pi} \int_{-2}^2 x^{2k} \sqrt{4-x^2} dx

and work it out with the beta function identity, and the answer comes out to CkC_k.

The bulk Hessian spectrum is set by non-crossing partition combinatorics no matter what the entry distribution looks like. This is universality in random matrix theory.

Annealed vs. quenched complexity

Here is a distinction that matters. Kac-Rice computes E[Crit]\mathbb{E}[\text{Crit}] which is the annealed quantity, and the physically relevant one is E[logCrit]\mathbb{E}[\log \text{Crit}] which is the quenched one. These are different in general because logEElog\log \mathbb{E} \neq \mathbb{E} \log. But for spherical pp-spin models above the ground state they line up because the system self-averages and Kac-Rice spits out the right answer.

Euler characteristic and topology

Counting critical points by index is hard and the Euler characteristic of the excursion set AuA_u gives you another way in.

χ(Au)=k=0N1(1)kCritk(u)\chi(A_u) = \sum_{k=0}^{N-1} (-1)^k \text{Crit}_k(u)

At high energy the saddles of maximal index take over and so E[χ(Au)](1)N1E[CritN1(u)]\mathbb{E}[\chi(A_u)] \approx (-1)^{N-1} \mathbb{E}[\text{Crit}_{N-1}(u)].

Adler and Taylor showed that for isotropic Gaussian fields

E[χ(Au)]=Vol(D)(2π)(N+1)/2det(Λ)1/2HN1(u)eu2/2\mathbb{E}[\chi(A_u)] = \text{Vol}(D) \cdot (2\pi)^{-(N+1)/2} \det(\Lambda)^{1/2} H_{N-1}(u) e^{-u^2/2}

with Hn(u)H_n(u) the Hermite polynomial. This sets the rate at which the topology of the landscape shifts with energy.

TAP free energy

The physics literature comes at the same problem through the Thouless-Anderson-Palmer free energy which swaps the microscopic Hamiltonian H(σ)H(\sigma) for a free energy F(m)F(m) over magnetizations mi=σim_i = \langle \sigma_i \rangle.

FTAP(m)=H(m)12Ti<jJij2(1mi2)(1mj2)+TS(mi)F_{TAP}(m) = H(m) - \frac{1}{2T} \sum_{i<j} J_{ij}^2 (1-m_i^2)(1-m_j^2) + T \sum S(m_i)

The second term is the Onsager reaction term and it fixes mean field theory for the feedback between a spin and the field it makes. In the pp-spin spherical model the TAP equations collapse to L(θ)=0\nabla L(\theta) = 0 and the Kac-Rice complexity matches up with the entropy of TAP solutions, and two very different frameworks hand you the same critical point statistics.

Dynamics

Critical point geometry sets where gradient descent gets stuck. How fast it moves is a separate question.

Spectral gap. The convergence rate leans on the condition number κ=λmax/λmin\kappa = \lambda_{max}/\lambda_{min}. In the saddle regime where E>EthE > E_{th} the most negative eigenvalue sets the escape rate and from the shifted semicircle

λmin(u)2αu\lambda_{min}(u) \approx -2 - \alpha u

At high energies as u0u \to 0 you get λmin2\lambda_{min} \approx -2, and the escape rate is universal and does not lean on the starting point as long as the energy is high. And there are no flat saddles until EthE_{th}.

At EthE_{th} the spectral gap closes and λmin0\lambda_{min} \to 0, and the optimization dynamics slow down a lot right when the landscape topology gets simple. This blow-up of the relaxation time is critical slowing down.

Langevin dynamics and SGD. Adding noise

dθt=L(θt)dt+2TdWtd\theta_t = -\nabla L(\theta_t) dt + \sqrt{2 T} dW_t

satisfies detailed balance for eβHe^{-\beta H} and Σ(E)\Sigma(E) turns into an entropic barrier. The free energy is F(u)=uTΣ(u)F(u) = u - T\Sigma(u) and if you plug in the Kac-Rice complexity

F(u)=u+T2u2+F(u) = u + \frac{T}{2} u^2 + \cdots

At high temperature the system settles at an equilibrium energy u(T)u^*(T) where the entropic force from the critical point density pulling the dynamics balances the energetic gradient. SGD finds good minima and not the ground state because it settles at a temperature set by the learning rate, and this is a thermodynamic thing and not an algorithmic one.

Numerical verification (Python/JAX)

We can check the shifted GOE prediction directly.

import jax import jax.numpy as jnp from jax import random, grad, hessian, vmap, jit import matplotlib.pyplot as plt from functools import partial # ------------------------------------------------------------- # 1. Physics Parameters # ------------------------------------------------------------- N = 500 # Dimension (Keep < 1000 for speed) P = 3 # Spin glass parameter (p=3 => cubic interactions) BATCH_SIZE = 50 # Number of landscapes to average over # ------------------------------------------------------------- # 2. Hamiltonian Definition # ------------------------------------------------------------- # We define the p-spin spherical spin glass H(x). # H(x) = sum J_ijk x_i x_j x_k / N^((p-1)/2) # Calculating the tensor contraction explicitly is O(N^p). Too slow. # Trick: H(x) is a Gaussian field. We can simulate the gradient and Hessian directly # using the covariance structure, OR we can use the Tensor sketch approximation. # For exactness, we will sample the random tensor J for small N. def p_spin_energy(x, J, N, p): """ Computes H(x) given interacting tensor J. x: shape (N,) J: shape (N, N, N) for p=3 """ # Contract x with J p times # For p=3: sum_ijk J_ijk x_i x_j x_k # Scale by N^(-(p-1)/2) normalization = N ** (-(p-1)/2.0) # Efficient contraction for p=3 # J (N, N, N) . x (N) -> (N, N) Tx = jnp.tensordot(J, x, axes=1) # Tx (N, N) . x (N) -> (N) Txx = jnp.dot(Tx, x) # Txx (N) . x (N) -> Scalar val = jnp.dot(Txx, x) return val * normalization # ------------------------------------------------------------- # 3. Hessian Simulation # ------------------------------------------------------------- @partial(jit, static_argnums=(1, 2)) def compute_landscape_stats(key, N, p): """ Generates a random p-spin landscape and samples points. Returns Energy and Hessian Eigenvalues at those points. Note: We simply sample random points on the sphere. Since the field is isotropic, f(theta) is Gaussian N(0, N). We don't need to search for critical points to check the conditional density; conditioning on f(x)=E is sufficient. """ k1, k2 = random.split(key) # Switching to Direct GOE Simulation based on Theoretical Result # This avoids storing O(N^p) tensor. @jit def sample_shifted_goe(key, N, p, energy_intensive): """ Samples from the conditional Hessian distribution: H | E ~ GOE(N) - alpha * E * I For p-spin: xi(q) = q^p xi'(1) = p xi''(1) = p(p-1) Shift = - xi''(1) / xi'(1) * (Energy / N) = - p(p-1)/p * u = - (p-1) * u """ # 1. Generate Standard GOE # Off-diagonal: N(0, 1). Diagonal: N(0, 2). # Symmetric. H_raw = random.normal(key, (N, N)) H_sym = (H_raw + H_raw.T) / jnp.sqrt(2) # 2. Scale GOE to match p-spin curvature # The random part has variance p(p-1). sigma_rand = jnp.sqrt(p * (p-1)) H_noise = sigma_rand * H_sym # 3. Apply Deterministic Shift u = energy_intensive shift_val = (p-1) * u * jnp.sqrt(N) H_total = H_noise - shift_val * jnp.eye(N) return jnp.linalg.eigvalsh(H_total) def run_simulation(): N_sim = 500 p_sim = 3 energies = jnp.array([-1.7, -1.0, 0.0, 1.0]) # -1.7 is near ground state for p=3 spherical (E_gs ~ -1.657) key = random.PRNGKey(1337) plt.figure(figsize=(15, 6)) for i, u in enumerate(energies): key, subkey = random.split(key) # Monte Carlo over random matrices eigenvalues = [] for _ in range(20): # Average over 20 realizations subkey, sk = random.split(subkey) evals = sample_shifted_goe(sk, N_sim, p_sim, u) eigenvalues.append(evals) eigenvalues = jnp.hstack(jnp.array(eigenvalues)) # Normalize by sqrt(N) to see the semicircle on O(1) scale eigenvalues = eigenvalues / jnp.sqrt(N_sim) plt.subplot(1, 4, i+1) plt.hist(eigenvalues, bins=60, density=True, alpha=0.7, color='navy') plt.axvline(0, color='red', linestyle='--', linewidth=2) plt.title(f"u = {u}") plt.xlabel("$\lambda / \sqrt{N}$") if i == 0: plt.ylabel("Spectral Density") plt.suptitle(f"Hessian Spectrum of p={p_sim} Spin Glass (Shifted GOE)", fontsize=16) plt.tight_layout() # plt.savefig("hessian_spectrum.png") # run_simulation()

The plots show exactly what you would expect. At u=0u=0 you get a semicircle centered at zero with half the eigenvalues negative and these are saddles of index around N/2N/2. At u=1.0u=-1.0 the semicircle slides right but still straddles zero. At u=1.7u=-1.7 near the ground state for p=3p=3 where Egs1.657E_{gs} \approx -1.657 the whole spectrum sits in R+\mathbb{R}^+ and you have a genuine local minimum. The threshold EthE_{th} is exactly where the left edge of the semicircle crosses zero.

Implications for high-dimensional optimization

High-dimensional random landscapes are locked down tight. Concentration of measure forces the Hessian index to drop monotonically with energy and packs local minima into a narrow band near the global minimum.

The jump from saddle-dominated at high energy to minimum-dominated at low energy is sharp. Where this jump happens and how the spectral gap runs the convergence near it is the central question for high-dimensional non-convex optimization.


Neural networks and mode connectivity

Does the pp-spin theory carry over to neural networks? Partly. A single hidden layer network f(x)=aiσ(wiTx)f(x) = \sum a_i \sigma(w_i^T x) with random weights has a similar shape but with a few key differences.

  1. Non-isotropy. Weights are grouped by layer and not spread out on a sphere.
  2. Permutation symmetry. Swapping neurons leaves the function alone and so every critical point has K!K! copies linked by permutation, and things that look like distinct local minima are really the same under this symmetry.
  3. ReLU non-smoothness. Dead neurons open up flat regions and the Hessian builds up a spectral mass at zero.

Mode connectivity from Garipov and coauthors in 2018 shows that low-loss paths run between any two SGD solutions. The sublevel sets {θ:L(θ)<ϵ}\{ \theta : L(\theta) < \epsilon \} look connected and this is not at all like a pure spin glass landscape. The likely story is overparameterization because adding dimensions turns local minima into saddle points, and a local minimum in R100\mathbb{R}^{100} can pick up a negative eigenvalue and become a saddle in R101\mathbb{R}^{101}.


Persistent homology of loss landscapes

“Landscape shape” can be pinned down with persistence. The kk-th Betti number βk\beta_k counts kk-dimensional holes and the Morse inequalities give NkβkN_k \ge \beta_k where NkN_k is the number of index-kk critical points.

For SN1\mathbb{S}^{N-1} the Betti numbers are small integers. But the random field cranks out NkecNN_k \sim e^{cN} critical points and so Nk/βkN_k / \beta_k \to \infty, and most critical points are topologically extra and come in birth-death pairs that cancel out without changing the topology. SGD mini-batching basically does these cancellations and smooths the landscape on the fly.

The algebraic constraint is the Euler characteristic.

(1)kNk=χ(M)=(1)kβk\sum (-1)^k N_k = \chi(\mathcal{M}) = \sum (-1)^k \beta_k

Define the Poincare polynomial Pt(M)=βktkP_t(\mathcal{M}) = \sum \beta_k t^k and the Morse polynomial Mt(f)=NktkM_t(f) = \sum N_k t^k, and the strong Morse inequalities give

Mt(f)Pt(M)=(1+t)Q(t)M_t(f) - P_t(\mathcal{M}) = (1+t) Q(t)

with Q(t)Q(t) having non-negative coefficients. For SN1\mathbb{S}^{N-1} you have Pt(S)=1+tN1P_t(S) = 1 + t^{N-1} and so the random field has to crank out critical points in (k,k+1)(k, k+1) pairs to make this polynomial identity work.


Milestones

WhenWhoContribution
1943S.O. RicePublishes the “Rice Formula” for zero-crossings of random processes.
1977Thouless, Anderson, PalmerTAP equations for spin glasses and bring in the Onsager reaction term.
2007Adler & Taylor”Random Fields and Geometry” and the mathematical reference for Kac-Rice on manifolds.
2013Auffinger et al.Rigorous complexity calculation for spherical p-spin models.
2014Dauphin et al.Point to saddle points and not local minima as the main obstacle in deep learning optimization.
2018Garipov et al.Mode connectivity and showing that low-loss paths connect SGD solutions.

Definitions

The annealed complexity logE[Crit]\log \mathbb{E}[\text{Crit}] is the easier one to work out through Kac-Rice and the quenched complexity E[logCrit]\mathbb{E}[\log \text{Crit}] is the one that matters physically. For spherical p-spin models these line up above the ground state energy and so the Kac-Rice calculation hands you the right answer. A critical point’s index is the number of its negative Hessian eigenvalues and so index 0 means a local minimum and index kk means a saddle with kk downhill directions. The Kac-Rice formula is the integral formula for counting critical points of a random field and plays the same role as the Rice formula for zero-crossings in one dimension. Critical slowing down is the blow-up of relaxation time near the threshold energy EthE_{th} where the spectral gap of the Hessian closes. The Euler characteristic χ=(1)kβk\chi = \sum (-1)^k \beta_k is an alternating sum of Betti numbers and it constrains how critical points of different indices can be spread around. A saddle point is a critical point where the gradient dies but at least one Hessian eigenvalue is negative.


References

1. Auffinger, A., Ben Arous, G., & Cerny, J. (2013). “Random matrices and complexity of spin glasses”. The foundational paper and it rigorously works out the complexity Σ(E)\Sigma(E) for p-spin models using Kac-Rice.

2. Choromanska, A. et al. (2015). “The loss surfaces of multilayer networks”. Pushed the p-spin results onto neural networks and argued that “bad local minima” are a myth in high dimensions.

3. Dauphin, Y. et al. (2014). “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”. Showed that Newton’s method gets pulled toward saddles in high dimensions and put forward “Saddle-Free Newton”.

4. Adler, R. J., & Taylor, J. E. (2007). “Random Fields and Geometry”. The big textbook on the geometry of excursion sets.