Random Fields & Non-Convex Optimization
Gradient descent in non-convex landscapes
Minimizing a non-convex function in dimensions is NP-hard in general and yet gradient descent trains huge neural networks to near-zero training error. Why does practice beat the worst case by so much?
In two dimensions non-convex landscapes really are hard because local minima pile up and gradient methods get stuck. High dimensions are a different animal. As grows the local minima crowd in near the global minimum and critical points at higher energy have negative Hessian eigenvalues and turn into saddle points that gradient descent slips out of.
We can count critical points at a given energy level.
When the critical points are exponential in . The complexity function and the Hessian index distribution at each energy level together run the show for landscape structure. The key pattern is that as drops the Morse index (the number of negative Hessian eigenvalues) drops right alongside it. High-energy critical points are saddles with many negative eigenvalues and true local minima with index zero only show up near the global minimum.
We nail this down using the Kac-Rice formula for isotropic Gaussian random fields on the sphere.
Setup
Let be a centered Gaussian random field. We look at its critical points and excursion sets . What we care about is the geometry of the optimization surface itself and the random landscape sitting as a Riemannian manifold.
The covariance kernel
is taken to be isotropic and the correlation only leans on the overlap .
For the pure -spin model which is the standard setting in spherical spin glass theory
the covariance is . The polynomial covariance structure makes the Hessian spectrum workable in the large- limit.
Because the state space is compact we work right on the manifold. The gradient gets projected onto .
The Riemannian Hessian picks up a curvature correction from the second fundamental form.
At critical points this curvature term adds roughly , and when is very negative it pushes a big positive constant onto every Hessian eigenvalue and makes low-energy critical points more convex. And this falls straight out of the sphere’s extrinsic curvature.
Kac-Rice formula
We want which is the expected number of index- critical points at energy . The Kac-Rice formula gives
and isotropy kills the -dependence and cuts this down to
where Vol.
Joint density. The pair is a centered Gaussian vector and at you get
assuming . For isotropic models and come out uncorrelated and so
Conditional Hessian. This is where random matrix theory walks in. Conditioned on the Hessian follows a shifted GOE
where is a standard Gaussian orthogonal ensemble matrix. The eigenvalue distribution of heads toward the Wigner semicircle as .
The Hessian eigenvalues follow this semicircle shifted by .
For the -spin Hamiltonian the energy is extensive and of order . Write for the intensive energy and the scaled Hessian has eigenvalues following .
Computing the determinant. We need and we take logs.
This is the annealed approximation. The complexity is
and pulling it all together you get
where is a model-dependent curvature constant.
The saddle-to-minimum transition
The integral behaves nicely but the topology of the critical points changes when the shift pushes the semicircle all the way onto one half-line.
At high energy where there is no shift and the spectrum sits on and roughly half the eigenvalues are negative, and the critical points are saddles of high index.
At low energy where the shift is positive and once the shift passes the whole spectrum sits in and every critical point is a local minimum.
The threshold energy is set by the condition that the left edge of the shifted semicircle hits zero.
Auffinger and Ben Arous and Cerny worked this out in 2013 for the pure -spin.
where is the ground state energy.
Above the critical points are saddles and local minima are exponentially suppressed, and below the local minima pile up, and at the ground state energy the complexity goes back to zero. Gradient descent does not get stuck at high energies because the curvature structure stops traps from forming up there.
Moments of the GOE and Catalan numbers
The Wigner semicircle law can be pulled out from the moment sequence of the GOE alone and the combinatorics are worth spelling out.
Work out the moments . Odd moments die by symmetry and for even moments
and by Wick’s theorem only paired indices pitch in. The constraint is that the pairings have to be non-crossing and you draw the indices around a circle with no crossing lines. The count of non-crossing pairings of points is the -th Catalan number.
Which distribution has moments ? The semicircle. To check it set in
and work it out with the beta function identity, and the answer comes out to .
The bulk Hessian spectrum is set by non-crossing partition combinatorics no matter what the entry distribution looks like. This is universality in random matrix theory.
Annealed vs. quenched complexity
Here is a distinction that matters. Kac-Rice computes which is the annealed quantity, and the physically relevant one is which is the quenched one. These are different in general because . But for spherical -spin models above the ground state they line up because the system self-averages and Kac-Rice spits out the right answer.
Euler characteristic and topology
Counting critical points by index is hard and the Euler characteristic of the excursion set gives you another way in.
At high energy the saddles of maximal index take over and so .
Adler and Taylor showed that for isotropic Gaussian fields
with the Hermite polynomial. This sets the rate at which the topology of the landscape shifts with energy.
TAP free energy
The physics literature comes at the same problem through the Thouless-Anderson-Palmer free energy which swaps the microscopic Hamiltonian for a free energy over magnetizations .
The second term is the Onsager reaction term and it fixes mean field theory for the feedback between a spin and the field it makes. In the -spin spherical model the TAP equations collapse to and the Kac-Rice complexity matches up with the entropy of TAP solutions, and two very different frameworks hand you the same critical point statistics.
Dynamics
Critical point geometry sets where gradient descent gets stuck. How fast it moves is a separate question.
Spectral gap. The convergence rate leans on the condition number . In the saddle regime where the most negative eigenvalue sets the escape rate and from the shifted semicircle
At high energies as you get , and the escape rate is universal and does not lean on the starting point as long as the energy is high. And there are no flat saddles until .
At the spectral gap closes and , and the optimization dynamics slow down a lot right when the landscape topology gets simple. This blow-up of the relaxation time is critical slowing down.
Langevin dynamics and SGD. Adding noise
satisfies detailed balance for and turns into an entropic barrier. The free energy is and if you plug in the Kac-Rice complexity
At high temperature the system settles at an equilibrium energy where the entropic force from the critical point density pulling the dynamics balances the energetic gradient. SGD finds good minima and not the ground state because it settles at a temperature set by the learning rate, and this is a thermodynamic thing and not an algorithmic one.
Numerical verification (Python/JAX)
We can check the shifted GOE prediction directly.
import jax
import jax.numpy as jnp
from jax import random, grad, hessian, vmap, jit
import matplotlib.pyplot as plt
from functools import partial
# -------------------------------------------------------------
# 1. Physics Parameters
# -------------------------------------------------------------
N = 500 # Dimension (Keep < 1000 for speed)
P = 3 # Spin glass parameter (p=3 => cubic interactions)
BATCH_SIZE = 50 # Number of landscapes to average over
# -------------------------------------------------------------
# 2. Hamiltonian Definition
# -------------------------------------------------------------
# We define the p-spin spherical spin glass H(x).
# H(x) = sum J_ijk x_i x_j x_k / N^((p-1)/2)
# Calculating the tensor contraction explicitly is O(N^p). Too slow.
# Trick: H(x) is a Gaussian field. We can simulate the gradient and Hessian directly
# using the covariance structure, OR we can use the Tensor sketch approximation.
# For exactness, we will sample the random tensor J for small N.
def p_spin_energy(x, J, N, p):
"""
Computes H(x) given interacting tensor J.
x: shape (N,)
J: shape (N, N, N) for p=3
"""
# Contract x with J p times
# For p=3: sum_ijk J_ijk x_i x_j x_k
# Scale by N^(-(p-1)/2)
normalization = N ** (-(p-1)/2.0)
# Efficient contraction for p=3
# J (N, N, N) . x (N) -> (N, N)
Tx = jnp.tensordot(J, x, axes=1)
# Tx (N, N) . x (N) -> (N)
Txx = jnp.dot(Tx, x)
# Txx (N) . x (N) -> Scalar
val = jnp.dot(Txx, x)
return val * normalization
# -------------------------------------------------------------
# 3. Hessian Simulation
# -------------------------------------------------------------
@partial(jit, static_argnums=(1, 2))
def compute_landscape_stats(key, N, p):
"""
Generates a random p-spin landscape and samples points.
Returns Energy and Hessian Eigenvalues at those points.
Note: We simply sample random points on the sphere.
Since the field is isotropic, f(theta) is Gaussian N(0, N).
We don't need to search for critical points to check the conditional density;
conditioning on f(x)=E is sufficient.
"""
k1, k2 = random.split(key)
# Switching to Direct GOE Simulation based on Theoretical Result
# This avoids storing O(N^p) tensor.
@jit
def sample_shifted_goe(key, N, p, energy_intensive):
"""
Samples from the conditional Hessian distribution:
H | E ~ GOE(N) - alpha * E * I
For p-spin:
xi(q) = q^p
xi'(1) = p
xi''(1) = p(p-1)
Shift = - xi''(1) / xi'(1) * (Energy / N)
= - p(p-1)/p * u = - (p-1) * u
"""
# 1. Generate Standard GOE
# Off-diagonal: N(0, 1). Diagonal: N(0, 2).
# Symmetric.
H_raw = random.normal(key, (N, N))
H_sym = (H_raw + H_raw.T) / jnp.sqrt(2)
# 2. Scale GOE to match p-spin curvature
# The random part has variance p(p-1).
sigma_rand = jnp.sqrt(p * (p-1))
H_noise = sigma_rand * H_sym
# 3. Apply Deterministic Shift
u = energy_intensive
shift_val = (p-1) * u * jnp.sqrt(N)
H_total = H_noise - shift_val * jnp.eye(N)
return jnp.linalg.eigvalsh(H_total)
def run_simulation():
N_sim = 500
p_sim = 3
energies = jnp.array([-1.7, -1.0, 0.0, 1.0])
# -1.7 is near ground state for p=3 spherical (E_gs ~ -1.657)
key = random.PRNGKey(1337)
plt.figure(figsize=(15, 6))
for i, u in enumerate(energies):
key, subkey = random.split(key)
# Monte Carlo over random matrices
eigenvalues = []
for _ in range(20): # Average over 20 realizations
subkey, sk = random.split(subkey)
evals = sample_shifted_goe(sk, N_sim, p_sim, u)
eigenvalues.append(evals)
eigenvalues = jnp.hstack(jnp.array(eigenvalues))
# Normalize by sqrt(N) to see the semicircle on O(1) scale
eigenvalues = eigenvalues / jnp.sqrt(N_sim)
plt.subplot(1, 4, i+1)
plt.hist(eigenvalues, bins=60, density=True, alpha=0.7, color='navy')
plt.axvline(0, color='red', linestyle='--', linewidth=2)
plt.title(f"u = {u}")
plt.xlabel("$\lambda / \sqrt{N}$")
if i == 0:
plt.ylabel("Spectral Density")
plt.suptitle(f"Hessian Spectrum of p={p_sim} Spin Glass (Shifted GOE)", fontsize=16)
plt.tight_layout()
# plt.savefig("hessian_spectrum.png")
# run_simulation()The plots show exactly what you would expect. At you get a semicircle centered at zero with half the eigenvalues negative and these are saddles of index around . At the semicircle slides right but still straddles zero. At near the ground state for where the whole spectrum sits in and you have a genuine local minimum. The threshold is exactly where the left edge of the semicircle crosses zero.
Implications for high-dimensional optimization
High-dimensional random landscapes are locked down tight. Concentration of measure forces the Hessian index to drop monotonically with energy and packs local minima into a narrow band near the global minimum.
The jump from saddle-dominated at high energy to minimum-dominated at low energy is sharp. Where this jump happens and how the spectral gap runs the convergence near it is the central question for high-dimensional non-convex optimization.
Neural networks and mode connectivity
Does the -spin theory carry over to neural networks? Partly. A single hidden layer network with random weights has a similar shape but with a few key differences.
- Non-isotropy. Weights are grouped by layer and not spread out on a sphere.
- Permutation symmetry. Swapping neurons leaves the function alone and so every critical point has copies linked by permutation, and things that look like distinct local minima are really the same under this symmetry.
- ReLU non-smoothness. Dead neurons open up flat regions and the Hessian builds up a spectral mass at zero.
Mode connectivity from Garipov and coauthors in 2018 shows that low-loss paths run between any two SGD solutions. The sublevel sets look connected and this is not at all like a pure spin glass landscape. The likely story is overparameterization because adding dimensions turns local minima into saddle points, and a local minimum in can pick up a negative eigenvalue and become a saddle in .
Persistent homology of loss landscapes
“Landscape shape” can be pinned down with persistence. The -th Betti number counts -dimensional holes and the Morse inequalities give where is the number of index- critical points.
For the Betti numbers are small integers. But the random field cranks out critical points and so , and most critical points are topologically extra and come in birth-death pairs that cancel out without changing the topology. SGD mini-batching basically does these cancellations and smooths the landscape on the fly.
The algebraic constraint is the Euler characteristic.
Define the Poincare polynomial and the Morse polynomial , and the strong Morse inequalities give
with having non-negative coefficients. For you have and so the random field has to crank out critical points in pairs to make this polynomial identity work.
Milestones
| When | Who | Contribution |
|---|---|---|
| 1943 | S.O. Rice | Publishes the “Rice Formula” for zero-crossings of random processes. |
| 1977 | Thouless, Anderson, Palmer | TAP equations for spin glasses and bring in the Onsager reaction term. |
| 2007 | Adler & Taylor | ”Random Fields and Geometry” and the mathematical reference for Kac-Rice on manifolds. |
| 2013 | Auffinger et al. | Rigorous complexity calculation for spherical p-spin models. |
| 2014 | Dauphin et al. | Point to saddle points and not local minima as the main obstacle in deep learning optimization. |
| 2018 | Garipov et al. | Mode connectivity and showing that low-loss paths connect SGD solutions. |
Definitions
The annealed complexity is the easier one to work out through Kac-Rice and the quenched complexity is the one that matters physically. For spherical p-spin models these line up above the ground state energy and so the Kac-Rice calculation hands you the right answer. A critical point’s index is the number of its negative Hessian eigenvalues and so index 0 means a local minimum and index means a saddle with downhill directions. The Kac-Rice formula is the integral formula for counting critical points of a random field and plays the same role as the Rice formula for zero-crossings in one dimension. Critical slowing down is the blow-up of relaxation time near the threshold energy where the spectral gap of the Hessian closes. The Euler characteristic is an alternating sum of Betti numbers and it constrains how critical points of different indices can be spread around. A saddle point is a critical point where the gradient dies but at least one Hessian eigenvalue is negative.
References
1. Auffinger, A., Ben Arous, G., & Cerny, J. (2013). “Random matrices and complexity of spin glasses”. The foundational paper and it rigorously works out the complexity for p-spin models using Kac-Rice.
2. Choromanska, A. et al. (2015). “The loss surfaces of multilayer networks”. Pushed the p-spin results onto neural networks and argued that “bad local minima” are a myth in high dimensions.
3. Dauphin, Y. et al. (2014). “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”. Showed that Newton’s method gets pulled toward saddles in high dimensions and put forward “Saddle-Free Newton”.
4. Adler, R. J., & Taylor, J. E. (2007). “Random Fields and Geometry”. The big textbook on the geometry of excursion sets.