Random Matrix Theory & The Marchenko-Pastur Law
If you build a big matrix out of random Gaussian entries the eigenvalue distribution has a precise and universal structure.
The answer is not Gaussian. Eigenvalues repel each other like charged particles in a Coulomb gas and spread out to minimize energy and balance confinement (which pulls them toward 0) against logarithmic repulsion (which pushes them apart). This is very different from the CLT regime where independent draws pile up into a bell curve.
For the Wishart matrix which is the sample covariance of pure noise the eigenvalues follow the Marchenko-Pastur Law with compact support . The hard edge of this support is the key structural feature. Eigenvalues inside the bulk look the same as noise but an eigenvalue poking out beyond the edge shows signal.
This note works out these results using the Stieltjes Transform and Free Probability and then covers the BBP Phase Transition which is the exact threshold where signal becomes detectable above noise.
1. Eigenvalue Repulsion as a Coulomb Gas
The joint density of eigenvalues of a Gaussian Ensemble sits as
where for real symmetric and for complex Hermitian.
You can rewrite this as and get
This is a 2D Coulomb gas in a harmonic trap where the term confines and the term repels. As the gas freezes into a deterministic equilibrium and that limit is Wigner’s Semicircle for symmetric matrices and Marchenko-Pastur for covariance matrices.
2. The Stieltjes Transform
Given a spectral density on you define on the upper half-plane
Three reasons this is the right tool.
- For a matrix you get and algebra turns into analysis.
- You pick out the density with Plemelj-Sokhotski,
- It tames finite- fluctuations far better than the density itself and this makes limit proofs tractable.
3. Wigner Semicircle
Let be symmetric with and entries i.i.d. with mean 0 and variance 1.
Take the resolvent . By symmetry the average trace lines up with any diagonal element and so we pick .
Block-decompose and get
The Schur complement gives us
Now send and drops to 0 and the quadratic form concentrates by the Trace Lemma,
And so we get a self-consistent equation
Solving gives .
You pick off the density from the imaginary part on ,
The argument bootstraps because the resolvent sits on a smaller copy of itself and in the limit the recursion closes into a fixed-point equation.
4. Marchenko-Pastur
Now let where is with i.i.d. entries and .
A similar block decomposition on the resolvent of gives a different fixed point,
which rearranges to
The density sits where the discriminant drops below zero,
The roots of the discriminant sit at and the density is
The behavior near is worth a careful look.
- When (square) the support is and the density blows up at 0 like .
- When (tall) the support sits bounded away from 0.
- When (wide) there is a point mass at 0 with weight and this sits on the null space directions.
5. The BBP Phase Transition
Suppose the data isn’t pure noise. The signal model is and and the question is whether the top eigenvalue of sees .
Theorem (Baik, Ben Arous, Peche). There is a critical threshold .
Below threshold (). The top eigenvalue sticks to the bulk edge and the empirical eigenvector sits asymptotically orthogonal to the true with . PCA fails completely and this is a fundamental wall because no linear method can pick out the spike.
Above threshold (). The top eigenvalue breaks free,
and the overlap goes nonzero,
This is a hard detection limit. Below the signal is invisible to spectral methods and no linear method can pick it out. But in practice people frequently try to pull signal from eigenvalues well inside the bulk.
6. Tracy-Widom Fluctuations
We said the bulk edge sits at but for finite the largest eigenvalue jitters around that limit.
For sums of i.i.d. variables the CLT gives Gaussian fluctuations at scale . Eigenvalue repulsion stiffens the system and so fluctuations come in at which is much smaller and the limiting shape is non-Gaussian.
Theorem (Tracy-Widom, 1994).
The CDF falls out of the Painleve II equation,
This distribution shows up in settings with no obvious tie to random matrices.
- Longest increasing subsequence of random permutations.
- Height fluctuations of KPZ interfaces.
- Bus bunching in transit systems (yes, really).
The distribution is asymmetric and it has a soft tail on the right (where outliers can happen) and a hard wall on the left (where the bulk pushes back).
7. Free Probability and the R-Transform
Suppose you want the spectrum of where both are large random matrices. If they share eigenvectors then eigenvalues add pointwise but if they sit in “generic position” they are free in Voiculescu’s sense and this is the noncommutative analogue of independence.
The R-Transform loosens free convolution into addition the same way the log-characteristic function loosens classical convolution.
Here is how it works.
- Get the Stieltjes transform .
- Functionally invert it to get .
- Then .
- Free convolution turns into addition and .
- Invert back by solving for .
Semicircle plus semicircle is semicircle. And more practically for a ResNet layer computing the spectrum of is the free convolution of a Dirac at 1 with Marchenko-Pastur and you can predict the spectral shape exactly.
8. Simulation
import numpy as np
import matplotlib.pyplot as plt
def marchenko_pastur_pdf(x, gamma, sigma=1):
lambda_plus = sigma**2 * (1 + np.sqrt(gamma))**2
lambda_minus = sigma**2 * (1 - np.sqrt(gamma))**2
# PDF only defined within support
with np.errstate(divide='ignore', invalid='ignore'):
y = np.sqrt((lambda_plus - x) * (x - lambda_minus)) / (2 * np.pi * gamma * x * sigma**2)
y[x < lambda_minus] = 0
y[x > lambda_plus] = 0
return y
def simulate_eigenvalues(p=1000, n=2000, signal_strength=0.0):
gamma = p / n
# Noise
Z = np.random.randn(p, n)
# Signal (Rank 1 spike)
if signal_strength > 0:
u = np.random.randn(p, 1); u /= np.linalg.norm(u)
v = np.random.randn(n, 1); v /= np.linalg.norm(v)
X = np.sqrt(signal_strength) * u @ v.T * np.sqrt(n) + Z
else:
X = Z
S = (1/n) * X @ X.T
evals = np.linalg.eigvalsh(S)
# Plot Bulk
plt.figure(figsize=(10, 6))
plt.hist(evals, bins=80, density=True, alpha=0.5, label='Empirical')
x = np.linspace(0, (1 + np.sqrt(gamma))**2 + 1, 1000)
pdf = marchenko_pastur_pdf(x, gamma)
plt.plot(x, pdf, 'r', linewidth=2, label='Marchenko-Pastur')
# Edge Fluctuations (Tracy-Widom scaling)
lambda_plus = (1 + np.sqrt(gamma))**2
max_eval = np.max(evals)
scaling_factor = lambda_plus * (n**(-2/3))
tw_fluctuation = (max_eval - lambda_plus) / scaling_factor
print(f"Largest Eigenvalue: {max_eval:.4f}")
print(f"Edge Prediction: {lambda_plus:.4f}")
print(f"Normalized Fluctuation (should be O(1)): {tw_fluctuation:.4f}")
# BBP Prediction
threshold = np.sqrt(gamma)
if signal_strength > threshold:
predicted_spike = (1 + signal_strength) * (1 + gamma/signal_strength)
plt.axvline(predicted_spike, color='g', linestyle='--', label=f'BBP Prediction ({predicted_spike:.2f})')
plt.axvline(lambda_plus, color='k', linestyle='--', label='Bulk Edge')
plt.legend()
return evals
# To run:
# simulate_eigenvalues(p=1000, n=2000, signal_strength=1.5) 9. Universality and Neural Networks
None of the work above actually needs Gaussian entries. The Universality Theorem says that as long as are i.i.d. with mean 0 and variance 1 and finite higher moments the spectral statistics settle to the same limits and this is a kind of spectral CLT.
This matters for deep learning. We pick weights as and the Jacobian of a deep network is a product of such matrices and so for stable gradients the singular values have to bunch near 1. Marchenko-Pastur shows where they actually land and this is the theoretical basis for Xavier and He initialization. Orthogonal initialization goes further and collapses the spectrum to a Dirac at 1.
The loss Hessian is also a large random matrix and empirically its spectrum looks like a Marchenko-Pastur bulk plus a few outlier spikes and this is exactly the BBP picture. A 2005 theorem about sample covariance matrices shows the training landscape of modern networks with surprising accuracy.
10. Where This Stands
Wigner started this field to model nuclear resonances in the 1950s and seventy years later the same eigenvalue distributions show up in neural network initialization and wireless channel capacity and the KPZ universality class.
Marchenko-Pastur shows the shape of noise and BBP shows when signal is detectable and Tracy-Widom runs the edge fluctuations. Together these three results give a full picture of the eigenvalues of large random matrices and of any high-dimensional system whose correlation structure you want to understand.
Moments and Catalan combinatorics
There is another route to the Semicircle that dodges the Stieltjes transform entirely and it works by computing moments,
Think of the indices as a closed walk on a graph. Since entries are independent with mean 0 only walks that hit every edge at least twice contribute.
Odd moments wipe out because an odd-length walk cannot hit every edge twice. Even moments count “non-crossing” walks on trees and the answer is the Catalan numbers,
And these are exactly the moments of the semicircle,
Compact support together with moment matching gives uniqueness of the distribution by Carleman’s condition.
For Marchenko-Pastur the combinatorics run on non-crossing partitions counted by Narayana numbers.
Spectral large deviations
The semicircle is the typical spectral density and so how unlikely is it to see something else?
There is an LDP at speed ,
The rate function from Ben Arous and Guionnet is
This is the electrostatic energy of a charge distribution where is the confinement cost and is the repulsion cost and the semicircle minimizes it. Universality from this angle means the system sits at an energy minimum and so small pushes on the entry distribution do not move the ground state.
The Harish-Chandra-Itzykson-Zuber integral
Sometimes you need to integrate over the unitary group . The Harish-Chandra-Itzykson-Zuber integral ties together two Hermitian matrices and ,
If has eigenvalues and has then
where is the Vandermonde determinant.
This is a clean closed form and it ties RMT to integrable systems and representation theory and lets you solve coupled matrix models exactly.
The Circular Law for non-symmetric matrices
Drop the symmetry assumption and let be with i.i.d. entries . Now eigenvalues are complex.
Girko’s Circular Law. As eigenvalues spread uniformly on the unit disk,
For RNNs this lands directly because eigenvalues outside the unit circle fire off exploding gradients and those inside kill gradients. Orthogonal initialization pins them to and that is the sweet spot.
Key developments
- 1928. John Wishart derives the distribution of sample covariance matrices, launching the field.
- 1955. Eugene Wigner models heavy nuclei resonance with random matrices and discovers the semicircle law.
- 1967. Marchenko and Pastur derive the limiting spectral density for large covariance matrices.
- 1984. Girko proves the circular law for non-Hermitian matrices.
- 1991. Dan Voiculescu founds free probability, providing algebraic tools for computing spectra of sums and products.
- 1994. Tracy and Widom discover the universal distribution governing edge fluctuations.
- 2005. Baik, Ben Arous, and Peche characterize the signal detection threshold (BBP phase transition).
- 2010s. RMT applied to deep learning, including loss surfaces (Pennington) and initialization schemes (Saxe, Ganguli).
Notation and terminology
| Term | Meaning |
|---|---|
| Bulk | The continuous part of the limiting spectral density, where eigenvalues are densely packed. |
| Edge | The boundary of the bulk support. Tracy-Widom governs fluctuations here. |
| Free probability | Noncommutative probability theory where “freeness” replaces independence. Linearizes spectral computations via the R-transform. |
| Stieltjes transform | , the moment-generating function of the spectral density, analytic on the upper half-plane. |
| Universality | The phenomenon that spectral statistics depend only on the first few moments of the entry distribution, not its specific form. |
| Wishart matrix | with Gaussian . The matrix analogue of the chi-squared distribution. |
References
1. Anderson, G. W., Guionnet, A., & Zeitouni, O. (2010). “An Introduction to Random Matrices”. The standard reference. It is dense and thorough and covers large deviations and topological expansions.
2. Marchenko, V. A., & Pastur, L. A. (1967). “Distribution of eigenvalues for some sets of random matrices”. The original paper.
3. Baik, J., Ben Arous, G., & Peche, S. (2005). “Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices”. The BBP transition and it changed how people think about signal detection in high dimensions.
4. Tracy, C. A., & Widom, H. (1994). “Level-spacing distributions and the Airy kernel”. Where the Tracy-Widom distribution comes from and it is now everywhere in KPZ physics.
5. Pennington, J., & Bahri, Y. (2017). “Geometry of Neural Network Loss Surfaces via Random Matrix Theory”. Used RMT on Hessians of deep networks and found that critical points are saddles and not local minima.
6. Voiculescu, D. V. (1991). “Limit laws for random matrices and free products”. Free probability starts here.