Random Matrix Theory & The Marchenko-Pastur Law
If we construct a large matrix with random entries , what do its eigenvalues look like? Naive intuition suggests they might be Gaussian. They are not. They repel each other. Unlike independent random variables which clump together (Central Limit Theorem), eigenvalues behave like charged particles in a “Coulomb Gas”. They distribute themselves to minimize a combined energy potential, balancing the confinement from the Gaussian entries (which pulls them to 0) and the logarithmic repulsion (which pushes them apart). The eigenvalues of a Wishart Matrix (covariance of noise) follow the Marchenko-Pastur Law. This distribution has compact support . This Hard Edge is the most critical feature of random spectrums. It allows us to distinguish signal from noise. Any eigenvalue popping out of this bulk is a “spike” (signal), while anything inside is indistinguishable from noise. This note derives the limit laws of Random Matrix Theory (RMT) using the Stieltjes Transform and Free Probability, and explores the BBP Phase Transition—the precise moment when a signal becomes strong enough to escape the sea of noise.
1. The Physics of Eigenvalues: The Coulomb Gas
Why do eigenvalues repel? The joint probability density of the eigenvalues of a Gaussian Ensemble is given by:
where is the “Dyson Index” ( for Real Symmetric, for Complex Hermitian). This can be written as a Boltzmann distribution with Hamiltonian:
This is exactly the energy of a 2D Coulomb Gas confined by a harmonic potential.
- Confinement: keeps particles near the origin.
- Repulsion: prevents particles from overlapping. As , the gas settles into a deterministic equilibrium configuration. This equilibrium density is the Wigner Semicircle (for symmetric matrices) or Marchenko-Pastur (for covariance matrices).
2. The Stieltjes Transform
The primary tool for analyzing spectral densities (solving for the equilibrium of the gas) is the Stieltjes Transform. Let be a probability measure on (the spectral density). The transform is defined on the complex upper half-plane :
Why this transform?
- Resolvent Connection: For a matrix , . This links linear algebra (trace of inverse) to analysis.
- Inversion Formula: We can recover the density from via the Plemelj-Sokhotski formula:
- Stability: The transform handles the “fluctuations” of finite much better than the density itself, allowing for rigorous limit proofs.
3. Derivation I: The Wigner Semicircle Law
Let be an symmetric Wigner matrix with entries , where are i.i.d with zero mean and unit variance. We assume the existence of a limiting spectral distribution and find its Stieltjes transform .
The Schur Complement Trick (Leave-One-Out)
Let be the resolvent matrix. By symmetry, the average trace converges to any diagonal element, say . Decompose to isolate the first row/column:
Using the Schur Complement formula for block matrix inversion:
As :
- .
- The quadratic form concentrates around its mean via the Trace Lemma:
This yields the self-consistent equation:
Solving the quadratic gives . The density is recovered from the imaginary part on the cut :
4. Derivation II: The Marchenko-Pastur Law
Now consider the Sample Covariance Matrix , where is with i.i.d entries. Let . We want the spectrum of . Using a similar block decomposition (or the Sherman-Morrison formula) on the resolvent of , we arrive at a more complex fixed-point equation. The diagonal element relates to the variance trace of the remaining matrix. The self-consistent equation for the Stieltjes transform is:
This transforms into the quadratic equation:
The Solution Structure
The roots yield the Stieltjes transform. The density is non-zero when the discriminant is negative (complex roots).
The boundaries of the support are the roots of the discriminant . The resulting density is:
Key Insight:
- If (square matrix), the support is . The density diverges at 0 as .
- If (tall matrix), the support is bounded away from 0.
- If (wide matrix), there is a point mass at 0 with weight , representing the rank deficiency.
5. The BBP Phase Transition
What if is not pure noise? Signal model: . Covariance . Does the top eigenvalue of reflect ? Theorem (Baik, Ben Arous, Péché): There is a critical signal-to-noise ration .
- Weak Signal (): The top eigenvalue “sticks” to the bulk edge . The empirical eigenvector is asymptotically orthogonal to the true . . Implication: PCA returns pure noise. You learn nothing.
- Strong Signal (): The top eigenvalue “pops out” (outlier). The overlap becomes non-zero:
This implies a fundamental limit of detection. No algorithm can linearly recover the spike below this transition.
6. The Tracy-Widom Distribution
In the BBP discussion, we saw that the “Bulk Edge” is at . But for finite , the largest eigenvalue fluctuates around this limit. What is the distribution of these fluctuations? For sums of i.i.d. random variables, the CLT gives us a Gaussian. For eigenvalues, the strong repulsion creates a stiffer system. The fluctuations are much smaller () and non-Gaussian. Theorem (Tracy-Widom, 1994):
The probability distribution function is defined by the solution to the Painlevé II differential equation:
This distribution is universal. It appears in:
- The longest increasing subsequence of random permutations (Ulam’s Problem).
- The height fluctuations of the KPZ equation (stochastic interface growth).
- The bus Bunching phenomenon in transport systems. The distribution is asymmetric, with a “soft tail” extending to the right (outliers) and a hard wall on the left (pushing against the bulk).
7. Free Probability and the R-Transform
How do we compute the spectrum of a sum of two random matrices ? If and commute (share eigenvectors), their eigenvalues simply sum. If they are random rotations of each other, they are Free. Free Independence is the non-commutative analog of statistical independence. Voiculescu introduced the R-Transform (analogous to the log-characteristic function ) to linearize free convolution.
The Algorithm
- Compute the Stieltjes transform of the density .
- Compute the functional inverse .
- The R-transform is .
- Additivity: .
- Invert back: Find by solving .
Example: Sum of two Wigners is a Wigner (semicircle semicircle = semicircle). Example: Sum of noise + signal. We can predict the exact shape of the deformed spacing. This is crucial for analyzing ResNets (Identity + Random Weight).
8. Numerical Simulation (Python)
Let’s visualize the MP law and the BBP transition.
import numpy as np
import matplotlib.pyplot as plt
def marchenko_pastur_pdf(x, gamma, sigma=1):
lambda_plus = sigma**2 * (1 + np.sqrt(gamma))**2
lambda_minus = sigma**2 * (1 - np.sqrt(gamma))**2
# PDF only defined within support
with np.errstate(divide='ignore', invalid='ignore'):
y = np.sqrt((lambda_plus - x) * (x - lambda_minus)) / (2 * np.pi * gamma * x * sigma**2)
y[x < lambda_minus] = 0
y[x > lambda_plus] = 0
return y
def simulate_eigenvalues(p=1000, n=2000, signal_strength=0.0):
gamma = p / n
# Noise
Z = np.random.randn(p, n)
# Signal (Rank 1 spike)
if signal_strength > 0:
u = np.random.randn(p, 1); u /= np.linalg.norm(u)
v = np.random.randn(n, 1); v /= np.linalg.norm(v)
X = np.sqrt(signal_strength) * u @ v.T * np.sqrt(n) + Z
else:
X = Z
S = (1/n) * X @ X.T
evals = np.linalg.eigvalsh(S)
# Plot Bulk
plt.figure(figsize=(10, 6))
plt.hist(evals, bins=80, density=True, alpha=0.5, label='Empirical')
x = np.linspace(0, (1 + np.sqrt(gamma))**2 + 1, 1000)
pdf = marchenko_pastur_pdf(x, gamma)
plt.plot(x, pdf, 'r', linewidth=2, label='Marchenko-Pastur')
# Edge Fluctuations (Tracy-Widom scaling)
lambda_plus = (1 + np.sqrt(gamma))**2
max_eval = np.max(evals)
scaling_factor = lambda_plus * (n**(-2/3))
tw_fluctuation = (max_eval - lambda_plus) / scaling_factor
print(f"Largest Eigenvalue: {max_eval:.4f}")
print(f"Edge Prediction: {lambda_plus:.4f}")
print(f"Normalized Fluctuation (should be O(1)): {tw_fluctuation:.4f}")
# BBP Prediction
threshold = np.sqrt(gamma)
if signal_strength > threshold:
predicted_spike = (1 + signal_strength) * (1 + gamma/signal_strength)
plt.axvline(predicted_spike, color='g', linestyle='--', label=f'BBP Prediction ({predicted_spike:.2f})')
plt.axvline(lambda_plus, color='k', linestyle='--', label='Bulk Edge')
plt.legend()
return evals
# To run:
# simulate_eigenvalues(p=1000, n=2000, signal_strength=1.5) 9. Universality and Deep Learning
Why do these laws matter if our data isn’t Gaussian? Universality Theorem: As long as the entries are i.i.d. with mean 0 and variance 1 (and finite higher moments), the spectral statistics (Wigner Semicircle, Marchenko-Pastur, Tracy-Widom) are identical to the Gaussian case as . This is the Spectral Central Limit Theorem. Application: Neural Network Initialization: In Deep Learning, we initialize weights . The Jacobian of a deep network is a product of many random matrices. For the gradient to flow without vanishing or exploding, the singular values of must be concentrated near 1. RMT tells us that for a random matrix , the singular values are distributed according to Marchenko-Pastur. To ensure the edge of the spectrum reaches 1 (dynamical isometry), we need careful scaling of variances (Xavier/He Initialization) and potentially orthogonal initialization to trivialise the spectrum to a Dirac at 1 initially. The Hessian of the loss surface is also a large random matrix. Its spectrum often consists of a “bulk” (noise) plus “spikes” (signal). This validates the BBP model as a relevant proxy for training dynamics.
10. Conclusion
Random Matrix Theory provides a powerful lens for understanding the behavior of complex, high-dimensional systems. From the initial motivation of modeling heavy nuclei physics to modern applications in deep learning and wireless communications, the universality of eigenvalue statistics allows us to ignore the microscopic details of the noise and focus on the macroscopic structure.
The Marchenko-Pastur law gives us the “shape of noise,” defining the rigid boundary between signal and randomness. The BBP transition tells us exactly when a signal is strong enough to be detected. And the Tracy-Widom distribution quantifies the subtle fluctuations at the edge of this boundary.
As we move towards understanding larger neural networks and more complex datasets, RMT will likely play an even more central role, providing the rigorous limit laws that justify our empirical observations.
Historical Timeline
| Year | Event | Significance |
|---|---|---|
| 1928 | Wishart Distribution | John Wishart derives the distribution of sample covariance matrices. |
| 1955 | Wigner Semicircle Law | Eugene Wigner models heavy nuclei resonance using random matrices. |
| 1967 | Marchenko-Pastur Law | Derivation of the limit spectral density for large covariance matrices. |
| 1984 | Girko’s Circular Law | Proof of the universal law for non-Hermitian matrices. |
| 1991 | Free Probability | Dan Voiculescu founds the theory of free random variables. |
| 1994 | Tracy-Widom Distribution | Discovery of the universal law governing edge fluctuations. |
| 2005 | BBP Phase Transition | Baik, Ben Arous, and Péché characterize the signal detection threshold. |
| 2010s | Deep Learning Era | RMT applied to loss surfaces (Pennington) and initialization (Saxe, Ganguli). |
Appendix A: The Combinatorial Method of Moments
An alternative to Stoeltjes relies on trace powers (Moments of the spectral density):
This sum looks like a path on a graph with vertices. The indices form a closed loop of length . Since are independent with mean 0, only terms where every edge is traversed at least twice have non-zero expectation.
- k is odd: Impossible to traverse every edge twice. .
- k is even (): Dominant terms correspond to “Trees” where every edge is traversed exactly twice (down and up). The number of such “Non-Crossing” paths is counted by the Catalan Numbers:
The moments of the Wigner Semicircle are exactly the Catalan numbers!
Since the distribution has compact support, the moments uniquely determine the measure (Carleman’s Condition). For Marchenko-Pastur, the combinatorics involve “non-crossing partitions” of the cycle, counted by Narayana Numbers.
Appendix B: Large Deviation Principles
The Wigner Semicircle is just the average behavior. What is the probability of observing a different empirical spectral density ? RMT satisfies a Large Deviation Principle (LDP) with speed .
The rate function is the “Ben Arous-Guionnet” entropy:
This is purely Electrostatic Energy:
- Potential Energy : Cost of moving particles far from 0.
- Interaction Energy : Cost of particles being close together. The Semicircle Law is simply the Ground State (Minimizer) of this energy functional. This explains why the law is so robust (Universality): it’s an energy minimum.
Appendix C: The Harish-Chandra-Itzykson-Zuber (HCIZ) Integral
In Physics and RMT, we often need to compute integrals over the Unitary Group . The most famous is the HCIZ integral, which couples two Hermitian matrices and :
This integral is fundamental to the study of matrix models with external fields. Theorem: If has eigenvalues and has , then:
where is the Vandermonde determinant. This formula allows us to solve exactly for the correlations of eigenvalues in coupled matrix models, linking RMT to Integrable Systems and representation theory.
Appendix D: The Circular Law (Non-Hermitian)
What if is not symmetric? (e.g., Weights in a Recurrent Neural Network). Let be with i.i.d entries . The eigenvalues are complex and spread out in the complex plane. Theorem (Girko’s Circular Law): As , the eigenvalues are uniformly distributed on the unit disk in the complex plane .
Implication for RNNs:
- If eigenvalues lie outside the unit circle (), signals grow exponentially (), gradients explode.
- If inside (), signals decay (), gradients vanish.
- Use Orthogonal Initialization to place eigenvalues exactly on the boundary to preserve gradient norm (isometry).
Appendix E: Glossary of Terms
- Bulk: The continuous part of the limit spectrum (e.g., Semicircle, Marchenko-Pastur support).
- Edge: The boundary of the support. Fluctuations here are governed by Tracy-Widom.
- Free Probability: A theory of non-commutative random variables where “freeness” replaces independence.
- Stieltjes Transform: The generating function of the moments of the spectral density.
- Universality: The phenomenon where spectral statistics depend only on the moments of the entries, not their specific distribution.
- Wishart Matrix: A product where is Gaussian. The random matrix analog of the Chi-Square distribution.
References
1. Anderson, G. W., Guionnet, A., & Zeitouni, O. (2010). “An Introduction to Random Matrices”. The standard mathematical text. Extremely dense, covers Large Deviations and Topological expansions.
2. Marchenko, V. A., & Pastur, L. A. (1967). “Distribution of eigenvalues for some sets of random matrices”. The classic paper. Derived the eponymous law.
3. Baik, J., Ben Arous, G., & Péché, S. (2005). “Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices”. Proved the BBP transition. Critical for signal detection in high noise.
4. Tracy, C. A., & Widom, H. (1994). “Level-spacing distributions and the Airy kernel”. Derived the Tracy-Widom distribution for the fluctuations of the largest eigenvalue. This distribution is now seen everywhere in KPZ physics.
5. Pennington, J., & Bahri, Y. (2017). “Geometry of Neural Network Loss Surfaces via Random Matrix Theory”. Applied RMT to understand the Hessian of deep networks. Showed that critical points are saddles, not minima.
6. Voiculescu, D. V. (1991). “Limit laws for random matrices and free products”. Introduced Free Probability.