Random Matrix Theory & The Marchenko-Pastur Law

If you build a big matrix out of random Gaussian entries XijN(0,1)X_{ij} \sim \mathcal{N}(0, 1) the eigenvalue distribution has a precise and universal structure.

The answer is not Gaussian. Eigenvalues repel each other like charged particles in a Coulomb gas and spread out to minimize energy and balance confinement (which pulls them toward 0) against logarithmic repulsion (which pushes them apart). This is very different from the CLT regime where independent draws pile up into a bell curve.

For the Wishart matrix W=1nXXTW = \frac{1}{n} XX^T which is the sample covariance of pure noise the eigenvalues follow the Marchenko-Pastur Law with compact support [(1γ)2,(1+γ)2][(1-\sqrt{\gamma})^2, (1+\sqrt{\gamma})^2]. The hard edge of this support is the key structural feature. Eigenvalues inside the bulk look the same as noise but an eigenvalue poking out beyond the edge shows signal.

This note works out these results using the Stieltjes Transform and Free Probability and then covers the BBP Phase Transition which is the exact threshold where signal becomes detectable above noise.


1. Eigenvalue Repulsion as a Coulomb Gas

The joint density of eigenvalues λ1,,λn\lambda_1, \dots, \lambda_n of a Gaussian Ensemble sits as

P(λ1,,λn)i<jλiλjβi=1neβn4λi2P(\lambda_1, \dots, \lambda_n) \propto \prod_{i < j} |\lambda_i - \lambda_j|^\beta \prod_{i=1}^n e^{-\frac{\beta n}{4} \lambda_i^2}

where β=1\beta = 1 for real symmetric and β=2\beta = 2 for complex Hermitian.

You can rewrite this as P(λ)eβH(λ)P(\lambda) \propto e^{-\beta H(\lambda)} and get

H(λ)=i=1nn4λi2i<jlogλiλjH(\lambda) = \sum_{i=1}^n \frac{n}{4} \lambda_i^2 - \sum_{i < j} \log |\lambda_i - \lambda_j|

This is a 2D Coulomb gas in a harmonic trap where the λi2\sum \lambda_i^2 term confines and the logλiλj-\sum \log|\lambda_i - \lambda_j| term repels. As nn \to \infty the gas freezes into a deterministic equilibrium and that limit is Wigner’s Semicircle for symmetric matrices and Marchenko-Pastur for covariance matrices.


2. The Stieltjes Transform

Given a spectral density μ\mu on R\mathbb{R} you define on the upper half-plane

m(z)=R1λzdμ(λ),zC,Im(z)>0m(z) = \int_{\mathbb{R}} \frac{1}{\lambda - z} d\mu(\lambda), \quad z \in \mathbb{C}, \text{Im}(z) > 0

Three reasons this is the right tool.

  1. For a matrix HH you get mH(z)=1nTr(HzI)1m_H(z) = \frac{1}{n} \text{Tr}(H - zI)^{-1} and algebra turns into analysis.
  2. You pick out the density with Plemelj-Sokhotski, ρ(λ)=1πlimϵ0+Im[m(λ+iϵ)]\rho(\lambda) = \frac{1}{\pi} \lim_{\epsilon \to 0^+} \text{Im}[m(\lambda + i\epsilon)]
  3. It tames finite-nn fluctuations far better than the density itself and this makes limit proofs tractable.

3. Wigner Semicircle

Let WnW_n be n×nn \times n symmetric with Wij=Xij/nW_{ij} = X_{ij}/\sqrt{n} and entries i.i.d. with mean 0 and variance 1.

Take the resolvent G(z)=(WnzI)1G(z) = (W_n - zI)^{-1}. By symmetry the average trace lines up with any diagonal element and so we pick G11G_{11}.

Block-decompose and get

Wn=(W11wTwW(n1))W_n = \begin{pmatrix} W_{11} & \mathbf{w}^T \\ \mathbf{w} & W^{(n-1)} \end{pmatrix}

The Schur complement gives us

G11=1W11zwT(W(n1)zI)1wG_{11} = \frac{1}{W_{11} - z - \mathbf{w}^T (W^{(n-1)} - zI)^{-1} \mathbf{w}}

Now send nn \to \infty and W11=X11/nW_{11} = X_{11}/\sqrt{n} drops to 0 and the quadratic form concentrates by the Trace Lemma,

wTG(n1)w1nTr(G(n1))=mn1(z)m(z)\mathbf{w}^T G^{(n-1)} \mathbf{w} \approx \frac{1}{n} \text{Tr}(G^{(n-1)}) = m_{n-1}(z) \approx m(z)

And so we get a self-consistent equation

m(z)=1zm(z)m(z) = \frac{1}{-z - m(z)}

Solving m2+zm+1=0m^2 + zm + 1 = 0 gives m(z)=z+z242m(z) = \frac{-z + \sqrt{z^2 - 4}}{2}.

You pick off the density from the imaginary part on [2,2][-2, 2],

ρsc(λ)=12π4λ2\rho_{sc}(\lambda) = \frac{1}{2\pi} \sqrt{4 - \lambda^2}

The argument bootstraps because the resolvent sits on a smaller copy of itself and in the limit the recursion closes into a fixed-point equation.


4. Marchenko-Pastur

Now let Sn=1nXXTS_n = \frac{1}{n} XX^T where XX is p×np \times n with i.i.d. entries and p/nγp/n \to \gamma.

A similar block decomposition on the resolvent of SnS_n gives a different fixed point,

m(z)=1γ1zγzm(z)m(z) = \frac{1}{\gamma - 1 - z - \gamma z\, m(z)}

which rearranges to

γzm2+(z+1γ)m+1=0\gamma z m^2 + (z + 1 - \gamma) m + 1 = 0

The density sits where the discriminant drops below zero,

Δ(z)=(z+1γ)24γz<0\Delta(z) = (z + 1 - \gamma)^2 - 4\gamma z < 0

The roots of the discriminant sit at z±=(1±γ)2z_{\pm} = (1 \pm \sqrt{\gamma})^2 and the density is

ρMP(λ)=12πγλ(λ+λ)(λλ)for λ[λ,λ+]\rho_{MP}(\lambda) = \frac{1}{2\pi \gamma \lambda} \sqrt{(\lambda_+ - \lambda)(\lambda - \lambda_-)} \quad \text{for } \lambda \in [\lambda_-, \lambda_+]

The behavior near γ=1\gamma = 1 is worth a careful look.

  • When γ=1\gamma = 1 (square) the support is [0,4][0, 4] and the density blows up at 0 like λ1/2\lambda^{-1/2}.
  • When γ<1\gamma < 1 (tall) the support sits bounded away from 0.
  • When γ>1\gamma > 1 (wide) there is a point mass at 0 with weight 11/γ1 - 1/\gamma and this sits on the null space directions.

5. The BBP Phase Transition

Suppose the data isn’t pure noise. The signal model is X=θuvT+ZX = \sqrt{\theta} u v^T + Z and Σ=I+θuuT\Sigma = I + \theta u u^T and the question is whether the top eigenvalue of SnS_n sees θ\theta.

Theorem (Baik, Ben Arous, Peche). There is a critical threshold θc=γ\theta_c = \sqrt{\gamma}.

Below threshold (θ<θc\theta < \theta_c). The top eigenvalue sticks to the bulk edge (1+γ)2(1+\sqrt{\gamma})^2 and the empirical eigenvector u^\hat{u} sits asymptotically orthogonal to the true uu with u^,u0\langle \hat{u}, u \rangle \to 0. PCA fails completely and this is a fundamental wall because no linear method can pick out the spike.

Above threshold (θ>θc\theta > \theta_c). The top eigenvalue breaks free,

λtop(1+θ)(1+γθ)\lambda_{top} \to (1+\theta)(1 + \frac{\gamma}{\theta})

and the overlap goes nonzero,

u^,u21γ/θ21+γ/θ|\langle \hat{u}, u \rangle|^2 \to \frac{1 - \gamma/\theta^2}{1 + \gamma/\theta}

This is a hard detection limit. Below θc\theta_c the signal is invisible to spectral methods and no linear method can pick it out. But in practice people frequently try to pull signal from eigenvalues well inside the bulk.


6. Tracy-Widom Fluctuations

We said the bulk edge sits at λ+=(1+γ)2\lambda_+ = (1+\sqrt{\gamma})^2 but for finite nn the largest eigenvalue jitters around that limit.

For sums of i.i.d. variables the CLT gives Gaussian fluctuations at scale n1/2n^{-1/2}. Eigenvalue repulsion stiffens the system and so fluctuations come in at O(n2/3)O(n^{-2/3}) which is much smaller and the limiting shape is non-Gaussian.

Theorem (Tracy-Widom, 1994).

λmaxλ++λ+n2/3TWβ\lambda_{max} \approx \lambda_+ + \frac{\lambda_+}{n^{2/3}} \cdot \mathbb{TW}_\beta

The CDF Fβ(s)=P(TWβs)F_\beta(s) = P(\mathbb{TW}_\beta \le s) falls out of the Painleve II equation,

q(s)=sq(s)+2q(s)3q''(s) = s q(s) + 2 q(s)^3

This distribution shows up in settings with no obvious tie to random matrices.

  • Longest increasing subsequence of random permutations.
  • Height fluctuations of KPZ interfaces.
  • Bus bunching in transit systems (yes, really).

The distribution is asymmetric and it has a soft tail on the right (where outliers can happen) and a hard wall on the left (where the bulk pushes back).


7. Free Probability and the R-Transform

Suppose you want the spectrum of A+BA + B where both are large random matrices. If they share eigenvectors then eigenvalues add pointwise but if they sit in “generic position” they are free in Voiculescu’s sense and this is the noncommutative analogue of independence.

The R-Transform loosens free convolution into addition the same way the log-characteristic function loosens classical convolution.

Here is how it works.

  1. Get the Stieltjes transform GA(z)G_A(z).
  2. Functionally invert it to get GA1(w)G_A^{-1}(w).
  3. Then RA(w)=GA1(w)1wR_A(w) = G_A^{-1}(w) - \frac{1}{w}.
  4. Free convolution turns into addition and RA+B(w)=RA(w)+RB(w)R_{A+B}(w) = R_A(w) + R_B(w).
  5. Invert back by solving RA+B(G(z))+1/G(z)=zR_{A+B}(G(z)) + 1/G(z) = z for GA+BG_{A+B}.

Semicircle plus semicircle is semicircle. And more practically for a ResNet layer computing x+Wxx + Wx the spectrum of I+WI + W is the free convolution of a Dirac at 1 with Marchenko-Pastur and you can predict the spectral shape exactly.


8. Simulation

import numpy as np import matplotlib.pyplot as plt def marchenko_pastur_pdf(x, gamma, sigma=1): lambda_plus = sigma**2 * (1 + np.sqrt(gamma))**2 lambda_minus = sigma**2 * (1 - np.sqrt(gamma))**2 # PDF only defined within support with np.errstate(divide='ignore', invalid='ignore'): y = np.sqrt((lambda_plus - x) * (x - lambda_minus)) / (2 * np.pi * gamma * x * sigma**2) y[x < lambda_minus] = 0 y[x > lambda_plus] = 0 return y def simulate_eigenvalues(p=1000, n=2000, signal_strength=0.0): gamma = p / n # Noise Z = np.random.randn(p, n) # Signal (Rank 1 spike) if signal_strength > 0: u = np.random.randn(p, 1); u /= np.linalg.norm(u) v = np.random.randn(n, 1); v /= np.linalg.norm(v) X = np.sqrt(signal_strength) * u @ v.T * np.sqrt(n) + Z else: X = Z S = (1/n) * X @ X.T evals = np.linalg.eigvalsh(S) # Plot Bulk plt.figure(figsize=(10, 6)) plt.hist(evals, bins=80, density=True, alpha=0.5, label='Empirical') x = np.linspace(0, (1 + np.sqrt(gamma))**2 + 1, 1000) pdf = marchenko_pastur_pdf(x, gamma) plt.plot(x, pdf, 'r', linewidth=2, label='Marchenko-Pastur') # Edge Fluctuations (Tracy-Widom scaling) lambda_plus = (1 + np.sqrt(gamma))**2 max_eval = np.max(evals) scaling_factor = lambda_plus * (n**(-2/3)) tw_fluctuation = (max_eval - lambda_plus) / scaling_factor print(f"Largest Eigenvalue: {max_eval:.4f}") print(f"Edge Prediction: {lambda_plus:.4f}") print(f"Normalized Fluctuation (should be O(1)): {tw_fluctuation:.4f}") # BBP Prediction threshold = np.sqrt(gamma) if signal_strength > threshold: predicted_spike = (1 + signal_strength) * (1 + gamma/signal_strength) plt.axvline(predicted_spike, color='g', linestyle='--', label=f'BBP Prediction ({predicted_spike:.2f})') plt.axvline(lambda_plus, color='k', linestyle='--', label='Bulk Edge') plt.legend() return evals # To run: # simulate_eigenvalues(p=1000, n=2000, signal_strength=1.5)

9. Universality and Neural Networks

None of the work above actually needs Gaussian entries. The Universality Theorem says that as long as XijX_{ij} are i.i.d. with mean 0 and variance 1 and finite higher moments the spectral statistics settle to the same limits and this is a kind of spectral CLT.

This matters for deep learning. We pick weights as WN(0,2/n)W \sim \mathcal{N}(0, 2/n) and the Jacobian of a deep network is a product of such matrices and so for stable gradients the singular values have to bunch near 1. Marchenko-Pastur shows where they actually land and this is the theoretical basis for Xavier and He initialization. Orthogonal initialization goes further and collapses the spectrum to a Dirac at 1.

The loss Hessian is also a large random matrix and empirically its spectrum looks like a Marchenko-Pastur bulk plus a few outlier spikes and this is exactly the BBP picture. A 2005 theorem about sample covariance matrices shows the training landscape of modern networks with surprising accuracy.


10. Where This Stands

Wigner started this field to model nuclear resonances in the 1950s and seventy years later the same eigenvalue distributions show up in neural network initialization and wireless channel capacity and the KPZ universality class.

Marchenko-Pastur shows the shape of noise and BBP shows when signal is detectable and Tracy-Widom runs the edge fluctuations. Together these three results give a full picture of the eigenvalues of large random matrices and of any high-dimensional system whose correlation structure you want to understand.


Moments and Catalan combinatorics

There is another route to the Semicircle that dodges the Stieltjes transform entirely and it works by computing moments,

Mk=1nE[Tr(Wk)]=1nE[i1,,ik=1nWi1i2Wi2i3Wiki1]M_k = \frac{1}{n} \mathbb{E}[\text{Tr}(W^k)] = \frac{1}{n} \mathbb{E} \left[ \sum_{i_1, \dots, i_k=1}^n W_{i_1 i_2} W_{i_2 i_3} \dots W_{i_k i_1} \right]

Think of the indices as a closed walk i1i2i1i_1 \to i_2 \to \dots \to i_1 on a graph. Since entries are independent with mean 0 only walks that hit every edge at least twice contribute.

Odd moments wipe out because an odd-length walk cannot hit every edge twice. Even moments M2mM_{2m} count “non-crossing” walks on trees and the answer is the Catalan numbers,

Cm=1m+1(2mm)C_m = \frac{1}{m+1} \binom{2m}{m}

And these are exactly the moments of the semicircle,

22x2m12π4x2dx=Cm\int_{-2}^2 x^{2m} \frac{1}{2\pi} \sqrt{4-x^2} dx = C_m

Compact support together with moment matching gives uniqueness of the distribution by Carleman’s condition.

For Marchenko-Pastur the combinatorics run on non-crossing partitions counted by Narayana numbers.


Spectral large deviations

The semicircle is the typical spectral density and so how unlikely is it to see something else?

There is an LDP at speed n2n^2,

P(ρnμ)en2I(μ)P(\rho_n \approx \mu) \approx e^{-n^2 I(\mu)}

The rate function from Ben Arous and Guionnet is

I(μ)=12x2dμ(x)logxydμ(x)dμ(y)CI(\mu) = \frac{1}{2} \int x^2 d\mu(x) - \iint \log |x-y| d\mu(x) d\mu(y) - C

This is the electrostatic energy of a charge distribution where x2dμ\int x^2 d\mu is the confinement cost and logxydμdμ-\iint \log|x-y| d\mu d\mu is the repulsion cost and the semicircle minimizes it. Universality from this angle means the system sits at an energy minimum and so small pushes on the entry distribution do not move the ground state.


The Harish-Chandra-Itzykson-Zuber integral

Sometimes you need to integrate over the unitary group U(n)U(n). The Harish-Chandra-Itzykson-Zuber integral ties together two Hermitian matrices AA and BB,

I(A,B,τ)=U(n)exp(τTr(AUBU))dUI(A, B, \tau) = \int_{U(n)} \exp\left( \tau \text{Tr}(A U B U^*) \right) dU

If AA has eigenvalues aia_i and BB has bib_i then

I(A,B,τ)det(eτaibj)i,jΔ(A)Δ(B)I(A, B, \tau) \propto \frac{\det(e^{\tau a_i b_j})_{i,j}}{\Delta(A) \Delta(B)}

where Δ(A)=i<j(aiaj)\Delta(A) = \prod_{i<j} (a_i - a_j) is the Vandermonde determinant.

This is a clean closed form and it ties RMT to integrable systems and representation theory and lets you solve coupled matrix models exactly.


The Circular Law for non-symmetric matrices

Drop the symmetry assumption and let XX be n×nn \times n with i.i.d. entries N(0,1/n)\mathcal{N}(0, 1/n). Now eigenvalues are complex.

Girko’s Circular Law. As nn \to \infty eigenvalues spread uniformly on the unit disk,

ρ(λ)=1π1λ1\rho(\lambda) = \frac{1}{\pi} \mathbb{1}_{|\lambda| \le 1}

For RNNs this lands directly because eigenvalues outside the unit circle fire off exploding gradients and those inside kill gradients. Orthogonal initialization pins them to λ=1|\lambda| = 1 and that is the sweet spot.


Key developments

  • 1928. John Wishart derives the distribution of sample covariance matrices, launching the field.
  • 1955. Eugene Wigner models heavy nuclei resonance with random matrices and discovers the semicircle law.
  • 1967. Marchenko and Pastur derive the limiting spectral density for large covariance matrices.
  • 1984. Girko proves the circular law for non-Hermitian matrices.
  • 1991. Dan Voiculescu founds free probability, providing algebraic tools for computing spectra of sums and products.
  • 1994. Tracy and Widom discover the universal distribution governing edge fluctuations.
  • 2005. Baik, Ben Arous, and Peche characterize the signal detection threshold (BBP phase transition).
  • 2010s. RMT applied to deep learning, including loss surfaces (Pennington) and initialization schemes (Saxe, Ganguli).

Notation and terminology

TermMeaning
BulkThe continuous part of the limiting spectral density, where eigenvalues are densely packed.
EdgeThe boundary of the bulk support. Tracy-Widom governs fluctuations here.
Free probabilityNoncommutative probability theory where “freeness” replaces independence. Linearizes spectral computations via the R-transform.
Stieltjes transformm(z)=ρ(λ)λzdλm(z) = \int \frac{\rho(\lambda)}{\lambda - z} d\lambda, the moment-generating function of the spectral density, analytic on the upper half-plane.
UniversalityThe phenomenon that spectral statistics depend only on the first few moments of the entry distribution, not its specific form.
Wishart matrixW=XXTW = XX^T with Gaussian XX. The matrix analogue of the chi-squared distribution.

References

1. Anderson, G. W., Guionnet, A., & Zeitouni, O. (2010). “An Introduction to Random Matrices”. The standard reference. It is dense and thorough and covers large deviations and topological expansions.

2. Marchenko, V. A., & Pastur, L. A. (1967). “Distribution of eigenvalues for some sets of random matrices”. The original paper.

3. Baik, J., Ben Arous, G., & Peche, S. (2005). “Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices”. The BBP transition and it changed how people think about signal detection in high dimensions.

4. Tracy, C. A., & Widom, H. (1994). “Level-spacing distributions and the Airy kernel”. Where the Tracy-Widom distribution comes from and it is now everywhere in KPZ physics.

5. Pennington, J., & Bahri, Y. (2017). “Geometry of Neural Network Loss Surfaces via Random Matrix Theory”. Used RMT on Hessians of deep networks and found that critical points are saddles and not local minima.

6. Voiculescu, D. V. (1991). “Limit laws for random matrices and free products”. Free probability starts here.