Random Matrix Theory & The Marchenko-Pastur Law

If we construct a large matrix with random entries XijN(0,1)X_{ij} \sim \mathcal{N}(0, 1), what do its eigenvalues look like? Naive intuition suggests they might be Gaussian. They are not. They repel each other. Unlike independent random variables which clump together (Central Limit Theorem), eigenvalues behave like charged particles in a “Coulomb Gas”. They distribute themselves to minimize a combined energy potential, balancing the confinement from the Gaussian entries (which pulls them to 0) and the logarithmic repulsion (which pushes them apart). The eigenvalues of a Wishart Matrix W=1nXXTW = \frac{1}{n} XX^T (covariance of noise) follow the Marchenko-Pastur Law. This distribution has compact support [(1γ)2,(1+γ)2][(1-\sqrt{\gamma})^2, (1+\sqrt{\gamma})^2]. This Hard Edge is the most critical feature of random spectrums. It allows us to distinguish signal from noise. Any eigenvalue popping out of this bulk is a “spike” (signal), while anything inside is indistinguishable from noise. This note derives the limit laws of Random Matrix Theory (RMT) using the Stieltjes Transform and Free Probability, and explores the BBP Phase Transition—the precise moment when a signal becomes strong enough to escape the sea of noise.


1. The Physics of Eigenvalues: The Coulomb Gas

Why do eigenvalues repel? The joint probability density of the eigenvalues λ1,,λn\lambda_1, \dots, \lambda_n of a Gaussian Ensemble is given by:

P(λ1,,λn)i<jλiλjβi=1neβn4λi2P(\lambda_1, \dots, \lambda_n) \propto \prod_{i < j} |\lambda_i - \lambda_j|^\beta \prod_{i=1}^n e^{-\frac{\beta n}{4} \lambda_i^2}

where β\beta is the “Dyson Index” (β=1\beta=1 for Real Symmetric, β=2\beta=2 for Complex Hermitian). This can be written as a Boltzmann distribution P(λ)eβH(λ)P(\lambda) \propto e^{-\beta H(\lambda)} with Hamiltonian:

H(λ)=i=1nn4λi2i<jlogλiλjH(\lambda) = \sum_{i=1}^n \frac{n}{4} \lambda_i^2 - \sum_{i < j} \log |\lambda_i - \lambda_j|

This is exactly the energy of a 2D Coulomb Gas confined by a harmonic potential.

  • Confinement: λi2\sum \lambda_i^2 keeps particles near the origin.
  • Repulsion: logλiλj-\sum \log |\lambda_i - \lambda_j| prevents particles from overlapping. As nn \to \infty, the gas settles into a deterministic equilibrium configuration. This equilibrium density is the Wigner Semicircle (for symmetric matrices) or Marchenko-Pastur (for covariance matrices).

2. The Stieltjes Transform

The primary tool for analyzing spectral densities (solving for the equilibrium of the gas) is the Stieltjes Transform. Let μ\mu be a probability measure on R\mathbb{R} (the spectral density). The transform is defined on the complex upper half-plane C+\mathbb{C}^+:

m(z)=R1λzdμ(λ),zC,Im(z)>0m(z) = \int_{\mathbb{R}} \frac{1}{\lambda - z} d\mu(\lambda), \quad z \in \mathbb{C}, \text{Im}(z) > 0

Why this transform?

  1. Resolvent Connection: For a matrix HH, mH(z)=1nTr(HzI)1m_H(z) = \frac{1}{n} \text{Tr}(H - zI)^{-1}. This links linear algebra (trace of inverse) to analysis.
  2. Inversion Formula: We can recover the density ρ(λ)\rho(\lambda) from m(z)m(z) via the Plemelj-Sokhotski formula: ρ(λ)=1πlimϵ0+Im[m(λ+iϵ)]\rho(\lambda) = \frac{1}{\pi} \lim_{\epsilon \to 0^+} \text{Im}[m(\lambda + i\epsilon)]
  3. Stability: The transform handles the “fluctuations” of finite nn much better than the density itself, allowing for rigorous limit proofs.

3. Derivation I: The Wigner Semicircle Law

Let WnW_n be an n×nn \times n symmetric Wigner matrix with entries Wij=Xij/nW_{ij} = X_{ij}/\sqrt{n}, where XijX_{ij} are i.i.d with zero mean and unit variance. We assume the existence of a limiting spectral distribution and find its Stieltjes transform m(z)m(z).

The Schur Complement Trick (Leave-One-Out)

Let G(z)=(WnzI)1G(z) = (W_n - zI)^{-1} be the resolvent matrix. By symmetry, the average trace converges to any diagonal element, say G11G_{11}. Decompose WnW_n to isolate the first row/column:

Wn=(W11wTwW(n1))W_n = \begin{pmatrix} W_{11} & \mathbf{w}^T \\ \mathbf{w} & W^{(n-1)} \end{pmatrix}

Using the Schur Complement formula for block matrix inversion:

G11=1W11zwT(W(n1)zI)1wG_{11} = \frac{1}{W_{11} - z - \mathbf{w}^T (W^{(n-1)} - zI)^{-1} \mathbf{w}}

As nn \to \infty:

  1. W11=X11/n0W_{11} = X_{11}/\sqrt{n} \to 0.
  2. The quadratic form wTG(n1)w\mathbf{w}^T G^{(n-1)} \mathbf{w} concentrates around its mean via the Trace Lemma: wTG(n1)w1nTr(G(n1))=mn1(z)m(z)\mathbf{w}^T G^{(n-1)} \mathbf{w} \approx \frac{1}{n} \text{Tr}(G^{(n-1)}) = m_{n-1}(z) \approx m(z)

This yields the self-consistent equation:

m(z)=1zm(z)m(z) = \frac{1}{-z - m(z)}

Solving the quadratic m2+zm+1=0m^2 + zm + 1 = 0 gives m(z)=z+z242m(z) = \frac{-z + \sqrt{z^2 - 4}}{2}. The density is recovered from the imaginary part on the cut [2,2][-2, 2]:

ρsc(λ)=12π4λ2\rho_{sc}(\lambda) = \frac{1}{2\pi} \sqrt{4 - \lambda^2}

4. Derivation II: The Marchenko-Pastur Law

Now consider the Sample Covariance Matrix Sn=1nXXTS_n = \frac{1}{n} XX^T, where XX is p×np \times n with i.i.d entries. Let p/nγp/n \to \gamma. We want the spectrum of SnS_n. Using a similar block decomposition (or the Sherman-Morrison formula) on the resolvent of SnS_n, we arrive at a more complex fixed-point equation. The diagonal element m(z)m(z) relates to the variance trace of the remaining matrix. The self-consistent equation for the Stieltjes transform is:

m(z)=1z+11+γm(z)m(z) = \frac{1}{-z + \frac{1}{1 + \gamma m(z)}}

This transforms into the quadratic equation:

γzm2+(z(1γ)1+γ)m+1=0\gamma z m^2 + (z(1-\gamma) - 1 + \gamma) m + 1 = 0

The Solution Structure

The roots yield the Stieltjes transform. The density is non-zero when the discriminant is negative (complex roots).

Δ(z)=[z(1γ)1]24zγ<0\Delta(z) = [z(1-\gamma)-1]^2 - 4z\gamma < 0

The boundaries of the support are the roots of the discriminant z±=(1±γ)2z_{\pm} = (1 \pm \sqrt{\gamma})^2. The resulting density is:

ρMP(λ)=12πγλ(λ+λ)(λλ)for λ[λ,λ+]\rho_{MP}(\lambda) = \frac{1}{2\pi \gamma \lambda} \sqrt{(\lambda_+ - \lambda)(\lambda - \lambda_-)} \quad \text{for } \lambda \in [\lambda_-, \lambda_+]

Key Insight:

  • If γ=1\gamma = 1 (square matrix), the support is [0,4][0, 4]. The density diverges at 0 as λ1/2\lambda^{-1/2}.
  • If γ<1\gamma < 1 (tall matrix), the support is bounded away from 0.
  • If γ>1\gamma > 1 (wide matrix), there is a point mass at 0 with weight 11/γ1 - 1/\gamma, representing the rank deficiency.

5. The BBP Phase Transition

What if XX is not pure noise? Signal model: X=θuvT+ZX = \sqrt{\theta} u v^T + Z. Covariance Σ=I+θuuT\Sigma = I + \theta u u^T. Does the top eigenvalue of SnS_n reflect θ\theta? Theorem (Baik, Ben Arous, Péché): There is a critical signal-to-noise ration θc=γ\theta_c = \sqrt{\gamma}.

  1. Weak Signal (θ<θc\theta < \theta_c): The top eigenvalue “sticks” to the bulk edge λ+=(1+γ)2\lambda_+ = (1+\sqrt{\gamma})^2. The empirical eigenvector u^\hat{u} is asymptotically orthogonal to the true uu. u^,u0\langle \hat{u}, u \rangle \to 0. Implication: PCA returns pure noise. You learn nothing.
  2. Strong Signal (θ>θc\theta > \theta_c): The top eigenvalue “pops out” (outlier). λtop(1+θ)(1+γθ)\lambda_{top} \to (1+\theta)(1 + \frac{\gamma}{\theta}) The overlap becomes non-zero: u^,u21γ/θ21+γ/θ|\langle \hat{u}, u \rangle|^2 \to \frac{1 - \gamma/\theta^2}{1 + \gamma/\theta}

This implies a fundamental limit of detection. No algorithm can linearly recover the spike below this transition.



6. The Tracy-Widom Distribution

In the BBP discussion, we saw that the “Bulk Edge” is at λ+=(1+γ)2\lambda_+ = (1+\sqrt{\gamma})^2. But for finite nn, the largest eigenvalue λmax\lambda_{max} fluctuates around this limit. What is the distribution of these fluctuations? For sums of i.i.d. random variables, the CLT gives us a Gaussian. For eigenvalues, the strong repulsion creates a stiffer system. The fluctuations are much smaller (O(n2/3)O(n^{-2/3})) and non-Gaussian. Theorem (Tracy-Widom, 1994):

λmaxλ++λ+n2/3TWβ\lambda_{max} \approx \lambda_+ + \frac{\lambda_+}{n^{2/3}} \cdot \mathbb{TW}_\beta

The probability distribution function Fβ(s)=P(TWβs)F_\beta(s) = P(\mathbb{TW}_\beta \le s) is defined by the solution to the Painlevé II differential equation:

q(s)=sq(s)+2q(s)3q''(s) = s q(s) + 2 q(s)^3

This distribution is universal. It appears in:

  • The longest increasing subsequence of random permutations (Ulam’s Problem).
  • The height fluctuations of the KPZ equation (stochastic interface growth).
  • The bus Bunching phenomenon in transport systems. The distribution is asymmetric, with a “soft tail” extending to the right (outliers) and a hard wall on the left (pushing against the bulk).

7. Free Probability and the R-Transform

How do we compute the spectrum of a sum of two random matrices A+BA + B? If AA and BB commute (share eigenvectors), their eigenvalues simply sum. If they are random rotations of each other, they are Free. Free Independence is the non-commutative analog of statistical independence. Voiculescu introduced the R-Transform (analogous to the log-characteristic function logϕ(t)\log \phi(t)) to linearize free convolution.

The Algorithm

  1. Compute the Stieltjes transform GA(z)G_A(z) of the density ρA\rho_A.
  2. Compute the functional inverse GA1(w)G_A^{-1}(w).
  3. The R-transform is RA(w)=GA1(w)1wR_A(w) = G_A^{-1}(w) - \frac{1}{w}.
  4. Additivity: RA+B(w)=RA(w)+RB(w)R_{A+B}(w) = R_A(w) + R_B(w).
  5. Invert back: Find GA+B(z)G_{A+B}(z) by solving RA+B(G(z))+1/G(z)=zR_{A+B}(G(z)) + 1/G(z) = z.

Example: Sum of two Wigners is a Wigner (semicircle \boxplus semicircle = semicircle). Example: Sum of noise + signal. We can predict the exact shape of the deformed spacing. This is crucial for analyzing ResNets (Identity + Random Weight).


8. Numerical Simulation (Python)

Let’s visualize the MP law and the BBP transition.

import numpy as np import matplotlib.pyplot as plt def marchenko_pastur_pdf(x, gamma, sigma=1): lambda_plus = sigma**2 * (1 + np.sqrt(gamma))**2 lambda_minus = sigma**2 * (1 - np.sqrt(gamma))**2 # PDF only defined within support with np.errstate(divide='ignore', invalid='ignore'): y = np.sqrt((lambda_plus - x) * (x - lambda_minus)) / (2 * np.pi * gamma * x * sigma**2) y[x < lambda_minus] = 0 y[x > lambda_plus] = 0 return y def simulate_eigenvalues(p=1000, n=2000, signal_strength=0.0): gamma = p / n # Noise Z = np.random.randn(p, n) # Signal (Rank 1 spike) if signal_strength > 0: u = np.random.randn(p, 1); u /= np.linalg.norm(u) v = np.random.randn(n, 1); v /= np.linalg.norm(v) X = np.sqrt(signal_strength) * u @ v.T * np.sqrt(n) + Z else: X = Z S = (1/n) * X @ X.T evals = np.linalg.eigvalsh(S) # Plot Bulk plt.figure(figsize=(10, 6)) plt.hist(evals, bins=80, density=True, alpha=0.5, label='Empirical') x = np.linspace(0, (1 + np.sqrt(gamma))**2 + 1, 1000) pdf = marchenko_pastur_pdf(x, gamma) plt.plot(x, pdf, 'r', linewidth=2, label='Marchenko-Pastur') # Edge Fluctuations (Tracy-Widom scaling) lambda_plus = (1 + np.sqrt(gamma))**2 max_eval = np.max(evals) scaling_factor = lambda_plus * (n**(-2/3)) tw_fluctuation = (max_eval - lambda_plus) / scaling_factor print(f"Largest Eigenvalue: {max_eval:.4f}") print(f"Edge Prediction: {lambda_plus:.4f}") print(f"Normalized Fluctuation (should be O(1)): {tw_fluctuation:.4f}") # BBP Prediction threshold = np.sqrt(gamma) if signal_strength > threshold: predicted_spike = (1 + signal_strength) * (1 + gamma/signal_strength) plt.axvline(predicted_spike, color='g', linestyle='--', label=f'BBP Prediction ({predicted_spike:.2f})') plt.axvline(lambda_plus, color='k', linestyle='--', label='Bulk Edge') plt.legend() return evals # To run: # simulate_eigenvalues(p=1000, n=2000, signal_strength=1.5)


9. Universality and Deep Learning

Why do these laws matter if our data isn’t Gaussian? Universality Theorem: As long as the entries XijX_{ij} are i.i.d. with mean 0 and variance 1 (and finite higher moments), the spectral statistics (Wigner Semicircle, Marchenko-Pastur, Tracy-Widom) are identical to the Gaussian case as nn \to \infty. This is the Spectral Central Limit Theorem. Application: Neural Network Initialization: In Deep Learning, we initialize weights WN(0,2n)W \sim \mathcal{N}(0, \frac{2}{n}). The Jacobian JJ of a deep network is a product of many random matrices. For the gradient to flow without vanishing or exploding, the singular values of JJ must be concentrated near 1. RMT tells us that for a random matrix WW, the singular values are distributed according to Marchenko-Pastur. To ensure the edge of the spectrum reaches 1 (dynamical isometry), we need careful scaling of variances (Xavier/He Initialization) and potentially orthogonal initialization to trivialise the spectrum to a Dirac at 1 initially. The Hessian of the loss surface is also a large random matrix. Its spectrum often consists of a “bulk” (noise) plus “spikes” (signal). This validates the BBP model as a relevant proxy for training dynamics.


10. Conclusion

Random Matrix Theory provides a powerful lens for understanding the behavior of complex, high-dimensional systems. From the initial motivation of modeling heavy nuclei physics to modern applications in deep learning and wireless communications, the universality of eigenvalue statistics allows us to ignore the microscopic details of the noise and focus on the macroscopic structure.

The Marchenko-Pastur law gives us the “shape of noise,” defining the rigid boundary between signal and randomness. The BBP transition tells us exactly when a signal is strong enough to be detected. And the Tracy-Widom distribution quantifies the subtle fluctuations at the edge of this boundary.

As we move towards understanding larger neural networks and more complex datasets, RMT will likely play an even more central role, providing the rigorous limit laws that justify our empirical observations.


Historical Timeline

YearEventSignificance
1928Wishart DistributionJohn Wishart derives the distribution of sample covariance matrices.
1955Wigner Semicircle LawEugene Wigner models heavy nuclei resonance using random matrices.
1967Marchenko-Pastur LawDerivation of the limit spectral density for large covariance matrices.
1984Girko’s Circular LawProof of the universal law for non-Hermitian matrices.
1991Free ProbabilityDan Voiculescu founds the theory of free random variables.
1994Tracy-Widom DistributionDiscovery of the universal law governing edge fluctuations.
2005BBP Phase TransitionBaik, Ben Arous, and Péché characterize the signal detection threshold.
2010sDeep Learning EraRMT applied to loss surfaces (Pennington) and initialization (Saxe, Ganguli).

Appendix A: The Combinatorial Method of Moments

An alternative to Stoeltjes relies on trace powers (Moments of the spectral density):

Mk=1nE[Tr(Wk)]=1nE[i1,,ik=1nWi1i2Wi2i3Wiki1]M_k = \frac{1}{n} \mathbb{E}[\text{Tr}(W^k)] = \frac{1}{n} \mathbb{E} \left[ \sum_{i_1, \dots, i_k=1}^n W_{i_1 i_2} W_{i_2 i_3} \dots W_{i_k i_1} \right]

This sum looks like a path on a graph with nn vertices. The indices i1i2i1i_1 \to i_2 \to \dots \to i_1 form a closed loop of length kk. Since WijW_{ij} are independent with mean 0, only terms where every edge is traversed at least twice have non-zero expectation.

  • k is odd: Impossible to traverse every edge twice. Modd0M_{odd} \to 0.
  • k is even (2m2m): Dominant terms correspond to “Trees” where every edge is traversed exactly twice (down and up). The number of such “Non-Crossing” paths is counted by the Catalan Numbers:
Cm=1m+1(2mm)C_m = \frac{1}{m+1} \binom{2m}{m}

The moments of the Wigner Semicircle are exactly the Catalan numbers!

22x2m12π4x2dx=Cm\int_{-2}^2 x^{2m} \frac{1}{2\pi} \sqrt{4-x^2} dx = C_m

Since the distribution has compact support, the moments uniquely determine the measure (Carleman’s Condition). For Marchenko-Pastur, the combinatorics involve “non-crossing partitions” of the cycle, counted by Narayana Numbers.


Appendix B: Large Deviation Principles

The Wigner Semicircle is just the average behavior. What is the probability of observing a different empirical spectral density μ\mu? RMT satisfies a Large Deviation Principle (LDP) with speed n2n^2.

P(ρnμ)en2I(μ)P(\rho_n \approx \mu) \approx e^{-n^2 I(\mu)}

The rate function I(μ)I(\mu) is the “Ben Arous-Guionnet” entropy:

I(μ)=12x2dμ(x)logxydμ(x)dμ(y)CI(\mu) = \frac{1}{2} \int x^2 d\mu(x) - \iint \log |x-y| d\mu(x) d\mu(y) - C

This is purely Electrostatic Energy:

  1. Potential Energy x2dμ\int x^2 d\mu: Cost of moving particles far from 0.
  2. Interaction Energy logxy-\iint \log |x-y|: Cost of particles being close together. The Semicircle Law is simply the Ground State (Minimizer) of this energy functional. This explains why the law is so robust (Universality): it’s an energy minimum.

Appendix C: The Harish-Chandra-Itzykson-Zuber (HCIZ) Integral

In Physics and RMT, we often need to compute integrals over the Unitary Group U(n)U(n). The most famous is the HCIZ integral, which couples two Hermitian matrices AA and BB:

I(A,B,τ)=U(n)exp(τTr(AUBU))dUI(A, B, \tau) = \int_{U(n)} \exp\left( \tau \text{Tr}(A U B U^*) \right) dU

This integral is fundamental to the study of matrix models with external fields. Theorem: If AA has eigenvalues aia_i and BB has bib_i, then:

I(A,B,τ)det(eτaibj)i,jΔ(A)Δ(B)I(A, B, \tau) \propto \frac{\det(e^{\tau a_i b_j})_{i,j}}{\Delta(A) \Delta(B)}

where Δ(A)=i<j(aiaj)\Delta(A) = \prod_{i<j} (a_i - a_j) is the Vandermonde determinant. This formula allows us to solve exactly for the correlations of eigenvalues in coupled matrix models, linking RMT to Integrable Systems and representation theory.


Appendix D: The Circular Law (Non-Hermitian)

What if XX is not symmetric? (e.g., Weights in a Recurrent Neural Network). Let XX be n×nn \times n with i.i.d entries N(0,1/n)\mathcal{N}(0, 1/n). The eigenvalues are complex and spread out in the complex plane. Theorem (Girko’s Circular Law): As nn \to \infty, the eigenvalues are uniformly distributed on the unit disk in the complex plane C\mathbb{C}.

ρ(λ)=1π1λ1\rho(\lambda) = \frac{1}{\pi} \mathbb{1}_{|\lambda| \le 1}

Implication for RNNs:

  • If eigenvalues lie outside the unit circle (λ>1|\lambda| > 1), signals grow exponentially (\to \infty), gradients explode.
  • If inside (λ<1|\lambda| < 1), signals decay (0\to 0), gradients vanish.
  • Use Orthogonal Initialization to place eigenvalues exactly on the boundary λ=1|\lambda|=1 to preserve gradient norm (isometry).

Appendix E: Glossary of Terms

  • Bulk: The continuous part of the limit spectrum (e.g., Semicircle, Marchenko-Pastur support).
  • Edge: The boundary of the support. Fluctuations here are governed by Tracy-Widom.
  • Free Probability: A theory of non-commutative random variables where “freeness” replaces independence.
  • Stieltjes Transform: The generating function of the moments of the spectral density.
  • Universality: The phenomenon where spectral statistics depend only on the moments of the entries, not their specific distribution.
  • Wishart Matrix: A product XXTXX^T where XX is Gaussian. The random matrix analog of the Chi-Square distribution.

References

1. Anderson, G. W., Guionnet, A., & Zeitouni, O. (2010). “An Introduction to Random Matrices”. The standard mathematical text. Extremely dense, covers Large Deviations and Topological expansions.

2. Marchenko, V. A., & Pastur, L. A. (1967). “Distribution of eigenvalues for some sets of random matrices”. The classic paper. Derived the eponymous law.

3. Baik, J., Ben Arous, G., & Péché, S. (2005). “Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices”. Proved the BBP transition. Critical for signal detection in high noise.

4. Tracy, C. A., & Widom, H. (1994). “Level-spacing distributions and the Airy kernel”. Derived the Tracy-Widom distribution for the fluctuations of the largest eigenvalue. This distribution is now seen everywhere in KPZ physics.

5. Pennington, J., & Bahri, Y. (2017). “Geometry of Neural Network Loss Surfaces via Random Matrix Theory”. Applied RMT to understand the Hessian of deep networks. Showed that critical points are saddles, not minima.

6. Voiculescu, D. V. (1991). “Limit laws for random matrices and free products”. Introduced Free Probability.