Random Matrix Theory & The Marchenko-Pastur Law

6/21/2025

If we construct a large matrix with random entries $X_{ij} \sim \mathcal{N}(0, 1)$ , what do its eigenvalues look like? Naive intuition suggests they might be Gaussian. They are not. They repel each other. Unlike independent random variables which clump together (Central Limit Theorem), eigenvalues behave like charged particles in a “Coulomb Gas”. They distribute themselves to minimize a combined energy potential, balancing the confinement from the Gaussian entries (which pulls them to 0) and the logarithmic repulsion (which pushes them apart). The eigenvalues of a Wishart Matrix $W = \frac{1}{n} XX^T$ (covariance of noise) follow the Marchenko-Pastur Law. This distribution has compact support $[(1-\sqrt{\gamma})^2, (1+\sqrt{\gamma})^2]$ . This Hard Edge is the most critical feature of random spectrums. It allows us to distinguish signal from noise. Any eigenvalue popping out of this bulk is a “spike” (signal), while anything inside is indistinguishable from noise. This note derives the limit laws of Random Matrix Theory (RMT) using the Stieltjes Transform and Free Probability, and explores the BBP Phase Transition—the precise moment when a signal becomes strong enough to escape the sea of noise.

1. The Physics of Eigenvalues: The Coulomb Gas

Why do eigenvalues repel? The joint probability density of the eigenvalues $\lambda_1, \dots, \lambda_n$ of a Gaussian Ensemble is given by:

P(\lambda_1, \dots, \lambda_n) \propto \prod_{i < j} |\lambda_i - \lambda_j|^\beta \prod_{i=1}^n e^{-\frac{\beta n}{4} \lambda_i^2}

where $\beta$ is the “Dyson Index” ( $\beta=1$ for Real Symmetric, $\beta=2$ for Complex Hermitian). This can be written as a Boltzmann distribution $P(\lambda) \propto e^{-\beta H(\lambda)}$ with Hamiltonian:

H(\lambda) = \sum_{i=1}^n \frac{n}{4} \lambda_i^2 - \sum_{i < j} \log |\lambda_i - \lambda_j|

This is exactly the energy of a 2D Coulomb Gas confined by a harmonic potential.

Confinement: $\sum \lambda_i^2$ keeps particles near the origin.
Repulsion: $-\sum \log |\lambda_i - \lambda_j|$ prevents particles from overlapping. As $n \to \infty$ , the gas settles into a deterministic equilibrium configuration. This equilibrium density is the Wigner Semicircle (for symmetric matrices) or Marchenko-Pastur (for covariance matrices).

2. The Stieltjes Transform

The primary tool for analyzing spectral densities (solving for the equilibrium of the gas) is the Stieltjes Transform. Let $\mu$ be a probability measure on $\mathbb{R}$ (the spectral density). The transform is defined on the complex upper half-plane $\mathbb{C}^+$ :

m(z) = \int_{\mathbb{R}} \frac{1}{\lambda - z} d\mu(\lambda), \quad z \in \mathbb{C}, \text{Im}(z) > 0

Why this transform?

Resolvent Connection: For a matrix $H$ , $m_H(z) = \frac{1}{n} \text{Tr}(H - zI)^{-1}$ . This links linear algebra (trace of inverse) to analysis.
Inversion Formula: We can recover the density $\rho(\lambda)$ from $m(z)$ via the Plemelj-Sokhotski formula: $\rho(\lambda) = \frac{1}{\pi} \lim_{\epsilon \to 0^+} \text{Im}[m(\lambda + i\epsilon)]$
Stability: The transform handles the “fluctuations” of finite $n$ much better than the density itself, allowing for rigorous limit proofs.

3. Derivation I: The Wigner Semicircle Law

Let $W_n$ be an $n \times n$ symmetric Wigner matrix with entries $W_{ij} = X_{ij}/\sqrt{n}$ , where $X_{ij}$ are i.i.d with zero mean and unit variance. We assume the existence of a limiting spectral distribution and find its Stieltjes transform $m(z)$ .

The Schur Complement Trick (Leave-One-Out)

Let $G(z) = (W_n - zI)^{-1}$ be the resolvent matrix. By symmetry, the average trace converges to any diagonal element, say $G_{11}$ . Decompose $W_n$ to isolate the first row/column:

W_n = \begin{pmatrix} W_{11} & \mathbf{w}^T \\ \mathbf{w} & W^{(n-1)} \end{pmatrix}

Using the Schur Complement formula for block matrix inversion:

G_{11} = \frac{1}{W_{11} - z - \mathbf{w}^T (W^{(n-1)} - zI)^{-1} \mathbf{w}}

As $n \to \infty$ :

$W_{11} = X_{11}/\sqrt{n} \to 0$ .
The quadratic form $\mathbf{w}^T G^{(n-1)} \mathbf{w}$ concentrates around its mean via the Trace Lemma: $\mathbf{w}^T G^{(n-1)} \mathbf{w} \approx \frac{1}{n} \text{Tr}(G^{(n-1)}) = m_{n-1}(z) \approx m(z)$

This yields the self-consistent equation:

m(z) = \frac{1}{-z - m(z)}

Solving the quadratic $m^2 + zm + 1 = 0$ gives $m(z) = \frac{-z + \sqrt{z^2 - 4}}{2}$ . The density is recovered from the imaginary part on the cut $[-2, 2]$ :

\rho_{sc}(\lambda) = \frac{1}{2\pi} \sqrt{4 - \lambda^2}

4. Derivation II: The Marchenko-Pastur Law

Now consider the Sample Covariance Matrix $S_n = \frac{1}{n} XX^T$ , where $X$ is $p \times n$ with i.i.d entries. Let $p/n \to \gamma$ . We want the spectrum of $S_n$ . Using a similar block decomposition (or the Sherman-Morrison formula) on the resolvent of $S_n$ , we arrive at a more complex fixed-point equation. The diagonal element $m(z)$ relates to the variance trace of the remaining matrix. The self-consistent equation for the Stieltjes transform is:

m(z) = \frac{1}{-z + \frac{1}{1 + \gamma m(z)}}

This transforms into the quadratic equation:

\gamma z m^2 + (z(1-\gamma) - 1 + \gamma) m + 1 = 0

The Solution Structure

The roots yield the Stieltjes transform. The density is non-zero when the discriminant is negative (complex roots).

\Delta(z) = [z(1-\gamma)-1]^2 - 4z\gamma < 0

The boundaries of the support are the roots of the discriminant $z_{\pm} = (1 \pm \sqrt{\gamma})^2$ . The resulting density is:

\rho_{MP}(\lambda) = \frac{1}{2\pi \gamma \lambda} \sqrt{(\lambda_+ - \lambda)(\lambda - \lambda_-)} \quad \text{for } \lambda \in [\lambda_-, \lambda_+]

Key Insight:

If $\gamma = 1$ (square matrix), the support is $[0, 4]$ . The density diverges at 0 as $\lambda^{-1/2}$ .
If $\gamma < 1$ (tall matrix), the support is bounded away from 0.
If $\gamma > 1$ (wide matrix), there is a point mass at 0 with weight $1 - 1/\gamma$ , representing the rank deficiency.

5. The BBP Phase Transition

What if $X$ is not pure noise? Signal model: $X = \sqrt{\theta} u v^T + Z$ . Covariance $\Sigma = I + \theta u u^T$ . Does the top eigenvalue of $S_n$ reflect $\theta$ ? Theorem (Baik, Ben Arous, Péché): There is a critical signal-to-noise ration $\theta_c = \sqrt{\gamma}$ .

Weak Signal ( $\theta < \theta_c$ ): The top eigenvalue “sticks” to the bulk edge $\lambda_+ = (1+\sqrt{\gamma})^2$ . The empirical eigenvector $\hat{u}$ is asymptotically orthogonal to the true $u$ . $\langle \hat{u}, u \rangle \to 0$ . Implication: PCA returns pure noise. You learn nothing.
Strong Signal ( $\theta > \theta_c$ ): The top eigenvalue “pops out” (outlier). $\lambda_{top} \to (1+\theta)(1 + \frac{\gamma}{\theta})$ The overlap becomes non-zero: $|\langle \hat{u}, u \rangle|^2 \to \frac{1 - \gamma/\theta^2}{1 + \gamma/\theta}$

This implies a fundamental limit of detection. No algorithm can linearly recover the spike below this transition.

6. The Tracy-Widom Distribution

In the BBP discussion, we saw that the “Bulk Edge” is at $\lambda_+ = (1+\sqrt{\gamma})^2$ . But for finite $n$ , the largest eigenvalue $\lambda_{max}$ fluctuates around this limit. What is the distribution of these fluctuations? For sums of i.i.d. random variables, the CLT gives us a Gaussian. For eigenvalues, the strong repulsion creates a stiffer system. The fluctuations are much smaller ( $O(n^{-2/3})$ ) and non-Gaussian. Theorem (Tracy-Widom, 1994):

\lambda_{max} \approx \lambda_+ + \frac{\lambda_+}{n^{2/3}} \cdot \mathbb{TW}_\beta

The probability distribution function $F_\beta(s) = P(\mathbb{TW}_\beta \le s)$ is defined by the solution to the Painlevé II differential equation:

q''(s) = s q(s) + 2 q(s)^3

This distribution is universal. It appears in:

The longest increasing subsequence of random permutations (Ulam’s Problem).
The height fluctuations of the KPZ equation (stochastic interface growth).
The bus Bunching phenomenon in transport systems. The distribution is asymmetric, with a “soft tail” extending to the right (outliers) and a hard wall on the left (pushing against the bulk).

7. Free Probability and the R-Transform

How do we compute the spectrum of a sum of two random matrices $A + B$ ? If $A$ and $B$ commute (share eigenvectors), their eigenvalues simply sum. If they are random rotations of each other, they are Free. Free Independence is the non-commutative analog of statistical independence. Voiculescu introduced the R-Transform (analogous to the log-characteristic function $\log \phi(t)$ ) to linearize free convolution.

The Algorithm

Compute the Stieltjes transform $G_A(z)$ of the density $\rho_A$ .
Compute the functional inverse $G_A^{-1}(w)$ .
The R-transform is $R_A(w) = G_A^{-1}(w) - \frac{1}{w}$ .
Additivity: $R_{A+B}(w) = R_A(w) + R_B(w)$ .
Invert back: Find $G_{A+B}(z)$ by solving $R_{A+B}(G(z)) + 1/G(z) = z$ .

Example: Sum of two Wigners is a Wigner (semicircle $\boxplus$ semicircle = semicircle). Example: Sum of noise + signal. We can predict the exact shape of the deformed spacing. This is crucial for analyzing ResNets (Identity + Random Weight).

8. Numerical Simulation (Python)

Let’s visualize the MP law and the BBP transition.


import numpy as np
import matplotlib.pyplot as plt
 
def marchenko_pastur_pdf(x, gamma, sigma=1):
    lambda_plus = sigma**2 * (1 + np.sqrt(gamma))**2
    lambda_minus = sigma**2 * (1 - np.sqrt(gamma))**2
    
    # PDF only defined within support
    with np.errstate(divide='ignore', invalid='ignore'):
        y = np.sqrt((lambda_plus - x) * (x - lambda_minus)) / (2 * np.pi * gamma * x * sigma**2)
    y[x < lambda_minus] = 0
    y[x > lambda_plus] = 0
    return y
 
def simulate_eigenvalues(p=1000, n=2000, signal_strength=0.0):
    gamma = p / n
    
    # Noise
    Z = np.random.randn(p, n)
    
    # Signal (Rank 1 spike)
    if signal_strength > 0:
        u = np.random.randn(p, 1); u /= np.linalg.norm(u)
        v = np.random.randn(n, 1); v /= np.linalg.norm(v)
        X = np.sqrt(signal_strength) * u @ v.T * np.sqrt(n) + Z 
    else:
        X = Z
        
    S = (1/n) * X @ X.T
    
    evals = np.linalg.eigvalsh(S)
    
    # Plot Bulk
    plt.figure(figsize=(10, 6))
    plt.hist(evals, bins=80, density=True, alpha=0.5, label='Empirical')
    
    x = np.linspace(0, (1 + np.sqrt(gamma))**2 + 1, 1000)
    pdf = marchenko_pastur_pdf(x, gamma)
    plt.plot(x, pdf, 'r', linewidth=2, label='Marchenko-Pastur')
    
    # Edge Fluctuations (Tracy-Widom scaling)
    lambda_plus = (1 + np.sqrt(gamma))**2
    max_eval = np.max(evals)
    
    scaling_factor = lambda_plus * (n**(-2/3))
    tw_fluctuation = (max_eval - lambda_plus) / scaling_factor
    print(f"Largest Eigenvalue: {max_eval:.4f}")
    print(f"Edge Prediction: {lambda_plus:.4f}")
    print(f"Normalized Fluctuation (should be O(1)): {tw_fluctuation:.4f}")
 
    # BBP Prediction
    threshold = np.sqrt(gamma)
    if signal_strength > threshold:
        predicted_spike = (1 + signal_strength) * (1 + gamma/signal_strength)
        plt.axvline(predicted_spike, color='g', linestyle='--', label=f'BBP Prediction ({predicted_spike:.2f})')
    
    plt.axvline(lambda_plus, color='k', linestyle='--', label='Bulk Edge')
    plt.legend()
    return evals
 
# To run:
# simulate_eigenvalues(p=1000, n=2000, signal_strength=1.5)

9. Universality and Deep Learning

Why do these laws matter if our data isn’t Gaussian? Universality Theorem: As long as the entries $X_{ij}$ are i.i.d. with mean 0 and variance 1 (and finite higher moments), the spectral statistics (Wigner Semicircle, Marchenko-Pastur, Tracy-Widom) are identical to the Gaussian case as $n \to \infty$ . This is the Spectral Central Limit Theorem. Application: Neural Network Initialization: In Deep Learning, we initialize weights $W \sim \mathcal{N}(0, \frac{2}{n})$ . The Jacobian $J$ of a deep network is a product of many random matrices. For the gradient to flow without vanishing or exploding, the singular values of $J$ must be concentrated near 1. RMT tells us that for a random matrix $W$ , the singular values are distributed according to Marchenko-Pastur. To ensure the edge of the spectrum reaches 1 (dynamical isometry), we need careful scaling of variances (Xavier/He Initialization) and potentially orthogonal initialization to trivialise the spectrum to a Dirac at 1 initially. The Hessian of the loss surface is also a large random matrix. Its spectrum often consists of a “bulk” (noise) plus “spikes” (signal). This validates the BBP model as a relevant proxy for training dynamics.

10. Conclusion

Random Matrix Theory provides a powerful lens for understanding the behavior of complex, high-dimensional systems. From the initial motivation of modeling heavy nuclei physics to modern applications in deep learning and wireless communications, the universality of eigenvalue statistics allows us to ignore the microscopic details of the noise and focus on the macroscopic structure.

The Marchenko-Pastur law gives us the “shape of noise,” defining the rigid boundary between signal and randomness. The BBP transition tells us exactly when a signal is strong enough to be detected. And the Tracy-Widom distribution quantifies the subtle fluctuations at the edge of this boundary.

As we move towards understanding larger neural networks and more complex datasets, RMT will likely play an even more central role, providing the rigorous limit laws that justify our empirical observations.

Historical Timeline

Year	Event	Significance
1928	Wishart Distribution	John Wishart derives the distribution of sample covariance matrices.
1955	Wigner Semicircle Law	Eugene Wigner models heavy nuclei resonance using random matrices.
1967	Marchenko-Pastur Law	Derivation of the limit spectral density for large covariance matrices.
1984	Girko’s Circular Law	Proof of the universal law for non-Hermitian matrices.
1991	Free Probability	Dan Voiculescu founds the theory of free random variables.
1994	Tracy-Widom Distribution	Discovery of the universal law governing edge fluctuations.
2005	BBP Phase Transition	Baik, Ben Arous, and Péché characterize the signal detection threshold.
2010s	Deep Learning Era	RMT applied to loss surfaces (Pennington) and initialization (Saxe, Ganguli).

Appendix A: The Combinatorial Method of Moments

An alternative to Stoeltjes relies on trace powers (Moments of the spectral density):

M_k = \frac{1}{n} \mathbb{E}[\text{Tr}(W^k)] = \frac{1}{n} \mathbb{E} \left[ \sum_{i_1, \dots, i_k=1}^n W_{i_1 i_2} W_{i_2 i_3} \dots W_{i_k i_1} \right]

This sum looks like a path on a graph with $n$ vertices. The indices $i_1 \to i_2 \to \dots \to i_1$ form a closed loop of length $k$ . Since $W_{ij}$ are independent with mean 0, only terms where every edge is traversed at least twice have non-zero expectation.

k is odd: Impossible to traverse every edge twice. $M_{odd} \to 0$ .
k is even ( $2m$ ): Dominant terms correspond to “Trees” where every edge is traversed exactly twice (down and up). The number of such “Non-Crossing” paths is counted by the Catalan Numbers:

C_m = \frac{1}{m+1} \binom{2m}{m}

The moments of the Wigner Semicircle are exactly the Catalan numbers!

\int_{-2}^2 x^{2m} \frac{1}{2\pi} \sqrt{4-x^2} dx = C_m

Since the distribution has compact support, the moments uniquely determine the measure (Carleman’s Condition). For Marchenko-Pastur, the combinatorics involve “non-crossing partitions” of the cycle, counted by Narayana Numbers.

Appendix B: Large Deviation Principles

The Wigner Semicircle is just the average behavior. What is the probability of observing a different empirical spectral density $\mu$ ? RMT satisfies a Large Deviation Principle (LDP) with speed $n^2$ .

P(\rho_n \approx \mu) \approx e^{-n^2 I(\mu)}

The rate function $I(\mu)$ is the “Ben Arous-Guionnet” entropy:

I(\mu) = \frac{1}{2} \int x^2 d\mu(x) - \iint \log |x-y| d\mu(x) d\mu(y) - C

This is purely Electrostatic Energy:

Potential Energy $\int x^2 d\mu$ : Cost of moving particles far from 0.
Interaction Energy $-\iint \log |x-y|$ : Cost of particles being close together. The Semicircle Law is simply the Ground State (Minimizer) of this energy functional. This explains why the law is so robust (Universality): it’s an energy minimum.

Appendix C: The Harish-Chandra-Itzykson-Zuber (HCIZ) Integral

In Physics and RMT, we often need to compute integrals over the Unitary Group $U(n)$ . The most famous is the HCIZ integral, which couples two Hermitian matrices $A$ and $B$ :

I(A, B, \tau) = \int_{U(n)} \exp\left( \tau \text{Tr}(A U B U^*) \right) dU

This integral is fundamental to the study of matrix models with external fields. Theorem: If $A$ has eigenvalues $a_i$ and $B$ has $b_i$ , then:

I(A, B, \tau) \propto \frac{\det(e^{\tau a_i b_j})_{i,j}}{\Delta(A) \Delta(B)}

where $\Delta(A) = \prod_{i<j} (a_i - a_j)$ is the Vandermonde determinant. This formula allows us to solve exactly for the correlations of eigenvalues in coupled matrix models, linking RMT to Integrable Systems and representation theory.

Appendix D: The Circular Law (Non-Hermitian)

What if $X$ is not symmetric? (e.g., Weights in a Recurrent Neural Network). Let $X$ be $n \times n$ with i.i.d entries $\mathcal{N}(0, 1/n)$ . The eigenvalues are complex and spread out in the complex plane. Theorem (Girko’s Circular Law): As $n \to \infty$ , the eigenvalues are uniformly distributed on the unit disk in the complex plane $\mathbb{C}$ .

\rho(\lambda) = \frac{1}{\pi} \mathbb{1}_{|\lambda| \le 1}

Implication for RNNs:

If eigenvalues lie outside the unit circle ( $|\lambda| > 1$ ), signals grow exponentially ( $\to \infty$ ), gradients explode.
If inside ( $|\lambda| < 1$ ), signals decay ( $\to 0$ ), gradients vanish.
Use Orthogonal Initialization to place eigenvalues exactly on the boundary $|\lambda|=1$ to preserve gradient norm (isometry).

Appendix E: Glossary of Terms

Bulk: The continuous part of the limit spectrum (e.g., Semicircle, Marchenko-Pastur support).
Edge: The boundary of the support. Fluctuations here are governed by Tracy-Widom.
Free Probability: A theory of non-commutative random variables where “freeness” replaces independence.
Stieltjes Transform: The generating function of the moments of the spectral density.
Universality: The phenomenon where spectral statistics depend only on the moments of the entries, not their specific distribution.
Wishart Matrix: A product $XX^T$ where $X$ is Gaussian. The random matrix analog of the Chi-Square distribution.

References

1. Anderson, G. W., Guionnet, A., & Zeitouni, O. (2010). “An Introduction to Random Matrices”. The standard mathematical text. Extremely dense, covers Large Deviations and Topological expansions.

2. Marchenko, V. A., & Pastur, L. A. (1967). “Distribution of eigenvalues for some sets of random matrices”. The classic paper. Derived the eponymous law.

3. Baik, J., Ben Arous, G., & Péché, S. (2005). “Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices”. Proved the BBP transition. Critical for signal detection in high noise.

4. Tracy, C. A., & Widom, H. (1994). “Level-spacing distributions and the Airy kernel”. Derived the Tracy-Widom distribution for the fluctuations of the largest eigenvalue. This distribution is now seen everywhere in KPZ physics.

5. Pennington, J., & Bahri, Y. (2017). “Geometry of Neural Network Loss Surfaces via Random Matrix Theory”. Applied RMT to understand the Hessian of deep networks. Showed that critical points are saddles, not minima.

6. Voiculescu, D. V. (1991). “Limit laws for random matrices and free products”. Introduced Free Probability.