The Berry-Esseen Bound

1. The Non-Asymptotic Central Limit Theorem

The Central Limit Theorem (CLT) is a qualitative statement: for i.i.d. variables with finite variance, the normalized sum converges in distribution to N(0,1)\mathcal{N}(0, 1). In practice, nn is never infinite. We require a quantitative measure of proximity. If we use a Gaussian approximation for n=30n=30, what is our maximum error in probability? The Berry-Esseen Theorem provides the definitive upper bound on the Kolmogorov distance:

Dn=supxRP(Snx)Φ(x)CEX3σ3nD_n = \sup_{x \in \mathbb{R}} |P(S_n \le x) - \Phi(x)| \le C \frac{\mathbb{E}|X|^3}{\sigma^3 \sqrt{n}}

where CC is a universal constant. This result is remarkable for its uniformity. It doesn’t matter if the underlying distribution is skewed, multimodal, or discrete; the 1/n1/\sqrt{n} rate is the “speed of light” for the CLT.

2. The Lyapunov Condition and Fractional Moments

Is the third moment EX3\mathbb{E}|X|^3 strictly necessary? The CLT only requires the second moment EX2\mathbb{E}X^2. However, the rate of convergence depends on the tail behavior. Consider the Lyapunov Condition: Let r=2+δr = 2 + \delta for δ(0,1]\delta \in (0, 1]. If EXr<\mathbb{E}|X|^r < \infty, define:

Ln=i=1nEXir(EXi2)r/2L_n = \frac{\sum_{i=1}^n \mathbb{E}|X_i|^r}{(\sum \mathbb{E}X_i^2)^{r/2}}

If Ln0L_n \to 0, the CLT holds. The Berry-Esseen bound generalizes to these fractional moments:

supxFn(x)Φ(x)CδLn\sup_x |F_n(x) - \Phi(x)| \le C_\delta \cdot L_n

For i.i.d. variables, Ln=nEXr(nσ2)r/2=O(nδ/2)L_n = \frac{n \mathbb{E}|X|^r}{(n \sigma^2)^{r/2}} = O(n^{-\delta/2}).

  • If we only have 2.12.1 moments, the error decays as n0.05n^{-0.05} (extremely slow).
  • If we have 33 moments (δ=1\delta=1), we recover the n1/2n^{-1/2} rate.
  • Increasing moments beyond 33 does NOT improve the Kolmogorov rate beyond 1/n1/\sqrt{n} for the worst-case xx, though it allows for higher-order Edgeworth expansions for smooth distributions.


3. Esseen’s Smoothing Lemma: Fourier Interpolation

We cannot easily bound supFn(x)Φ(x)\sup |F_n(x) - \Phi(x)| directly. We can, however, bound the difference of Characteristic Functions (CF): ψn(t)et2/2\psi_n(t) - e^{-t^2/2}. The inverse Fourier transform relates Δψ\Delta \psi to ΔF\Delta F. Problem: FnF_n may not have a density (if XX is discrete). The inversion formula 12πeitx(Δψ)dt\frac{1}{2\pi} \int e^{-itx} (\Delta \psi) dt fails to converge. Solution: Smoothing. We convolve the distributions with a smooth kernel HTH_T with limited bandwidth TT.

The Smoothing Lemma (Esseen, 1945): Let F,GF, G be CDFs. GG is differentiable with GM|G'| \le M. For any T>0T > 0, the maximum deviation is bounded by the 1/t1/t weighted integral of the CF difference:

supxF(x)G(x)1πTTφF(t)φG(t)tdt+24MπT\sup_x |F(x) - G(x)| \le \frac{1}{\pi} \int_{-T}^T \left| \frac{\varphi_F(t) - \varphi_G(t)}{t} \right| dt + \frac{24M}{\pi T}

4. Convergence of the Characteristic Function

To use the Lemma, we analyze Δφ\Delta \varphi for the sum Sn=1nXiS_n = \frac{1}{\sqrt{n}} \sum X_i. Let EX=0,EX2=1,EX3=ρ\mathbb{E}X=0, \mathbb{E}X^2=1, \mathbb{E}|X|^3 = \rho.

φX(t)=E[eitX]=1t22+θρt36\varphi_X(t) = \mathbb{E}[e^{itX}] = 1 - \frac{t^2}{2} + \theta \frac{\rho |t|^3}{6}

For the sum SnS_n:

φn(t)=(φX(tn))n=(1t22n+ρt36n3/2)n\varphi_n(t) = \left( \varphi_X\left(\frac{t}{\sqrt{n}}\right) \right)^n = \left( 1 - \frac{t^2}{2n} + \frac{\rho |t|^3}{6 n^{3/2}} \right)^n

To control error, we analyze logφn(t)\log \varphi_n(t). Let φX(u)=1u22+R(u)\varphi_X(u) = 1 - \frac{u^2}{2} + R(u) where R(u)ρu36|R(u)| \le \frac{\rho |u|^3}{6}. For sum SnS_n, the characteristic function is φn(t)=[φX(tn)]n\varphi_n(t) = \left[ \varphi_X\left(\frac{t}{\sqrt{n}}\right) \right]^n. Restricting tt to tn4ρ|t| \le \frac{\sqrt{n}}{4\rho} ensures φX(tn)112\left| \varphi_X\left(\frac{t}{\sqrt{n}}\right) - 1 \right| \le \frac{1}{2}, allowing the use of the principal branch of the logarithm log(1+z)=zz22+O(z3)\log(1+z) = z - \frac{z^2}{2} + O(|z|^3).

Expanding the log-characteristic function:

logφn(t)=nlog(1t22n+O(ρt3n3/2))=t22+O(ρt3n)\log \varphi_n(t) = n \log \left( 1 - \frac{t^2}{2n} + O\left( \frac{\rho |t|^3}{n^{3/2}} \right) \right) = -\frac{t^2}{2} + O\left( \frac{\rho |t|^3}{\sqrt{n}} \right)

Exponentiating back to the domain of characteristic functions:

φn(t)=et2/2exp(O(ρt3n))\varphi_n(t) = e^{-t^2/2} \exp\left( O\left( \frac{\rho |t|^3}{\sqrt{n}} \right) \right)

Using ez1zez|e^z - 1| \le |z| e^{|z|}, we derive a bound for Esseen’s integral:

φn(t)et2/2et2/2Cρt3n\left| \varphi_n(t) - e^{-t^2/2} \right| \le e^{-t^2/2} \frac{C \rho |t|^3}{\sqrt{n}}

This allows TTΔφtdt\int_{-T}^T \frac{|\Delta \varphi|}{|t|} dt to converge to the O(1/n)O(1/\sqrt{n}) rate. We choose Tn/ρT \approx \sqrt{n}/\rho in Esseen’s Lemma. The integral becomes:

TT1tet2/2ρt3ndt=ρnt2et2/2dt=O(ρ/n)\int_{-T}^T \frac{1}{|t|} |e^{-t^2/2} \frac{\rho |t|^3}{\sqrt{n}}| dt = \frac{\rho}{\sqrt{n}} \int t^2 e^{-t^2/2} dt = O(\rho/\sqrt{n})

The “bias” term 24MπT\frac{24M}{\pi T} is also O(ρ/n)O(\rho/\sqrt{n}) since TnT \propto \sqrt{n}. This proves the 1/n1/\sqrt{n} rate. The beauty of this proof is that it localizes the problem to the behavior of the CF near the origin.


5. Stein’s Method: The Algebraic Perspective

Fourier methods rely on the product property of CFs, making them hard to adapt to dependent variables. Stein’s Method (1972) operates directly on expectations using differential equations. The starting point is the Stein Operator: Af(w)=f(w)wf(w)\mathcal{A} f(w) = f'(w) - w f(w). For ZN(0,1)Z \sim \mathcal{N}(0, 1), E[Af(Z)]=0\mathbb{E}[\mathcal{A} f(Z)] = 0 for all smooth ff.

The Stein Equation: To bound P(Snx)Φ(x)|P(S_n \le x) - \Phi(x)|, we choose the test function h(w)=I(wx)h(w) = \mathbb{I}(w \le x) and solve:

f(w)wf(w)=h(w)Eh(Z)f'(w) - w f(w) = h(w) - \mathbb{E}h(Z)

The Berry-Esseen bound is then E[Af(Sn)]|\mathbb{E}[\mathcal{A} f(S_n)]|. Crucially, for h=I(,x]h = \mathbb{I}_{(-\infty, x]}, the solution ff satisfies:

fπ/2,f1,f2\|f\|_\infty \le \sqrt{\pi/2}, \quad \|f'\|_\infty \le 1, \quad \|f''\|_\infty \le 2

These bounds are independent of the distribution of XiX_i! By expanding E[Snf(Sn)]\mathbb{E}[S_n f(S_n)] using the “w-neighborhood” construction (similar to the leave-one-out idea but more general), we derive the same ρ/n\rho/\sqrt{n} rate. Algebraically, we replace SnS_n with a sum of “locals” and use Taylor’s theorem.

6. Dependent Variables and Dependency Graphs

Stein’s method excels at handling local dependence. Suppose the sequence {Xi}i=1n\{X_i\}_{i=1}^n is not independent, but each XiX_i only depends on a few neighbors. Define a Dependency Graph L=(V,E)L=(V, E) where V={1,,n}V = \{1, \dots, n\}. If {i,j}E\{i, j\} \notin E, then {Xk:kA}\{X_k : k \in A\} and {Xk:kB}\{X_k : k \in B\} are independent whenever no edge connects AA and BB. Let DD be the maximum degree of the graph. General Berry-Esseen for Dependency Graphs:

supxFn(x)Φ(x)CD2EXi3σ3\sup_x |F_n(x) - \Phi(x)| \le C \frac{D^2 \sum \mathbb{E}|X_i|^3}{\sigma^3}

For a kk-dependent sequence (e.g., Xi=f(Yi,,Yi+k)X_i = f(Y_i, \dots, Y_{i+k})), D=2kD = 2k. The rate remains O(1/n)O(1/\sqrt{n}) as long as the dependence range DD is much smaller than nn. This allows the CLT to apply to physical systems with local interactions (like the 1D Ising Model or Markov Chains).


7. Higher Order: The Edgeworth Expansion

Berry-Esseen is an O(n1/2)O(n^{-1/2}) result. For continuous variables, we can find the exact correction terms for skewness and kurtosis. Let κj\kappa_j be the jj-th cumulant. Taylor expand the CGF logφn(t)\log \varphi_n(t):

logφn(t)=j=2κj(it)jnj/21j!=t22+κ3(it)36n+κ4(it)424n+\log \varphi_n(t) = \sum_{j=2}^\infty \kappa_j \frac{(it)^j}{n^{j/2-1} j!} = -\frac{t^2}{2} + \frac{\kappa_3 (it)^3}{6\sqrt{n}} + \frac{\kappa_4 (it)^4}{24n} + \dots

Exponentiating gives the Edgeworth Expansion in the Fourier domain:

φn(t)=et2/2[1+κ3(it)36n+(κ4(it)424n+κ32(it)672n)+]\varphi_n(t) = e^{-t^2/2} \left[ 1 + \frac{\kappa_3 (it)^3}{6\sqrt{n}} + \left( \frac{\kappa_4 (it)^4}{24n} + \frac{\kappa_3^2 (it)^6}{72n} \right) + \dots \right]

Inverting the transform term-by-term requires Hermite Polynomials Hek(x)ϕ(x)=(1)kdkdxkϕ(x)He_k(x) \phi(x) = (-1)^k \frac{d^k}{dx^k} \phi(x). Specifically:

Fn(x)=Φ(x)κ36n(x21)ϕ(x)F_n(x) = \Phi(x) - \frac{\kappa_3}{6\sqrt{n}} (x^2 - 1) \phi(x) - \dots

Cramér’s Condition: Does the Edgeworth expansion always work? No. It requires lim suptφX(t)<1\limsup_{|t| \to \infty} |\varphi_X(t)| < 1.

  • This holds for all distributions with a non-zero absolutely continuous component.
  • It fails for discrete (lattice) variables. For Bernoulli variables, the error is O(1/n)O(1/\sqrt{n}), but the Edgeworth series does not even converge to the discrete distribution.

8. Numerical Check: Lattice vs Continuous Convergence

The following simulation demonstrates the “Staircase” error of lattice distributions compared to the smooth correction of Edgeworth for continuous variables.

import numpy as np import scipy.stats as stats import matplotlib.pyplot as plt def compare_convergence(n=20): # 1. Discrete Case: Bernoulli(0.5) # 2. Continuous Case: Exponential(1) x = np.linspace(-3, 3, 500) # Gaussian CDF phi_x = stats.norm.pdf(x) Phi_x = stats.norm.cdf(x) # Edgeworth Correction (Skewness 2 for Exp(1)) rho_exp = 2.0 edgeworth_exp = Phi_x - (rho_exp / (6 * np.sqrt(n))) * (x**2 - 1) * phi_x # Empirical results samples = 100000 # Bernoulli sum X_bern = np.random.choice([-1, 1], size=(samples, n)) S_bern = np.sum(X_bern, axis=1) / np.sqrt(n) # Exp sum X_exp = np.random.exponential(1, size=(samples, n)) - 1 S_exp = np.sum(X_exp, axis=1) / np.sqrt(n) plt.figure(figsize=(12, 5)) # Plotting Errors plt.subplot(1, 2, 1) plt.step(np.sort(S_bern), np.linspace(0, 1, samples) - stats.norm.cdf(np.sort(S_bern)), label='Bernoulli Error') plt.title('Lattice Error (Staircase)') plt.legend() plt.subplot(1, 2, 2) plt.plot(np.sort(S_exp), np.linspace(0, 1, samples) - stats.norm.cdf(np.sort(S_exp)), label='Empirical Exp Error', alpha=0.5) plt.plot(x, edgeworth_exp - Phi_x, 'r--', label='Edgeworth Prediction') plt.title('Continuous Error (Edgeworth Linearity)') plt.legend() # plt.show()


9. The Local Central Limit Theorem (LCLT)

Berry-Esseen bounds the CDF error. What about the density fn(x)f_n(x)? If XX has a continuous component, we expect fn(x)ϕ(x)f_n(x) \to \phi(x). Theorem (Local Limit Theorem): Let the characteristic function of XX be φ(t)\varphi(t). If φ(t)\varphi(t) is integrable (i.e., φ(t)rdt<\int_{-\infty}^\infty |\varphi(t)|^r dt < \infty for some r1r \ge 1), then the density fn(x)f_n(x) of the normalized sum exists for large nn and satisfies:

supxRfn(x)ϕ(x)=O(1n)\sup_{x \in \mathbb{R}} |f_n(x) - \phi(x)| = O\left( \frac{1}{\sqrt{n}} \right)

This requires the distribution of XX to have a sufficiently smooth component, stopping “lattice spikes” seen in discrete variables. This is the Local CLT. For lattice distributions, fn(x)f_n(x) is only defined on the grid. The Local CLT for lattice variables states that:

nhP(Sn=x)=ϕ(x)+o(1)\frac{\sqrt{n}}{h} P(S_n = x) = \phi(x) + o(1)

where hh is the lattice span. The rate is still O(1/n)O(1/\sqrt{n}), but the interpretation changes from “area error” to “height error”.

10. Applications in Real World Risk

10.1 Finance: The VaR Corruption

In Quantitative Finance, Value-at-Risk (VaR) is the α\alpha-quantile of the loss distribution. Under the “Normal Assumption,” VaRα=μ+Φ1(α)σVaR_\alpha = \mu + \Phi^{-1}(\alpha) \sigma. Berry-Esseen tells us precisely how much we can trust this quantile for a portfolio of nn assets. If asset returns are skewed (fat tails or leverage effects), the true VaR deviates. The Skewness Leak: Adjusting Value-at-Risk for skewness necessitates the Cornish-Fisher Expansion, obtained by inverting the Edgeworth series. Let zα=Φ1(α)z_\alpha = \Phi^{-1}(\alpha) be the Gaussian quantile. The true quantile qαq_\alpha of the distribution SnS_n yields the asymptotic expansion:

qα=zα+1n[κ36(zα21)]+O(n1)q_\alpha = z_\alpha + \frac{1}{\sqrt{n}} \left[ \frac{\kappa_3}{6}(z_\alpha^2 - 1) \right] + O(n^{-1})

If κ3<0\kappa_3 < 0 (negative skewness), the term κ36(zα21)\frac{\kappa_3}{6}(z_\alpha^2 - 1) is negative (since typically zα2>1z_\alpha^2 > 1 for tail risks like α=0.01\alpha=0.01). This shifts the quantile further into the tail, showing that the Gaussian assumption underestimates potential loss. For α=0.01\alpha=0.01 (1% risk), z0.012.33z_{0.01} \approx -2.33. If skewness κ3=1\kappa_3 = -1 (typical for equities), the correction is negative, meaning the true loss is underestimated. Without Berry-Esseen, a risk manager might believe they are capped at 2.33 sigmas, when n=50n=50 implies they are actually facing 2.5 sigmas of risk. This “Model Risk” is the primary cause of capital inadequacy during market stress.

10.2 A/B Testing and Power Analysis

In A/B testing, we calculate the required sample size NN to detect an effect Δ\Delta. The standard formula N16σ2/Δ2N \approx 16 \sigma^2 / \Delta^2 assumes Gaussian errors. The Berry-Esseen Penalty: If your underlying metric (e.g., “Time Spent”) is highly skewed, your actual p-value might be higher than intended. If the third moment ρ\rho is large, the true alpha-risk of your test is α+Dn\alpha + D_n. If Dn0.05D_n \approx 0.05 and your target α=0.05\alpha=0.05, your Type-I error rate has doubled. Practitioners must over-sample by a factor of (1+Cρασ3n)(1 + \frac{C \rho}{\alpha \sigma^3 \sqrt{n}}) to maintain the same level of rigorous confidence.


11. Non-Uniform Berry-Esseen Bounds

The standard BE theorem is Uniform: it bounds the error x\forall x. However, the error is often much smaller for large xx (the tails). The Non-Uniform Bound (Bikelis, 1966):

Fn(x)Φ(x)Cρn(1+x3)|F_n(x) - \Phi(x)| \le \frac{C \rho}{\sqrt{n} (1 + |x|^3)}

This is a critical refinement. It says that at 3 sigmas (x=3x=3), the error is 1/281/28th of the error at the mean. This allows us to use CLT logic for Tail Probabilities even when nn is modest, provided we account for the 1/(1+x3)1/(1+|x|^3) decay. Note: This decay must be slower than the Gaussian decay ex2/2e^{-x^2/2} for the bound to be non-trivial.


12. Conclusion

The Berry-Esseen bound is a warning against naive Gaussian modeling. It tells us that for small nn, the “tails” of our distribution are almost certainly non-Gaussian. If your risk management system relies on the assumption that a 5-sigma event is 10710^{-7}, Berry-Esseen reminds you that for n=100n=100, the error in that tail probability might be as large as 10210^{-2}. The “Distributional Limit” is a mathematical ideal; for the practitioner, the 1/n1/\sqrt{n} error is the physical floor of our certainty. The Gaussian is a useful fiction, but Berry-Esseen is the hard reality of finite samples.


Historical Timeline

YearEventSignificance
1835Adolphe QueteletIdentifies the “Average Man”, realizing Gaussian distributions in nature.
1942Carl-Gustav EsseenProves the original Berry-Esseen bound with smoothing lemma.
1966A. BikelisDerives non-uniform bounds for tail probabilities.
1972Charles SteinIntroduces Stein’s Method, enabling bounds for dependent variables.
1986Andrew BarronProves the Entropic CLT for KL-divergence.
1991Friedrich GötzeEstablishes multivariate bounds for convex sets (d1/4d^{1/4}).
2011Irina ShevtsovaTightens the universal constant to 0.47480.4748.
2017Koltchinskii & LouniciProves Matrix Berry-Esseen bounds.

Appendix A: Proof Sketch of Esseen’s Smoothing Lemma

The lemma is the “Magic” behind Berry-Esseen. It translates tail errors in Frequency space to uniform errors in Real space. Let Δ(x)=F(x)G(x)\Delta(x) = F(x) - G(x). We aim to bound η=supxΔ(x)\eta = \sup_x |\Delta(x)|. Let kTk_T be a smoothing kernel with CF supported on [T,T][-T, T]. Let KTK_T be its CDF. Convolve Δ\Delta with KTK_T: ΔT=ΔKT\Delta_T = \Delta * K_T. By the Fourier Inversion Formula for functions with supported transforms:

ΔT(x)=12πTTeitxφF(t)φG(t)itφK(t)dt\Delta_T(x) = \frac{1}{2\pi} \int_{-T}^T e^{-itx} \frac{\varphi_F(t) - \varphi_G(t)}{-it} \varphi_K(t) dt

The first term of the Lemma follows from bounding this integral. The second term (24M/πT24M/\pi T) accounts for the “Smoothing Error” ΔT(x)Δ(x)|\Delta_T(x) - \Delta(x)|. The Kernel Trick: Since GG is MM-Lipschitz, we can show that for any shift aa:

Δ(a)max(ΔT(a+b))+Mwidth|\Delta(a)| \le \max( |\Delta_T(a+b)| ) + M \cdot \text{width}

Choosing the Fejér kernel or de la Vallée Poussin kernel allows for the specific constant 24/π24/\pi.


Appendix B: Pathologies and Discrete Corrections

B.1 Infinite Moments

If EX2=\mathbb{E}X^2 = \infty (e.g., Pareto with α<2\alpha < 2), CLT fails. The sum converges to a Stable Distribution (Levy Flight). n1/αSnLαn^{-1/\alpha} S_n \to L_\alpha. Berry-Esseen is meaningless.

B.2 High Dimensions (The Convex Set Requirement)

In Rd\mathbb{R}^d, the uniform error over convex sets scales as d1/4n1/2d^{1/4} n^{-1/2}. However, if we test on “any Borel set”, the error is 1. (Distributions are mutually singular). Specifically, if we look at the boundaries of the hypercube, the Gaussian Prob is small, but Discrete Prob can be large.

B.3 Discrete Corrections: Yates & Continuity

Since lattice distributions produce the worst-case Berry-Esseen error, how do we fix them? The Continuity Correction: When approximating P(Snk)P(S_n \le k) for integer kk, we use P(Zk+0.5)P(Z \le k + 0.5). This “half-integer” shift compensates for the staircase error. In the Edgeworth context, this corresponds to the Euler-Maclaurin formula correction. For a Bernoulli sum, the error drops from O(1/n)O(1/\sqrt{n}) to O(1/n)O(1/n) if the continuity correction is used.


Appendix C: The Quest for the Universal Constant

The Berry-Esseen bound Cρ/nC \rho / \sqrt{n} is only useful if we know CC. The history of probability is a race to shrink this constant.

  • Esseen (1942): Original proof gave C7.59C \approx 7.59.
  • Beekman (1972): Improved to 0.79750.7975.
  • Shiganov (1986): Reduced to 0.76550.7655.
  • Shevtsova (2011): Current tightest bound C<0.4748C < 0.4748.
AssumptionBest Known Constant CCSource
General (Independent)0.4748Shevtsova (2011)
Symmetric Variables0.4097Shevtsova (2010)
i.i.d. Symmetric0.4097Tyurin (2010)
Lower Bound (Bernoulli)0.3989Esseen (1956)
Non-Uniform General25.5Paditz (1989)

Why does 0.070.07 matter? In safety-critical systems (e.g., nuclear engineering), we use Gaussian approximations to guarantee that failure probability is <106< 10^{-6}. If CC is too large, our required “buffer” nn becomes economically non-viable.


Appendix D: High-Dimensional and Matrix Extensions

D.1 Multivariate Generalizations

In Rd\mathbb{R}^d, the norm is X22\|X\|_2^2, but we care about the convergence of probabilities of sets. The Götze Bound (1991): For i.i.d. vectors XiRdX_i \in \mathbb{R}^d with identity covariance:

supACP(SnA)P(ZA)cd1/4EX3n\sup_{A \in \mathcal{C}} |P(S_n \in A) - P(Z \in A)| \le \frac{c \cdot d^{1/4} \mathbb{E}\|X\|^3}{\sqrt{n}}

where C\mathcal{C} is the class of all convex sets. The d1/4d^{1/4} dependency is the “curse of dimensionality” for the CLT.

D.2 Matrix Berry-Esseen Bounds

Can we extend the 1/n1/\sqrt{n} rate to sums of independent random matrices? The Koltchinskii-Lounici Bound (2017): For nn independent symmetric matrices XiX_i with operator norm XiopM\|X_i\|_{\text{op}} \le M and covariance Σ=E[Xi2]\Sigma = \mathbb{E}[X_i^2], the rate of convergence is governed by the Effective Rank (or Stable Rank), defined as:

r(Σ)=Tr(Σ)Σopr(\Sigma) = \frac{\text{Tr}(\Sigma)}{\|\Sigma\|_{\text{op}}}

where Σop\|\Sigma\|_{\text{op}} is the spectral norm (largest eigenvalue). The Koltchinskii-Lounici bound states:

suptRP(Snopt)P(Zopt)Cr(Σ)n\sup_{t \in \mathbb{R}} |P(\|S_n\|_{\text{op}} \le t) - P(\|Z\|_{\text{op}} \le t)| \le C \sqrt{\frac{r(\Sigma)}{n}}

Showing convergence depends on r(Σ)r(\Sigma) rather than dd. This is the matrix-analog of Berry-Esseen. Unlike the scalar case where ρ\rho is the key parameter, here the ambient dimension and the stable rank dictate the speed of convergence.


Appendix E: The Entropic CLT

While Berry-Esseen deals with the Kolmogorov distance DKSD_{KS}, modern information theory looks at the Relative Entropy (Kullback-Leibler divergence). Let D(fnϕ)=fnlog(fn/ϕ)dxD(f_n \| \phi) = \int f_n \log(f_n / \phi) dx. Theorem (Barron, 1986): If D(f1ϕ)<D(f_1 \| \phi) < \infty, then:

D(fnϕ)0D(f_n \| \phi) \to 0

The Rate: Under Stein-type conditions (Log-Sobolev inequalities), the rate of entropy convergence is O(1/n)O(1/n). Note: This is faster than the 1/n1/\sqrt{n} Kolmogorov rate. The Entropic CLT provides the final justification for Maximum Entropy principles: the Gaussian is not just a limit, it is the state of highest ignorance.


Appendix F: Glossary of Terms

  • Characteristic Function (CF): The Fourier transform of the measure: φ(t)=eitxdF(x)\varphi(t) = \int e^{itx} dF(x).
  • Cumulants (κj\kappa_j): The coefficients of the Taylor expansion of the log-CF. κ3\kappa_3 is the skewness.
  • Dependency Graph: A combinatorial object representing the lack of independence in a random field.
  • Hermite Polynomials: Orthogonal polynomials used to “correct” the Gaussian density in Edgeworth series.
  • Kolmogorov Distance: The metric D(F,G)=supxF(x)G(x)D(F, G) = \sup_x |F(x) - G(x)|. The strongest metric for convergence in distribution.

References

1. Esseen, C. G. (1942). “On the Liapounoff limit of error in the theory of probability”. The original paper deriving the bound via the smoothing lemma.

2. Stein, C. (1972). “A bound for the error in the normal approximation to the distribution of a sum of dependent random variables”. Introduced Stein’s Method. A revolutionary different approach that replaced Fourier analysis with functional equations.

3. Chen, L. H. Y., Goldstein, L., & Shao, Q. M. (2011). “Normal Approximation by Stein’s Method”. The modern textbook on Stein’s method.

4. Shevtsova, I. G. (2011). “On the absolute constant in the Berry-Esseen inequality”. The paper that proved the current world record for the constant (0.47480.4748).

5. Bentkus, V. (2003). “On the dependence of the Berry-Esseen bound on dimension”. The definitive result on the d1/4d^{1/4} rate.

6. Barron, A. R. (1986). “Entropy and the Central Limit Theorem”. The foundational paper for the entropic version of the CLT.