The Berry-Esseen Bound
1. The Non-Asymptotic Central Limit Theorem
The Central Limit Theorem (CLT) is a qualitative statement: for i.i.d. variables with finite variance, the normalized sum converges in distribution to . In practice, is never infinite. We require a quantitative measure of proximity. If we use a Gaussian approximation for , what is our maximum error in probability? The Berry-Esseen Theorem provides the definitive upper bound on the Kolmogorov distance:
where is a universal constant. This result is remarkable for its uniformity. It doesn’t matter if the underlying distribution is skewed, multimodal, or discrete; the rate is the “speed of light” for the CLT.
2. The Lyapunov Condition and Fractional Moments
Is the third moment strictly necessary? The CLT only requires the second moment . However, the rate of convergence depends on the tail behavior. Consider the Lyapunov Condition: Let for . If , define:
If , the CLT holds. The Berry-Esseen bound generalizes to these fractional moments:
For i.i.d. variables, .
- If we only have moments, the error decays as (extremely slow).
- If we have moments (), we recover the rate.
- Increasing moments beyond does NOT improve the Kolmogorov rate beyond for the worst-case , though it allows for higher-order Edgeworth expansions for smooth distributions.
3. Esseen’s Smoothing Lemma: Fourier Interpolation
We cannot easily bound directly. We can, however, bound the difference of Characteristic Functions (CF): . The inverse Fourier transform relates to . Problem: may not have a density (if is discrete). The inversion formula fails to converge. Solution: Smoothing. We convolve the distributions with a smooth kernel with limited bandwidth .
The Smoothing Lemma (Esseen, 1945): Let be CDFs. is differentiable with . For any , the maximum deviation is bounded by the weighted integral of the CF difference:
4. Convergence of the Characteristic Function
To use the Lemma, we analyze for the sum . Let .
For the sum :
To control error, we analyze . Let where . For sum , the characteristic function is . Restricting to ensures , allowing the use of the principal branch of the logarithm .
Expanding the log-characteristic function:
Exponentiating back to the domain of characteristic functions:
Using , we derive a bound for Esseen’s integral:
This allows to converge to the rate. We choose in Esseen’s Lemma. The integral becomes:
The “bias” term is also since . This proves the rate. The beauty of this proof is that it localizes the problem to the behavior of the CF near the origin.
5. Stein’s Method: The Algebraic Perspective
Fourier methods rely on the product property of CFs, making them hard to adapt to dependent variables. Stein’s Method (1972) operates directly on expectations using differential equations. The starting point is the Stein Operator: . For , for all smooth .
The Stein Equation: To bound , we choose the test function and solve:
The Berry-Esseen bound is then . Crucially, for , the solution satisfies:
These bounds are independent of the distribution of ! By expanding using the “w-neighborhood” construction (similar to the leave-one-out idea but more general), we derive the same rate. Algebraically, we replace with a sum of “locals” and use Taylor’s theorem.
6. Dependent Variables and Dependency Graphs
Stein’s method excels at handling local dependence. Suppose the sequence is not independent, but each only depends on a few neighbors. Define a Dependency Graph where . If , then and are independent whenever no edge connects and . Let be the maximum degree of the graph. General Berry-Esseen for Dependency Graphs:
For a -dependent sequence (e.g., ), . The rate remains as long as the dependence range is much smaller than . This allows the CLT to apply to physical systems with local interactions (like the 1D Ising Model or Markov Chains).
7. Higher Order: The Edgeworth Expansion
Berry-Esseen is an result. For continuous variables, we can find the exact correction terms for skewness and kurtosis. Let be the -th cumulant. Taylor expand the CGF :
Exponentiating gives the Edgeworth Expansion in the Fourier domain:
Inverting the transform term-by-term requires Hermite Polynomials . Specifically:
Cramér’s Condition: Does the Edgeworth expansion always work? No. It requires .
- This holds for all distributions with a non-zero absolutely continuous component.
- It fails for discrete (lattice) variables. For Bernoulli variables, the error is , but the Edgeworth series does not even converge to the discrete distribution.
8. Numerical Check: Lattice vs Continuous Convergence
The following simulation demonstrates the “Staircase” error of lattice distributions compared to the smooth correction of Edgeworth for continuous variables.
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
def compare_convergence(n=20):
# 1. Discrete Case: Bernoulli(0.5)
# 2. Continuous Case: Exponential(1)
x = np.linspace(-3, 3, 500)
# Gaussian CDF
phi_x = stats.norm.pdf(x)
Phi_x = stats.norm.cdf(x)
# Edgeworth Correction (Skewness 2 for Exp(1))
rho_exp = 2.0
edgeworth_exp = Phi_x - (rho_exp / (6 * np.sqrt(n))) * (x**2 - 1) * phi_x
# Empirical results
samples = 100000
# Bernoulli sum
X_bern = np.random.choice([-1, 1], size=(samples, n))
S_bern = np.sum(X_bern, axis=1) / np.sqrt(n)
# Exp sum
X_exp = np.random.exponential(1, size=(samples, n)) - 1
S_exp = np.sum(X_exp, axis=1) / np.sqrt(n)
plt.figure(figsize=(12, 5))
# Plotting Errors
plt.subplot(1, 2, 1)
plt.step(np.sort(S_bern), np.linspace(0, 1, samples) - stats.norm.cdf(np.sort(S_bern)), label='Bernoulli Error')
plt.title('Lattice Error (Staircase)')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(np.sort(S_exp), np.linspace(0, 1, samples) - stats.norm.cdf(np.sort(S_exp)), label='Empirical Exp Error', alpha=0.5)
plt.plot(x, edgeworth_exp - Phi_x, 'r--', label='Edgeworth Prediction')
plt.title('Continuous Error (Edgeworth Linearity)')
plt.legend()
# plt.show()9. The Local Central Limit Theorem (LCLT)
Berry-Esseen bounds the CDF error. What about the density ? If has a continuous component, we expect . Theorem (Local Limit Theorem): Let the characteristic function of be . If is integrable (i.e., for some ), then the density of the normalized sum exists for large and satisfies:
This requires the distribution of to have a sufficiently smooth component, stopping “lattice spikes” seen in discrete variables. This is the Local CLT. For lattice distributions, is only defined on the grid. The Local CLT for lattice variables states that:
where is the lattice span. The rate is still , but the interpretation changes from “area error” to “height error”.
10. Applications in Real World Risk
10.1 Finance: The VaR Corruption
In Quantitative Finance, Value-at-Risk (VaR) is the -quantile of the loss distribution. Under the “Normal Assumption,” . Berry-Esseen tells us precisely how much we can trust this quantile for a portfolio of assets. If asset returns are skewed (fat tails or leverage effects), the true VaR deviates. The Skewness Leak: Adjusting Value-at-Risk for skewness necessitates the Cornish-Fisher Expansion, obtained by inverting the Edgeworth series. Let be the Gaussian quantile. The true quantile of the distribution yields the asymptotic expansion:
If (negative skewness), the term is negative (since typically for tail risks like ). This shifts the quantile further into the tail, showing that the Gaussian assumption underestimates potential loss. For (1% risk), . If skewness (typical for equities), the correction is negative, meaning the true loss is underestimated. Without Berry-Esseen, a risk manager might believe they are capped at 2.33 sigmas, when implies they are actually facing 2.5 sigmas of risk. This “Model Risk” is the primary cause of capital inadequacy during market stress.
10.2 A/B Testing and Power Analysis
In A/B testing, we calculate the required sample size to detect an effect . The standard formula assumes Gaussian errors. The Berry-Esseen Penalty: If your underlying metric (e.g., “Time Spent”) is highly skewed, your actual p-value might be higher than intended. If the third moment is large, the true alpha-risk of your test is . If and your target , your Type-I error rate has doubled. Practitioners must over-sample by a factor of to maintain the same level of rigorous confidence.
11. Non-Uniform Berry-Esseen Bounds
The standard BE theorem is Uniform: it bounds the error . However, the error is often much smaller for large (the tails). The Non-Uniform Bound (Bikelis, 1966):
This is a critical refinement. It says that at 3 sigmas (), the error is th of the error at the mean. This allows us to use CLT logic for Tail Probabilities even when is modest, provided we account for the decay. Note: This decay must be slower than the Gaussian decay for the bound to be non-trivial.
12. Conclusion
The Berry-Esseen bound is a warning against naive Gaussian modeling. It tells us that for small , the “tails” of our distribution are almost certainly non-Gaussian. If your risk management system relies on the assumption that a 5-sigma event is , Berry-Esseen reminds you that for , the error in that tail probability might be as large as . The “Distributional Limit” is a mathematical ideal; for the practitioner, the error is the physical floor of our certainty. The Gaussian is a useful fiction, but Berry-Esseen is the hard reality of finite samples.
Historical Timeline
| Year | Event | Significance |
|---|---|---|
| 1835 | Adolphe Quetelet | Identifies the “Average Man”, realizing Gaussian distributions in nature. |
| 1942 | Carl-Gustav Esseen | Proves the original Berry-Esseen bound with smoothing lemma. |
| 1966 | A. Bikelis | Derives non-uniform bounds for tail probabilities. |
| 1972 | Charles Stein | Introduces Stein’s Method, enabling bounds for dependent variables. |
| 1986 | Andrew Barron | Proves the Entropic CLT for KL-divergence. |
| 1991 | Friedrich Götze | Establishes multivariate bounds for convex sets (). |
| 2011 | Irina Shevtsova | Tightens the universal constant to . |
| 2017 | Koltchinskii & Lounici | Proves Matrix Berry-Esseen bounds. |
Appendix A: Proof Sketch of Esseen’s Smoothing Lemma
The lemma is the “Magic” behind Berry-Esseen. It translates tail errors in Frequency space to uniform errors in Real space. Let . We aim to bound . Let be a smoothing kernel with CF supported on . Let be its CDF. Convolve with : . By the Fourier Inversion Formula for functions with supported transforms:
The first term of the Lemma follows from bounding this integral. The second term () accounts for the “Smoothing Error” . The Kernel Trick: Since is -Lipschitz, we can show that for any shift :
Choosing the Fejér kernel or de la Vallée Poussin kernel allows for the specific constant .
Appendix B: Pathologies and Discrete Corrections
B.1 Infinite Moments
If (e.g., Pareto with ), CLT fails. The sum converges to a Stable Distribution (Levy Flight). . Berry-Esseen is meaningless.
B.2 High Dimensions (The Convex Set Requirement)
In , the uniform error over convex sets scales as . However, if we test on “any Borel set”, the error is 1. (Distributions are mutually singular). Specifically, if we look at the boundaries of the hypercube, the Gaussian Prob is small, but Discrete Prob can be large.
B.3 Discrete Corrections: Yates & Continuity
Since lattice distributions produce the worst-case Berry-Esseen error, how do we fix them? The Continuity Correction: When approximating for integer , we use . This “half-integer” shift compensates for the staircase error. In the Edgeworth context, this corresponds to the Euler-Maclaurin formula correction. For a Bernoulli sum, the error drops from to if the continuity correction is used.
Appendix C: The Quest for the Universal Constant
The Berry-Esseen bound is only useful if we know . The history of probability is a race to shrink this constant.
- Esseen (1942): Original proof gave .
- Beekman (1972): Improved to .
- Shiganov (1986): Reduced to .
- Shevtsova (2011): Current tightest bound .
| Assumption | Best Known Constant | Source |
|---|---|---|
| General (Independent) | 0.4748 | Shevtsova (2011) |
| Symmetric Variables | 0.4097 | Shevtsova (2010) |
| i.i.d. Symmetric | 0.4097 | Tyurin (2010) |
| Lower Bound (Bernoulli) | 0.3989 | Esseen (1956) |
| Non-Uniform General | 25.5 | Paditz (1989) |
Why does matter? In safety-critical systems (e.g., nuclear engineering), we use Gaussian approximations to guarantee that failure probability is . If is too large, our required “buffer” becomes economically non-viable.
Appendix D: High-Dimensional and Matrix Extensions
D.1 Multivariate Generalizations
In , the norm is , but we care about the convergence of probabilities of sets. The Götze Bound (1991): For i.i.d. vectors with identity covariance:
where is the class of all convex sets. The dependency is the “curse of dimensionality” for the CLT.
D.2 Matrix Berry-Esseen Bounds
Can we extend the rate to sums of independent random matrices? The Koltchinskii-Lounici Bound (2017): For independent symmetric matrices with operator norm and covariance , the rate of convergence is governed by the Effective Rank (or Stable Rank), defined as:
where is the spectral norm (largest eigenvalue). The Koltchinskii-Lounici bound states:
Showing convergence depends on rather than . This is the matrix-analog of Berry-Esseen. Unlike the scalar case where is the key parameter, here the ambient dimension and the stable rank dictate the speed of convergence.
Appendix E: The Entropic CLT
While Berry-Esseen deals with the Kolmogorov distance , modern information theory looks at the Relative Entropy (Kullback-Leibler divergence). Let . Theorem (Barron, 1986): If , then:
The Rate: Under Stein-type conditions (Log-Sobolev inequalities), the rate of entropy convergence is . Note: This is faster than the Kolmogorov rate. The Entropic CLT provides the final justification for Maximum Entropy principles: the Gaussian is not just a limit, it is the state of highest ignorance.
Appendix F: Glossary of Terms
- Characteristic Function (CF): The Fourier transform of the measure: .
- Cumulants (): The coefficients of the Taylor expansion of the log-CF. is the skewness.
- Dependency Graph: A combinatorial object representing the lack of independence in a random field.
- Hermite Polynomials: Orthogonal polynomials used to “correct” the Gaussian density in Edgeworth series.
- Kolmogorov Distance: The metric . The strongest metric for convergence in distribution.
References
1. Esseen, C. G. (1942). “On the Liapounoff limit of error in the theory of probability”. The original paper deriving the bound via the smoothing lemma.
2. Stein, C. (1972). “A bound for the error in the normal approximation to the distribution of a sum of dependent random variables”. Introduced Stein’s Method. A revolutionary different approach that replaced Fourier analysis with functional equations.
3. Chen, L. H. Y., Goldstein, L., & Shao, Q. M. (2011). “Normal Approximation by Stein’s Method”. The modern textbook on Stein’s method.
4. Shevtsova, I. G. (2011). “On the absolute constant in the Berry-Esseen inequality”. The paper that proved the current world record for the constant ().
5. Bentkus, V. (2003). “On the dependence of the Berry-Esseen bound on dimension”. The definitive result on the rate.
6. Barron, A. R. (1986). “Entropy and the Central Limit Theorem”. The foundational paper for the entropic version of the CLT.