The Berry-Esseen Bound
1. Quantitative Convergence in the Central Limit Theorem
The CLT says normalized sums of i.i.d. random variables with finite variance drift into as grows and the convergence is asymptotic and in practice sits at some finite value and the obvious question is how well the Gaussian approximation actually holds up at a given .
The Berry-Esseen theorem gives you a uniform bound on the error.
with a universal constant.
This bound holds uniformly across all and it does not care whether the underlying distribution is skewed or multimodal or discrete. The rate only needs a finite third absolute moment to show up.
2. The Role of Moment Assumptions
The CLT itself only asks for and the rate of convergence leans on the behavior of higher moments.
Take for and if you can write down the Lyapunov fraction.
If the CLT holds and the Berry-Esseen bound stretches out to match.
For i.i.d. variables and the behavior drops straight out of that.
- With only moments and the error drops as which is painfully slow.
- Three moments and bring back the rate.
- Moments past the third do not help the worst-case Kolmogorov distance and the rate sits at unless you restrict to smooth distributions and pull in Edgeworth expansions.
3. Esseen’s Smoothing Lemma
Bounding head-on is hard and characteristic functions are easier to work with because the CF of a sum of independent variables factors as a product.
The wrinkle is that if is discrete then has no density and naive Fourier inversion blows up. The fix is to smooth by convolving both distributions with a kernel of bandwidth and then bound the approximation error that the smoothing drags in.
Smoothing Lemma (Esseen, 1945): Take and as CDFs and let be differentiable with and for any you get the following bound.
The first term holds the Fourier-domain approximation and the second is the bias that smoothing drags in and the proof of Berry-Esseen comes down to holding each of these in check.
4. Controlling the Characteristic Function
Now you actually put the smoothing lemma to work and set with and and .
For the sum you raise this whole thing to the .
Write with and restrict to so that and in this range you can safely take logarithms using .
The log-characteristic function drops into a simple expression.
Exponentiating gives you the characteristic function as a Gaussian times a small correction.
Using the inequality you pick out a clean Fourier-domain bound.
Plugging into Esseen’s integral with gives you the size of the first term.
The bias term drops out at too since .
The characteristic function near the origin sets the convergence rate and the smoothing lemma turns Fourier-domain control into a uniform bound on CDFs.
5. Stein’s Method
Fourier methods lean on independence because the CF of a sum factors as a product and for dependent variables this breaks and you need a different angle.
Stein’s method (1972) swaps Fourier analysis for a functional equation and defines the Stein operator as .
If then for every smooth with sensible growth conditions and this drops out of integration by parts against the Gaussian density.
To bound you set and solve the Stein equation.
The Berry-Esseen bound then boils down to holding in check.
The solution for sits inside clean sup norm bounds.
These bounds are universal and they do not care about the distribution of the .
To work out you use a leave-one-out decomposition and write for the sum with the -th term pulled out so . Taylor expanding around and leaning on the moment assumptions each term in the sum kicks in an error of order and summing over brings back the rate.
6. Berry-Esseen under Dependency Graphs
Stein’s method stretches cleanly into settings with local dependence.
Suppose is not independent and each only depends on a bounded neighborhood and you define a dependency graph with where forces and to be independent and any two sets of vertices with no connecting edge are mutually independent.
Let be the max degree.
Berry-Esseen for dependency graphs:
For a -dependent sequence like you get and the rate sits at as long as is bounded or more generally and this covers important examples like 1D Ising models and functionals of Markov chains.
7. Edgeworth Expansions
Berry-Esseen gives you and for continuous distributions you can push further by working out explicit correction terms.
Let be the -th cumulant and Taylor expand the cumulant generating function.
Then you exponentiate the whole thing.
Invert term-by-term using Hermite polynomials and the CDF drops out.
This expansion needs Cramer’s condition which says .
Any distribution with a non-zero absolutely continuous piece holds Cramer’s condition and lattice distributions do not. For Bernoulli variables the Kolmogorov error sits at and the Edgeworth series does not actually converge to the discrete CDF.
8. Numerical Comparison: Lattice vs. Continuous Distributions
The simulation below shows the qualitative difference between the lattice and continuous cases and you see the staircase error pattern that rides along with discrete distributions set against the smooth Edgeworth correction for continuous ones.
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
def compare_convergence(n=20):
# 1. Discrete Case: Bernoulli(0.5)
# 2. Continuous Case: Exponential(1)
x = np.linspace(-3, 3, 500)
# Gaussian CDF
phi_x = stats.norm.pdf(x)
Phi_x = stats.norm.cdf(x)
# Edgeworth Correction (Skewness 2 for Exp(1))
rho_exp = 2.0
edgeworth_exp = Phi_x - (rho_exp / (6 * np.sqrt(n))) * (x**2 - 1) * phi_x
# Empirical results
samples = 100000
# Bernoulli sum
X_bern = np.random.choice([-1, 1], size=(samples, n))
S_bern = np.sum(X_bern, axis=1) / np.sqrt(n)
# Exp sum
X_exp = np.random.exponential(1, size=(samples, n)) - 1
S_exp = np.sum(X_exp, axis=1) / np.sqrt(n)
plt.figure(figsize=(12, 5))
# Plotting Errors
plt.subplot(1, 2, 1)
plt.step(np.sort(S_bern), np.linspace(0, 1, samples) - stats.norm.cdf(np.sort(S_bern)), label='Bernoulli Error')
plt.title('Lattice Error (Staircase)')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(np.sort(S_exp), np.linspace(0, 1, samples) - stats.norm.cdf(np.sort(S_exp)), label='Empirical Exp Error', alpha=0.5)
plt.plot(x, edgeworth_exp - Phi_x, 'r--', label='Edgeworth Prediction')
plt.title('Continuous Error (Edgeworth Linearity)')
plt.legend()
# plt.show()9. The Local Central Limit Theorem
Berry-Esseen bounds the CDF error and you can also ask about convergence of the density to the Gaussian density .
Theorem (Local Limit): If the CF is integrable with for some then for large the density shows up and the sup error drops as .
The integrability condition forces the distribution to have a density with controlled oscillation and it wipes out lattice distributions which throw up the characteristic spikes in .
For lattice distributions only sits on the lattice points and the local CLT takes a different form.
with the lattice span and the rate is the same and the interpretation shifts because this holds pointwise probability mass in check instead of CDF error.
10. Applications
Value-at-Risk and Gaussian Tail Approximations
In quantitative finance Value-at-Risk at level is the -quantile of the loss distribution and under a Gaussian assumption you get and Berry-Esseen pins down the error in this approximation for a portfolio of assets with skewed returns.
The correction uses the Cornish-Fisher expansion which you pick out by inverting the Edgeworth series and you let and the corrected quantile drops out like this.
If which is negative skew and typical for equities then the correction term sits negative for tail risks since when and the true loss quantile runs past the Gaussian prediction.
For you have and with skewness and the actual risk sits around 2.5 standard deviations and the Gaussian model reports 2.33 and this gap is exactly where capital inadequacy gets born.
A/B Testing
The standard sample size formula assumes Gaussian errors and if the metric is highly skewed like time-on-site then the actual p-value drifts off the nominal one by up to . When and the target significance level is the effective Type-I error rate can double and compensating forces you to oversample by a factor of roughly .
11. Non-Uniform Bounds
The standard Berry-Esseen bound is uniform in and the same error sits across everywhere and in reality the approximation is much better out in the tails.
Bikelis (1966):
At this is roughly of the error at and you can trust the CLT approximation a lot more out in the tails than at the center even at moderate . The decay rate is polynomial and slower than the Gaussian tail and the slower decay is what makes the bound informative.
12. Summary
Berry-Esseen is a quantitative check on finite-sample Gaussian approximations and for moderate the tails of the true distribution can sit pretty far from the Gaussian.
If a risk system treats a 5-sigma event as having probability Berry-Esseen says that at the error in that tail probability can run as big as and the rate drops a hard floor on how accurate the normal approximation can get.
Historical notes
The story starts with Adolphe Quetelet in 1835 who picked out the Average Man and first spotted Gaussian distributions in nature. More than a century later Carl-Gustav Esseen proved the original Berry-Esseen bound in 1942 and set down the smoothing lemma which is still the central technical piece. A. Bikelis worked out non-uniform bounds for tail probabilities in 1966 and showed that the CLT approximation actually gets better out in the tails. Charles Stein rolled out his method in 1972 and opened the door to bounds for dependent random variables. Andrew Barron proved the Entropic CLT for KL-divergence in 1986 and measured convergence in an information-theoretic metric. Friedrich Gotze pinned down multivariate bounds for convex sets in 1991 with error scaling as . The universal constant got tightened to by Irina Shevtsova in 2011 and Koltchinskii and Lounici proved Matrix Berry-Esseen bounds in 2017.
Characteristic Function (CF) is the Fourier transform of a distribution and the CF of a sum of independent variables factors as a product and this is why Fourier methods feel natural for the CLT.
Cumulants () are the coefficients of the Taylor expansion of the log-CF and the third cumulant is the skewness and it drives the leading correction in the Edgeworth expansion.
Dependency Graph is a graph that holds which random variables are not independent and Stein’s method uses this structure to bound the normal approximation error for sums of dependent variables.
Hermite Polynomials are orthogonal polynomials against the Gaussian weight and they show up as correction terms in the Edgeworth series.
Kolmogorov Distance is the sup-norm distance between CDFs and this is the metric Berry-Esseen holds in check.
Proof Sketch of the Smoothing Lemma
This is the central technical piece in Berry-Esseen and it turns Fourier-domain control of characteristic functions into uniform bounds on CDFs.
Let and and take a smoothing kernel whose CF is supported on with CDF and convolve.
Bounding this integral gives you the first term of the lemma.
The second term () comes from the smoothing error and since is -Lipschitz for any shift you get
Using the Fejer or de la Vallee Poussin kernel gives the specific constant and the remaining calculation is mostly bookkeeping with the kernel’s moments and support. See Petrov (1975), Chapter V for the full argument.
Pathologies and Discrete Corrections
Infinite Moments
If like Pareto with then the CLT wipes out entirely and the normalized sum drifts into a stable distribution instead with and Berry-Esseen does not apply in this regime.
High Dimensions
In the uniform error over convex sets scales as . Over arbitrary Borel sets the situation runs much worse because the error can sit as large as 1 since the discrete and Gaussian distributions can be mutually singular. The Gaussian probability of a hypercube boundary is negligible and the discrete distribution may drop substantial mass there.
Discrete Corrections
Lattice distributions hit the worst-case Berry-Esseen error and the standard remedy is the continuity correction where when approximating for integer you use and this half-integer shift compensates for the staircase structure of the discrete CDF and in the language of Edgeworth expansions it lines up with the Euler-Maclaurin correction.
For Bernoulli sums the continuity correction drops the error from to .
The Universal Constant
The Berry-Esseen bound needs a concrete value of to be useful in practice and sharpening this constant has been a sustained effort over eight decades.
- Esseen (1942):
- Beekman (1972):
- Shiganov (1986):
- Shevtsova (2011): (current record)
| Assumption | Best Known | Source |
|---|---|---|
| General (Independent) | 0.4748 | Shevtsova (2011) |
| Symmetric Variables | 0.4097 | Shevtsova (2010) |
| i.i.d. Symmetric | 0.4097 | Tyurin (2010) |
| Lower Bound (Bernoulli) | 0.3989 | Esseen (1956) |
| Non-Uniform General | 25.5 | Paditz (1989) |
The gap between 0.47 and 0.40 matters in safety-critical engineering like nuclear and aerospace where Gaussian approximations underwrite failure probability guarantees below and if runs too large the required sample sizes become impractical.
High-Dimensional and Matrix Extensions
Multivariate
In the natural question is convergence of probabilities over sets.
Gotze (1991): For i.i.d. vectors with identity covariance you get
where is the class of convex sets and the factor is the cost of cranking up dimension.
Matrix Berry-Esseen
For independent symmetric random matrices with and covariance the relevant quantity is the effective rank.
Koltchinskii-Lounici (2017):
The convergence rate leans on rather than and the bound is the natural matrix analogue of Berry-Esseen where the stable rank plays the role that the third moment plays in the scalar case.
The Entropic CLT
Berry-Esseen measures convergence in Kolmogorov distance and from an information-theoretic angle relative entropy is a more natural metric.
Let .
Barron (1986): If then .
Under log-Sobolev conditions the rate is and this is faster than the Kolmogorov rate.
References
1. Esseen, C. G. (1942). “On the Liapounoff limit of error in the theory of probability”. The original paper deriving the bound via the smoothing lemma.
2. Stein, C. (1972). “A bound for the error in the normal approximation to the distribution of a sum of dependent random variables”. Rolled out Stein’s Method and a completely different angle that swapped Fourier analysis for functional equations.
3. Chen, L. H. Y., Goldstein, L., & Shao, Q. M. (2011). “Normal Approximation by Stein’s Method”. The modern textbook on Stein’s method.
4. Shevtsova, I. G. (2011). “On the absolute constant in the Berry-Esseen inequality”. The paper that proved the current world record for the constant ().
5. Bentkus, V. (2003). “On the dependence of the Berry-Esseen bound on dimension”. The definitive result on the rate.
6. Barron, A. R. (1986). “Entropy and the Central Limit Theorem”. The foundational paper for the entropic version of the CLT.