The Berry-Esseen Bound

1. Quantitative Convergence in the Central Limit Theorem

The CLT says normalized sums of i.i.d. random variables with finite variance drift into N(0,1)\mathcal{N}(0, 1) as nn grows and the convergence is asymptotic and in practice nn sits at some finite value and the obvious question is how well the Gaussian approximation actually holds up at a given nn.

The Berry-Esseen theorem gives you a uniform bound on the error.

Dn=supxRP(Snx)Φ(x)CEX3σ3nD_n = \sup_{x \in \mathbb{R}} |P(S_n \le x) - \Phi(x)| \le C \frac{\mathbb{E}|X|^3}{\sigma^3 \sqrt{n}}

with CC a universal constant.

This bound holds uniformly across all xx and it does not care whether the underlying distribution is skewed or multimodal or discrete. The rate 1/n1/\sqrt{n} only needs a finite third absolute moment to show up.

2. The Role of Moment Assumptions

The CLT itself only asks for EX2<\mathbb{E}X^2 < \infty and the rate of convergence leans on the behavior of higher moments.

Take r=2+δr = 2 + \delta for δ(0,1]\delta \in (0, 1] and if EXr<\mathbb{E}|X|^r < \infty you can write down the Lyapunov fraction.

Ln=i=1nEXir(EXi2)r/2L_n = \frac{\sum_{i=1}^n \mathbb{E}|X_i|^r}{(\sum \mathbb{E}X_i^2)^{r/2}}

If Ln0L_n \to 0 the CLT holds and the Berry-Esseen bound stretches out to match.

supxFn(x)Φ(x)CδLn\sup_x |F_n(x) - \Phi(x)| \le C_\delta \cdot L_n

For i.i.d. variables Ln=nEXr(nσ2)r/2=O(nδ/2)L_n = \frac{n \mathbb{E}|X|^r}{(n \sigma^2)^{r/2}} = O(n^{-\delta/2}) and the behavior drops straight out of that.

  • With only 2.12.1 moments and δ=0.1\delta = 0.1 the error drops as n0.05n^{-0.05} which is painfully slow.
  • Three moments and δ=1\delta=1 bring back the n1/2n^{-1/2} rate.
  • Moments past the third do not help the worst-case Kolmogorov distance and the rate sits at 1/n1/\sqrt{n} unless you restrict to smooth distributions and pull in Edgeworth expansions.

3. Esseen’s Smoothing Lemma

Bounding supFn(x)Φ(x)\sup |F_n(x) - \Phi(x)| head-on is hard and characteristic functions are easier to work with because the CF of a sum of independent variables factors as a product.

The wrinkle is that if XX is discrete then FnF_n has no density and naive Fourier inversion blows up. The fix is to smooth by convolving both distributions with a kernel of bandwidth TT and then bound the approximation error that the smoothing drags in.

Smoothing Lemma (Esseen, 1945): Take FF and GG as CDFs and let GG be differentiable with GM|G'| \le M and for any T>0T > 0 you get the following bound.

supxF(x)G(x)1πTTφF(t)φG(t)tdt+24MπT\sup_x |F(x) - G(x)| \le \frac{1}{\pi} \int_{-T}^T \left| \frac{\varphi_F(t) - \varphi_G(t)}{t} \right| dt + \frac{24M}{\pi T}

The first term holds the Fourier-domain approximation and the second is the bias that smoothing drags in and the proof of Berry-Esseen comes down to holding each of these in check.

4. Controlling the Characteristic Function

Now you actually put the smoothing lemma to work and set Sn=1nXiS_n = \frac{1}{\sqrt{n}} \sum X_i with EX=0\mathbb{E}X=0 and EX2=1\mathbb{E}X^2=1 and EX3=ρ\mathbb{E}|X|^3 = \rho.

φX(t)=E[eitX]=1t22+θρt36\varphi_X(t) = \mathbb{E}[e^{itX}] = 1 - \frac{t^2}{2} + \theta \frac{\rho |t|^3}{6}

For the sum you raise this whole thing to the nn.

φn(t)=(φX(tn))n=(1t22n+ρt36n3/2)n\varphi_n(t) = \left( \varphi_X\left(\frac{t}{\sqrt{n}}\right) \right)^n = \left( 1 - \frac{t^2}{2n} + \frac{\rho |t|^3}{6 n^{3/2}} \right)^n

Write φX(u)=1u22+R(u)\varphi_X(u) = 1 - \frac{u^2}{2} + R(u) with R(u)ρu36|R(u)| \le \frac{\rho |u|^3}{6} and restrict to tn4ρ|t| \le \frac{\sqrt{n}}{4\rho} so that φX(tn)112\left| \varphi_X\left(\frac{t}{\sqrt{n}}\right) - 1 \right| \le \frac{1}{2} and in this range you can safely take logarithms using log(1+z)=zz22+O(z3)\log(1+z) = z - \frac{z^2}{2} + O(|z|^3).

The log-characteristic function drops into a simple expression.

logφn(t)=nlog(1t22n+O(ρt3n3/2))=t22+O(ρt3n)\log \varphi_n(t) = n \log \left( 1 - \frac{t^2}{2n} + O\left( \frac{\rho |t|^3}{n^{3/2}} \right) \right) = -\frac{t^2}{2} + O\left( \frac{\rho |t|^3}{\sqrt{n}} \right)

Exponentiating gives you the characteristic function as a Gaussian times a small correction.

φn(t)=et2/2exp(O(ρt3n))\varphi_n(t) = e^{-t^2/2} \exp\left( O\left( \frac{\rho |t|^3}{\sqrt{n}} \right) \right)

Using the inequality ez1zez|e^z - 1| \le |z| e^{|z|} you pick out a clean Fourier-domain bound.

φn(t)et2/2et2/2Cρt3n\left| \varphi_n(t) - e^{-t^2/2} \right| \le e^{-t^2/2} \frac{C \rho |t|^3}{\sqrt{n}}

Plugging into Esseen’s integral with Tn/ρT \approx \sqrt{n}/\rho gives you the size of the first term.

TT1tet2/2ρt3ndt=ρnt2et2/2dt=O(ρ/n)\int_{-T}^T \frac{1}{|t|} |e^{-t^2/2} \frac{\rho |t|^3}{\sqrt{n}}| dt = \frac{\rho}{\sqrt{n}} \int t^2 e^{-t^2/2} dt = O(\rho/\sqrt{n})

The bias term 24MπT\frac{24M}{\pi T} drops out at O(ρ/n)O(\rho/\sqrt{n}) too since TnT \propto \sqrt{n}.

The characteristic function near the origin sets the convergence rate and the smoothing lemma turns Fourier-domain control into a uniform bound on CDFs.


5. Stein’s Method

Fourier methods lean on independence because the CF of a sum factors as a product and for dependent variables this breaks and you need a different angle.

Stein’s method (1972) swaps Fourier analysis for a functional equation and defines the Stein operator as Af(w)=f(w)wf(w)\mathcal{A} f(w) = f'(w) - w f(w).

If ZN(0,1)Z \sim \mathcal{N}(0, 1) then E[Af(Z)]=0\mathbb{E}[\mathcal{A} f(Z)] = 0 for every smooth ff with sensible growth conditions and this drops out of integration by parts against the Gaussian density.

To bound P(Snx)Φ(x)|P(S_n \le x) - \Phi(x)| you set h(w)=I(wx)h(w) = \mathbb{I}(w \le x) and solve the Stein equation.

f(w)wf(w)=h(w)Eh(Z)f'(w) - w f(w) = h(w) - \mathbb{E}h(Z)

The Berry-Esseen bound then boils down to holding E[Af(Sn)]|\mathbb{E}[\mathcal{A} f(S_n)]| in check.

The solution ff for h=I(,x]h = \mathbb{I}_{(-\infty, x]} sits inside clean sup norm bounds.

f2π/4,f1,f2\|f\|_\infty \le \sqrt{2\pi}/4, \quad \|f'\|_\infty \le 1, \quad \|f''\|_\infty \le 2

These bounds are universal and they do not care about the distribution of the XiX_i.

To work out E[Snf(Sn)]\mathbb{E}[S_n f(S_n)] you use a leave-one-out decomposition and write Sn(i)S_n^{(i)} for the sum with the ii-th term pulled out so Sn=Sn(i)+Xi/nS_n = S_n^{(i)} + X_i/\sqrt{n}. Taylor expanding f(Sn)f(S_n) around Sn(i)S_n^{(i)} and leaning on the moment assumptions each term in the sum kicks in an error of order ρ/n3/2\rho / n^{3/2} and summing over i=1,,ni = 1, \ldots, n brings back the ρ/n\rho/\sqrt{n} rate.

6. Berry-Esseen under Dependency Graphs

Stein’s method stretches cleanly into settings with local dependence.

Suppose {Xi}i=1n\{X_i\}_{i=1}^n is not independent and each XiX_i only depends on a bounded neighborhood and you define a dependency graph L=(V,E)L=(V, E) with V={1,,n}V = \{1, \dots, n\} where {i,j}E\{i, j\} \notin E forces XiX_i and XjX_j to be independent and any two sets of vertices with no connecting edge are mutually independent.

Let DD be the max degree.

Berry-Esseen for dependency graphs:

supxFn(x)Φ(x)CD2EXi3σ3\sup_x |F_n(x) - \Phi(x)| \le C \frac{D^2 \sum \mathbb{E}|X_i|^3}{\sigma^3}

For a kk-dependent sequence like Xi=f(Yi,,Yi+k)X_i = f(Y_i, \dots, Y_{i+k}) you get D=2kD = 2k and the rate sits at O(1/n)O(1/\sqrt{n}) as long as DD is bounded or more generally DnD \ll n and this covers important examples like 1D Ising models and functionals of Markov chains.


7. Edgeworth Expansions

Berry-Esseen gives you O(n1/2)O(n^{-1/2}) and for continuous distributions you can push further by working out explicit correction terms.

Let κj\kappa_j be the jj-th cumulant and Taylor expand the cumulant generating function.

logφn(t)=j=2κj(it)jnj/21j!=t22+κ3(it)36n+κ4(it)424n+\log \varphi_n(t) = \sum_{j=2}^\infty \kappa_j \frac{(it)^j}{n^{j/2-1} j!} = -\frac{t^2}{2} + \frac{\kappa_3 (it)^3}{6\sqrt{n}} + \frac{\kappa_4 (it)^4}{24n} + \dots

Then you exponentiate the whole thing.

φn(t)=et2/2[1+κ3(it)36n+(κ4(it)424n+κ32(it)672n)+]\varphi_n(t) = e^{-t^2/2} \left[ 1 + \frac{\kappa_3 (it)^3}{6\sqrt{n}} + \left( \frac{\kappa_4 (it)^4}{24n} + \frac{\kappa_3^2 (it)^6}{72n} \right) + \dots \right]

Invert term-by-term using Hermite polynomials Hek(x)ϕ(x)=(1)kdkdxkϕ(x)He_k(x) \phi(x) = (-1)^k \frac{d^k}{dx^k} \phi(x) and the CDF drops out.

Fn(x)=Φ(x)κ36n(x21)ϕ(x)F_n(x) = \Phi(x) - \frac{\kappa_3}{6\sqrt{n}} (x^2 - 1) \phi(x) - \dots

This expansion needs Cramer’s condition which says lim suptφX(t)<1\limsup_{|t| \to \infty} |\varphi_X(t)| < 1.

Any distribution with a non-zero absolutely continuous piece holds Cramer’s condition and lattice distributions do not. For Bernoulli variables the Kolmogorov error sits at O(1/n)O(1/\sqrt{n}) and the Edgeworth series does not actually converge to the discrete CDF.

8. Numerical Comparison: Lattice vs. Continuous Distributions

The simulation below shows the qualitative difference between the lattice and continuous cases and you see the staircase error pattern that rides along with discrete distributions set against the smooth Edgeworth correction for continuous ones.

import numpy as np import scipy.stats as stats import matplotlib.pyplot as plt def compare_convergence(n=20): # 1. Discrete Case: Bernoulli(0.5) # 2. Continuous Case: Exponential(1) x = np.linspace(-3, 3, 500) # Gaussian CDF phi_x = stats.norm.pdf(x) Phi_x = stats.norm.cdf(x) # Edgeworth Correction (Skewness 2 for Exp(1)) rho_exp = 2.0 edgeworth_exp = Phi_x - (rho_exp / (6 * np.sqrt(n))) * (x**2 - 1) * phi_x # Empirical results samples = 100000 # Bernoulli sum X_bern = np.random.choice([-1, 1], size=(samples, n)) S_bern = np.sum(X_bern, axis=1) / np.sqrt(n) # Exp sum X_exp = np.random.exponential(1, size=(samples, n)) - 1 S_exp = np.sum(X_exp, axis=1) / np.sqrt(n) plt.figure(figsize=(12, 5)) # Plotting Errors plt.subplot(1, 2, 1) plt.step(np.sort(S_bern), np.linspace(0, 1, samples) - stats.norm.cdf(np.sort(S_bern)), label='Bernoulli Error') plt.title('Lattice Error (Staircase)') plt.legend() plt.subplot(1, 2, 2) plt.plot(np.sort(S_exp), np.linspace(0, 1, samples) - stats.norm.cdf(np.sort(S_exp)), label='Empirical Exp Error', alpha=0.5) plt.plot(x, edgeworth_exp - Phi_x, 'r--', label='Edgeworth Prediction') plt.title('Continuous Error (Edgeworth Linearity)') plt.legend() # plt.show()

9. The Local Central Limit Theorem

Berry-Esseen bounds the CDF error and you can also ask about convergence of the density fn(x)f_n(x) to the Gaussian density ϕ(x)\phi(x).

Theorem (Local Limit): If the CF φ(t)\varphi(t) is integrable with φ(t)rdt<\int_{-\infty}^\infty |\varphi(t)|^r dt < \infty for some r1r \ge 1 then for large nn the density fn(x)f_n(x) shows up and the sup error drops as 1/n1/\sqrt{n}.

supxRfn(x)ϕ(x)=O(1n)\sup_{x \in \mathbb{R}} |f_n(x) - \phi(x)| = O\left( \frac{1}{\sqrt{n}} \right)

The integrability condition forces the distribution to have a density with controlled oscillation and it wipes out lattice distributions which throw up the characteristic spikes in fnf_n.

For lattice distributions fn(x)f_n(x) only sits on the lattice points and the local CLT takes a different form.

nhP(Sn=x)=ϕ(x)+o(1)\frac{\sqrt{n}}{h} P(S_n = x) = \phi(x) + o(1)

with hh the lattice span and the rate is the same and the interpretation shifts because this holds pointwise probability mass in check instead of CDF error.

10. Applications

Value-at-Risk and Gaussian Tail Approximations

In quantitative finance Value-at-Risk at level α\alpha is the α\alpha-quantile of the loss distribution and under a Gaussian assumption you get VaRα=μ+Φ1(α)σVaR_\alpha = \mu + \Phi^{-1}(\alpha) \sigma and Berry-Esseen pins down the error in this approximation for a portfolio of nn assets with skewed returns.

The correction uses the Cornish-Fisher expansion which you pick out by inverting the Edgeworth series and you let zα=Φ1(α)z_\alpha = \Phi^{-1}(\alpha) and the corrected quantile drops out like this.

qα=zα+1n[κ36(zα21)]+O(n1)q_\alpha = z_\alpha + \frac{1}{\sqrt{n}} \left[ \frac{\kappa_3}{6}(z_\alpha^2 - 1) \right] + O(n^{-1})

If κ3<0\kappa_3 < 0 which is negative skew and typical for equities then the correction term sits negative for tail risks since zα2>1z_\alpha^2 > 1 when α=0.01\alpha = 0.01 and the true loss quantile runs past the Gaussian prediction.

For α=0.01\alpha=0.01 you have z0.012.33z_{0.01} \approx -2.33 and with skewness κ3=1\kappa_3 = -1 and n=50n=50 the actual risk sits around 2.5 standard deviations and the Gaussian model reports 2.33 and this gap is exactly where capital inadequacy gets born.

A/B Testing

The standard sample size formula N16σ2/Δ2N \approx 16 \sigma^2 / \Delta^2 assumes Gaussian errors and if the metric is highly skewed like time-on-site then the actual p-value drifts off the nominal one by up to DnD_n. When Dn0.05D_n \approx 0.05 and the target significance level is α=0.05\alpha = 0.05 the effective Type-I error rate can double and compensating forces you to oversample by a factor of roughly (1+Cρασ3n)(1 + \frac{C \rho}{\alpha \sigma^3 \sqrt{n}}).


11. Non-Uniform Bounds

The standard Berry-Esseen bound is uniform in xx and the same error sits across everywhere and in reality the approximation is much better out in the tails.

Bikelis (1966):

Fn(x)Φ(x)Cρn(1+x3)|F_n(x) - \Phi(x)| \le \frac{C \rho}{\sqrt{n} (1 + |x|^3)}

At x=3x=3 this is roughly 1/281/28 of the error at x=0x=0 and you can trust the CLT approximation a lot more out in the tails than at the center even at moderate nn. The decay rate (1+x3)1(1+|x|^3)^{-1} is polynomial and slower than the Gaussian tail ex2/2e^{-x^2/2} and the slower decay is what makes the bound informative.


12. Summary

Berry-Esseen is a quantitative check on finite-sample Gaussian approximations and for moderate nn the tails of the true distribution can sit pretty far from the Gaussian.

If a risk system treats a 5-sigma event as having probability 10710^{-7} Berry-Esseen says that at n=100n=100 the error in that tail probability can run as big as 10210^{-2} and the 1/n1/\sqrt{n} rate drops a hard floor on how accurate the normal approximation can get.


Historical notes

The story starts with Adolphe Quetelet in 1835 who picked out the Average Man and first spotted Gaussian distributions in nature. More than a century later Carl-Gustav Esseen proved the original Berry-Esseen bound in 1942 and set down the smoothing lemma which is still the central technical piece. A. Bikelis worked out non-uniform bounds for tail probabilities in 1966 and showed that the CLT approximation actually gets better out in the tails. Charles Stein rolled out his method in 1972 and opened the door to bounds for dependent random variables. Andrew Barron proved the Entropic CLT for KL-divergence in 1986 and measured convergence in an information-theoretic metric. Friedrich Gotze pinned down multivariate bounds for convex sets in 1991 with error scaling as d1/4d^{1/4}. The universal constant got tightened to 0.47480.4748 by Irina Shevtsova in 2011 and Koltchinskii and Lounici proved Matrix Berry-Esseen bounds in 2017.

Characteristic Function (CF) is the Fourier transform of a distribution φ(t)=eitxdF(x)\varphi(t) = \int e^{itx} dF(x) and the CF of a sum of independent variables factors as a product and this is why Fourier methods feel natural for the CLT.

Cumulants (κj\kappa_j) are the coefficients of the Taylor expansion of the log-CF and the third cumulant κ3\kappa_3 is the skewness and it drives the leading correction in the Edgeworth expansion.

Dependency Graph is a graph that holds which random variables are not independent and Stein’s method uses this structure to bound the normal approximation error for sums of dependent variables.

Hermite Polynomials are orthogonal polynomials against the Gaussian weight and they show up as correction terms in the Edgeworth series.

Kolmogorov Distance is the sup-norm distance between CDFs D(F,G)=supxF(x)G(x)D(F, G) = \sup_x |F(x) - G(x)| and this is the metric Berry-Esseen holds in check.


Proof Sketch of the Smoothing Lemma

This is the central technical piece in Berry-Esseen and it turns Fourier-domain control of characteristic functions into uniform bounds on CDFs.

Let Δ(x)=F(x)G(x)\Delta(x) = F(x) - G(x) and η=supxΔ(x)\eta = \sup_x |\Delta(x)| and take a smoothing kernel kTk_T whose CF is supported on [T,T][-T, T] with CDF KTK_T and convolve.

ΔT(x)=12πTTeitxφF(t)φG(t)itφK(t)dt\Delta_T(x) = \frac{1}{2\pi} \int_{-T}^T e^{-itx} \frac{\varphi_F(t) - \varphi_G(t)}{-it} \varphi_K(t) dt

Bounding this integral gives you the first term of the lemma.

The second term (24M/πT24M/\pi T) comes from the smoothing error ΔT(x)Δ(x)|\Delta_T(x) - \Delta(x)| and since GG is MM-Lipschitz for any shift aa you get

Δ(a)max(ΔT(a+b))+Mwidth|\Delta(a)| \le \max( |\Delta_T(a+b)| ) + M \cdot \text{width}

Using the Fejer or de la Vallee Poussin kernel gives the specific constant 24/π24/\pi and the remaining calculation is mostly bookkeeping with the kernel’s moments and support. See Petrov (1975), Chapter V for the full argument.


Pathologies and Discrete Corrections

Infinite Moments

If EX2=\mathbb{E}X^2 = \infty like Pareto with α<2\alpha < 2 then the CLT wipes out entirely and the normalized sum drifts into a stable distribution instead with n1/αSnLαn^{-1/\alpha} S_n \to L_\alpha and Berry-Esseen does not apply in this regime.

High Dimensions

In Rd\mathbb{R}^d the uniform error over convex sets scales as d1/4n1/2d^{1/4} n^{-1/2}. Over arbitrary Borel sets the situation runs much worse because the error can sit as large as 1 since the discrete and Gaussian distributions can be mutually singular. The Gaussian probability of a hypercube boundary is negligible and the discrete distribution may drop substantial mass there.

Discrete Corrections

Lattice distributions hit the worst-case Berry-Esseen error and the standard remedy is the continuity correction where when approximating P(Snk)P(S_n \le k) for integer kk you use P(Zk+0.5)P(Z \le k + 0.5) and this half-integer shift compensates for the staircase structure of the discrete CDF and in the language of Edgeworth expansions it lines up with the Euler-Maclaurin correction.

For Bernoulli sums the continuity correction drops the error from O(1/n)O(1/\sqrt{n}) to O(1/n)O(1/n).


The Universal Constant

The Berry-Esseen bound Cρ/nC \rho / \sqrt{n} needs a concrete value of CC to be useful in practice and sharpening this constant has been a sustained effort over eight decades.

  • Esseen (1942): C7.59C \approx 7.59
  • Beekman (1972): 0.79750.7975
  • Shiganov (1986): 0.76550.7655
  • Shevtsova (2011): C<0.4748C < 0.4748 (current record)
AssumptionBest Known CCSource
General (Independent)0.4748Shevtsova (2011)
Symmetric Variables0.4097Shevtsova (2010)
i.i.d. Symmetric0.4097Tyurin (2010)
Lower Bound (Bernoulli)0.3989Esseen (1956)
Non-Uniform General25.5Paditz (1989)

The gap between 0.47 and 0.40 matters in safety-critical engineering like nuclear and aerospace where Gaussian approximations underwrite failure probability guarantees below 10610^{-6} and if CC runs too large the required sample sizes become impractical.


High-Dimensional and Matrix Extensions

Multivariate

In Rd\mathbb{R}^d the natural question is convergence of probabilities over sets.

Gotze (1991): For i.i.d. vectors XiRdX_i \in \mathbb{R}^d with identity covariance you get

supACP(SnA)P(ZA)cd1/4EX3n\sup_{A \in \mathcal{C}} |P(S_n \in A) - P(Z \in A)| \le \frac{c \cdot d^{1/4} \mathbb{E}\|X\|^3}{\sqrt{n}}

where C\mathcal{C} is the class of convex sets and the d1/4d^{1/4} factor is the cost of cranking up dimension.

Matrix Berry-Esseen

For nn independent symmetric random matrices XiX_i with XiopM\|X_i\|_{\text{op}} \le M and covariance Σ=E[Xi2]\Sigma = \mathbb{E}[X_i^2] the relevant quantity is the effective rank.

r(Σ)=Tr(Σ)Σopr(\Sigma) = \frac{\text{Tr}(\Sigma)}{\|\Sigma\|_{\text{op}}}

Koltchinskii-Lounici (2017):

suptRP(Snopt)P(Zopt)Cr(Σ)n\sup_{t \in \mathbb{R}} |P(\|S_n\|_{\text{op}} \le t) - P(\|Z\|_{\text{op}} \le t)| \le C \sqrt{\frac{r(\Sigma)}{n}}

The convergence rate leans on r(Σ)r(\Sigma) rather than dd and the bound is the natural matrix analogue of Berry-Esseen where the stable rank plays the role that the third moment plays in the scalar case.


The Entropic CLT

Berry-Esseen measures convergence in Kolmogorov distance and from an information-theoretic angle relative entropy is a more natural metric.

Let D(fnϕ)=fnlog(fn/ϕ)dxD(f_n \| \phi) = \int f_n \log(f_n / \phi) dx.

Barron (1986): If D(f1ϕ)<D(f_1 \| \phi) < \infty then D(fnϕ)0D(f_n \| \phi) \to 0.

Under log-Sobolev conditions the rate is O(1/n)O(1/n) and this is faster than the 1/n1/\sqrt{n} Kolmogorov rate.


References

1. Esseen, C. G. (1942). “On the Liapounoff limit of error in the theory of probability”. The original paper deriving the bound via the smoothing lemma.

2. Stein, C. (1972). “A bound for the error in the normal approximation to the distribution of a sum of dependent random variables”. Rolled out Stein’s Method and a completely different angle that swapped Fourier analysis for functional equations.

3. Chen, L. H. Y., Goldstein, L., & Shao, Q. M. (2011). “Normal Approximation by Stein’s Method”. The modern textbook on Stein’s method.

4. Shevtsova, I. G. (2011). “On the absolute constant in the Berry-Esseen inequality”. The paper that proved the current world record for the constant (0.47480.4748).

5. Bentkus, V. (2003). “On the dependence of the Berry-Esseen bound on dimension”. The definitive result on the d1/4d^{1/4} rate.

6. Barron, A. R. (1986). “Entropy and the Central Limit Theorem”. The foundational paper for the entropic version of the CLT.