The Berry-Esseen Bound

3/18/2025

1. The Non-Asymptotic Central Limit Theorem

The Central Limit Theorem (CLT) is a qualitative statement: for i.i.d. variables with finite variance, the normalized sum converges in distribution to $\mathcal{N}(0, 1)$ . In practice, $n$ is never infinite. We require a quantitative measure of proximity. If we use a Gaussian approximation for $n=30$ , what is our maximum error in probability? The Berry-Esseen Theorem provides the definitive upper bound on the Kolmogorov distance:

D_n = \sup_{x \in \mathbb{R}} |P(S_n \le x) - \Phi(x)| \le C \frac{\mathbb{E}|X|^3}{\sigma^3 \sqrt{n}}

where $C$ is a universal constant. This result is remarkable for its uniformity. It doesn’t matter if the underlying distribution is skewed, multimodal, or discrete; the $1/\sqrt{n}$ rate is the “speed of light” for the CLT.

2. The Lyapunov Condition and Fractional Moments

Is the third moment $\mathbb{E}|X|^3$ strictly necessary? The CLT only requires the second moment $\mathbb{E}X^2$ . However, the rate of convergence depends on the tail behavior. Consider the Lyapunov Condition: Let $r = 2 + \delta$ for $\delta \in (0, 1]$ . If $\mathbb{E}|X|^r < \infty$ , define:

L_n = \frac{\sum_{i=1}^n \mathbb{E}|X_i|^r}{(\sum \mathbb{E}X_i^2)^{r/2}}

If $L_n \to 0$ , the CLT holds. The Berry-Esseen bound generalizes to these fractional moments:

\sup_x |F_n(x) - \Phi(x)| \le C_\delta \cdot L_n

For i.i.d. variables, $L_n = \frac{n \mathbb{E}|X|^r}{(n \sigma^2)^{r/2}} = O(n^{-\delta/2})$ .

If we only have $2.1$ moments, the error decays as $n^{-0.05}$ (extremely slow).
If we have $3$ moments ( $\delta=1$ ), we recover the $n^{-1/2}$ rate.
Increasing moments beyond $3$ does NOT improve the Kolmogorov rate beyond $1/\sqrt{n}$ for the worst-case $x$ , though it allows for higher-order Edgeworth expansions for smooth distributions.

3. Esseen’s Smoothing Lemma: Fourier Interpolation

We cannot easily bound $\sup |F_n(x) - \Phi(x)|$ directly. We can, however, bound the difference of Characteristic Functions (CF): $\psi_n(t) - e^{-t^2/2}$ . The inverse Fourier transform relates $\Delta \psi$ to $\Delta F$ . Problem: $F_n$ may not have a density (if $X$ is discrete). The inversion formula $\frac{1}{2\pi} \int e^{-itx} (\Delta \psi) dt$ fails to converge. Solution: Smoothing. We convolve the distributions with a smooth kernel $H_T$ with limited bandwidth $T$ .

The Smoothing Lemma (Esseen, 1945): Let $F, G$ be CDFs. $G$ is differentiable with $|G'| \le M$ . For any $T > 0$ , the maximum deviation is bounded by the $1/t$ weighted integral of the CF difference:

\sup_x |F(x) - G(x)| \le \frac{1}{\pi} \int_{-T}^T \left| \frac{\varphi_F(t) - \varphi_G(t)}{t} \right| dt + \frac{24M}{\pi T}

4. Convergence of the Characteristic Function

To use the Lemma, we analyze $\Delta \varphi$ for the sum $S_n = \frac{1}{\sqrt{n}} \sum X_i$ . Let $\mathbb{E}X=0, \mathbb{E}X^2=1, \mathbb{E}|X|^3 = \rho$ .

\varphi_X(t) = \mathbb{E}[e^{itX}] = 1 - \frac{t^2}{2} + \theta \frac{\rho |t|^3}{6}

For the sum $S_n$ :

\varphi_n(t) = \left( \varphi_X\left(\frac{t}{\sqrt{n}}\right) \right)^n = \left( 1 - \frac{t^2}{2n} + \frac{\rho |t|^3}{6 n^{3/2}} \right)^n

To control error, we analyze $\log \varphi_n(t)$ . Let $\varphi_X(u) = 1 - \frac{u^2}{2} + R(u)$ where $|R(u)| \le \frac{\rho |u|^3}{6}$ . For sum $S_n$ , the characteristic function is $\varphi_n(t) = \left[ \varphi_X\left(\frac{t}{\sqrt{n}}\right) \right]^n$ . Restricting $t$ to $|t| \le \frac{\sqrt{n}}{4\rho}$ ensures $\left| \varphi_X\left(\frac{t}{\sqrt{n}}\right) - 1 \right| \le \frac{1}{2}$ , allowing the use of the principal branch of the logarithm $\log(1+z) = z - \frac{z^2}{2} + O(|z|^3)$ .

Expanding the log-characteristic function:

\log \varphi_n(t) = n \log \left( 1 - \frac{t^2}{2n} + O\left( \frac{\rho |t|^3}{n^{3/2}} \right) \right) = -\frac{t^2}{2} + O\left( \frac{\rho |t|^3}{\sqrt{n}} \right)

Exponentiating back to the domain of characteristic functions:

\varphi_n(t) = e^{-t^2/2} \exp\left( O\left( \frac{\rho |t|^3}{\sqrt{n}} \right) \right)

Using $|e^z - 1| \le |z| e^{|z|}$ , we derive a bound for Esseen’s integral:

\left| \varphi_n(t) - e^{-t^2/2} \right| \le e^{-t^2/2} \frac{C \rho |t|^3}{\sqrt{n}}

This allows $\int_{-T}^T \frac{|\Delta \varphi|}{|t|} dt$ to converge to the $O(1/\sqrt{n})$ rate. We choose $T \approx \sqrt{n}/\rho$ in Esseen’s Lemma. The integral becomes:

\int_{-T}^T \frac{1}{|t|} |e^{-t^2/2} \frac{\rho |t|^3}{\sqrt{n}}| dt = \frac{\rho}{\sqrt{n}} \int t^2 e^{-t^2/2} dt = O(\rho/\sqrt{n})

The “bias” term $\frac{24M}{\pi T}$ is also $O(\rho/\sqrt{n})$ since $T \propto \sqrt{n}$ . This proves the $1/\sqrt{n}$ rate. The beauty of this proof is that it localizes the problem to the behavior of the CF near the origin.

5. Stein’s Method: The Algebraic Perspective

Fourier methods rely on the product property of CFs, making them hard to adapt to dependent variables. Stein’s Method (1972) operates directly on expectations using differential equations. The starting point is the Stein Operator: $\mathcal{A} f(w) = f'(w) - w f(w)$ . For $Z \sim \mathcal{N}(0, 1)$ , $\mathbb{E}[\mathcal{A} f(Z)] = 0$ for all smooth $f$ .

The Stein Equation: To bound $|P(S_n \le x) - \Phi(x)|$ , we choose the test function $h(w) = \mathbb{I}(w \le x)$ and solve:

f'(w) - w f(w) = h(w) - \mathbb{E}h(Z)

The Berry-Esseen bound is then $|\mathbb{E}[\mathcal{A} f(S_n)]|$ . Crucially, for $h = \mathbb{I}_{(-\infty, x]}$ , the solution $f$ satisfies:

\|f\|_\infty \le \sqrt{\pi/2}, \quad \|f'\|_\infty \le 1, \quad \|f''\|_\infty \le 2

These bounds are independent of the distribution of $X_i$ ! By expanding $\mathbb{E}[S_n f(S_n)]$ using the “w-neighborhood” construction (similar to the leave-one-out idea but more general), we derive the same $\rho/\sqrt{n}$ rate. Algebraically, we replace $S_n$ with a sum of “locals” and use Taylor’s theorem.

6. Dependent Variables and Dependency Graphs

Stein’s method excels at handling local dependence. Suppose the sequence $\{X_i\}_{i=1}^n$ is not independent, but each $X_i$ only depends on a few neighbors. Define a Dependency Graph $L=(V, E)$ where $V = \{1, \dots, n\}$ . If $\{i, j\} \notin E$ , then $\{X_k : k \in A\}$ and $\{X_k : k \in B\}$ are independent whenever no edge connects $A$ and $B$ . Let $D$ be the maximum degree of the graph. General Berry-Esseen for Dependency Graphs:

\sup_x |F_n(x) - \Phi(x)| \le C \frac{D^2 \sum \mathbb{E}|X_i|^3}{\sigma^3}

For a $k$ -dependent sequence (e.g., $X_i = f(Y_i, \dots, Y_{i+k})$ ), $D = 2k$ . The rate remains $O(1/\sqrt{n})$ as long as the dependence range $D$ is much smaller than $n$ . This allows the CLT to apply to physical systems with local interactions (like the 1D Ising Model or Markov Chains).

7. Higher Order: The Edgeworth Expansion

Berry-Esseen is an $O(n^{-1/2})$ result. For continuous variables, we can find the exact correction terms for skewness and kurtosis. Let $\kappa_j$ be the $j$ -th cumulant. Taylor expand the CGF $\log \varphi_n(t)$ :

\log \varphi_n(t) = \sum_{j=2}^\infty \kappa_j \frac{(it)^j}{n^{j/2-1} j!} = -\frac{t^2}{2} + \frac{\kappa_3 (it)^3}{6\sqrt{n}} + \frac{\kappa_4 (it)^4}{24n} + \dots

Exponentiating gives the Edgeworth Expansion in the Fourier domain:

\varphi_n(t) = e^{-t^2/2} \left[ 1 + \frac{\kappa_3 (it)^3}{6\sqrt{n}} + \left( \frac{\kappa_4 (it)^4}{24n} + \frac{\kappa_3^2 (it)^6}{72n} \right) + \dots \right]

Inverting the transform term-by-term requires Hermite Polynomials $He_k(x) \phi(x) = (-1)^k \frac{d^k}{dx^k} \phi(x)$ . Specifically:

F_n(x) = \Phi(x) - \frac{\kappa_3}{6\sqrt{n}} (x^2 - 1) \phi(x) - \dots

Cramér’s Condition: Does the Edgeworth expansion always work? No. It requires $\limsup_{|t| \to \infty} |\varphi_X(t)| < 1$ .

This holds for all distributions with a non-zero absolutely continuous component.
It fails for discrete (lattice) variables. For Bernoulli variables, the error is $O(1/\sqrt{n})$ , but the Edgeworth series does not even converge to the discrete distribution.

8. Numerical Check: Lattice vs Continuous Convergence

The following simulation demonstrates the “Staircase” error of lattice distributions compared to the smooth correction of Edgeworth for continuous variables.


import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
 
def compare_convergence(n=20):
    # 1. Discrete Case: Bernoulli(0.5)
    # 2. Continuous Case: Exponential(1)
    
    x = np.linspace(-3, 3, 500)
    
    # Gaussian CDF
    phi_x = stats.norm.pdf(x)
    Phi_x = stats.norm.cdf(x)
    
    # Edgeworth Correction (Skewness 2 for Exp(1))
    rho_exp = 2.0
    edgeworth_exp = Phi_x - (rho_exp / (6 * np.sqrt(n))) * (x**2 - 1) * phi_x
    
    # Empirical results
    samples = 100000
    
    # Bernoulli sum
    X_bern = np.random.choice([-1, 1], size=(samples, n))
    S_bern = np.sum(X_bern, axis=1) / np.sqrt(n)
    
    # Exp sum
    X_exp = np.random.exponential(1, size=(samples, n)) - 1
    S_exp = np.sum(X_exp, axis=1) / np.sqrt(n)
    
    plt.figure(figsize=(12, 5))
    
    # Plotting Errors
    plt.subplot(1, 2, 1)
    plt.step(np.sort(S_bern), np.linspace(0, 1, samples) - stats.norm.cdf(np.sort(S_bern)), label='Bernoulli Error')
    plt.title('Lattice Error (Staircase)')
    plt.legend()
    
    plt.subplot(1, 2, 2)
    plt.plot(np.sort(S_exp), np.linspace(0, 1, samples) - stats.norm.cdf(np.sort(S_exp)), label='Empirical Exp Error', alpha=0.5)
    plt.plot(x, edgeworth_exp - Phi_x, 'r--', label='Edgeworth Prediction')
    plt.title('Continuous Error (Edgeworth Linearity)')
    plt.legend()
    
    # plt.show()

9. The Local Central Limit Theorem (LCLT)

Berry-Esseen bounds the CDF error. What about the density $f_n(x)$ ? If $X$ has a continuous component, we expect $f_n(x) \to \phi(x)$ . Theorem (Local Limit Theorem): Let the characteristic function of $X$ be $\varphi(t)$ . If $\varphi(t)$ is integrable (i.e., $\int_{-\infty}^\infty |\varphi(t)|^r dt < \infty$ for some $r \ge 1$ ), then the density $f_n(x)$ of the normalized sum exists for large $n$ and satisfies:

\sup_{x \in \mathbb{R}} |f_n(x) - \phi(x)| = O\left( \frac{1}{\sqrt{n}} \right)

This requires the distribution of $X$ to have a sufficiently smooth component, stopping “lattice spikes” seen in discrete variables. This is the Local CLT. For lattice distributions, $f_n(x)$ is only defined on the grid. The Local CLT for lattice variables states that:

\frac{\sqrt{n}}{h} P(S_n = x) = \phi(x) + o(1)

where $h$ is the lattice span. The rate is still $O(1/\sqrt{n})$ , but the interpretation changes from “area error” to “height error”.

10. Applications in Real World Risk

10.1 Finance: The VaR Corruption

In Quantitative Finance, Value-at-Risk (VaR) is the $\alpha$ -quantile of the loss distribution. Under the “Normal Assumption,” $VaR_\alpha = \mu + \Phi^{-1}(\alpha) \sigma$ . Berry-Esseen tells us precisely how much we can trust this quantile for a portfolio of $n$ assets. If asset returns are skewed (fat tails or leverage effects), the true VaR deviates. The Skewness Leak: Adjusting Value-at-Risk for skewness necessitates the Cornish-Fisher Expansion, obtained by inverting the Edgeworth series. Let $z_\alpha = \Phi^{-1}(\alpha)$ be the Gaussian quantile. The true quantile $q_\alpha$ of the distribution $S_n$ yields the asymptotic expansion:

q_\alpha = z_\alpha + \frac{1}{\sqrt{n}} \left[ \frac{\kappa_3}{6}(z_\alpha^2 - 1) \right] + O(n^{-1})

If $\kappa_3 < 0$ (negative skewness), the term $\frac{\kappa_3}{6}(z_\alpha^2 - 1)$ is negative (since typically $z_\alpha^2 > 1$ for tail risks like $\alpha=0.01$ ). This shifts the quantile further into the tail, showing that the Gaussian assumption underestimates potential loss. For $\alpha=0.01$ (1% risk), $z_{0.01} \approx -2.33$ . If skewness $\kappa_3 = -1$ (typical for equities), the correction is negative, meaning the true loss is underestimated. Without Berry-Esseen, a risk manager might believe they are capped at 2.33 sigmas, when $n=50$ implies they are actually facing 2.5 sigmas of risk. This “Model Risk” is the primary cause of capital inadequacy during market stress.

10.2 A/B Testing and Power Analysis

In A/B testing, we calculate the required sample size $N$ to detect an effect $\Delta$ . The standard formula $N \approx 16 \sigma^2 / \Delta^2$ assumes Gaussian errors. The Berry-Esseen Penalty: If your underlying metric (e.g., “Time Spent”) is highly skewed, your actual p-value might be higher than intended. If the third moment $\rho$ is large, the true alpha-risk of your test is $\alpha + D_n$ . If $D_n \approx 0.05$ and your target $\alpha=0.05$ , your Type-I error rate has doubled. Practitioners must over-sample by a factor of $(1 + \frac{C \rho}{\alpha \sigma^3 \sqrt{n}})$ to maintain the same level of rigorous confidence.

11. Non-Uniform Berry-Esseen Bounds

The standard BE theorem is Uniform: it bounds the error $\forall x$ . However, the error is often much smaller for large $x$ (the tails). The Non-Uniform Bound (Bikelis, 1966):

|F_n(x) - \Phi(x)| \le \frac{C \rho}{\sqrt{n} (1 + |x|^3)}

This is a critical refinement. It says that at 3 sigmas ( $x=3$ ), the error is $1/28$ th of the error at the mean. This allows us to use CLT logic for Tail Probabilities even when $n$ is modest, provided we account for the $1/(1+|x|^3)$ decay. Note: This decay must be slower than the Gaussian decay $e^{-x^2/2}$ for the bound to be non-trivial.

12. Conclusion

The Berry-Esseen bound is a warning against naive Gaussian modeling. It tells us that for small $n$ , the “tails” of our distribution are almost certainly non-Gaussian. If your risk management system relies on the assumption that a 5-sigma event is $10^{-7}$ , Berry-Esseen reminds you that for $n=100$ , the error in that tail probability might be as large as $10^{-2}$ . The “Distributional Limit” is a mathematical ideal; for the practitioner, the $1/\sqrt{n}$ error is the physical floor of our certainty. The Gaussian is a useful fiction, but Berry-Esseen is the hard reality of finite samples.

Historical Timeline

Year	Event	Significance
1835	Adolphe Quetelet	Identifies the “Average Man”, realizing Gaussian distributions in nature.
1942	Carl-Gustav Esseen	Proves the original Berry-Esseen bound with smoothing lemma.
1966	A. Bikelis	Derives non-uniform bounds for tail probabilities.
1972	Charles Stein	Introduces Stein’s Method, enabling bounds for dependent variables.
1986	Andrew Barron	Proves the Entropic CLT for KL-divergence.
1991	Friedrich Götze	Establishes multivariate bounds for convex sets ( $d^{1/4}$ ).
2011	Irina Shevtsova	Tightens the universal constant to $0.4748$ .
2017	Koltchinskii & Lounici	Proves Matrix Berry-Esseen bounds.

Appendix A: Proof Sketch of Esseen’s Smoothing Lemma

The lemma is the “Magic” behind Berry-Esseen. It translates tail errors in Frequency space to uniform errors in Real space. Let $\Delta(x) = F(x) - G(x)$ . We aim to bound $\eta = \sup_x |\Delta(x)|$ . Let $k_T$ be a smoothing kernel with CF supported on $[-T, T]$ . Let $K_T$ be its CDF. Convolve $\Delta$ with $K_T$ : $\Delta_T = \Delta * K_T$ . By the Fourier Inversion Formula for functions with supported transforms:

\Delta_T(x) = \frac{1}{2\pi} \int_{-T}^T e^{-itx} \frac{\varphi_F(t) - \varphi_G(t)}{-it} \varphi_K(t) dt

The first term of the Lemma follows from bounding this integral. The second term ( $24M/\pi T$ ) accounts for the “Smoothing Error” $|\Delta_T(x) - \Delta(x)|$ . The Kernel Trick: Since $G$ is $M$ -Lipschitz, we can show that for any shift $a$ :

|\Delta(a)| \le \max( |\Delta_T(a+b)| ) + M \cdot \text{width}

Choosing the Fejér kernel or de la Vallée Poussin kernel allows for the specific constant $24/\pi$ .

Appendix B: Pathologies and Discrete Corrections

B.1 Infinite Moments

If $\mathbb{E}X^2 = \infty$ (e.g., Pareto with $\alpha < 2$ ), CLT fails. The sum converges to a Stable Distribution (Levy Flight). $n^{-1/\alpha} S_n \to L_\alpha$ . Berry-Esseen is meaningless.

B.2 High Dimensions (The Convex Set Requirement)

In $\mathbb{R}^d$ , the uniform error over convex sets scales as $d^{1/4} n^{-1/2}$ . However, if we test on “any Borel set”, the error is 1. (Distributions are mutually singular). Specifically, if we look at the boundaries of the hypercube, the Gaussian Prob is small, but Discrete Prob can be large.

B.3 Discrete Corrections: Yates & Continuity

Since lattice distributions produce the worst-case Berry-Esseen error, how do we fix them? The Continuity Correction: When approximating $P(S_n \le k)$ for integer $k$ , we use $P(Z \le k + 0.5)$ . This “half-integer” shift compensates for the staircase error. In the Edgeworth context, this corresponds to the Euler-Maclaurin formula correction. For a Bernoulli sum, the error drops from $O(1/\sqrt{n})$ to $O(1/n)$ if the continuity correction is used.

Appendix C: The Quest for the Universal Constant

The Berry-Esseen bound $C \rho / \sqrt{n}$ is only useful if we know $C$ . The history of probability is a race to shrink this constant.

Esseen (1942): Original proof gave $C \approx 7.59$ .
Beekman (1972): Improved to $0.7975$ .
Shiganov (1986): Reduced to $0.7655$ .
Shevtsova (2011): Current tightest bound $C < 0.4748$ .

Assumption	Best Known Constant $C$	Source
General (Independent)	0.4748	Shevtsova (2011)
Symmetric Variables	0.4097	Shevtsova (2010)
i.i.d. Symmetric	0.4097	Tyurin (2010)
Lower Bound (Bernoulli)	0.3989	Esseen (1956)
Non-Uniform General	25.5	Paditz (1989)

Why does $0.07$ matter? In safety-critical systems (e.g., nuclear engineering), we use Gaussian approximations to guarantee that failure probability is $< 10^{-6}$ . If $C$ is too large, our required “buffer” $n$ becomes economically non-viable.

Appendix D: High-Dimensional and Matrix Extensions

D.1 Multivariate Generalizations

In $\mathbb{R}^d$ , the norm is $\|X\|_2^2$ , but we care about the convergence of probabilities of sets. The Götze Bound (1991): For i.i.d. vectors $X_i \in \mathbb{R}^d$ with identity covariance:

\sup_{A \in \mathcal{C}} |P(S_n \in A) - P(Z \in A)| \le \frac{c \cdot d^{1/4} \mathbb{E}\|X\|^3}{\sqrt{n}}

where $\mathcal{C}$ is the class of all convex sets. The $d^{1/4}$ dependency is the “curse of dimensionality” for the CLT.

D.2 Matrix Berry-Esseen Bounds

Can we extend the $1/\sqrt{n}$ rate to sums of independent random matrices? The Koltchinskii-Lounici Bound (2017): For $n$ independent symmetric matrices $X_i$ with operator norm $\|X_i\|_{\text{op}} \le M$ and covariance $\Sigma = \mathbb{E}[X_i^2]$ , the rate of convergence is governed by the Effective Rank (or Stable Rank), defined as:

r(\Sigma) = \frac{\text{Tr}(\Sigma)}{\|\Sigma\|_{\text{op}}}

where $\|\Sigma\|_{\text{op}}$ is the spectral norm (largest eigenvalue). The Koltchinskii-Lounici bound states:

\sup_{t \in \mathbb{R}} |P(\|S_n\|_{\text{op}} \le t) - P(\|Z\|_{\text{op}} \le t)| \le C \sqrt{\frac{r(\Sigma)}{n}}

Showing convergence depends on $r(\Sigma)$ rather than $d$ . This is the matrix-analog of Berry-Esseen. Unlike the scalar case where $\rho$ is the key parameter, here the ambient dimension and the stable rank dictate the speed of convergence.

Appendix E: The Entropic CLT

While Berry-Esseen deals with the Kolmogorov distance $D_{KS}$ , modern information theory looks at the Relative Entropy (Kullback-Leibler divergence). Let $D(f_n \| \phi) = \int f_n \log(f_n / \phi) dx$ . Theorem (Barron, 1986): If $D(f_1 \| \phi) < \infty$ , then:

D(f_n \| \phi) \to 0

The Rate: Under Stein-type conditions (Log-Sobolev inequalities), the rate of entropy convergence is $O(1/n)$ . Note: This is faster than the $1/\sqrt{n}$ Kolmogorov rate. The Entropic CLT provides the final justification for Maximum Entropy principles: the Gaussian is not just a limit, it is the state of highest ignorance.

Appendix F: Glossary of Terms

Characteristic Function (CF): The Fourier transform of the measure: $\varphi(t) = \int e^{itx} dF(x)$ .
Cumulants ( $\kappa_j$ ): The coefficients of the Taylor expansion of the log-CF. $\kappa_3$ is the skewness.
Dependency Graph: A combinatorial object representing the lack of independence in a random field.
Hermite Polynomials: Orthogonal polynomials used to “correct” the Gaussian density in Edgeworth series.
Kolmogorov Distance: The metric $D(F, G) = \sup_x |F(x) - G(x)|$ . The strongest metric for convergence in distribution.

References

1. Esseen, C. G. (1942). “On the Liapounoff limit of error in the theory of probability”. The original paper deriving the bound via the smoothing lemma.

2. Stein, C. (1972). “A bound for the error in the normal approximation to the distribution of a sum of dependent random variables”. Introduced Stein’s Method. A revolutionary different approach that replaced Fourier analysis with functional equations.

3. Chen, L. H. Y., Goldstein, L., & Shao, Q. M. (2011). “Normal Approximation by Stein’s Method”. The modern textbook on Stein’s method.

4. Shevtsova, I. G. (2011). “On the absolute constant in the Berry-Esseen inequality”. The paper that proved the current world record for the constant ( $0.4748$ ).

5. Bentkus, V. (2003). “On the dependence of the Berry-Esseen bound on dimension”. The definitive result on the $d^{1/4}$ rate.

6. Barron, A. R. (1986). “Entropy and the Central Limit Theorem”. The foundational paper for the entropic version of the CLT.