Stein's Paradox & Empirical Bayes
1. The sample mean in high dimensions
Suppose you see in one dimension and you want a guess for . The natural guess is just itself and no other estimator beats it across the board.
In higher dimensions the picture shifts hard. Suppose you see in dimension and you try the same guess . It has a built-in flaw because it overshoots every time.
By the law of large numbers, as :
Meanwhile, the squared norm of satisfies:
So sits on a shell of radius about and that is way farther from the origin than is. The noise piece piles up around and the cross term is only and the noise piece lands in directions orthogonal to because of how concentration of measure works in high dimensions.
The obvious fix is to pull back toward the origin and Stein proved in 1956 that this fix is not just nice to have but actually required.
2. The James-Stein estimator
Set up risk under squared error loss.
For the risk is no matter what is.
The James-Stein estimator uses a shrinkage factor that depends on the data.
Theorem (Stein’s Paradox). For all and all ,
The improvement is largest near the origin, and at we have
In dimension 100 the MLE has risk 100 and James-Stein has risk 2 and that is a 98 percent cut.
To see why shrinkage is the natural move geometrically, think about the ideal or oracle shrinkage factor. Given the scalar that makes smallest is
The oracle shrinkage factor matches the James-Stein form and gets swapped for to handle the estimation error in the denominator.
3. Stein’s Lemma
To work out the risk of a nonlinear estimator without knowing we need the following identity.
Lemma (Stein, 1981). Let and let be weakly differentiable with . Then
Proof. We go component by component. Because the coordinates are independent the joint density factors as and each component can be handled on its own.
For the -th component,
Pull out the integral over ,
The Gaussian kernel obeys and so . Plugging this in gives
Integrating by parts with and ,
The boundary term dies whenever grows slower than and that covers every polynomial-growth estimator and we are left with
Adding over gives
4. Deriving the James-Stein estimator
Write and open up the risk,
The first term equals and by Stein’s Lemma the cross term equals and so
The quantity is Stein’s Unbiased Risk Estimate and people call it SURE. To beat the MLE we need a nonzero that pushes this expression below zero.
Try . It blows up at but that event has probability zero under a continuous distribution.
To work out the divergence note that since the partial derivative is
Adding over all the components gives
The squared norm of is
Plugging into SURE gives
The risk improvement is quadratic in and its minimum sits at and that gives
For this is strictly negative and so
When the distribution gives and so
The paradox is that a single estimator beats the MLE at the same time for every single value of the parameter.
5. The Bayesian derivation
SURE shows you the risk reduction is real and the Bayesian view explains why the estimator has this exact shape.
Put a Gaussian prior on .
Level 1:
Level 2:
By conjugacy the posterior is Gaussian and the precisions add,
The posterior mean is
Writing the Bayes estimate is and that is linear shrinkage toward the origin.
The snag is that is unknown but the marginal distribution gives us enough to pin down from the data,
Since the quantity is an unbiased estimator of and
Plugging this empirical estimate of the shrinkage factor into the Bayes rule gives you the James-Stein estimator exactly.
The empirical Bayes story makes what the estimator is doing clear because it learns the signal-to-noise ratio straight from the data. When is big compared to the data says there is strong signal and the estimator barely shrinks and when is small the data is mostly noise and the estimator shrinks hard.
6. Admissibility and Brownian motion
There is a deep link between whether the MLE is admissible and how Brownian motion recurs and Brown worked this out in 1971.
Think about estimators of the form . Through Tweedie’s formula any such estimator can be read as a formal Bayes estimator against .
The risk improvement of over is controlled by the Laplacian of where is the marginal density and
If is superharmonic (meaning ) then the risk improves. The flat prior gives and and that is the boundary case that matches the MLE.
The harmonic prior gives a superharmonic marginal for .
The algebra reflects something topological. In and Brownian motion is recurrent and keeps coming back to every neighborhood over and over and in Brownian motion is transient and runs off to infinity.
The inadmissibility of the MLE in lines up with probability mass leaking off to infinity under the flat prior and the James-Stein estimator makes up for that leak. Shrinkage is needed exactly because the surrounding space is high-dimensional enough for random walks to escape.
7. The positive-part correction
The standard James-Stein estimator has a failure mode. When the shrinkage factor goes negative and the estimate flips direction and that is obviously bad because the estimator shoves over to the opposite side of the origin from .
Baranchik’s positive-part tweak clips the shrinkage factor,
This estimator dominates but it is not itself admissible because it has non-analytic kinks at and Brown showed that admissible estimators for exponential families have to be analytic.
Nobody knows a simple analytic estimator that beats JS+ and so the positive-part estimator is what people actually use in practice.
8. Exact risk of the positive-part estimator
Working out the exact risk of takes some care because the estimator acts differently in two regions.
The difference between and is only nonzero when the shrinkage overshoots and that happens when . In that region and .
Let . The risk gap between the two estimators is
Opening up the risk gap on using Stein’s lemma on the full space and handling the truncation gives
The cross term does not die on because the truncation breaks the symmetry that Stein’s lemma leans on over the full space. Even so you can check that for every by writing
because on the factor is negative and and so the JS estimate shoots past the origin and makes a bigger error than just guessing is zero.
The exact risk of needs you to integrate against the non-central chi-square distribution and that has a Poisson mixture form
where .
The risk of JS is
The extra improvement from JS+ means you integrate the full risk gap including the -dependent cross term over the region and you have to do this numerically for general . The qualitative takeaway that for every falls out of the geometric argument above.
9. Simulation
The simulation below compares the risk of the MLE and James-Stein and the positive-part estimators as a function of . The MLE risk is a flat line at and the shrinkage estimators get their biggest win near the origin.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import noncentral_chisquare
def simulate_risk(d=10, n_trials=5000, theta_norms=None):
if theta_norms is None:
theta_norms = np.linspace(0, 10, 20)
risk_mle = []
risk_js = []
risk_plus = []
for r in theta_norms:
# Create theta vector with norm r
theta = np.zeros(d)
theta[0] = r
# Generate Data
# X ~ N(theta, I). Shape (n_trials, d)
X = np.random.randn(n_trials, d) + theta
# Norms squared
X_norm_sq = np.sum(X**2, axis=1)
# MLE Error
# ||X - theta||^2
loss_mle = np.sum( (X - theta)**2, axis=1 )
risk_mle.append(np.mean(loss_mle))
# JS Estimator
# (1 - (d-2)/||X||^2) * X
shrinkage = 1 - (d-2)/X_norm_sq
Theta_JS = X * shrinkage[:, np.newaxis]
loss_js = np.sum( (Theta_JS - theta)**2, axis=1 )
risk_js.append(np.mean(loss_js))
# JS+ Estimator
shrinkage_plus = np.maximum(0, shrinkage)
Theta_Plus = X * shrinkage_plus[:, np.newaxis]
loss_plus = np.sum( (Theta_Plus - theta)**2, axis=1 )
risk_plus.append(np.mean(loss_plus))
# Plot
plt.figure(figsize=(10, 6))
plt.plot(theta_norms, risk_mle, 'k--', label='MLE Risk (d)')
plt.plot(theta_norms, risk_js, 'b-o', label='James-Stein Risk')
plt.plot(theta_norms, risk_plus, 'r-', label='Positive Part JS')
plt.axhline(d, color='gray', linestyle=':')
plt.axhline(2, color='green', linestyle=':', label='Min Risk (at 0)')
plt.xlabel('||Theta||')
plt.ylabel('Risk (MSE)')
plt.title(f'Stein\'s Paradox in {d} Dimensions')
plt.legend()
# plt.show()
return theta_norms, risk_js
# Theoretical Minimum Risk:
# At theta=0, Risk = d - (d-2)^2 E[1/ChiSq_d].
# E[1/ChiSq_d] = 1/(d-2).
# Risk(0) = d - (d-2) = 2.
# Massive dependency! From d to 2.10. The Efron-Morris baseball example
Efron and Morris (1975) ran shrinkage estimation on batting averages. They took 18 players’ first 45 at-bats and used them to predict how those players would do over the rest of the season.
The MLE uses each player’s own 45-at-bat average and the James-Stein estimator pulls every player toward the grand mean.
James-Stein cut the total squared prediction error by a factor of 3. Players who started unusually hot at batting .450 got pulled down toward the league average and players who started cold got pulled up and in both cases the shrunk estimates landed closer to the true end-of-season averages.
Shrinkage toward a common value is the same trick that underlies ridge regression which is Bayesian estimation with a Gaussian prior centered at zero. James-Stein can be read as ridge regression where the regularization parameter tunes itself from the data as .
11. Implications
Stein’s paradox shows something basic about estimation in high dimensions. Treating parameters one at a time is worse than it should be even when they really are unrelated and borrowing strength across the ensemble by shrinking toward a common value cuts total risk because it trades a small amount of bias for a big cut in variance.
In modern machine learning this idea shows up everywhere. Regularization and dropout and weight decay are all forms of shrinkage that bias models toward simpler solutions because high-dimensional parameter spaces blow up estimation error.
Background
| Year | Event | Significance |
|---|---|---|
| 1956 | Charles Stein | Proves inadmissibility of MLE in . |
| 1961 | James & Stein | Construct the explicit James-Stein estimator. |
| 1971 | Lawrence Brown | Links admissibility to recurrence of diffusions. |
| 1973 | Efron & Morris | Empirical Bayes interpretation (baseball paper). |
| 1995 | Donoho & Johnstone | Wavelet shrinkage (soft thresholding). |
| 2006 | Candes & Tao | Compressed sensing (L1 shrinkage). |
Terminology
An estimator is admissible if no other estimator gets lower risk for every at the same time. When estimator A gets for every and beats it strictly somewhere then A dominates B.
The James-Stein estimator is a shrinkage estimator that beats the MLE for the multivariate normal mean in dimensions . It works by pulling estimates toward a central value and trading bias for lower variance. The positive-part tweak clips the shrinkage factor so it stays non-negative and that stops overshrinkage.
Stein’s Lemma is the integration-by-parts identity for Gaussian and it lets us work out risk without knowing . The empirical Bayes framework estimates prior hyperparameters straight from data and the James-Stein estimator drops out as an empirical Bayes procedure. A minimax estimator minimizes worst-case risk across all .
References
1. Stein, C. (1956). “Inadmissibility of the usual estimator for the mean of a multivariate normal distribution”. The original paper establishing the paradox.
2. James, W., & Stein, C. (1961). “Estimation with quadratic loss”. Constructs the explicit estimator and computes its risk.
3. Efron, B., & Morris, C. (1973). “Stein’s estimation rule and its competitors, an empirical Bayes approach”. Develops the empirical Bayes interpretation and demonstrates practical applications.
4. Lehmann, E. L., & Casella, G. (2006). “Theory of Point Estimation”. Standard graduate reference for decision theory and point estimation.