Sufficient Statistics & Information
1. When can you throw data away?
You watch sitting in some space and it comes from with , and maybe is a million video frames and maybe is a single number like the gravitational constant that made the motion happen.
A statistic takes and sends it to and squeezes it down, and the real question is whether this squeezing loses anything.
is sufficient for when the conditional distribution of given does not lean on at all.
Once you know you can cook up fake data that looks exactly like the real thing using just and a random number generator and no knowledge of at all, and so tells you nothing about that has not already told you.
2. Halmos-Savage factorization
In practice nobody ever checks sufficiency by working out conditional distributions by hand and the tool everyone reaches for is factorization.
Theorem (Halmos-Savage, 1949). Take and have it sit under a -finite measure , and then is sufficient for exactly when the density splits up as
where touches only through and has no in it.
Why it works ( direction). When is sufficient you can write and the conditional piece has no in it and that piece becomes , and the marginal of carries all the -dependence and that becomes .
For the full measure-theoretic argument take and build a dominating mixture , and then is -measurable and so a function of , and leans on the mixture and not on any specific . The chain rule hands you the factorization.
3. Fisher Information
Sufficiency has a clean story in terms of Fisher Information.
Data processing cannot make new information and so for any statistic .
And is sufficient exactly when equality holds everywhere, and the proof is short. With factorization you get and so the score leans only on , and the variance of the score (which is the Fisher Information) comes out the same whether you work it out from or from .
Sufficiency holds exactly when no information is lost at all.
4. Minimal sufficiency and the likelihood ratio
There are always lots of sufficient statistics and itself is sufficient in a trivial way, and what you want is the most squeezed one and the coarsest partition of the sample space that still loses nothing.
is minimal sufficient when for any other sufficient statistic the is a function of .
Lehmann-Scheffe criterion: is minimal sufficient iff
Two data points land on the same summary exactly when their likelihood ratio goes flat.
Uniform . . The ratio stays constant exactly when and so is minimal sufficient, and one number holds everything about that observations have in them.
Cauchy. . The likelihood ratio is a degree- rational function of and for it to stay constant the polynomials have to share all their roots, and that forces as sets. The minimal sufficient statistic is the full order statistics and no squeezing is possible.
Cauchy tails are so heavy that every single observation tells you something about on its own and no lower-dimensional summary exists.
5. Pitman-Koopman-Darmois
When can you squeeze samples into a fixed number of summaries no matter how big gets? Almost never.
Theorem. Under regularity conditions where the support does not lean on , when a family has a sufficient statistic of dimension that does not grow with the family has to be an exponential family.
Sketch. For i.i.d. samples factorization hands you , and you take derivatives in both and to get cross-derivative constraints. For the Jacobian that maps data to parameter gradients to have rank at most as grows you need , and you integrate and out pops the exponential family form.
Fixed-dimensional sufficiency is so tight that it basically picks out exponential families and nothing else.
6. Exponential families and convex duality
Write the canonical form:
Three facts make exponential families so easy to work with.
is convex. Because and this is a log-moment-generating function, and Holder’s inequality does the rest.
Moments from derivatives.
Covariance is PSD and so is strictly convex as long as you have a minimal representation.
MLE is moment matching. The log-likelihood for data is , and you set the gradient to zero.
The MLE is the parameter that makes the model’s expected sufficient statistics line up with the empirical ones, and information geometry reads this as projecting the empirical distribution onto the model manifold.
7. Basu’s Theorem
An ancillary statistic has a distribution that does not lean on at all. Take when and you get one.
Ancillaries hold no likelihood information about but they do tell you how precise your experiment was.
Basu’s Theorem. When is boundedly complete and sufficient and is ancillary then .
Proof. Take and because is ancillary stays constant in . Let and sufficiency strips the out. Then is constant and so for all . Completeness of forces almost surely. And so and that is independence.
Why this matters for the -test. For the is complete sufficient for given the joint structure and is ancillary for , and Basu hands you . The independence of and is why the -test works because the top and the bottom are independent. And it only works for Gaussians by Geary’s Theorem, and you try it with any other distribution and the independence falls apart.
8. Rao-Blackwellization
You have a bad unbiased estimator and a sufficient statistic and you set .
It is still unbiased because .
And it is computable because sufficiency means the conditional distribution has no in it.
And by the law of total variance you get
Conditioning on the sufficient statistic always cuts variance down and it cuts it strictly unless the estimator was already a function of .
9. Information Bottleneck and neural networks
Tishby’s Information Bottleneck asks you to find a that pushes up and pulls down, and the ideal is a sufficient statistic for and minimal sufficient means perfect compression.
People have argued that SGD quietly finds these representations. The trouble is that for deterministic networks with continuous inputs the is infinite or undefined and you need noise in the weights or activations to make any of this well-defined.
For Bayesian neural networks the posterior predictive leans on sufficient statistics of the training data. But neural networks are not exponential families because they have finite width, and so the sufficient statistic is the whole dataset and no compression actually happens.
In the NTK regime when the width goes to infinity the sufficient statistics collapse down to the kernel matrix and that gives you a real finite-dimensional summary.
10. Sufficient statistics in time series
For i.i.d. data sufficiency squeezes static points down. In time series you are chewing through a stream and you want a running summary with
This is the state, and when a finite-dimensional exists you have a hidden Markov model or a state space model.
Kalman filter. For linear Gaussian systems the sufficient statistics for the future are and the Kalman filter is just their recursive update.
Fixed-memory tracking of a dynamic system. This is Pitman-Koopman-Darmois pushed onto conditional transition densities, and if the noise were Cauchy and not Gaussian you would have to hold onto the whole history.
11. Approximate sufficiency and Le Cam’s deficiency
What happens when a statistic is only almost sufficient?
Le Cam pinned this down with decision theory. Two experiments where you watch and where you watch are equivalent when every decision rule in one can be matched in the other with the same risk.
The deficiency distance is
where is a Markov kernel, and means is sufficient and means is -sufficient.
And this hooks up to differential privacy. You want statistics that are sufficient for the signal and insufficient for picking out the user.
12. Fisher’s program
Fisher’s 1922 insight was that inference is data reduction. You start with a high-dimensional and you boil it down to a tiny , and sufficient statistics tell you exactly what you can throw away and factorization gives you the tool to check, and exponential families are the spot where finite-dimensional summaries actually exist, and Basu and Lehmann-Scheffe fill in the links between sufficiency and ancillarity and invariance.
In modern ML the word “sufficiency” goes by “representation learning” and the goal is the same. Find the smallest coordinates of whatever manifold the data actually lives on.
Simulation of Rao-Blackwellization and Fisher Information
We work out for a Poisson with three estimators.
- alone, unbiased and wasteful.
- , which is the Rao-Blackwell improvement and the MVUE.
- from , which is binary compression and throws the counts away.
The third estimator throws the count information away and should have much higher variance.
import numpy as np
import jax.numpy as jnp
from jax import grad, vmap
import matplotlib.pyplot as plt
def fisher_info_loss_demo(true_lambda=5.0, n_samples=10, n_trials=5000):
estimates_raw = []
estimates_rb = []
estimates_bad = []
for _ in range(n_trials):
X = np.random.poisson(true_lambda, n_samples)
# 1. Raw Estimator: Only first sample
est_raw = X[0]
estimates_raw.append(est_raw)
# 2. Rao-Blackwellized (Mean)
est_rb = np.mean(X)
estimates_rb.append(est_rb)
# 3. Bad Statistic (Lossy Compression)
# We only know how many are non-zero.
# k = count(X > 0). k ~ Binomial(n, 1 - e^-lambda)
# p = 1 - e^-lambda => lambda = -log(1-p)
# MLE for p is k/n.
# est_lambda = -log(1 - k/n)
k = np.sum(X > 0)
# Avoid div by zero
if k == n_samples:
est_bad = -np.log(1 - (n_samples-0.5)/n_samples) # Smooth
else:
est_bad = -np.log(1 - k/n_samples)
estimates_bad.append(est_bad)
print(f"Theory Variance (Raw): {true_lambda:.4f}")
print(f"Empirical Var (Raw): {np.var(estimates_raw):.4f}")
# Cramer-Rao Lower Bound = lambda / n
crlb = true_lambda / n_samples
print(f"CRLB (Optimal): {crlb:.4f}")
print(f"Empirical Var (RB): {np.var(estimates_rb):.4f}")
print(f"Empirical Var (Bad Stat): {np.var(estimates_bad):.4f}")
# The 'Bad Stat' throws away info (exact counts), keeping only binary info.
# Variance explodes.Completeness and Basu’s Theorem
Completeness of exponential families
The family is complete when holds an open rectangle.
Suppose for all , and then and that hands you . This is the Laplace transform of and when a Laplace transform vanishes on an open set the function is zero almost everywhere, and the result follows.
When completeness fails
Take and pick . Then for every because it is a full-period integral. But and so the uniform family is incomplete and Basu’s theorem does not apply, and ancillaries can depend on sufficient statistics here.
Moments from the partition function
Every moment calculation you would ever want falls out by taking derivatives of .
Bernoulli. . The canonical parameter is which is the logit. The inverse is . The log-partition is . Mean. Variance.
Poisson. . The canonical parameter is and the log-partition is . Mean. Variance.
Gaussian (known ). . The canonical parameter is and the log-partition is . Mean. Variance.
Quantum sufficiency
Sufficiency carries over to quantum mechanics and the objects just change shape. A density matrix on Hilbert space takes the place of the distribution and a quantum channel (CPTP map) takes the place of the statistic.
And is sufficient when there is a recovery channel with
This hooks into the Petz recovery map. A measurement (POVM) is sufficient for a family of states when the quantum Fisher information stays put.
The monotonicity of relative entropy gives the criterion
and equality holds exactly when is sufficient for . The data processing inequality stops any channel from making states easier to tell apart, and this is the same principle as in the classical case.
Sufficiency in latent variable models (EM)
Take a model with observed and latent . The complete likelihood often sits in an exponential family with sufficient statistics . But we only get to see .
EM works by computing expected sufficient statistics.
E-step: compute and then
M-step: find by moment matching:
For Gaussian mixtures the sufficient statistics are for the cluster masses and for the first moments and for the second moments. Without factorization EM would be unworkable for most models.
Milestones
- 1922. R.A. Fisher lays down “sufficiency” in his foundational paper on mathematical statistics and sets up the idea that a statistic can hold all the information about a parameter.
- 1935. Koopman and Pitman and Darmois each prove on their own that fixed-dimensional sufficiency forces exponential family structure.
- 1945. C.R. Rao proves the Rao-Blackwell theorem and shows that conditioning on a sufficient statistic always cuts variance down.
- 1949. Halmos and Savage prove the factorization theorem with measure theory and put sufficiency on solid footing.
- 1955. D. Basu proves that complete sufficient statistics come out independent of ancillary statistics.
- 1972. Le Cam brings in deficiency distance and lets you compare approximate sufficiency between experiments.
- 1999. Tishby puts forward the Information Bottleneck principle and hooks sufficiency up to representation learning and deep networks.
Terminology used in this post
A sufficient statistic holds everything knows about and formally the conditional distribution of given does not lean on at all. A minimal sufficient statistic is the coarsest such statistic and it squeezes the data as far as it can go with no information loss. Completeness means no non-trivial function of has expectation zero for all parameter values and it is the condition that makes Basu’s theorem work. The exponential family is the class of distributions whose log-likelihood is linear in the sufficient statistics and the Pitman-Koopman-Darmois theorem says these are basically the only families that allow fixed-dimensional sufficiency. Fisher Information measures how much an observable tells you about and Rao-Blackwellization is the move of conditioning an estimator on a sufficient statistic to cut variance down. An ancillary statistic has a distribution that does not lean on and interaction information can be positive or negative and it measures synergy or redundancy between variables.
References
1. Fisher, R. A. (1922). “On the mathematical foundations of theoretical statistics”.
2. Halmos, P. R., & Savage, L. J. (1949). “Application of the Radon-Nikodym theorem to the theory of sufficient statistics”.
3. Basu, D. (1955). “On statistics independent of a complete sufficient statistic”.
4. Lehmann, E. L., & Scheffe, H. (1950). “Completeness, similar regions, and unbiased estimation”.
5. Pitman, E. J. G. (1936). “Sufficient statistics and intrinsic accuracy”.
6. Csiszar, I. (1975). “I-divergence geometry of probability distributions and minimization problems”.