Sufficient Statistics & Information
1. The Concept of Lossless Compression
The Problem: We observe a random variable taking values in a space , distributed according to where . The raw data is often high-dimensional (e.g., video frames). The parameter is low-dimensional (e.g., the physics constant causing the motion). All inference about must be based on . A statistic is a map from . Intuitively, compresses the data. When is this compression lossless?
Definition (Sufficiency): A statistic is sufficient for if the conditional distribution of given is independent of for all .
If this holds, we can simulate the original data knowing only and a random number generator, without knowing . Therefore, keeping provides no additional information about that isn’t already in .
2. The Halmos-Savage Factorization Theorem
Checking the conditional distribution definition is practically impossible. We rely on the Factorization Criterion.
Theorem (Halmos-Savage, 1949): Let be a family of distributions dominated by a -finite measure . is sufficient for if and only if the Radon-Nikodym density factorizes as:
where depends on only through , and is independent of .
Proof of Sufficiency (): If is sufficient, then for any set :
Let be the density of with respect to some induced measure. Let be related to the conditional density. Ideally, . is -free (). contains ().
Measure Theoretic Proof Sketch: Let be the sufficient sub-sigma-algebra generated by . Sufficiency means the derivative can be effectively computed on . Specifically, we define the measure (a dominating mixture). Then is -measurable. Since , we identify: (which is -measurable, hence function of ). (which depends on the mixture, not satisfying specific ).
3. Fisher Information and Efficiency
Why do we care? Because of Fisher Information.
Theorem (Data Processing Inequality): For any statistic , . Processing data cannot create information. Theorem: is sufficient if and only if for all . (This holds strictly for dominated families with regular densities).
Proof: Using factorization . . The score function depends only on . Thus the variance of the score (Fisher Info) is identical. Sufficiency = Conservation of Fisher Information.
4. Minimal Sufficiency & The Likelihood Ratio
Usually there are many sufficient statistics. The whole data is sufficient. We want the statist statistic—the one that compresses the most. Definition: is minimal sufficient if for any other sufficient statistic , is a function of . (i.e., partitions the sample space strictly coarser than ).
Lehmann-Scheffe Method: is minimal sufficient if and only if:
Basically, two data points and are “equivalent” (map to the same ) iff their likelihood theoretical ratio is constant.
Example 1: Uniform . Ratio is 1 iff for all . This requires . Thus is minimal sufficient.
Example 2: Cauchy Distribution . The likelihood ratio is a rational function of of degree . For the ratio to be constant, the polynomials in the numerator and denominator must share roots. This implies the set of values must match . Thus, for Cauchy, the minimal sufficient statistic is the Order Statistics (sorted data). No compression is possible. Lesson: Heavy tails often destroy sufficiency properties.
5. The Pitman-Koopman-Darmois Theorem
When can we compress samples into numbers (where is fixed as )? Only in very special cases.
Theorem: Under regularity conditions (support independent of ), if a family admits a sufficient statistic of fixed dimension (independent of sample size ), then the family is an Exponential Family.
Proof Sketch: Consider the log-likelihood for i.i.d. samples:
If is sufficient, then by Factorization:
Differentiating with respect to mixed samples and , we obtain a constraint on the cross-derivatives that forces the function to separate into a product form . Specifically, consider the Jacobian of the mapping from data space to parameter gradients. For the rank to be bounded by as , the gradients must lie in a low-dimensional subspace. This forces the structure:
Integration yields the exponential family form.
6. Exponential Families & Convex Duality
The “Canonical Form” is critical for derivations.
Property 1: Log-Partition is Convex. Since , . Holder’s inequality implies log-sum-exp is convex.
Property 2: Moments are Gradients.
Since Covariance is Positive Semi-Definite, is strictly convex (if representation is minimal).
Property 3: MLE is Moment Matching. The Log-Likelihood for data is:
Taking the gradient and setting to 0:
The Maximum Likelihood Estimator is the unique parameter that makes the model’s expected sufficient statistics match the observed empirical sufficiency. This is called the Dual Matching Condition. Information Geometry interprets this as a Projection of the empirical distribution onto the model manifold.
7. Basu’s Theorem & Ancillarity
An Ancillary Statistic is one whose distribution does not depend on . (e.g., sample size if fixed, or for location family ). Ancillaries seem useless? No, they define the “precision” of the experiment. Basu’s Theorem: If is boundedly complete and sufficient sufficient, and is ancillary, then and are Independent. Independence?!? has all the info. has no info. Independence seems trivial? No, usually info sets overlap. But here they are orthogonal.
Proof
Let be an event defined by ancillary. is constant. Consider . Let . Since is sufficient, does not depend on . So for all . Because is Complete (no non-zero function has mean 0 for all ), we must have a.s. Thus . This implies independence.
Application: Geary’s Theorem
Let . is sufficient for (if known? No, joint sufficient). is ancillary for (location invariant). Is complete? Yes. Therefore and are independent. This is why the t-test works (numerator and denominator are independent). Note: This independence holds ONLY for Gaussians (Geary’s Theorem).
8. Rao-Blackwellization
If we have a rough estimator (unbiased), and a sufficient statistic . Define .
- Unbiased: .
- Computable: Since is sufficient, conditional distribution is -free, so we can calculate the expectation.
- Variance Reduction: Thus . Smoothing noise by conditioning on the sufficient statistic strictly improves the estimator (unless already a function of ). This is the Rao-Blackwell Theorem.
9. Information Bottleneck & Deep Learning
Tishby’s Information Bottleneck principle: should maximize while minimizing . Ideally is a sufficient statistic for . Minimal sufficient statistic = Perfect compression. In Deep Learning, we argue that SGD finds such representations. However, for deterministic networks, is infinite (or undefined) for continuous variables. We need to add noise to the weights/activations to make this rigorous. If the weights are random (Bayesian NN), the posterior predictive distribution depends on sufficient statistics of the training data. Since NN is not exponential family (finite size), sufficient statistics are the whole dataset. But with , in the NTK regime, the sufficient statistics become the Kernel Matrix of the data.
10. Sufficient Statistics in Time Series: The State
In i.i.d. settings, sufficiency is about compressing static points. In Time Series, we process data streams . We seek a summary such that:
This is the State of the system. If such a finite-dimensional exists, the process is a Hidden Markov Model (or State Space Model).
The Kalman Filter: For linear Gaussian systems, the sufficient statistics for the future are . The Kalman Filter is simply the recursive update of these sufficient statistics.
The fact that we can track a dynamic system with fixed memory is a direct consequence of the Pitman-Koopman-Darmois theorem applied to the conditional transition densities. If the noise were Cauchy, we would need to store the entire history .
11. Approximate Sufficiency: Le Cam’s Deficiency
What if a statistic is “almost” sufficient? Lucien Le Cam formalized this using Decision Theory. Two experiments (observing ) and (observing ) are equivalent if for any decision rule in , there exists a rule in with the same risk (and vice versa).
Deficiency Distance:
where is a Markov kernel (randomization). If , is sufficient. If , is -sufficient. This is crucial for Privacy (Differential Privacy). We want statistics that are sufficient for the signal but insufficient for the user’s identity.
12. Conclusion: The Conservation of Information
R.A. Fisher’s original insight remains one of the most profound in statistics: Inference is Data Reduction. We start with a massive, high-dimensional object and attempt to distill it into a tiny object . The theory of Sufficient Statistics tells us exactly what we can throw away. The Factorization Theorem gives us the algebraic tool to identify these summaries. Exponential Families provide the geometric structure where these summaries are finite-dimensional. And finally, the interplay between sufficiency, ancillarity (Basu), and invariance (Lehmann-Scheffe) provides the complete roadmap for optimal estimation. In the modern era of Deep Learning, “Sufficiency” has morphed into “Representation Learning”. But the goal remains the same: to find the minimal coordinates of the manifold on which the data lives.
Historical Timeline
| Year | Event | Significance |
|---|---|---|
| 1922 | R.A. Fisher | Defines “Sufficiency” in his foundational paper. |
| 1935 | Koopman, Pitman, Darmois | Independently link Sufficiency to Generalized Exponential Families. |
| 1945 | C.R. Rao | Proves the Rao-Blackwell Theorem (Variance Reduction). |
| 1949 | Halmos & Savage | Prove the Factorization Theorem using Measure Theory. |
| 1955 | D. Basu | Proves Basu’s Theorem (Independence of Ancillary Statistics). |
| 1972 | Le Cam | Introduces Deficiency Distance (Approximate Sufficiency). |
| 1999 | Tishby | Information Bottleneck Principle. |
Appendix A: Python Simulation of Rao-Blackwellization & Fisher Information
Let’s empirically verify two things:
- Rao-Blackwellization reduces variance (The “Improvement” Theorem).
- Fisher Information is lost if we use a non-sufficient statistic.
We estimate for Poisson.
- Estimator 1 (Raw): . Unbiased.
- Estimator 2 (Rao-Blackwell): . MVUE.
- Estimator 3 (Bad Statistic): Estimate from (Binary compression).
import numpy as np
import jax.numpy as jnp
from jax import grad, vmap
import matplotlib.pyplot as plt
def fisher_info_loss_demo(true_lambda=5.0, n_samples=10, n_trials=5000):
estimates_raw = []
estimates_rb = []
estimates_bad = []
for _ in range(n_trials):
X = np.random.poisson(true_lambda, n_samples)
# 1. Raw Estimator: Only first sample
est_raw = X[0]
estimates_raw.append(est_raw)
# 2. Rao-Blackwellized (Mean)
est_rb = np.mean(X)
estimates_rb.append(est_rb)
# 3. Bad Statistic (Lossy Compression)
# We only know how many are non-zero.
# k = count(X > 0). k ~ Binomial(n, 1 - e^-lambda)
# p = 1 - e^-lambda => lambda = -log(1-p)
# MLE for p is k/n.
# est_lambda = -log(1 - k/n)
k = np.sum(X > 0)
# Avoid div by zero
if k == n_samples:
est_bad = -np.log(1 - (n_samples-0.5)/n_samples) # Smooth
else:
est_bad = -np.log(1 - k/n_samples)
estimates_bad.append(est_bad)
print(f"Theory Variance (Raw): {true_lambda:.4f}")
print(f"Empirical Var (Raw): {np.var(estimates_raw):.4f}")
# Cramer-Rao Lower Bound = lambda / n
crlb = true_lambda / n_samples
print(f"CRLB (Optimal): {crlb:.4f}")
print(f"Empirical Var (RB): {np.var(estimates_rb):.4f}")
print(f"Empirical Var (Bad Stat): {np.var(estimates_bad):.4f}")
# The 'Bad Stat' throws away info (exact counts), keeping only binary info.
# Variance explodes.Appendix B: Completeness and Basu’s Theorem Proof Details
B.1 Completeness of Exponential Families
The family is Complete if the parameter space contains an open rectangle. Proof Idea: Suppose for all . . This is the Laplace Transform of . If Laplace transform is 0 in a neighborhood, the function is 0 a.e. Thus . This implies Completeness.
B.2 Counter-Example to Completeness
Consider the Uniform distribution . The statistic is sufficient? No, order statistics are. Consider . . Since the integral of sine over a full period is always 0, unrelated to shift . But is not zero. Thus, the Uniform family is Incomplete. Basu’s theorem fails here (Ancillaries might be dependent).
Appendix C: Deriving Moments from the Partition Function
The power of Exponential Families lies in . Let’s derive the mean and variance for common distributions using and .
1. Bernoulli Distribution . Canonical parameter: (Logit). Inverse: . Log-partition: . Mean: Variance:
2. Poisson Distribution . Canonical parameter: . Inverse: . Log-partition: . Mean: Variance:
3. Gaussian Distribution (Known Variance ) . Canonical parameter: . Log-partition: . Mean: Variance:
Appendix D: Quantum Sufficiency
The concept of sufficiency extends to Quantum Mechanics. Let a quantum system be described by a density matrix acting on Hilbert space . A “statistic” is replaced by a Quantum Channel (CPTP map) . When is a channel sufficient? When there exists a recovery channel such that:
Factorization in Quantum Mechanics: This is related to the Petz Recovery Map. A measurement (POVM) is sufficient for a family of states if the states commute with the measurement operators, or more generally, if the Quantum Fisher Information is preserved. Theorem (Monotonicity of Relative Entropy):
Equality holds if and only if is a sufficient statistic (sufficient channel) for the family . This connects sufficiency to the Second Law of Thermodynamics (data processing cannot decrease entropy).
Appendix E: Sufficiency in Latent Variable Models (The EM Algorithm)
Consider a model with observed data and latent variables . The complete likelihood often belongs to an exponential family with sufficient statistics . However, we only observe . The Expectation-Maximization (EM) algorithm rests on the observation that we can compute the expected sufficient statistics. E-Step: Compute the posterior of the latent variables . Compute the expected sufficient statistics:
M-Step: Update by Maximum Likelihood matching the expected stats to the model moments.
For GMMs, the sufficient statistics are (mass), (first moment), (second moment). This shows that Sufficiency is the “Computational Engine” of the EM algorithm. Without the factorization theorem, EM would be computationally intractable for most models.
Appendix F: Glossary of Definitions
- Ancillary Statistic: A statistic whose distribution does not depend on .
- Completeness: A family where no non-zero function has expectation zero for all parameters.
- Exponential Family: A family of distributions where the log-likelihood is linear in the sufficient statistics.
- Fisher Information: Measures the amount of information that an observable random variable carries about an unknown parameter .
- Interaction Information: Can be positive or negative (Synergy/Redundancy).
- Minimal Sufficient Statistic: The coarsest possible sufficient statistic.
- Rao-Blackwellization: Improving an estimator by conditioning on a sufficient statistic.
- Sufficient Statistic: A statistic that captures all information about contained in .
References
1. Fisher, R. A. (1922). “On the mathematical foundations of theoretical statistics”.
2. Halmos, P. R., & Savage, L. J. (1949). “Application of the Radon-Nikodym theorem to the theory of sufficient statistics”.
3. Basu, D. (1955). “On statistics independent of a complete sufficient statistic”.
4. Lehmann, E. L., & Scheffé, H. (1950). “Completeness, similar regions, and unbiased estimation”.
5. Pitman, E. J. G. (1936). “Sufficient statistics and intrinsic accuracy”.
6. Csiszar, I. (1975). “I-divergence geometry of probability distributions and minimization problems”.