Sufficient Statistics & Information

1. When can you throw data away?

You watch XX sitting in some space X\mathcal{X} and it comes from PθP_\theta with θΘ\theta \in \Theta, and maybe XX is a million video frames and maybe θ\theta is a single number like the gravitational constant that made the motion happen.

A statistic T(X)T(X) takes X\mathcal{X} and sends it to T\mathcal{T} and squeezes it down, and the real question is whether this squeezing loses anything.

T(X)T(X) is sufficient for θ\theta when the conditional distribution of XX given T(X)=tT(X)=t does not lean on θ\theta at all.

Pθ(XAT(X)=t)=P(XAT(X)=t)P_\theta(X \in A \mid T(X)=t) = P(X \in A \mid T(X)=t)

Once you know T(X)=tT(X)=t you can cook up fake data that looks exactly like the real thing using just tt and a random number generator and no knowledge of θ\theta at all, and so XX tells you nothing about θ\theta that T(X)T(X) has not already told you.

2. Halmos-Savage factorization

In practice nobody ever checks sufficiency by working out conditional distributions by hand and the tool everyone reaches for is factorization.

Theorem (Halmos-Savage, 1949). Take {Pθ:θΘ}\{P_\theta : \theta \in \Theta\} and have it sit under a σ\sigma-finite measure μ\mu, and then T(X)T(X) is sufficient for θ\theta exactly when the density fθ(x)=dPθdμ(x)f_\theta(x) = \frac{dP_\theta}{d\mu}(x) splits up as

fθ(x)=gθ(T(x))h(x)f_\theta(x) = g_\theta(T(x)) \, h(x)

where gθ0g_\theta \ge 0 touches xx only through T(x)T(x) and h(x)0h(x) \ge 0 has no θ\theta in it.

Why it works (    \implies direction). When TT is sufficient you can write Pθ(A)=TP(AT=t)dPθT(t)P_\theta(A) = \int_{\mathcal{T}} P(A | T=t) \, dP_\theta^T(t) and the conditional piece P(X=xT=t)P(X=x|T=t) has no θ\theta in it and that piece becomes h(x)h(x), and the marginal of TT carries all the θ\theta-dependence and that becomes gθ(t)g_\theta(t).

For the full measure-theoretic argument take A=σ(T)\mathcal{A} = \sigma(T) and build a dominating mixture λ=2iPθi\lambda = \sum 2^{-i} P_{\theta_i}, and then dPθdλ\frac{dP_\theta}{d\lambda} is A\mathcal{A}-measurable and so a function of TT, and dλdμ\frac{d\lambda}{d\mu} leans on the mixture and not on any specific θ\theta. The chain rule dPθdμ=dPθdλdλdμ\frac{dP_\theta}{d\mu} = \frac{dP_\theta}{d\lambda} \cdot \frac{d\lambda}{d\mu} hands you the factorization.

3. Fisher Information

Sufficiency has a clean story in terms of Fisher Information.

IX(θ)=Eθ[(θlogfθ(X))2]I_X(\theta) = \mathbb{E}_\theta \left[ \left( \nabla_\theta \log f_\theta(X) \right)^2 \right]

Data processing cannot make new information and so IT(θ)IX(θ)I_T(\theta) \le I_X(\theta) for any statistic TT.

And TT is sufficient exactly when equality holds everywhere, and the proof is short. With factorization you get (θ)=loggθ(T)+logh(X)\ell(\theta) = \log g_\theta(T) + \log h(X) and so the score θ(θ)=θloggθ(T)\nabla_\theta \ell(\theta) = \nabla_\theta \log g_\theta(T) leans only on TT, and the variance of the score (which is the Fisher Information) comes out the same whether you work it out from XX or from TT.

Sufficiency holds exactly when no information is lost at all.


4. Minimal sufficiency and the likelihood ratio

There are always lots of sufficient statistics and XX itself is sufficient in a trivial way, and what you want is the most squeezed one and the coarsest partition of the sample space that still loses nothing.

TT is minimal sufficient when for any other sufficient statistic SS the TT is a function of SS.

Lehmann-Scheffe criterion: T(x)T(x) is minimal sufficient iff

L(θ;x)L(θ;y) is constant in θ    T(x)=T(y)\frac{L(\theta; x)}{L(\theta; y)} \text{ is constant in } \theta \iff T(x) = T(y)

Two data points land on the same summary exactly when their likelihood ratio goes flat.

Uniform U[0,θ]U[0, \theta]. L(θ;x)=θnI(max(x)θ)L(\theta; x) = \theta^{-n} \mathbb{I}(\max(x) \le \theta). The ratio stays constant exactly when x(n)=y(n)x_{(n)} = y_{(n)} and so T(X)=max(X)T(X) = \max(X) is minimal sufficient, and one number holds everything about θ\theta that nn observations have in them.

Cauchy. f(x;θ)=1π(1+(xθ)2)f(x; \theta) = \frac{1}{\pi(1+(x-\theta)^2)}. The likelihood ratio is a degree-2n2n rational function of θ\theta and for it to stay constant the polynomials have to share all their roots, and that forces {xi}={yi}\{x_i\} = \{y_i\} as sets. The minimal sufficient statistic is the full order statistics and no squeezing is possible.

Cauchy tails are so heavy that every single observation tells you something about θ\theta on its own and no lower-dimensional summary exists.

5. Pitman-Koopman-Darmois

When can you squeeze nn samples into a fixed number of summaries no matter how big nn gets? Almost never.

Theorem. Under regularity conditions where the support does not lean on θ\theta, when a family has a sufficient statistic of dimension kk that does not grow with nn the family has to be an exponential family.

f(xθ)=h(x)exp(η(θ)T(x)A(θ))f(x | \theta) = h(x) \exp( \eta(\theta) \cdot T(x) - A(\theta) )

Sketch. For nn i.i.d. samples factorization hands you i=1nlogf(xiθ)=α(T(x),θ)+β(x)\sum_{i=1}^n \log f(x_i | \theta) = \alpha(T(\mathbf{x}), \theta) + \beta(\mathbf{x}), and you take derivatives in both xix_i and xjx_j to get cross-derivative constraints. For the Jacobian that maps data to parameter gradients to have rank at most kk as nn grows you need θlogf(xθ)=j=1kwj(θ)tj(x)\nabla_\theta \log f(x | \theta) = \sum_{j=1}^k w_j'(\theta) t_j(x), and you integrate and out pops the exponential family form.

Fixed-dimensional sufficiency is so tight that it basically picks out exponential families and nothing else.

6. Exponential families and convex duality

Write the canonical form:

p(xη)=h(x)exp(ηTT(x)A(η))p(x|\eta) = h(x) \exp( \eta^T T(x) - A(\eta) )

Three facts make exponential families so easy to work with.

A(η)A(\eta) is convex. Because A(η)=logh(x)eηT(x)dxA(\eta) = \log \int h(x) e^{\eta T(x)} dx and this is a log-moment-generating function, and Holder’s inequality does the rest.

Moments from derivatives.

ηA(η)=E[T(X)],η2A(η)=Cov(T(X))\nabla_\eta A(\eta) = \mathbb{E}[T(X)], \qquad \nabla^2_\eta A(\eta) = \text{Cov}(T(X))

Covariance is PSD and so AA is strictly convex as long as you have a minimal representation.

MLE is moment matching. The log-likelihood for data DD is L(η)=nηTTˉnA(η)+const\mathcal{L}(\eta) = n \eta^T \bar{T} - n A(\eta) + \text{const}, and you set the gradient to zero.

L=nTˉnA(η)=0    Eη[T(X)]=1nT(xi)\nabla \mathcal{L} = n \bar{T} - n \nabla A(\eta) = 0 \implies \mathbb{E}_\eta[T(X)] = \frac{1}{n} \sum T(x_i)

The MLE is the parameter that makes the model’s expected sufficient statistics line up with the empirical ones, and information geometry reads this as projecting the empirical distribution onto the model manifold.


7. Basu’s Theorem

An ancillary statistic A(X)A(X) has a distribution that does not lean on θ\theta at all. Take X1X2X_1 - X_2 when XiN(θ,1)X_i \sim \mathcal{N}(\theta, 1) and you get one.

Ancillaries hold no likelihood information about θ\theta but they do tell you how precise your experiment was.

Basu’s Theorem. When TT is boundedly complete and sufficient and AA is ancillary then TAT \perp A.

Proof. Take Bσ(A)B \in \sigma(A) and because AA is ancillary Pθ(B)P_\theta(B) stays constant in θ\theta. Let g(t)=P(BT=t)g(t) = P(B | T=t) and sufficiency strips the θ\theta out. Then Eθ[g(T)]=Pθ(B)\mathbb{E}_\theta[g(T)] = P_\theta(B) is constant and so Eθ[g(T)P(B)]=0\mathbb{E}_\theta[g(T) - P(B)] = 0 for all θ\theta. Completeness of TT forces g(T)=P(B)g(T) = P(B) almost surely. And so P(BT)=P(B)P(B|T) = P(B) and that is independence. \square

Why this matters for the tt-test. For XiN(μ,σ2)X_i \sim \mathcal{N}(\mu, \sigma^2) the Xˉ\bar{X} is complete sufficient for μ\mu given the joint structure and S2S^2 is ancillary for μ\mu, and Basu hands you XˉS2\bar{X} \perp S^2. The independence of Xˉ\bar{X} and S2S^2 is why the tt-test works because the top and the bottom are independent. And it only works for Gaussians by Geary’s Theorem, and you try it with any other distribution and the independence falls apart.


8. Rao-Blackwellization

You have a bad unbiased estimator θ^\hat{\theta} and a sufficient statistic TT and you set θ~=E[θ^T]\tilde{\theta} = \mathbb{E}[\hat{\theta} | T].

It is still unbiased because E[θ~]=E[E[θ^T]]=θ\mathbb{E}[\tilde{\theta}] = \mathbb{E}[\mathbb{E}[\hat{\theta}|T]] = \theta.

And it is computable because sufficiency means the conditional distribution has no θ\theta in it.

And by the law of total variance you get

Var(θ^)=Var(θ~)+E[Var(θ^T)]Var(θ~)\text{Var}(\hat{\theta}) = \text{Var}(\tilde{\theta}) + \mathbb{E}[\text{Var}(\hat{\theta}|T)] \ge \text{Var}(\tilde{\theta})

Conditioning on the sufficient statistic always cuts variance down and it cuts it strictly unless the estimator was already a function of TT.


9. Information Bottleneck and neural networks

Tishby’s Information Bottleneck asks you to find a TT that pushes I(T;Y)I(T; Y) up and pulls I(T;X)I(T; X) down, and the ideal TT is a sufficient statistic for YY and minimal sufficient means perfect compression.

People have argued that SGD quietly finds these representations. The trouble is that for deterministic networks with continuous inputs the I(T;X)I(T; X) is infinite or undefined and you need noise in the weights or activations to make any of this well-defined.

For Bayesian neural networks the posterior predictive leans on sufficient statistics of the training data. But neural networks are not exponential families because they have finite width, and so the sufficient statistic is the whole dataset and no compression actually happens.

In the NTK regime when the width goes to infinity the sufficient statistics collapse down to the kernel matrix and that gives you a real finite-dimensional summary.


10. Sufficient statistics in time series

For i.i.d. data sufficiency squeezes NN static points down. In time series you are chewing through a stream x1,x2,x_1, x_2, \ldots and you want a running summary ht=T(x1:t)h_t = T(x_{1:t}) with

P(xt+1:x1:t)=P(xt+1:ht)P(x_{t+1:\infty} | x_{1:t}) = P(x_{t+1:\infty} | h_t)

This hth_t is the state, and when a finite-dimensional hth_t exists you have a hidden Markov model or a state space model.

Kalman filter. For linear Gaussian systems the sufficient statistics for the future are (μt,Σt)(\mu_t, \Sigma_t) and the Kalman filter is just their recursive update.

μt=Aμt1+Kt(ytCAμt1)\mu_{t} = A \mu_{t-1} + K_t (y_t - C A \mu_{t-1}) Σt=(IKtC)Σtt1\Sigma_{t} = (I - K_t C) \Sigma_{t|t-1}

Fixed-memory tracking of a dynamic system. This is Pitman-Koopman-Darmois pushed onto conditional transition densities, and if the noise were Cauchy and not Gaussian you would have to hold onto the whole history.

11. Approximate sufficiency and Le Cam’s deficiency

What happens when a statistic is only almost sufficient?

Le Cam pinned this down with decision theory. Two experiments E\mathcal{E} where you watch XX and F\mathcal{F} where you watch T(X)T(X) are equivalent when every decision rule in one can be matched in the other with the same risk.

The deficiency distance is

δ(E,F)=infKsupθPθKQθTV\delta(\mathcal{E}, \mathcal{F}) = \inf_K \sup_\theta \| P_\theta - K Q_\theta \|_{TV}

where KK is a Markov kernel, and δ=0\delta = 0 means TT is sufficient and δ<ϵ\delta < \epsilon means TT is ϵ\epsilon-sufficient.

And this hooks up to differential privacy. You want statistics that are sufficient for the signal and insufficient for picking out the user.


12. Fisher’s program

Fisher’s 1922 insight was that inference is data reduction. You start with a high-dimensional XX and you boil it down to a tiny θ^\hat{\theta}, and sufficient statistics tell you exactly what you can throw away and factorization gives you the tool to check, and exponential families are the spot where finite-dimensional summaries actually exist, and Basu and Lehmann-Scheffe fill in the links between sufficiency and ancillarity and invariance.

In modern ML the word “sufficiency” goes by “representation learning” and the goal is the same. Find the smallest coordinates of whatever manifold the data actually lives on.


Simulation of Rao-Blackwellization and Fisher Information

We work out λ\lambda for a Poisson with three estimators.

  • X1X_1 alone, unbiased and wasteful.
  • Xˉ\bar{X}, which is the Rao-Blackwell improvement and the MVUE.
  • λ^\hat{\lambda} from T(X)=I(X>0)T(X) = \mathbb{I}(X > 0), which is binary compression and throws the counts away.

The third estimator throws the count information away and should have much higher variance.

import numpy as np import jax.numpy as jnp from jax import grad, vmap import matplotlib.pyplot as plt def fisher_info_loss_demo(true_lambda=5.0, n_samples=10, n_trials=5000): estimates_raw = [] estimates_rb = [] estimates_bad = [] for _ in range(n_trials): X = np.random.poisson(true_lambda, n_samples) # 1. Raw Estimator: Only first sample est_raw = X[0] estimates_raw.append(est_raw) # 2. Rao-Blackwellized (Mean) est_rb = np.mean(X) estimates_rb.append(est_rb) # 3. Bad Statistic (Lossy Compression) # We only know how many are non-zero. # k = count(X > 0). k ~ Binomial(n, 1 - e^-lambda) # p = 1 - e^-lambda => lambda = -log(1-p) # MLE for p is k/n. # est_lambda = -log(1 - k/n) k = np.sum(X > 0) # Avoid div by zero if k == n_samples: est_bad = -np.log(1 - (n_samples-0.5)/n_samples) # Smooth else: est_bad = -np.log(1 - k/n_samples) estimates_bad.append(est_bad) print(f"Theory Variance (Raw): {true_lambda:.4f}") print(f"Empirical Var (Raw): {np.var(estimates_raw):.4f}") # Cramer-Rao Lower Bound = lambda / n crlb = true_lambda / n_samples print(f"CRLB (Optimal): {crlb:.4f}") print(f"Empirical Var (RB): {np.var(estimates_rb):.4f}") print(f"Empirical Var (Bad Stat): {np.var(estimates_bad):.4f}") # The 'Bad Stat' throws away info (exact counts), keeping only binary info. # Variance explodes.

Completeness and Basu’s Theorem

Completeness of exponential families

The family f(xθ)=h(x)exp(θT(x)A(θ))f(x|\theta) = h(x) \exp(\theta \cdot T(x) - A(\theta)) is complete when Θ\Theta holds an open rectangle.

Suppose Eθ[g(T)]=0\mathbb{E}_\theta[g(T)] = 0 for all θ\theta, and then g(t)eθtA(θ)dt=0\int g(t) e^{\theta t - A(\theta)} dt = 0 and that hands you g(t)eθtdt=0\int g(t) e^{\theta t} dt = 0. This is the Laplace transform of gg and when a Laplace transform vanishes on an open set the function is zero almost everywhere, and the result follows.

When completeness fails

Take U[θ,θ+1]U[\theta, \theta+1] and pick g(X)=sin(2πX)g(X) = \sin(2\pi X). Then Eθ[sin(2πX)]=θθ+1sin(2πx)dx=0\mathbb{E}_\theta[\sin(2\pi X)] = \int_\theta^{\theta+1} \sin(2\pi x) \, dx = 0 for every θ\theta because it is a full-period integral. But sin(2πX)0\sin(2\pi X) \ne 0 and so the uniform family is incomplete and Basu’s theorem does not apply, and ancillaries can depend on sufficient statistics here.


Moments from the partition function

Every moment calculation you would ever want falls out by taking derivatives of A(η)A(\eta).

Bernoulli. P(x)=μx(1μ)1x=exp(xlogμ1μ+log(1μ))P(x) = \mu^x (1-\mu)^{1-x} = \exp( x \log \frac{\mu}{1-\mu} + \log(1-\mu) ). The canonical parameter is η=logμ1μ\eta = \log \frac{\mu}{1-\mu} which is the logit. The inverse is μ=σ(η)=11+eη\mu = \sigma(\eta) = \frac{1}{1+e^{-\eta}}. The log-partition is A(η)=log(1+eη)A(\eta) = \log(1+e^\eta). Mean. A(η)=σ(η)=μA'(\eta) = \sigma(\eta) = \mu Variance. A(η)=μ(1μ)A''(\eta) = \mu(1-\mu)

Poisson. P(x)=λxeλx!=1x!exp(xlogλλ)P(x) = \frac{\lambda^x e^{-\lambda}}{x!} = \frac{1}{x!} \exp( x \log \lambda - \lambda ). The canonical parameter is η=logλ\eta = \log \lambda and the log-partition is A(η)=eηA(\eta) = e^\eta. Mean. A(η)=eη=λA'(\eta) = e^\eta = \lambda Variance. A(η)=eη=λA''(\eta) = e^\eta = \lambda

Gaussian (known σ2=1\sigma^2=1). P(x)exp(xμμ2/2)P(x) \propto \exp( x\mu - \mu^2/2 ). The canonical parameter is η=μ\eta = \mu and the log-partition is A(η)=η2/2A(\eta) = \eta^2/2. Mean. A(η)=η=μA'(\eta) = \eta = \mu Variance. A(η)=1A''(\eta) = 1


Quantum sufficiency

Sufficiency carries over to quantum mechanics and the objects just change shape. A density matrix ρθ\rho_\theta on Hilbert space H\mathcal{H} takes the place of the distribution and a quantum channel (CPTP map) E\mathcal{E} takes the place of the statistic.

And E\mathcal{E} is sufficient when there is a recovery channel R\mathcal{R} with

(RE)(ρθ)=ρθθ(\mathcal{R} \circ \mathcal{E})(\rho_\theta) = \rho_\theta \quad \forall \theta

This hooks into the Petz recovery map. A measurement (POVM) is sufficient for a family of states when the quantum Fisher information stays put.

The monotonicity of relative entropy gives the criterion

D(ρθρθ)D(E(ρθ)E(ρθ))D(\rho_\theta \| \rho_{\theta'}) \ge D(\mathcal{E}(\rho_\theta) \| \mathcal{E}(\rho_{\theta'}))

and equality holds exactly when E\mathcal{E} is sufficient for {ρθ,ρθ}\{\rho_\theta, \rho_{\theta'}\}. The data processing inequality stops any channel from making states easier to tell apart, and this is the same principle as in the classical case.


Sufficiency in latent variable models (EM)

Take a model with observed XX and latent ZZ. The complete likelihood P(X,Zθ)P(X, Z | \theta) often sits in an exponential family with sufficient statistics T(X,Z)T(X, Z). But we only get to see XX.

EM works by computing expected sufficient statistics.

E-step: compute P(ZX,θold)P(Z | X, \theta_{\text{old}}) and then

Tˉ=EZX[T(X,Z)]\bar{T} = \mathbb{E}_{Z|X} [ T(X, Z) ]

M-step: find θnew\theta_{\text{new}} by moment matching:

Eθnew[T(X,Z)]=Tˉ\mathbb{E}_{\theta_{\text{new}}} [ T(X, Z) ] = \bar{T}

For Gaussian mixtures the sufficient statistics are γik\sum \gamma_{ik} for the cluster masses and γikxi\sum \gamma_{ik} x_i for the first moments and γikxixiT\sum \gamma_{ik} x_i x_i^T for the second moments. Without factorization EM would be unworkable for most models.


Milestones

  • 1922. R.A. Fisher lays down “sufficiency” in his foundational paper on mathematical statistics and sets up the idea that a statistic can hold all the information about a parameter.
  • 1935. Koopman and Pitman and Darmois each prove on their own that fixed-dimensional sufficiency forces exponential family structure.
  • 1945. C.R. Rao proves the Rao-Blackwell theorem and shows that conditioning on a sufficient statistic always cuts variance down.
  • 1949. Halmos and Savage prove the factorization theorem with measure theory and put sufficiency on solid footing.
  • 1955. D. Basu proves that complete sufficient statistics come out independent of ancillary statistics.
  • 1972. Le Cam brings in deficiency distance and lets you compare approximate sufficiency between experiments.
  • 1999. Tishby puts forward the Information Bottleneck principle and hooks sufficiency up to representation learning and deep networks.

Terminology used in this post

A sufficient statistic holds everything XX knows about θ\theta and formally the conditional distribution of XX given T(X)T(X) does not lean on θ\theta at all. A minimal sufficient statistic is the coarsest such statistic and it squeezes the data as far as it can go with no information loss. Completeness means no non-trivial function of TT has expectation zero for all parameter values and it is the condition that makes Basu’s theorem work. The exponential family is the class of distributions whose log-likelihood is linear in the sufficient statistics and the Pitman-Koopman-Darmois theorem says these are basically the only families that allow fixed-dimensional sufficiency. Fisher Information measures how much an observable tells you about θ\theta and Rao-Blackwellization is the move of conditioning an estimator on a sufficient statistic to cut variance down. An ancillary statistic has a distribution that does not lean on θ\theta and interaction information can be positive or negative and it measures synergy or redundancy between variables.


References

1. Fisher, R. A. (1922). “On the mathematical foundations of theoretical statistics”.

2. Halmos, P. R., & Savage, L. J. (1949). “Application of the Radon-Nikodym theorem to the theory of sufficient statistics”.

3. Basu, D. (1955). “On statistics independent of a complete sufficient statistic”.

4. Lehmann, E. L., & Scheffe, H. (1950). “Completeness, similar regions, and unbiased estimation”.

5. Pitman, E. J. G. (1936). “Sufficient statistics and intrinsic accuracy”.

6. Csiszar, I. (1975). “I-divergence geometry of probability distributions and minimization problems”.