Sufficient Statistics & Information

6/30/2025

1. The Concept of Lossless Compression

The Problem: We observe a random variable $X$ taking values in a space $\mathcal{X}$ , distributed according to $P_\theta$ where $\theta \in \Theta$ . The raw data $X$ is often high-dimensional (e.g., $N$ video frames). The parameter $\theta$ is low-dimensional (e.g., the physics constant causing the motion). All inference about $\theta$ must be based on $X$ . A statistic $T(X)$ is a map from $\mathcal{X} \to \mathcal{T}$ . Intuitively, $T$ compresses the data. When is this compression lossless?

Definition (Sufficiency): A statistic $T(X)$ is sufficient for $\theta$ if the conditional distribution of $X$ given $T(X)=t$ is independent of $\theta$ for all $t$ .

P_\theta(X \in A \mid T(X)=t) = P(X \in A \mid T(X)=t)

If this holds, we can simulate the original data $X$ knowing only $t$ and a random number generator, without knowing $\theta$ . Therefore, keeping $X$ provides no additional information about $\theta$ that isn’t already in $T(X)$ .

2. The Halmos-Savage Factorization Theorem

Checking the conditional distribution definition is practically impossible. We rely on the Factorization Criterion.

Theorem (Halmos-Savage, 1949): Let $\{P_\theta : \theta \in \Theta\}$ be a family of distributions dominated by a $\sigma$ -finite measure $\mu$ . $T(X)$ is sufficient for $\theta$ if and only if the Radon-Nikodym density $f_\theta(x) = \frac{dP_\theta}{d\mu}(x)$ factorizes as:

f_\theta(x) = g_\theta(T(x)) h(x)

where $g_\theta \ge 0$ depends on $x$ only through $T(x)$ , and $h(x) \ge 0$ is independent of $\theta$ .

Proof of Sufficiency ( $\implies$ ): If $T$ is sufficient, then for any set $A$ :

P_\theta(A) = \int_{\mathcal{T}} P(A | T=t) dP_\theta^T(t)

Let $g_\theta(t)$ be the density of $T$ with respect to some induced measure. Let $h(x)$ be related to the conditional density. Ideally, $P(X=x) = P(X=x|T=t) P(T=t)$ . $P(X=x|T=t)$ is $\theta$ -free ( $h(x)$ ). $P(T=t)$ contains $\theta$ ( $g_\theta(t)$ ).

Measure Theoretic Proof Sketch: Let $\mathcal{A}$ be the sufficient sub-sigma-algebra generated by $T$ . Sufficiency means the derivative $\frac{dP_\theta}{d\mu}$ can be effectively computed on $\mathcal{A}$ . Specifically, we define the measure $\lambda = \sum 2^{-i} P_{\theta_i}$ (a dominating mixture). Then $\frac{dP_\theta}{d\lambda}$ is $\mathcal{A}$ -measurable. Since $\frac{dP_\theta}{d\mu} = \frac{dP_\theta}{d\lambda} \frac{d\lambda}{d\mu}$ , we identify: $g_\theta(T(x)) = \frac{dP_\theta}{d\lambda}$ (which is $\mathcal{A}$ -measurable, hence function of $T$ ). $h(x) = \frac{d\lambda}{d\mu}$ (which depends on the mixture, not satisfying specific $\theta$ ).

3. Fisher Information and Efficiency

Why do we care? Because of Fisher Information.

I_X(\theta) = \mathbb{E}_\theta \left[ \left( \nabla_\theta \log f_\theta(X) \right)^2 \right]

Theorem (Data Processing Inequality): For any statistic $T(X)$ , $I_T(\theta) \le I_X(\theta)$ . Processing data cannot create information. Theorem: $T(X)$ is sufficient if and only if $I_T(\theta) = I_X(\theta)$ for all $\theta$ . (This holds strictly for dominated families with regular densities).

Proof: Using factorization $l(\theta) = \log g_\theta(T) + \log h(X)$ . $\nabla_\theta l(\theta) = \nabla_\theta \log g_\theta(T)$ . The score function depends only on $T$ . Thus the variance of the score (Fisher Info) is identical. Sufficiency = Conservation of Fisher Information.

4. Minimal Sufficiency & The Likelihood Ratio

Usually there are many sufficient statistics. The whole data $X$ is sufficient. We want the statist statistic—the one that compresses the most. Definition: $T$ is minimal sufficient if for any other sufficient statistic $S$ , $T$ is a function of $S$ . (i.e., $T$ partitions the sample space strictly coarser than $S$ ).

Lehmann-Scheffe Method: $T(x)$ is minimal sufficient if and only if:

\frac{L(\theta; x)}{L(\theta; y)} \text{ is independent of } \theta \iff T(x) = T(y)

Basically, two data points $x$ and $y$ are “equivalent” (map to the same $T$ ) iff their likelihood theoretical ratio is constant.

Example 1: Uniform $U[0, \theta]$ $L(\theta; x) = \theta^{-n} \mathbb{I}(\max(x) \le \theta)$ . Ratio is 1 iff $\mathbb{I}(x_{(n)} \le \theta) = \mathbb{I}(y_{(n)} \le \theta)$ for all $\theta$ . This requires $x_{(n)} = y_{(n)}$ . Thus $T(X) = \max(X)$ is minimal sufficient.

Example 2: Cauchy Distribution $f(x; \theta) = \frac{1}{\pi(1+(x-\theta)^2)}$ . The likelihood ratio is a rational function of $\theta$ of degree $2n$ . For the ratio to be constant, the polynomials in the numerator and denominator must share roots. This implies the set of values $\{x_i\}$ must match $\{y_i\}$ . Thus, for Cauchy, the minimal sufficient statistic is the Order Statistics (sorted data). No compression is possible. Lesson: Heavy tails often destroy sufficiency properties.

5. The Pitman-Koopman-Darmois Theorem

When can we compress $n$ samples into $k$ numbers (where $k$ is fixed as $n \to \infty$ )? Only in very special cases.

Theorem: Under regularity conditions (support independent of $\theta$ ), if a family admits a sufficient statistic of fixed dimension $k$ (independent of sample size $n$ ), then the family is an Exponential Family.

f(x | \theta) = h(x) \exp( \eta(\theta) \cdot T(x) - A(\theta) )

Proof Sketch: Consider the log-likelihood for $n$ i.i.d. samples:

\ell(\theta; \mathbf{x}) = \sum_{i=1}^n \log f(x_i | \theta)

If $T(\mathbf{x}) = (t_1, \dots, t_k)$ is sufficient, then by Factorization:

\sum_{i=1}^n \log f(x_i | \theta) = \alpha(T(\mathbf{x}), \theta) + \beta(\mathbf{x})

Differentiating with respect to mixed samples $x_i$ and $x_j$ , we obtain a constraint on the cross-derivatives that forces the function $\log f$ to separate into a product form $\eta(\theta) t(x)$ . Specifically, consider the Jacobian of the mapping from data space to parameter gradients. For the rank to be bounded by $k$ as $n \to \infty$ , the gradients must lie in a low-dimensional subspace. This forces the structure:

\nabla_\theta \log f(x | \theta) = \sum_{j=1}^k w_j'(\theta) t_j(x)

Integration yields the exponential family form.

6. Exponential Families & Convex Duality

The “Canonical Form” is critical for derivations.

p(x|\eta) = h(x) \exp( \eta^T T(x) - A(\eta) )

Property 1: Log-Partition $A(\eta)$ is Convex. Since $\int p dx = 1$ , $A(\eta) = \log \int h(x) e^{\eta T(x)} dx$ . Holder’s inequality implies log-sum-exp is convex.

Property 2: Moments are Gradients.

\nabla_\eta A(\eta) = \mathbb{E}[T(X)]

\nabla^2_\eta A(\eta) = \text{Cov}(T(X))

Since Covariance is Positive Semi-Definite, $A$ is strictly convex (if representation is minimal).

Property 3: MLE is Moment Matching. The Log-Likelihood for data $D$ is:

\mathcal{L}(\eta) = \sum \log p(x_i|\eta) = n \eta^T \bar{T} - n A(\eta) + \text{const}

Taking the gradient and setting to 0:

\nabla \mathcal{L} = n \bar{T} - n \nabla A(\eta) = 0

\implies \mathbb{E}_\eta[T(X)] = \frac{1}{n} \sum T(x_i)

The Maximum Likelihood Estimator is the unique parameter $\hat{\eta}$ that makes the model’s expected sufficient statistics match the observed empirical sufficiency. This is called the Dual Matching Condition. Information Geometry interprets this as a Projection of the empirical distribution onto the model manifold.

7. Basu’s Theorem & Ancillarity

An Ancillary Statistic $A(X)$ is one whose distribution does not depend on $\theta$ . (e.g., sample size $n$ if fixed, or $X_1 - X_2$ for location family $\mathcal{N}(\theta, 1)$ ). Ancillaries seem useless? No, they define the “precision” of the experiment. Basu’s Theorem: If $T$ is boundedly complete and sufficient sufficient, and $A$ is ancillary, then $T$ and $A$ are Independent. Independence?!? $T$ has all the info. $A$ has no info. Independence seems trivial? No, usually info sets overlap. But here they are orthogonal.

Proof

Let $B \in \sigma(A)$ be an event defined by ancillary. $P(B)$ is constant. Consider $\mathbb{E}[ P(B | T) ] = P(B)$ . Let $g(t) = P(B | T=t)$ . Since $T$ is sufficient, $g(t)$ does not depend on $\theta$ . So $\mathbb{E}_\theta [ g(T) - P(B) ] = 0$ for all $\theta$ . Because $T$ is Complete (no non-zero function has mean 0 for all $\theta$ ), we must have $g(T) - P(B) = 0$ a.s. Thus $P(B | T) = P(B)$ . This implies independence. $\square$

Application: Geary’s Theorem

Let $X_i \sim \mathcal{N}(\mu, \sigma^2)$ . $\bar{X}$ is sufficient for $\mu$ (if $\sigma$ known? No, joint sufficient). $S^2$ is ancillary for $\mu$ (location invariant). Is $\bar{X}$ complete? Yes. Therefore $\bar{X}$ and $S^2$ are independent. This is why the t-test works (numerator and denominator are independent). Note: This independence holds ONLY for Gaussians (Geary’s Theorem).

8. Rao-Blackwellization

If we have a rough estimator $\hat{\theta}$ (unbiased), and a sufficient statistic $T$ . Define $\tilde{\theta} = \mathbb{E}[\hat{\theta} | T]$ .

Unbiased: $\mathbb{E}[\tilde{\theta}] = \mathbb{E}[ \mathbb{E}[\hat{\theta}|T] ] = \mathbb{E}[\hat{\theta}] = \theta$ .
Computable: Since $T$ is sufficient, conditional distribution is $\theta$ -free, so we can calculate the expectation.
Variance Reduction: $\text{Var}(\hat{\theta}) = \text{Var}(\mathbb{E}[\hat{\theta}|T]) + \mathbb{E}[\text{Var}(\hat{\theta}|T)]$ $\text{Var}(\hat{\theta}) = \text{Var}(\tilde{\theta}) + \text{Positive}$ Thus $\text{Var}(\tilde{\theta}) \le \text{Var}(\hat{\theta})$ . Smoothing noise by conditioning on the sufficient statistic strictly improves the estimator (unless already a function of $T$ ). This is the Rao-Blackwell Theorem.

9. Information Bottleneck & Deep Learning

Tishby’s Information Bottleneck principle: $T$ should maximize $I(T; Y)$ while minimizing $I(T; X)$ . Ideally $T$ is a sufficient statistic for $Y$ . Minimal sufficient statistic = Perfect compression. In Deep Learning, we argue that SGD finds such representations. However, for deterministic networks, $I(T; X)$ is infinite (or undefined) for continuous variables. We need to add noise to the weights/activations to make this rigorous. If the weights are random (Bayesian NN), the posterior predictive distribution depends on sufficient statistics of the training data. Since NN is not exponential family (finite size), sufficient statistics are the whole dataset. But with $Width \to \infty$ , in the NTK regime, the sufficient statistics become the Kernel Matrix of the data.

10. Sufficient Statistics in Time Series: The State

In i.i.d. settings, sufficiency is about compressing $N$ static points. In Time Series, we process data streams $x_1, x_2, \dots$ . We seek a summary $h_t = T(x_{1:t})$ such that:

P(x_{t+1: \infty} | x_{1:t}) = P(x_{t+1: \infty} | h_t)

This $h_t$ is the State of the system. If such a finite-dimensional $h_t$ exists, the process is a Hidden Markov Model (or State Space Model).

The Kalman Filter: For linear Gaussian systems, the sufficient statistics for the future are $(\mu_t, \Sigma_t)$ . The Kalman Filter is simply the recursive update of these sufficient statistics.

\mu_{t} = A \mu_{t-1} + K_t (y_t - C A \mu_{t-1})

\Sigma_{t} = (I - K_t C) \Sigma_{t|t-1}

The fact that we can track a dynamic system with fixed memory is a direct consequence of the Pitman-Koopman-Darmois theorem applied to the conditional transition densities. If the noise were Cauchy, we would need to store the entire history $x_{1:t}$ .

11. Approximate Sufficiency: Le Cam’s Deficiency

What if a statistic is “almost” sufficient? Lucien Le Cam formalized this using Decision Theory. Two experiments $\mathcal{E}$ (observing $X$ ) and $\mathcal{F}$ (observing $T(X)$ ) are equivalent if for any decision rule in $\mathcal{E}$ , there exists a rule in $\mathcal{F}$ with the same risk (and vice versa).

Deficiency Distance:

\delta(\mathcal{E}, \mathcal{F}) = \inf_K \sup_\theta \| P_\theta - K Q_\theta \|_{TV}

where $K$ is a Markov kernel (randomization). If $\delta = 0$ , $T$ is sufficient. If $\delta < \epsilon$ , $T$ is $\epsilon$ -sufficient. This is crucial for Privacy (Differential Privacy). We want statistics that are sufficient for the signal but insufficient for the user’s identity.

12. Conclusion: The Conservation of Information

R.A. Fisher’s original insight remains one of the most profound in statistics: Inference is Data Reduction. We start with a massive, high-dimensional object $X$ and attempt to distill it into a tiny object $\hat{\theta}$ . The theory of Sufficient Statistics tells us exactly what we can throw away. The Factorization Theorem gives us the algebraic tool to identify these summaries. Exponential Families provide the geometric structure where these summaries are finite-dimensional. And finally, the interplay between sufficiency, ancillarity (Basu), and invariance (Lehmann-Scheffe) provides the complete roadmap for optimal estimation. In the modern era of Deep Learning, “Sufficiency” has morphed into “Representation Learning”. But the goal remains the same: to find the minimal coordinates of the manifold on which the data lives.

Historical Timeline

Year	Event	Significance
1922	R.A. Fisher	Defines “Sufficiency” in his foundational paper.
1935	Koopman, Pitman, Darmois	Independently link Sufficiency to Generalized Exponential Families.
1945	C.R. Rao	Proves the Rao-Blackwell Theorem (Variance Reduction).
1949	Halmos & Savage	Prove the Factorization Theorem using Measure Theory.
1955	D. Basu	Proves Basu’s Theorem (Independence of Ancillary Statistics).
1972	Le Cam	Introduces Deficiency Distance (Approximate Sufficiency).
1999	Tishby	Information Bottleneck Principle.

Appendix A: Python Simulation of Rao-Blackwellization & Fisher Information

Let’s empirically verify two things:

Rao-Blackwellization reduces variance (The “Improvement” Theorem).
Fisher Information is lost if we use a non-sufficient statistic.

We estimate $\lambda$ for Poisson.

Estimator 1 (Raw): $X_1$ . Unbiased.
Estimator 2 (Rao-Blackwell): $\bar{X}$ . MVUE.
Estimator 3 (Bad Statistic): Estimate $\lambda$ from $T(X) = \mathbb{I}(X > 0)$ (Binary compression).


import numpy as np
import jax.numpy as jnp
from jax import grad, vmap
import matplotlib.pyplot as plt
 
def fisher_info_loss_demo(true_lambda=5.0, n_samples=10, n_trials=5000):
    estimates_raw = []
    estimates_rb = []
    estimates_bad = []
    
    for _ in range(n_trials):
        X = np.random.poisson(true_lambda, n_samples)
        
        # 1. Raw Estimator: Only first sample
        est_raw = X[0]
        estimates_raw.append(est_raw)
        
        # 2. Rao-Blackwellized (Mean)
        est_rb = np.mean(X)
        estimates_rb.append(est_rb)
        
        # 3. Bad Statistic (Lossy Compression)
        # We only know how many are non-zero.
        # k = count(X > 0). k ~ Binomial(n, 1 - e^-lambda)
        # p = 1 - e^-lambda => lambda = -log(1-p)
        # MLE for p is k/n.
        # est_lambda = -log(1 - k/n)
        k = np.sum(X > 0)
        # Avoid div by zero
        if k == n_samples:
            est_bad = -np.log(1 - (n_samples-0.5)/n_samples) # Smooth
        else:
            est_bad = -np.log(1 - k/n_samples)
        estimates_bad.append(est_bad)
 
    print(f"Theory Variance (Raw): {true_lambda:.4f}")
    print(f"Empirical Var (Raw): {np.var(estimates_raw):.4f}")
    
    # Cramer-Rao Lower Bound = lambda / n
    crlb = true_lambda / n_samples
    print(f"CRLB (Optimal): {crlb:.4f}")
    print(f"Empirical Var (RB): {np.var(estimates_rb):.4f}")
    
    print(f"Empirical Var (Bad Stat): {np.var(estimates_bad):.4f}")
    # The 'Bad Stat' throws away info (exact counts), keeping only binary info.
    # Variance explodes.

Appendix B: Completeness and Basu’s Theorem Proof Details

B.1 Completeness of Exponential Families

The family $f(x|\theta) = h(x) \exp(\theta \cdot T(x) - A(\theta))$ is Complete if the parameter space $\Theta$ contains an open rectangle. Proof Idea: Suppose $\mathbb{E}_\theta [ g(T) ] = 0$ for all $\theta$ . $\int g(t) e^{\theta t - A(\theta)} dt = 0$ $\int g(t) e^{\theta t} dt = 0$ . This is the Laplace Transform of $g$ . If Laplace transform is 0 in a neighborhood, the function is 0 a.e. Thus $g(t) = 0$ . This implies Completeness.

B.2 Counter-Example to Completeness

Consider the Uniform distribution $U[\theta, \theta+1]$ . The statistic $T(X) = X$ is sufficient? No, order statistics are. Consider $g(X) = \sin(2\pi X)$ . $\mathbb{E}_\theta [\sin(2\pi X)] = \int_\theta^{\theta+1} \sin(2\pi x) dx = 0$ . Since the integral of sine over a full period is always 0, unrelated to shift $\theta$ . But $\sin(2\pi X)$ is not zero. Thus, the Uniform family is Incomplete. Basu’s theorem fails here (Ancillaries might be dependent).

Appendix C: Deriving Moments from the Partition Function

The power of Exponential Families lies in $A(\eta)$ . Let’s derive the mean and variance for common distributions using $\nabla A$ and $\nabla^2 A$ .

1. Bernoulli Distribution $P(x) = \mu^x (1-\mu)^{1-x} = \exp( x \log \frac{\mu}{1-\mu} + \log(1-\mu) )$ . Canonical parameter: $\eta = \log \frac{\mu}{1-\mu}$ (Logit). Inverse: $\mu = \sigma(\eta) = \frac{1}{1+e^{-\eta}}$ . Log-partition: $A(\eta) = -\log(1-\mu) = \log(1+e^\eta)$ . Mean: $A'(\eta) = \frac{e^\eta}{1+e^\eta} = \sigma(\eta) = \mu$ Variance: $A''(\eta) = \sigma(\eta)(1-\sigma(\eta)) = \mu(1-\mu)$

2. Poisson Distribution $P(x) = \frac{\lambda^x e^{-\lambda}}{x!} = \frac{1}{x!} \exp( x \log \lambda - \lambda )$ . Canonical parameter: $\eta = \log \lambda$ . Inverse: $\lambda = e^\eta$ . Log-partition: $A(\eta) = \lambda = e^\eta$ . Mean: $A'(\eta) = e^\eta = \lambda$ Variance: $A''(\eta) = e^\eta = \lambda$

3. Gaussian Distribution (Known Variance $\sigma^2=1$ ) $P(x) \propto \exp( -\frac{1}{2}(x-\mu)^2 ) \propto \exp( x\mu - \mu^2/2 )$ . Canonical parameter: $\eta = \mu$ . Log-partition: $A(\eta) = \mu^2/2 = \eta^2/2$ . Mean: $A'(\eta) = \eta = \mu$ Variance: $A''(\eta) = 1$

Appendix D: Quantum Sufficiency

The concept of sufficiency extends to Quantum Mechanics. Let a quantum system be described by a density matrix $\rho_\theta$ acting on Hilbert space $\mathcal{H}$ . A “statistic” is replaced by a Quantum Channel (CPTP map) $\mathcal{E}(\rho)$ . When is a channel sufficient? When there exists a recovery channel $\mathcal{R}$ such that:

(\mathcal{R} \circ \mathcal{E})(\rho_\theta) = \rho_\theta \quad \forall \theta

Factorization in Quantum Mechanics: This is related to the Petz Recovery Map. A measurement (POVM) is sufficient for a family of states if the states commute with the measurement operators, or more generally, if the Quantum Fisher Information is preserved. Theorem (Monotonicity of Relative Entropy):

D(\rho_\theta \| \rho_{\theta'}) \ge D(\mathcal{E}(\rho_\theta) \| \mathcal{E}(\rho_{\theta'}))

Equality holds if and only if $\mathcal{E}$ is a sufficient statistic (sufficient channel) for the family $\{\rho_\theta, \rho_{\theta'}\}$ . This connects sufficiency to the Second Law of Thermodynamics (data processing cannot decrease entropy).

Appendix E: Sufficiency in Latent Variable Models (The EM Algorithm)

Consider a model with observed data $X$ and latent variables $Z$ . The complete likelihood $P(X, Z | \theta)$ often belongs to an exponential family with sufficient statistics $T(X, Z)$ . However, we only observe $X$ . The Expectation-Maximization (EM) algorithm rests on the observation that we can compute the expected sufficient statistics. E-Step: Compute the posterior of the latent variables $P(Z | X, \theta_{old})$ . Compute the expected sufficient statistics:

\bar{T} = \mathbb{E}_{Z|X} [ T(X, Z) ]

M-Step: Update $\theta$ by Maximum Likelihood matching the expected stats $\bar{T}$ to the model moments.

\mathbb{E}_{\theta_{new}} [ T(X, Z) ] = \bar{T}

For GMMs, the sufficient statistics are $\sum \gamma_{ik}$ (mass), $\sum \gamma_{ik} x_i$ (first moment), $\sum \gamma_{ik} x_i x_i^T$ (second moment). This shows that Sufficiency is the “Computational Engine” of the EM algorithm. Without the factorization theorem, EM would be computationally intractable for most models.

Appendix F: Glossary of Definitions

Ancillary Statistic: A statistic whose distribution does not depend on $\theta$ .
Completeness: A family where no non-zero function has expectation zero for all parameters.
Exponential Family: A family of distributions where the log-likelihood is linear in the sufficient statistics.
Fisher Information: Measures the amount of information that an observable random variable $X$ carries about an unknown parameter $\theta$ .
Interaction Information: Can be positive or negative (Synergy/Redundancy).
Minimal Sufficient Statistic: The coarsest possible sufficient statistic.
Rao-Blackwellization: Improving an estimator by conditioning on a sufficient statistic.
Sufficient Statistic: A statistic that captures all information about $\theta$ contained in $X$ .

References

1. Fisher, R. A. (1922). “On the mathematical foundations of theoretical statistics”.

2. Halmos, P. R., & Savage, L. J. (1949). “Application of the Radon-Nikodym theorem to the theory of sufficient statistics”.

3. Basu, D. (1955). “On statistics independent of a complete sufficient statistic”.

4. Lehmann, E. L., & Scheffé, H. (1950). “Completeness, similar regions, and unbiased estimation”.

5. Pitman, E. J. G. (1936). “Sufficient statistics and intrinsic accuracy”.

6. Csiszar, I. (1975). “I-divergence geometry of probability distributions and minimization problems”.