Sufficient Statistics & Information

1. The Concept of Lossless Compression

The Problem: We observe a random variable XX taking values in a space X\mathcal{X}, distributed according to PθP_\theta where θΘ\theta \in \Theta. The raw data XX is often high-dimensional (e.g., NN video frames). The parameter θ\theta is low-dimensional (e.g., the physics constant causing the motion). All inference about θ\theta must be based on XX. A statistic T(X)T(X) is a map from XT\mathcal{X} \to \mathcal{T}. Intuitively, TT compresses the data. When is this compression lossless?

Definition (Sufficiency): A statistic T(X)T(X) is sufficient for θ\theta if the conditional distribution of XX given T(X)=tT(X)=t is independent of θ\theta for all tt.

Pθ(XAT(X)=t)=P(XAT(X)=t)P_\theta(X \in A \mid T(X)=t) = P(X \in A \mid T(X)=t)

If this holds, we can simulate the original data XX knowing only tt and a random number generator, without knowing θ\theta. Therefore, keeping XX provides no additional information about θ\theta that isn’t already in T(X)T(X).

2. The Halmos-Savage Factorization Theorem

Checking the conditional distribution definition is practically impossible. We rely on the Factorization Criterion.

Theorem (Halmos-Savage, 1949): Let {Pθ:θΘ}\{P_\theta : \theta \in \Theta\} be a family of distributions dominated by a σ\sigma-finite measure μ\mu. T(X)T(X) is sufficient for θ\theta if and only if the Radon-Nikodym density fθ(x)=dPθdμ(x)f_\theta(x) = \frac{dP_\theta}{d\mu}(x) factorizes as:

fθ(x)=gθ(T(x))h(x)f_\theta(x) = g_\theta(T(x)) h(x)

where gθ0g_\theta \ge 0 depends on xx only through T(x)T(x), and h(x)0h(x) \ge 0 is independent of θ\theta.

Proof of Sufficiency (    \implies): If TT is sufficient, then for any set AA:

Pθ(A)=TP(AT=t)dPθT(t)P_\theta(A) = \int_{\mathcal{T}} P(A | T=t) dP_\theta^T(t)

Let gθ(t)g_\theta(t) be the density of TT with respect to some induced measure. Let h(x)h(x) be related to the conditional density. Ideally, P(X=x)=P(X=xT=t)P(T=t)P(X=x) = P(X=x|T=t) P(T=t). P(X=xT=t)P(X=x|T=t) is θ\theta-free (h(x)h(x)). P(T=t)P(T=t) contains θ\theta (gθ(t)g_\theta(t)).

Measure Theoretic Proof Sketch: Let A\mathcal{A} be the sufficient sub-sigma-algebra generated by TT. Sufficiency means the derivative dPθdμ\frac{dP_\theta}{d\mu} can be effectively computed on A\mathcal{A}. Specifically, we define the measure λ=2iPθi\lambda = \sum 2^{-i} P_{\theta_i} (a dominating mixture). Then dPθdλ\frac{dP_\theta}{d\lambda} is A\mathcal{A}-measurable. Since dPθdμ=dPθdλdλdμ\frac{dP_\theta}{d\mu} = \frac{dP_\theta}{d\lambda} \frac{d\lambda}{d\mu}, we identify: gθ(T(x))=dPθdλg_\theta(T(x)) = \frac{dP_\theta}{d\lambda} (which is A\mathcal{A}-measurable, hence function of TT). h(x)=dλdμh(x) = \frac{d\lambda}{d\mu} (which depends on the mixture, not satisfying specific θ\theta).

3. Fisher Information and Efficiency

Why do we care? Because of Fisher Information.

IX(θ)=Eθ[(θlogfθ(X))2]I_X(\theta) = \mathbb{E}_\theta \left[ \left( \nabla_\theta \log f_\theta(X) \right)^2 \right]

Theorem (Data Processing Inequality): For any statistic T(X)T(X), IT(θ)IX(θ)I_T(\theta) \le I_X(\theta). Processing data cannot create information. Theorem: T(X)T(X) is sufficient if and only if IT(θ)=IX(θ)I_T(\theta) = I_X(\theta) for all θ\theta. (This holds strictly for dominated families with regular densities).

Proof: Using factorization l(θ)=loggθ(T)+logh(X)l(\theta) = \log g_\theta(T) + \log h(X). θl(θ)=θloggθ(T)\nabla_\theta l(\theta) = \nabla_\theta \log g_\theta(T). The score function depends only on TT. Thus the variance of the score (Fisher Info) is identical. Sufficiency = Conservation of Fisher Information.


4. Minimal Sufficiency & The Likelihood Ratio

Usually there are many sufficient statistics. The whole data XX is sufficient. We want the statist statistic—the one that compresses the most. Definition: TT is minimal sufficient if for any other sufficient statistic SS, TT is a function of SS. (i.e., TT partitions the sample space strictly coarser than SS).

Lehmann-Scheffe Method: T(x)T(x) is minimal sufficient if and only if:

L(θ;x)L(θ;y) is independent of θ    T(x)=T(y)\frac{L(\theta; x)}{L(\theta; y)} \text{ is independent of } \theta \iff T(x) = T(y)

Basically, two data points xx and yy are “equivalent” (map to the same TT) iff their likelihood theoretical ratio is constant.

Example 1: Uniform U[0,θ]U[0, \theta] L(θ;x)=θnI(max(x)θ)L(\theta; x) = \theta^{-n} \mathbb{I}(\max(x) \le \theta). Ratio is 1 iff I(x(n)θ)=I(y(n)θ)\mathbb{I}(x_{(n)} \le \theta) = \mathbb{I}(y_{(n)} \le \theta) for all θ\theta. This requires x(n)=y(n)x_{(n)} = y_{(n)}. Thus T(X)=max(X)T(X) = \max(X) is minimal sufficient.

Example 2: Cauchy Distribution f(x;θ)=1π(1+(xθ)2)f(x; \theta) = \frac{1}{\pi(1+(x-\theta)^2)}. The likelihood ratio is a rational function of θ\theta of degree 2n2n. For the ratio to be constant, the polynomials in the numerator and denominator must share roots. This implies the set of values {xi}\{x_i\} must match {yi}\{y_i\}. Thus, for Cauchy, the minimal sufficient statistic is the Order Statistics (sorted data). No compression is possible. Lesson: Heavy tails often destroy sufficiency properties.

5. The Pitman-Koopman-Darmois Theorem

When can we compress nn samples into kk numbers (where kk is fixed as nn \to \infty)? Only in very special cases.

Theorem: Under regularity conditions (support independent of θ\theta), if a family admits a sufficient statistic of fixed dimension kk (independent of sample size nn), then the family is an Exponential Family.

f(xθ)=h(x)exp(η(θ)T(x)A(θ))f(x | \theta) = h(x) \exp( \eta(\theta) \cdot T(x) - A(\theta) )

Proof Sketch: Consider the log-likelihood for nn i.i.d. samples:

(θ;x)=i=1nlogf(xiθ)\ell(\theta; \mathbf{x}) = \sum_{i=1}^n \log f(x_i | \theta)

If T(x)=(t1,,tk)T(\mathbf{x}) = (t_1, \dots, t_k) is sufficient, then by Factorization:

i=1nlogf(xiθ)=α(T(x),θ)+β(x)\sum_{i=1}^n \log f(x_i | \theta) = \alpha(T(\mathbf{x}), \theta) + \beta(\mathbf{x})

Differentiating with respect to mixed samples xix_i and xjx_j, we obtain a constraint on the cross-derivatives that forces the function logf\log f to separate into a product form η(θ)t(x)\eta(\theta) t(x). Specifically, consider the Jacobian of the mapping from data space to parameter gradients. For the rank to be bounded by kk as nn \to \infty, the gradients must lie in a low-dimensional subspace. This forces the structure:

θlogf(xθ)=j=1kwj(θ)tj(x)\nabla_\theta \log f(x | \theta) = \sum_{j=1}^k w_j'(\theta) t_j(x)

Integration yields the exponential family form.

6. Exponential Families & Convex Duality

The “Canonical Form” is critical for derivations.

p(xη)=h(x)exp(ηTT(x)A(η))p(x|\eta) = h(x) \exp( \eta^T T(x) - A(\eta) )

Property 1: Log-Partition A(η)A(\eta) is Convex. Since pdx=1\int p dx = 1, A(η)=logh(x)eηT(x)dxA(\eta) = \log \int h(x) e^{\eta T(x)} dx. Holder’s inequality implies log-sum-exp is convex.

Property 2: Moments are Gradients.

ηA(η)=E[T(X)]\nabla_\eta A(\eta) = \mathbb{E}[T(X)] η2A(η)=Cov(T(X))\nabla^2_\eta A(\eta) = \text{Cov}(T(X))

Since Covariance is Positive Semi-Definite, AA is strictly convex (if representation is minimal).

Property 3: MLE is Moment Matching. The Log-Likelihood for data DD is:

L(η)=logp(xiη)=nηTTˉnA(η)+const\mathcal{L}(\eta) = \sum \log p(x_i|\eta) = n \eta^T \bar{T} - n A(\eta) + \text{const}

Taking the gradient and setting to 0:

L=nTˉnA(η)=0\nabla \mathcal{L} = n \bar{T} - n \nabla A(\eta) = 0     Eη[T(X)]=1nT(xi)\implies \mathbb{E}_\eta[T(X)] = \frac{1}{n} \sum T(x_i)

The Maximum Likelihood Estimator is the unique parameter η^\hat{\eta} that makes the model’s expected sufficient statistics match the observed empirical sufficiency. This is called the Dual Matching Condition. Information Geometry interprets this as a Projection of the empirical distribution onto the model manifold.


7. Basu’s Theorem & Ancillarity

An Ancillary Statistic A(X)A(X) is one whose distribution does not depend on θ\theta. (e.g., sample size nn if fixed, or X1X2X_1 - X_2 for location family N(θ,1)\mathcal{N}(\theta, 1)). Ancillaries seem useless? No, they define the “precision” of the experiment. Basu’s Theorem: If TT is boundedly complete and sufficient sufficient, and AA is ancillary, then TT and AA are Independent. Independence?!? TT has all the info. AA has no info. Independence seems trivial? No, usually info sets overlap. But here they are orthogonal.

Proof

Let Bσ(A)B \in \sigma(A) be an event defined by ancillary. P(B)P(B) is constant. Consider E[P(BT)]=P(B)\mathbb{E}[ P(B | T) ] = P(B). Let g(t)=P(BT=t)g(t) = P(B | T=t). Since TT is sufficient, g(t)g(t) does not depend on θ\theta. So Eθ[g(T)P(B)]=0\mathbb{E}_\theta [ g(T) - P(B) ] = 0 for all θ\theta. Because TT is Complete (no non-zero function has mean 0 for all θ\theta), we must have g(T)P(B)=0g(T) - P(B) = 0 a.s. Thus P(BT)=P(B)P(B | T) = P(B). This implies independence. \square

Application: Geary’s Theorem

Let XiN(μ,σ2)X_i \sim \mathcal{N}(\mu, \sigma^2). Xˉ\bar{X} is sufficient for μ\mu (if σ\sigma known? No, joint sufficient). S2S^2 is ancillary for μ\mu (location invariant). Is Xˉ\bar{X} complete? Yes. Therefore Xˉ\bar{X} and S2S^2 are independent. This is why the t-test works (numerator and denominator are independent). Note: This independence holds ONLY for Gaussians (Geary’s Theorem).


8. Rao-Blackwellization

If we have a rough estimator θ^\hat{\theta} (unbiased), and a sufficient statistic TT. Define θ~=E[θ^T]\tilde{\theta} = \mathbb{E}[\hat{\theta} | T].

  1. Unbiased: E[θ~]=E[E[θ^T]]=E[θ^]=θ\mathbb{E}[\tilde{\theta}] = \mathbb{E}[ \mathbb{E}[\hat{\theta}|T] ] = \mathbb{E}[\hat{\theta}] = \theta.
  2. Computable: Since TT is sufficient, conditional distribution is θ\theta-free, so we can calculate the expectation.
  3. Variance Reduction: Var(θ^)=Var(E[θ^T])+E[Var(θ^T)]\text{Var}(\hat{\theta}) = \text{Var}(\mathbb{E}[\hat{\theta}|T]) + \mathbb{E}[\text{Var}(\hat{\theta}|T)] Var(θ^)=Var(θ~)+Positive\text{Var}(\hat{\theta}) = \text{Var}(\tilde{\theta}) + \text{Positive} Thus Var(θ~)Var(θ^)\text{Var}(\tilde{\theta}) \le \text{Var}(\hat{\theta}). Smoothing noise by conditioning on the sufficient statistic strictly improves the estimator (unless already a function of TT). This is the Rao-Blackwell Theorem.

9. Information Bottleneck & Deep Learning

Tishby’s Information Bottleneck principle: TT should maximize I(T;Y)I(T; Y) while minimizing I(T;X)I(T; X). Ideally TT is a sufficient statistic for YY. Minimal sufficient statistic = Perfect compression. In Deep Learning, we argue that SGD finds such representations. However, for deterministic networks, I(T;X)I(T; X) is infinite (or undefined) for continuous variables. We need to add noise to the weights/activations to make this rigorous. If the weights are random (Bayesian NN), the posterior predictive distribution depends on sufficient statistics of the training data. Since NN is not exponential family (finite size), sufficient statistics are the whole dataset. But with WidthWidth \to \infty, in the NTK regime, the sufficient statistics become the Kernel Matrix of the data.


10. Sufficient Statistics in Time Series: The State

In i.i.d. settings, sufficiency is about compressing NN static points. In Time Series, we process data streams x1,x2,x_1, x_2, \dots. We seek a summary ht=T(x1:t)h_t = T(x_{1:t}) such that:

P(xt+1:x1:t)=P(xt+1:ht)P(x_{t+1: \infty} | x_{1:t}) = P(x_{t+1: \infty} | h_t)

This hth_t is the State of the system. If such a finite-dimensional hth_t exists, the process is a Hidden Markov Model (or State Space Model).

The Kalman Filter: For linear Gaussian systems, the sufficient statistics for the future are (μt,Σt)(\mu_t, \Sigma_t). The Kalman Filter is simply the recursive update of these sufficient statistics.

μt=Aμt1+Kt(ytCAμt1)\mu_{t} = A \mu_{t-1} + K_t (y_t - C A \mu_{t-1}) Σt=(IKtC)Σtt1\Sigma_{t} = (I - K_t C) \Sigma_{t|t-1}

The fact that we can track a dynamic system with fixed memory is a direct consequence of the Pitman-Koopman-Darmois theorem applied to the conditional transition densities. If the noise were Cauchy, we would need to store the entire history x1:tx_{1:t}.

11. Approximate Sufficiency: Le Cam’s Deficiency

What if a statistic is “almost” sufficient? Lucien Le Cam formalized this using Decision Theory. Two experiments E\mathcal{E} (observing XX) and F\mathcal{F} (observing T(X)T(X)) are equivalent if for any decision rule in E\mathcal{E}, there exists a rule in F\mathcal{F} with the same risk (and vice versa).

Deficiency Distance:

δ(E,F)=infKsupθPθKQθTV\delta(\mathcal{E}, \mathcal{F}) = \inf_K \sup_\theta \| P_\theta - K Q_\theta \|_{TV}

where KK is a Markov kernel (randomization). If δ=0\delta = 0, TT is sufficient. If δ<ϵ\delta < \epsilon, TT is ϵ\epsilon-sufficient. This is crucial for Privacy (Differential Privacy). We want statistics that are sufficient for the signal but insufficient for the user’s identity.


12. Conclusion: The Conservation of Information

R.A. Fisher’s original insight remains one of the most profound in statistics: Inference is Data Reduction. We start with a massive, high-dimensional object XX and attempt to distill it into a tiny object θ^\hat{\theta}. The theory of Sufficient Statistics tells us exactly what we can throw away. The Factorization Theorem gives us the algebraic tool to identify these summaries. Exponential Families provide the geometric structure where these summaries are finite-dimensional. And finally, the interplay between sufficiency, ancillarity (Basu), and invariance (Lehmann-Scheffe) provides the complete roadmap for optimal estimation. In the modern era of Deep Learning, “Sufficiency” has morphed into “Representation Learning”. But the goal remains the same: to find the minimal coordinates of the manifold on which the data lives.


Historical Timeline

YearEventSignificance
1922R.A. FisherDefines “Sufficiency” in his foundational paper.
1935Koopman, Pitman, DarmoisIndependently link Sufficiency to Generalized Exponential Families.
1945C.R. RaoProves the Rao-Blackwell Theorem (Variance Reduction).
1949Halmos & SavageProve the Factorization Theorem using Measure Theory.
1955D. BasuProves Basu’s Theorem (Independence of Ancillary Statistics).
1972Le CamIntroduces Deficiency Distance (Approximate Sufficiency).
1999TishbyInformation Bottleneck Principle.

Appendix A: Python Simulation of Rao-Blackwellization & Fisher Information

Let’s empirically verify two things:

  1. Rao-Blackwellization reduces variance (The “Improvement” Theorem).
  2. Fisher Information is lost if we use a non-sufficient statistic.

We estimate λ\lambda for Poisson.

  • Estimator 1 (Raw): X1X_1. Unbiased.
  • Estimator 2 (Rao-Blackwell): Xˉ\bar{X}. MVUE.
  • Estimator 3 (Bad Statistic): Estimate λ\lambda from T(X)=I(X>0)T(X) = \mathbb{I}(X > 0) (Binary compression).
import numpy as np import jax.numpy as jnp from jax import grad, vmap import matplotlib.pyplot as plt def fisher_info_loss_demo(true_lambda=5.0, n_samples=10, n_trials=5000): estimates_raw = [] estimates_rb = [] estimates_bad = [] for _ in range(n_trials): X = np.random.poisson(true_lambda, n_samples) # 1. Raw Estimator: Only first sample est_raw = X[0] estimates_raw.append(est_raw) # 2. Rao-Blackwellized (Mean) est_rb = np.mean(X) estimates_rb.append(est_rb) # 3. Bad Statistic (Lossy Compression) # We only know how many are non-zero. # k = count(X > 0). k ~ Binomial(n, 1 - e^-lambda) # p = 1 - e^-lambda => lambda = -log(1-p) # MLE for p is k/n. # est_lambda = -log(1 - k/n) k = np.sum(X > 0) # Avoid div by zero if k == n_samples: est_bad = -np.log(1 - (n_samples-0.5)/n_samples) # Smooth else: est_bad = -np.log(1 - k/n_samples) estimates_bad.append(est_bad) print(f"Theory Variance (Raw): {true_lambda:.4f}") print(f"Empirical Var (Raw): {np.var(estimates_raw):.4f}") # Cramer-Rao Lower Bound = lambda / n crlb = true_lambda / n_samples print(f"CRLB (Optimal): {crlb:.4f}") print(f"Empirical Var (RB): {np.var(estimates_rb):.4f}") print(f"Empirical Var (Bad Stat): {np.var(estimates_bad):.4f}") # The 'Bad Stat' throws away info (exact counts), keeping only binary info. # Variance explodes.

Appendix B: Completeness and Basu’s Theorem Proof Details

B.1 Completeness of Exponential Families

The family f(xθ)=h(x)exp(θT(x)A(θ))f(x|\theta) = h(x) \exp(\theta \cdot T(x) - A(\theta)) is Complete if the parameter space Θ\Theta contains an open rectangle. Proof Idea: Suppose Eθ[g(T)]=0\mathbb{E}_\theta [ g(T) ] = 0 for all θ\theta. g(t)eθtA(θ)dt=0\int g(t) e^{\theta t - A(\theta)} dt = 0 g(t)eθtdt=0\int g(t) e^{\theta t} dt = 0. This is the Laplace Transform of gg. If Laplace transform is 0 in a neighborhood, the function is 0 a.e. Thus g(t)=0g(t) = 0. This implies Completeness.

B.2 Counter-Example to Completeness

Consider the Uniform distribution U[θ,θ+1]U[\theta, \theta+1]. The statistic T(X)=XT(X) = X is sufficient? No, order statistics are. Consider g(X)=sin(2πX)g(X) = \sin(2\pi X). Eθ[sin(2πX)]=θθ+1sin(2πx)dx=0\mathbb{E}_\theta [\sin(2\pi X)] = \int_\theta^{\theta+1} \sin(2\pi x) dx = 0. Since the integral of sine over a full period is always 0, unrelated to shift θ\theta. But sin(2πX)\sin(2\pi X) is not zero. Thus, the Uniform family is Incomplete. Basu’s theorem fails here (Ancillaries might be dependent).


Appendix C: Deriving Moments from the Partition Function

The power of Exponential Families lies in A(η)A(\eta). Let’s derive the mean and variance for common distributions using A\nabla A and 2A\nabla^2 A.

1. Bernoulli Distribution P(x)=μx(1μ)1x=exp(xlogμ1μ+log(1μ))P(x) = \mu^x (1-\mu)^{1-x} = \exp( x \log \frac{\mu}{1-\mu} + \log(1-\mu) ). Canonical parameter: η=logμ1μ\eta = \log \frac{\mu}{1-\mu} (Logit). Inverse: μ=σ(η)=11+eη\mu = \sigma(\eta) = \frac{1}{1+e^{-\eta}}. Log-partition: A(η)=log(1μ)=log(1+eη)A(\eta) = -\log(1-\mu) = \log(1+e^\eta). Mean: A(η)=eη1+eη=σ(η)=μA'(\eta) = \frac{e^\eta}{1+e^\eta} = \sigma(\eta) = \mu Variance: A(η)=σ(η)(1σ(η))=μ(1μ)A''(\eta) = \sigma(\eta)(1-\sigma(\eta)) = \mu(1-\mu)

2. Poisson Distribution P(x)=λxeλx!=1x!exp(xlogλλ)P(x) = \frac{\lambda^x e^{-\lambda}}{x!} = \frac{1}{x!} \exp( x \log \lambda - \lambda ). Canonical parameter: η=logλ\eta = \log \lambda. Inverse: λ=eη\lambda = e^\eta. Log-partition: A(η)=λ=eηA(\eta) = \lambda = e^\eta. Mean: A(η)=eη=λA'(\eta) = e^\eta = \lambda Variance: A(η)=eη=λA''(\eta) = e^\eta = \lambda

3. Gaussian Distribution (Known Variance σ2=1\sigma^2=1) P(x)exp(12(xμ)2)exp(xμμ2/2)P(x) \propto \exp( -\frac{1}{2}(x-\mu)^2 ) \propto \exp( x\mu - \mu^2/2 ). Canonical parameter: η=μ\eta = \mu. Log-partition: A(η)=μ2/2=η2/2A(\eta) = \mu^2/2 = \eta^2/2. Mean: A(η)=η=μA'(\eta) = \eta = \mu Variance: A(η)=1A''(\eta) = 1


Appendix D: Quantum Sufficiency

The concept of sufficiency extends to Quantum Mechanics. Let a quantum system be described by a density matrix ρθ\rho_\theta acting on Hilbert space H\mathcal{H}. A “statistic” is replaced by a Quantum Channel (CPTP map) E(ρ)\mathcal{E}(\rho). When is a channel sufficient? When there exists a recovery channel R\mathcal{R} such that:

(RE)(ρθ)=ρθθ(\mathcal{R} \circ \mathcal{E})(\rho_\theta) = \rho_\theta \quad \forall \theta

Factorization in Quantum Mechanics: This is related to the Petz Recovery Map. A measurement (POVM) is sufficient for a family of states if the states commute with the measurement operators, or more generally, if the Quantum Fisher Information is preserved. Theorem (Monotonicity of Relative Entropy):

D(ρθρθ)D(E(ρθ)E(ρθ))D(\rho_\theta \| \rho_{\theta'}) \ge D(\mathcal{E}(\rho_\theta) \| \mathcal{E}(\rho_{\theta'}))

Equality holds if and only if E\mathcal{E} is a sufficient statistic (sufficient channel) for the family {ρθ,ρθ}\{\rho_\theta, \rho_{\theta'}\}. This connects sufficiency to the Second Law of Thermodynamics (data processing cannot decrease entropy).


Appendix E: Sufficiency in Latent Variable Models (The EM Algorithm)

Consider a model with observed data XX and latent variables ZZ. The complete likelihood P(X,Zθ)P(X, Z | \theta) often belongs to an exponential family with sufficient statistics T(X,Z)T(X, Z). However, we only observe XX. The Expectation-Maximization (EM) algorithm rests on the observation that we can compute the expected sufficient statistics. E-Step: Compute the posterior of the latent variables P(ZX,θold)P(Z | X, \theta_{old}). Compute the expected sufficient statistics:

Tˉ=EZX[T(X,Z)]\bar{T} = \mathbb{E}_{Z|X} [ T(X, Z) ]

M-Step: Update θ\theta by Maximum Likelihood matching the expected stats Tˉ\bar{T} to the model moments.

Eθnew[T(X,Z)]=Tˉ\mathbb{E}_{\theta_{new}} [ T(X, Z) ] = \bar{T}

For GMMs, the sufficient statistics are γik\sum \gamma_{ik} (mass), γikxi\sum \gamma_{ik} x_i (first moment), γikxixiT\sum \gamma_{ik} x_i x_i^T (second moment). This shows that Sufficiency is the “Computational Engine” of the EM algorithm. Without the factorization theorem, EM would be computationally intractable for most models.


Appendix F: Glossary of Definitions

  • Ancillary Statistic: A statistic whose distribution does not depend on θ\theta.
  • Completeness: A family where no non-zero function has expectation zero for all parameters.
  • Exponential Family: A family of distributions where the log-likelihood is linear in the sufficient statistics.
  • Fisher Information: Measures the amount of information that an observable random variable XX carries about an unknown parameter θ\theta.
  • Interaction Information: Can be positive or negative (Synergy/Redundancy).
  • Minimal Sufficient Statistic: The coarsest possible sufficient statistic.
  • Rao-Blackwellization: Improving an estimator by conditioning on a sufficient statistic.
  • Sufficient Statistic: A statistic that captures all information about θ\theta contained in XX.

References

1. Fisher, R. A. (1922). “On the mathematical foundations of theoretical statistics”.

2. Halmos, P. R., & Savage, L. J. (1949). “Application of the Radon-Nikodym theorem to the theory of sufficient statistics”.

3. Basu, D. (1955). “On statistics independent of a complete sufficient statistic”.

4. Lehmann, E. L., & Scheffé, H. (1950). “Completeness, similar regions, and unbiased estimation”.

5. Pitman, E. J. G. (1936). “Sufficient statistics and intrinsic accuracy”.

6. Csiszar, I. (1975). “I-divergence geometry of probability distributions and minimization problems”.