The Geometry of Superposition

Superposition as a packing problem

A transformer’s residual stream has dimension dmodeld_{\text{model}}, typically 768 or 4096. The number of features the network must represent is far larger, perhaps by orders of magnitude. Syntax, entities, sentiment, factual associations, positional information. There are more features than dimensions.

The network packs ndn \gg d features into dd dimensions by exploiting sparsity. On any given input, most features are inactive. If two features rarely co-activate, their representations can overlap without interfering at inference time. The question is how much overlap geometry permits before the interference becomes unrecoverable.

Concentration of measure, random matrix theory, optimal transport, and information geometry are the right tools.

The toy model

Following Elhage et al. (2022), take a one-layer linear network with weights WRm×nW \in \mathbb{R}^{m \times n} where m<nm < n. The network receives a feature vector xRnx \in \mathbb{R}^n, compresses to WxRmWx \in \mathbb{R}^m, and reconstructs via WTWxRnW^T Wx \in \mathbb{R}^n. Each component xix_i is nonzero independently with probability pp.

The loss is the expected weighted reconstruction error:

L(W)=i=1nsiE[WTWeiei21[xi0]]L(W) = \sum_{i=1}^{n} s_i \, \mathbb{E}\left[\left\| W^T W e_i - e_i \right\|^2 \cdot \mathbf{1}[x_i \neq 0]\right]

where sis_i is the importance of feature ii and eie_i is the ii-th standard basis vector. Features ii and jj are simultaneously active with probability p2p^2, so the loss decomposes:

L(W)=isip(1wi2)2+ijsip2(wiwj)2L(W) = \sum_i s_i \, p \left(1 - \|w_i\|^2\right)^2 + \sum_{i \neq j} s_i \, p^2 \left(w_i \cdot w_j\right)^2

where wi=Weiw_i = W e_i is the ii-th column of WW. The first term drives each column toward unit norm (faithful representation). The second penalizes interference between columns (cross-talk).

When p=1p = 1, all features are always active, interference dominates, and the network can only faithfully represent mm features by assigning orthogonal columns and abandoning the rest. As p0p \to 0, interference vanishes relative to reconstruction, and the network can represent far more than mm features simultaneously.

The transition is sharp. Below a critical sparsity, the optimal solution changes qualitatively from dedicated orthogonal representations to dense, superposed packing.

Fig 1.1: Superposition Phase Diagram (Drag to Explore)
Theorem (Superposition Phase Transition (Elhage et al., 2022)).

For the toy model with nn features of equal importance in mm dimensions with sparsity 1p1-p, the optimal representation transitions from dedicated (at most mm features represented) to superposed (up to nn features represented with controlled interference) as pp decreases below a critical threshold pcp_c that depends on n/mn/m.

At high sparsity and moderate compression, the optimal configurations minimize worst-case interference while maximizing represented features. These configurations are frames.

Almost-orthogonality

High-dimensional space is far more accommodating than R3\mathbb{R}^3 suggests. Volume concentrates in thin shells, random projections preserve distances, and typical configurations are more structured than arbitrary ones. The concentration of measure essay develops the general theory, and the consequence for superposition is immediate.

In R2\mathbb{R}^2, at most 3 unit vectors can achieve pairwise angles of 120°120° (the Mercedes-Benz frame), and in R3\mathbb{R}^3 at most 6. The growth looks linear, but in Rm\mathbb{R}^m with mm large it is exponential.

Theorem (Johnson-Lindenstrauss, 1984).

For any 0<ϵ<10 < \epsilon < 1 and any integer nn, there exist nn unit vectors in Rm\mathbb{R}^m with m=O(ϵ2logn)m = O(\epsilon^{-2} \log n) such that all pairwise inner products satisfy vivjϵ|v_i \cdot v_j| \leq \epsilon for iji \neq j.

The proof is probabilistic. Draw nn vectors uniformly from the unit sphere in Rm\mathbb{R}^m. For any fixed pair, vivjv_i \cdot v_j is sub-Gaussian with parameter O(1/m)O(1/m), so by a union bound over (n2)\binom{n}{2} pairs

P[maxijvivj>ϵ]n2exp(cmϵ2)\mathbb{P}\left[\max_{i \neq j} |v_i \cdot v_j| > \epsilon\right] \leq n^2 \exp\left(-c \, m \epsilon^2\right)

This is less than 1 whenever m>Cϵ2lognm > C \epsilon^{-2} \log n, so a good packing exists. Rm\mathbb{R}^m admits n=exp(cmϵ2)n = \exp(c \, m \epsilon^2) pairwise ϵ\epsilon-orthogonal vectors.

In a residual stream of dimension d=4096d = 4096, this means millions of almost-orthogonal feature directions. Interference between any two superposed features is O(1/d)O(1/\sqrt{d}), small enough that ReLU nonlinearities and activation sparsity suppress cross-talk.

But almost-orthogonal is not optimally packed. The best packing satisfies a lower bound.

Theorem (Welch Bound, 1974).

For nn unit vectors {w1,,wn}\{w_1, \ldots, w_n\} in Rm\mathbb{R}^m with n>mn > m, the maximum coherence μ=maxijwiwj\mu = \max_{i \neq j} |w_i \cdot w_j| satisfies:

μnmm(n1)\mu \geq \sqrt{\frac{n - m}{m(n-1)}}

Equality holds if and only if the vectors form an equiangular tight frame, meaning all pairwise wiwj|w_i \cdot w_j| are equal and iwiwiT=nmI\sum_i w_i w_i^T = \frac{n}{m} I.

Equiangular tight frames are the densest possible packings. A network learning to superpose features is implicitly solving a frame design problem, and the Welch bound is the floor.

Vectors: 5Welch bound: 0.6124
Fig 2.1: Packing Explorer (Drag Vectors to Minimize Coherence)

The Gram matrix WTWW^T W makes the tension visible. Off-diagonal entries are the interferences, and the optimal configuration is the one that minimizes the worst case. In two dimensions the packing is tightly constrained, but the JL lemma says this constraint relaxes exponentially with dimension.

Phase transitions in representation

The transition from dedicated to superposed is sharp, and the sharpness has structure that random matrix theory can detect.

The Gram matrix G=WTWRn×nG = W^T W \in \mathbb{R}^{n \times n} encodes the global geometry of the representation. Diagonal entries Gii=wi2G_{ii} = \|w_i\|^2 measure how faithfully each feature is represented, off-diagonal entries Gij=wiwjG_{ij} = w_i \cdot w_j measure interference, and the eigenvalues of GG are the observable.

In the dedicated regime WW has at most mm nonzero columns, so GG has rank m\leq m and the spectrum is mm values near 1 and nmn - m values at 0, cleanly bimodal.

In the superposition regime, WW has n>mn > m nonzero columns packed almost-orthogonally. The eigenvalues spread into a continuous distribution between 0 and 1, governed by the Marchenko-Pastur law.

Theorem (Marchenko-Pastur Law, 1967).

Let WRm×nW \in \mathbb{R}^{m \times n} have i.i.d. entries with mean 0 and variance σ2/m\sigma^2/m. As m,nm, n \to \infty with n/mγn/m \to \gamma, the empirical spectral distribution of WTWW^T W converges to the Marchenko-Pastur distribution with density:

fγ(λ)=12πγσ2(λ+λ)(λλ)λ1[λ,λ+]f_{\gamma}(\lambda) = \frac{1}{2\pi \gamma \sigma^2} \frac{\sqrt{(\lambda_+ - \lambda)(\lambda - \lambda_-)}}{\lambda} \mathbf{1}_{[\lambda_-, \lambda_+]}

where λ±=σ2(1±γ)2\lambda_{\pm} = \sigma^2(1 \pm \sqrt{\gamma})^2.

The learned weight matrix is not random, it has been optimized, but the eigenvalue distribution of WTWW^T W still reveals structure. Eigenvalues above the Marchenko-Pastur bulk edge λ+\lambda_+ correspond to features with dedicated dimensions (the signal eigenvalues) and eigenvalues within the bulk correspond to noise-like interference from superposition.

This is the spiked covariance model from random matrix theory, and it connects to the BBP phase transition. A feature is individually detectable if and only if its importance exceeds a critical threshold that depends on γ=n/m\gamma = n/m.

Sparsity: 0.90High sparsity → clean eigenvalues
Fig 3.1: Eigenvalue Anatomy (Adjust Sparsity)

At high sparsity the eigenvalues cluster at 0 and 1, each feature either cleanly represented or not, and as sparsity decreases they smear into a continuous distribution. The Marchenko-Pastur curve predicts the bulk edge, with eigenvalues above it corresponding to signal and eigenvalues within it to the geometric cost of packing.

Sparse autoencoders as geometric decoders

Superposition makes individual neurons polysemantic, meaning a single neuron responds to a linear combination of superposed features rather than a single one. Given the compressed representation z=WxRmz = Wx \in \mathbb{R}^m, the problem is to recover the original features xRnx \in \mathbb{R}^n.

A sparse autoencoder (SAE) attempts this by learning an encoder h=ReLU(Wencz+b)h = \text{ReLU}(W_{\text{enc}} z + b) and a decoder z^=Wdech\hat{z} = W_{\text{dec}} h, with decoder columns dk=Wdec[:,k]d_k = W_{\text{dec}}[:,k] as the learned feature directions. The loss is

L=zz^2+λh1\mathcal{L} = \|z - \hat{z}\|^2 + \lambda \|h\|_1

This is dictionary learning under a sparsity constraint. The decoder WdecW_{\text{dec}} is the dictionary, its columns are the atoms, hh is the sparse code. For each input zz, the SAE finds the sparsest linear combination of dictionary atoms that reconstructs zz.

The L1L^1 penalty serves a geometric role. Without it, the dictionary learns an arbitrary basis. With it, the code is forced sparse, most atoms off for any given input, which pushes dictionary atoms toward the true feature directions.

Theorem (Recovery Condition (Compressed Sensing)).

Let xRnx \in \mathbb{R}^n be ss-sparse and let WRm×nW \in \mathbb{R}^{m \times n} satisfy the Restricted Isometry Property of order 2s2s:

(1δ)x2Wx2(1+δ)x2(1-\delta)\|x\|^2 \leq \|Wx\|^2 \leq (1+\delta)\|x\|^2

for all 2s2s-sparse vectors xx and some δ<21\delta < \sqrt{2} - 1. Then the L1L^1-minimization problem minh1\min \|h\|_1 subject to Wh=WxWh = Wx recovers xx exactly.

The restricted isometry property connects to the packing geometry. A matrix satisfies RIP of order ss if every submatrix of ss columns is nearly an isometry, i.e. the columns are nearly orthogonal. The Johnson-Lindenstrauss lemma guarantees random matrices satisfy RIP with high probability.

Recovery succeeds if superposition is sufficiently sparse relative to coherence. If features activate with probability pp and directions have coherence μ\mu:

pμ2lognp \lesssim \frac{\mu^{-2}}{\log n}
L¹ penalty: 0.050Moderate
Fig 4.1: SAE Dictionary Learning (Click Arrows to Isolate)

The faint dashed lines are the true feature directions and the solid arrows are the learned dictionary atoms. The L1L^1 penalty controls convergence. Too weak and the atoms wander in an over-complete dictionary, too strong and the dictionary collapses to a few directions. Selecting a dictionary atom highlights which data points activate it.

Circuits as transport maps

Features at a single layer are one level of structure. Mechanistic interpretability also studies circuits, subgraphs of the computational graph that implement identifiable algorithms. A circuit traces how input features compose through attention heads and MLP layers to produce output features.

Each layer implements a map f:RdRdf_\ell: \mathbb{R}^d \to \mathbb{R}^d on the residual stream. The data distribution at layer \ell, μ\mu_\ell, is pushed forward to μ+1=(f)#μ\mu_{\ell+1} = (f_\ell)_\# \mu_\ell. The full network is a composition of pushforward maps.

T(μ,μ+1)=infπΠ(μ,μ+1)xy2dπ(x,y)\mathcal{T}(\mu_\ell, \mu_{\ell+1}) = \inf_{\pi \in \Pi(\mu_\ell, \mu_{\ell+1})} \int \|x - y\|^2 \, d\pi(x, y)

The Wasserstein distance W2(μ,μ+1)W_2(\mu_\ell, \mu_{\ell+1}) measures how much the representation rearranges at each layer. A circuit, a sparse interpretable subcomputation, is a low-rank approximation to this transport. It captures the dominant movement of probability mass from input to output features and ignores the rest.

Two circuits are close if they induce similar transport plans, moving similar features in similar directions. The Fisher information metric on the statistical manifold of layer representations captures this by measuring how sensitively the representation responds to input changes. A circuit is a geodesic on this manifold.

Fig 5.1: Layer Flow (Click Features to Trace Circuits)

The dense view shows all information flow at once, and the circuit view isolates the sparse interpretable subgraph. The computation is dense but the meaningful structure is sparse.

The information-theoretic limit

The network has nn features to encode in mm dimensions. Each feature is binary with activation probability pp. Total information content is nH(p)n \cdot H(p) bits, where H(p)=plogp(1p)log(1p)H(p) = -p \log p - (1-p) \log(1-p). The representation has mm real-valued dimensions, in principle infinite capacity. But the features must be decodable.

The rate-distortion function gives the tradeoff. For distortion DD:

R(D)=nH(p)(1DDmax)+R(D) = n \cdot H(p) \cdot \left(1 - \frac{D}{D_{\max}}\right)^+

where DmaxD_{\max} is the distortion of the trivial representation. No scheme can achieve error below R1(m)R^{-1}(m) with mm dimensions.

The gap between this bound and SAE performance has three sources. Dead features never activate, wasting capacity. Feature splitting distributes a single true feature across multiple dictionary atoms. Absorption folds together correlated features that should be distinct. All three are geometric failures where the learned dictionary does not align with the true feature directions.

Sparsity: 0.90Features: 100
Total features (n): 100
Fig 6.1: Rate-Distortion Landscape

As sparsity increases the theoretical bound drops and SAE performance improves, so the gap narrows but does not close. The critical dictionary size k=nH(p)k = n \cdot H(p) marks the transition between recoverable and unrecoverable. Below it information is irreversibly lost, above it recovery is possible in principle and the remaining gap is algorithmic.

Structured superposition

The Welch bound says how many features can coexist with bounded interference, the Marchenko-Pastur law separates signal from noise in the eigenspectrum, RIP says when sparse recovery is possible, and the rate-distortion function gives the fundamental limit of that recovery.

Fig 7.1: Concept Map (Click to Explore Neighborhoods)

But the theory above treats features as independent, binary, and uniform. Real features have hierarchical structure (syntactic features depend on lexical features, which depend on character-level features), correlations (subject detection and agreement co-activate), and varying dimensionalities (some one-dimensional, others spanning subspaces).

The geometry of structured superposition, where packing respects semantic relationships, where importance follows power laws, where correlations create low-dimensional manifolds within the full feature space, is mostly unexplored. The tools exist (structured random matrices, non-uniform concentration, Riemannian optimization on manifolds with symmetry) but whether real networks learn structure regular enough for these tools to apply is not yet known. Sparse autoencoders do recover interpretable features from models trained on natural data, which suggests regularity, but proving it requires understanding how training dynamics, data structure, and representation geometry interact.

References

Elhage, N., et al. “Toy Models of Superposition.” Transformer Circuits Thread, 2022.

Bricken, T., et al. “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.” Anthropic Research, 2023.

Templeton, A., et al. “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” Anthropic Research, 2024.

Johnson, W. B. and Lindenstrauss, J. “Extensions of Lipschitz mappings into a Hilbert space.” Contemp. Math., 26:189-206, 1984.

Welch, L. “Lower bounds on the maximum cross correlation of signals.” IEEE Trans. Inform. Theory, 20(3):397-399, 1974.

Marchenko, V. A. and Pastur, L. A. “Distribution of eigenvalues for some sets of random matrices.” Matematicheskii Sbornik, 114(4):507-536, 1967.

Donoho, D. L. “Compressed Sensing.” IEEE Trans. Inform. Theory, 52(4):1289-1306, 2006.

Candes, E. J. and Tao, T. “Near-optimal signal recovery from random projections.” IEEE Trans. Inform. Theory, 52(12):5406-5425, 2006.

Villani, C. Optimal Transport: Old and New. Springer, 2009.

Amari, S. Information Geometry and Its Applications. Springer, 2016.