The Geometry of Superposition
Superposition as a packing problem
A transformer’s residual stream has dimension , typically 768 or 4096. The number of features the network must represent is far larger, perhaps by orders of magnitude. Syntax, entities, sentiment, factual associations, positional information. There are more features than dimensions.
The network packs features into dimensions by exploiting sparsity. On any given input, most features are inactive. If two features rarely co-activate, their representations can overlap without interfering at inference time. The question is how much overlap geometry permits before the interference becomes unrecoverable.
Concentration of measure, random matrix theory, optimal transport, and information geometry are the right tools.
The toy model
Following Elhage et al. (2022), take a one-layer linear network with weights where . The network receives a feature vector , compresses to , and reconstructs via . Each component is nonzero independently with probability .
The loss is the expected weighted reconstruction error:
where is the importance of feature and is the -th standard basis vector. Features and are simultaneously active with probability , so the loss decomposes:
where is the -th column of . The first term drives each column toward unit norm (faithful representation). The second penalizes interference between columns (cross-talk).
When , all features are always active, interference dominates, and the network can only faithfully represent features by assigning orthogonal columns and abandoning the rest. As , interference vanishes relative to reconstruction, and the network can represent far more than features simultaneously.
The transition is sharp. Below a critical sparsity, the optimal solution changes qualitatively from dedicated orthogonal representations to dense, superposed packing.
For the toy model with features of equal importance in dimensions with sparsity , the optimal representation transitions from dedicated (at most features represented) to superposed (up to features represented with controlled interference) as decreases below a critical threshold that depends on .
At high sparsity and moderate compression, the optimal configurations minimize worst-case interference while maximizing represented features. These configurations are frames.
Almost-orthogonality
High-dimensional space is far more accommodating than suggests. Volume concentrates in thin shells, random projections preserve distances, and typical configurations are more structured than arbitrary ones. The concentration of measure essay develops the general theory, and the consequence for superposition is immediate.
In , at most 3 unit vectors can achieve pairwise angles of (the Mercedes-Benz frame), and in at most 6. The growth looks linear, but in with large it is exponential.
For any and any integer , there exist unit vectors in with such that all pairwise inner products satisfy for .
The proof is probabilistic. Draw vectors uniformly from the unit sphere in . For any fixed pair, is sub-Gaussian with parameter , so by a union bound over pairs
This is less than 1 whenever , so a good packing exists. admits pairwise -orthogonal vectors.
In a residual stream of dimension , this means millions of almost-orthogonal feature directions. Interference between any two superposed features is , small enough that ReLU nonlinearities and activation sparsity suppress cross-talk.
But almost-orthogonal is not optimally packed. The best packing satisfies a lower bound.
For unit vectors in with , the maximum coherence satisfies:
Equality holds if and only if the vectors form an equiangular tight frame, meaning all pairwise are equal and .
Equiangular tight frames are the densest possible packings. A network learning to superpose features is implicitly solving a frame design problem, and the Welch bound is the floor.
The Gram matrix makes the tension visible. Off-diagonal entries are the interferences, and the optimal configuration is the one that minimizes the worst case. In two dimensions the packing is tightly constrained, but the JL lemma says this constraint relaxes exponentially with dimension.
Phase transitions in representation
The transition from dedicated to superposed is sharp, and the sharpness has structure that random matrix theory can detect.
The Gram matrix encodes the global geometry of the representation. Diagonal entries measure how faithfully each feature is represented, off-diagonal entries measure interference, and the eigenvalues of are the observable.
In the dedicated regime has at most nonzero columns, so has rank and the spectrum is values near 1 and values at 0, cleanly bimodal.
In the superposition regime, has nonzero columns packed almost-orthogonally. The eigenvalues spread into a continuous distribution between 0 and 1, governed by the Marchenko-Pastur law.
Let have i.i.d. entries with mean 0 and variance . As with , the empirical spectral distribution of converges to the Marchenko-Pastur distribution with density:
where .
The learned weight matrix is not random, it has been optimized, but the eigenvalue distribution of still reveals structure. Eigenvalues above the Marchenko-Pastur bulk edge correspond to features with dedicated dimensions (the signal eigenvalues) and eigenvalues within the bulk correspond to noise-like interference from superposition.
This is the spiked covariance model from random matrix theory, and it connects to the BBP phase transition. A feature is individually detectable if and only if its importance exceeds a critical threshold that depends on .
At high sparsity the eigenvalues cluster at 0 and 1, each feature either cleanly represented or not, and as sparsity decreases they smear into a continuous distribution. The Marchenko-Pastur curve predicts the bulk edge, with eigenvalues above it corresponding to signal and eigenvalues within it to the geometric cost of packing.
Sparse autoencoders as geometric decoders
Superposition makes individual neurons polysemantic, meaning a single neuron responds to a linear combination of superposed features rather than a single one. Given the compressed representation , the problem is to recover the original features .
A sparse autoencoder (SAE) attempts this by learning an encoder and a decoder , with decoder columns as the learned feature directions. The loss is
This is dictionary learning under a sparsity constraint. The decoder is the dictionary, its columns are the atoms, is the sparse code. For each input , the SAE finds the sparsest linear combination of dictionary atoms that reconstructs .
The penalty serves a geometric role. Without it, the dictionary learns an arbitrary basis. With it, the code is forced sparse, most atoms off for any given input, which pushes dictionary atoms toward the true feature directions.
Let be -sparse and let satisfy the Restricted Isometry Property of order :
for all -sparse vectors and some . Then the -minimization problem subject to recovers exactly.
The restricted isometry property connects to the packing geometry. A matrix satisfies RIP of order if every submatrix of columns is nearly an isometry, i.e. the columns are nearly orthogonal. The Johnson-Lindenstrauss lemma guarantees random matrices satisfy RIP with high probability.
Recovery succeeds if superposition is sufficiently sparse relative to coherence. If features activate with probability and directions have coherence :
The faint dashed lines are the true feature directions and the solid arrows are the learned dictionary atoms. The penalty controls convergence. Too weak and the atoms wander in an over-complete dictionary, too strong and the dictionary collapses to a few directions. Selecting a dictionary atom highlights which data points activate it.
Circuits as transport maps
Features at a single layer are one level of structure. Mechanistic interpretability also studies circuits, subgraphs of the computational graph that implement identifiable algorithms. A circuit traces how input features compose through attention heads and MLP layers to produce output features.
Each layer implements a map on the residual stream. The data distribution at layer , , is pushed forward to . The full network is a composition of pushforward maps.
The Wasserstein distance measures how much the representation rearranges at each layer. A circuit, a sparse interpretable subcomputation, is a low-rank approximation to this transport. It captures the dominant movement of probability mass from input to output features and ignores the rest.
Two circuits are close if they induce similar transport plans, moving similar features in similar directions. The Fisher information metric on the statistical manifold of layer representations captures this by measuring how sensitively the representation responds to input changes. A circuit is a geodesic on this manifold.
The dense view shows all information flow at once, and the circuit view isolates the sparse interpretable subgraph. The computation is dense but the meaningful structure is sparse.
The information-theoretic limit
The network has features to encode in dimensions. Each feature is binary with activation probability . Total information content is bits, where . The representation has real-valued dimensions, in principle infinite capacity. But the features must be decodable.
The rate-distortion function gives the tradeoff. For distortion :
where is the distortion of the trivial representation. No scheme can achieve error below with dimensions.
The gap between this bound and SAE performance has three sources. Dead features never activate, wasting capacity. Feature splitting distributes a single true feature across multiple dictionary atoms. Absorption folds together correlated features that should be distinct. All three are geometric failures where the learned dictionary does not align with the true feature directions.
As sparsity increases the theoretical bound drops and SAE performance improves, so the gap narrows but does not close. The critical dictionary size marks the transition between recoverable and unrecoverable. Below it information is irreversibly lost, above it recovery is possible in principle and the remaining gap is algorithmic.
Structured superposition
The Welch bound says how many features can coexist with bounded interference, the Marchenko-Pastur law separates signal from noise in the eigenspectrum, RIP says when sparse recovery is possible, and the rate-distortion function gives the fundamental limit of that recovery.
But the theory above treats features as independent, binary, and uniform. Real features have hierarchical structure (syntactic features depend on lexical features, which depend on character-level features), correlations (subject detection and agreement co-activate), and varying dimensionalities (some one-dimensional, others spanning subspaces).
The geometry of structured superposition, where packing respects semantic relationships, where importance follows power laws, where correlations create low-dimensional manifolds within the full feature space, is mostly unexplored. The tools exist (structured random matrices, non-uniform concentration, Riemannian optimization on manifolds with symmetry) but whether real networks learn structure regular enough for these tools to apply is not yet known. Sparse autoencoders do recover interpretable features from models trained on natural data, which suggests regularity, but proving it requires understanding how training dynamics, data structure, and representation geometry interact.
References
Elhage, N., et al. “Toy Models of Superposition.” Transformer Circuits Thread, 2022.
Bricken, T., et al. “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.” Anthropic Research, 2023.
Templeton, A., et al. “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” Anthropic Research, 2024.
Johnson, W. B. and Lindenstrauss, J. “Extensions of Lipschitz mappings into a Hilbert space.” Contemp. Math., 26:189-206, 1984.
Welch, L. “Lower bounds on the maximum cross correlation of signals.” IEEE Trans. Inform. Theory, 20(3):397-399, 1974.
Marchenko, V. A. and Pastur, L. A. “Distribution of eigenvalues for some sets of random matrices.” Matematicheskii Sbornik, 114(4):507-536, 1967.
Donoho, D. L. “Compressed Sensing.” IEEE Trans. Inform. Theory, 52(4):1289-1306, 2006.
Candes, E. J. and Tao, T. “Near-optimal signal recovery from random projections.” IEEE Trans. Inform. Theory, 52(12):5406-5425, 2006.
Villani, C. Optimal Transport: Old and New. Springer, 2009.
Amari, S. Information Geometry and Its Applications. Springer, 2016.