The Geometry of Superposition
Superposition as a packing problem
A transformer’s residual stream has dimension of about 768 or 4096, but the number of features the network has to represent is far larger, maybe by orders of magnitude, since it needs to track syntax and entities and sentiment and factual associations and positional information all at once. There are simply more features than dimensions to hold them.
The network squeezes features into dimensions by using sparsity. Most features are silent on any given input, and when two features rarely fire together their representations can overlap without interfering at inference time. The real question is how much overlap the geometry allows before the interference becomes impossible to untangle.
The right tools for answering this are concentration of measure, random matrix theory, optimal transport, and information geometry.
The toy model
Following Elhage et al. (2022), take a one-layer linear network with weights where , so the network takes a feature vector , squeezes it down to , and then rebuilds it as . Each component is nonzero independently with probability .
The loss is the expected weighted reconstruction error:
where is the importance of feature and is the -th standard basis vector. Two features and are active at the same time with probability , so the loss splits apart cleanly.
where is the -th column of . The first term pushes each column toward unit norm so each feature is faithfully represented, and the second term punishes interference between columns, which is the cross-talk.
When all features fire at once and interference dominates, so the network can only represent features faithfully by picking orthogonal columns and abandoning the rest. As drops toward zero the interference fades relative to reconstruction, and the network can carry far more than features at once.
The transition between these regimes is sharp, and below a critical sparsity the optimal solution flips from dedicated orthogonal representations to dense superposed packing.
For the toy model with features of equal importance in dimensions with sparsity , the optimal representation transitions from dedicated (at most features represented) to superposed (up to features represented with controlled interference) as decreases below a critical threshold that depends on .
At high sparsity and moderate compression the best configurations squeeze in as many features as possible while keeping worst-case interference low, and these configurations are frames.
Almost-orthogonality
High-dimensional space is far roomier than would suggest, because volume concentrates in thin shells, random projections preserve distances, and typical configurations are more structured than arbitrary ones. The concentration of measure essay works out the general theory, and the consequence for superposition falls out immediately.
In at most 3 unit vectors can sit at pairwise angles of (the Mercedes-Benz frame), and in at most 6. The growth looks linear at first, but in with large it becomes exponential.
For any and any integer , there exist unit vectors in with such that all pairwise inner products satisfy for .
The proof is probabilistic. Draw vectors uniformly from the unit sphere in , and for any fixed pair the inner product is sub-Gaussian with parameter , so a union bound over pairs gives
This is less than 1 whenever , so a good packing exists, and fits pairwise -orthogonal vectors.
In a residual stream of dimension this means millions of almost-orthogonal feature directions, and the interference between any two superposed features is about , small enough that ReLU nonlinearities and activation sparsity wipe out the cross-talk.
Almost-orthogonal is not the same as optimally packed, and the best packing has to obey a lower bound.
For unit vectors in with , the maximum coherence satisfies:
Equality holds if and only if the vectors form an equiangular tight frame, meaning all pairwise are equal and .
Equiangular tight frames are the densest possible packings, so a network learning to superpose features is quietly solving a frame design problem and the Welch bound is the floor it can never drop below.
The Gram matrix makes the tension visible, since the off-diagonal entries are the interferences and the best configuration is the one that keeps the worst case small. In two dimensions the packing is tightly constrained, but the JL lemma says this constraint loosens exponentially as dimension grows.
Phase transitions in representation
The jump from dedicated to superposed is sharp, and the sharpness has a structure that random matrix theory can pick out.
The Gram matrix holds the global geometry of the representation, with diagonal entries showing how faithfully each feature is represented and off-diagonal entries showing the interference, and the eigenvalues of are what we actually observe.
In the dedicated regime has at most nonzero columns, so has rank at most and the spectrum is values near 1 and values at 0, a clean bimodal split.
In the superposition regime has nonzero columns packed almost-orthogonally, and the eigenvalues smear out into a continuous distribution between 0 and 1 that is governed by the Marchenko-Pastur law.
Let have i.i.d. entries with mean 0 and variance . As with , the empirical spectral distribution of converges to the Marchenko-Pastur distribution with density:
where .
The learned weight matrix has been optimized rather than drawn at random, but the eigenvalue distribution of still shows structure. Eigenvalues above the Marchenko-Pastur bulk edge are the signal eigenvalues and correspond to features with dedicated dimensions, and eigenvalues inside the bulk are the noise-like interference that comes from superposition.
This is the spiked covariance model from random matrix theory, and it is the same story as the BBP phase transition. A feature is individually detectable exactly when its importance crosses a critical threshold that depends on .
At high sparsity the eigenvalues cluster at 0 and 1 because each feature is either cleanly represented or not, and as sparsity drops they smear into a continuous distribution. The Marchenko-Pastur curve predicts the bulk edge, so eigenvalues above it are signal and eigenvalues inside it are the geometric cost of packing.
Sparse autoencoders as geometric decoders
Superposition makes individual neurons polysemantic because a single neuron ends up responding to a linear combination of superposed features rather than just one. Given the compressed representation , the problem is to pull the original features back out.
A sparse autoencoder (SAE) tries to do this by learning an encoder and a decoder , where the decoder columns are the learned feature directions. The loss is
This is dictionary learning with a sparsity constraint, where the decoder is the dictionary, its columns are the atoms, and is the sparse code. For each input the SAE hunts for the sparsest linear combination of dictionary atoms that rebuilds .
The penalty plays a geometric role. Without it the dictionary settles on some arbitrary basis, but with it the code is forced sparse so most atoms are off for any given input, which pushes the dictionary atoms toward the true feature directions.
Let be -sparse and let satisfy the Restricted Isometry Property of order :
for all -sparse vectors and some . Then the -minimization problem subject to recovers exactly.
RIP is the same packing condition in disguise. A matrix has RIP of order when every choice of columns is nearly an isometry, which is exactly the almost-orthogonality from before, and the Johnson-Lindenstrauss lemma guarantees random matrices satisfy it with high probability.
Recovery works when the features fire sparsely enough that the small interferences between them stay below the noise floor. If features activate with probability and the directions have coherence then
The faint dashed lines are the true feature directions and the solid arrows are the learned dictionary atoms, and the penalty controls convergence. Too weak and the atoms wander around an over-complete dictionary, too strong and the dictionary collapses to a handful of directions. Clicking a dictionary atom lights up the data points that activate it.
Circuits as transport maps
Features at a single layer are one level of structure, but mechanistic interpretability also studies circuits, the subgraphs of the computational graph that carry out identifiable algorithms. A circuit traces how input features combine through attention heads and MLP layers to produce output features.
Each layer is a map on the residual stream, and the data distribution at layer , written , is pushed forward to , so the full network is a composition of pushforward maps.
The Wasserstein distance measures how much the representation rearranges itself at each layer, and a circuit, which is a sparse interpretable subcomputation, is a low-rank approximation to this transport. It tracks the dominant movement of probability mass from input features to output features and throws away the rest.
Two circuits are close when they induce similar transport plans, moving similar features in similar directions. The Fisher information metric on the statistical manifold of layer representations shows this by measuring how sensitively the representation responds to input changes, and a circuit is a geodesic on this manifold.
The dense view shows all the information flow at once and the circuit view pulls out just the sparse interpretable subgraph, so the computation is dense but the meaningful structure is sparse.
The information-theoretic limit
The network has features to encode in dimensions, and each feature is binary with activation probability , so the total information content is bits where . The representation has real-valued dimensions and so in principle infinite capacity, but the features still have to be decodable.
The rate-distortion function gives the tradeoff, and for distortion it is
where is the distortion of the trivial representation, and no scheme can push the error below with dimensions.
The gap between this bound and what SAEs actually achieve has three sources. Dead features never activate and so waste capacity, feature splitting spreads a single true feature across multiple dictionary atoms, and absorption folds together correlated features that should stay distinct. All three are geometric failures where the learned dictionary does not line up with the true feature directions.
As sparsity rises the theoretical bound drops and SAE performance improves, so the gap shrinks but never closes. The critical dictionary size marks the line between recoverable and unrecoverable, and below it information is lost for good while above it recovery is possible in principle and the remaining gap is algorithmic.
Structured superposition
The Welch bound says how many features can live together with bounded interference, the Marchenko-Pastur law pulls signal from noise in the eigenspectrum, RIP says when sparse recovery is possible, and the rate-distortion function gives the fundamental limit of that recovery.
The theory above treats features as independent, binary, and uniform, but real features have hierarchical structure, since syntactic features depend on lexical features which depend on character-level features, and they have correlations, since subject detection and agreement fire together, and they have varying dimensionalities, since some are one-dimensional and others span whole subspaces.
The geometry of structured superposition, where packing respects semantic relationships and importance follows power laws and correlations create low-dimensional manifolds inside the full feature space, is mostly unexplored. The tools for it exist, including structured random matrices and non-uniform concentration and Riemannian optimization on manifolds with symmetry, but whether real networks learn structure regular enough for these tools to apply is still not known. Sparse autoencoders do recover interpretable features from models trained on natural data and that hints at regularity, but proving it means working out how training dynamics, data structure, and representation geometry interact.
References
Elhage, N., et al. “Toy Models of Superposition.” Transformer Circuits Thread, 2022.
Bricken, T., et al. “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.” Anthropic Research, 2023.
Templeton, A., et al. “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” Anthropic Research, 2024.
Johnson, W. B. and Lindenstrauss, J. “Extensions of Lipschitz mappings into a Hilbert space.” Contemp. Math., 26:189-206, 1984.
Welch, L. “Lower bounds on the maximum cross correlation of signals.” IEEE Trans. Inform. Theory, 20(3):397-399, 1974.
Marchenko, V. A. and Pastur, L. A. “Distribution of eigenvalues for some sets of random matrices.” Matematicheskii Sbornik, 114(4):507-536, 1967.
Donoho, D. L. “Compressed Sensing.” IEEE Trans. Inform. Theory, 52(4):1289-1306, 2006.
Candes, E. J. and Tao, T. “Near-optimal signal recovery from random projections.” IEEE Trans. Inform. Theory, 52(12):5406-5425, 2006.
Villani, C. Optimal Transport: Old and New. Springer, 2009.
Amari, S. Information Geometry and Its Applications. Springer, 2016.