The Geometry of Superposition

Superposition as a packing problem

A transformer’s residual stream has dimension dmodeld_{\text{model}} of about 768 or 4096, but the number of features the network has to represent is far larger, maybe by orders of magnitude, since it needs to track syntax and entities and sentiment and factual associations and positional information all at once. There are simply more features than dimensions to hold them.

The network squeezes ndn \gg d features into dd dimensions by using sparsity. Most features are silent on any given input, and when two features rarely fire together their representations can overlap without interfering at inference time. The real question is how much overlap the geometry allows before the interference becomes impossible to untangle.

The right tools for answering this are concentration of measure, random matrix theory, optimal transport, and information geometry.

The toy model

Following Elhage et al. (2022), take a one-layer linear network with weights WRm×nW \in \mathbb{R}^{m \times n} where m<nm < n, so the network takes a feature vector xRnx \in \mathbb{R}^n, squeezes it down to WxRmWx \in \mathbb{R}^m, and then rebuilds it as WTWxRnW^T Wx \in \mathbb{R}^n. Each component xix_i is nonzero independently with probability pp.

The loss is the expected weighted reconstruction error:

L(W)=i=1nsiE[WTWeiei21[xi0]]L(W) = \sum_{i=1}^{n} s_i \, \mathbb{E}\left[\left\| W^T W e_i - e_i \right\|^2 \cdot \mathbf{1}[x_i \neq 0]\right]

where sis_i is the importance of feature ii and eie_i is the ii-th standard basis vector. Two features ii and jj are active at the same time with probability p2p^2, so the loss splits apart cleanly.

L(W)=isip(1wi2)2+ijsip2(wiwj)2L(W) = \sum_i s_i \, p \left(1 - \|w_i\|^2\right)^2 + \sum_{i \neq j} s_i \, p^2 \left(w_i \cdot w_j\right)^2

where wi=Weiw_i = W e_i is the ii-th column of WW. The first term pushes each column toward unit norm so each feature is faithfully represented, and the second term punishes interference between columns, which is the cross-talk.

When p=1p = 1 all features fire at once and interference dominates, so the network can only represent mm features faithfully by picking orthogonal columns and abandoning the rest. As pp drops toward zero the interference fades relative to reconstruction, and the network can carry far more than mm features at once.

The transition between these regimes is sharp, and below a critical sparsity the optimal solution flips from dedicated orthogonal representations to dense superposed packing.

Fig 1.1: Superposition Phase Diagram (Drag to Explore)
Theorem (Superposition Phase Transition (Elhage et al., 2022)).

For the toy model with nn features of equal importance in mm dimensions with sparsity 1p1-p, the optimal representation transitions from dedicated (at most mm features represented) to superposed (up to nn features represented with controlled interference) as pp decreases below a critical threshold pcp_c that depends on n/mn/m.

At high sparsity and moderate compression the best configurations squeeze in as many features as possible while keeping worst-case interference low, and these configurations are frames.

Almost-orthogonality

High-dimensional space is far roomier than R3\mathbb{R}^3 would suggest, because volume concentrates in thin shells, random projections preserve distances, and typical configurations are more structured than arbitrary ones. The concentration of measure essay works out the general theory, and the consequence for superposition falls out immediately.

In R2\mathbb{R}^2 at most 3 unit vectors can sit at pairwise angles of 120°120° (the Mercedes-Benz frame), and in R3\mathbb{R}^3 at most 6. The growth looks linear at first, but in Rm\mathbb{R}^m with mm large it becomes exponential.

Theorem (Johnson-Lindenstrauss, 1984).

For any 0<ϵ<10 < \epsilon < 1 and any integer nn, there exist nn unit vectors in Rm\mathbb{R}^m with m=O(ϵ2logn)m = O(\epsilon^{-2} \log n) such that all pairwise inner products satisfy vivjϵ|v_i \cdot v_j| \leq \epsilon for iji \neq j.

The proof is probabilistic. Draw nn vectors uniformly from the unit sphere in Rm\mathbb{R}^m, and for any fixed pair the inner product vivjv_i \cdot v_j is sub-Gaussian with parameter O(1/m)O(1/m), so a union bound over (n2)\binom{n}{2} pairs gives

P[maxijvivj>ϵ]n2exp(cmϵ2)\mathbb{P}\left[\max_{i \neq j} |v_i \cdot v_j| > \epsilon\right] \leq n^2 \exp\left(-c \, m \epsilon^2\right)

This is less than 1 whenever m>Cϵ2lognm > C \epsilon^{-2} \log n, so a good packing exists, and Rm\mathbb{R}^m fits n=exp(cmϵ2)n = \exp(c \, m \epsilon^2) pairwise ϵ\epsilon-orthogonal vectors.

In a residual stream of dimension d=4096d = 4096 this means millions of almost-orthogonal feature directions, and the interference between any two superposed features is about 1/d1/\sqrt{d}, small enough that ReLU nonlinearities and activation sparsity wipe out the cross-talk.

Almost-orthogonal is not the same as optimally packed, and the best packing has to obey a lower bound.

Theorem (Welch Bound, 1974).

For nn unit vectors {w1,,wn}\{w_1, \ldots, w_n\} in Rm\mathbb{R}^m with n>mn > m, the maximum coherence μ=maxijwiwj\mu = \max_{i \neq j} |w_i \cdot w_j| satisfies:

μnmm(n1)\mu \geq \sqrt{\frac{n - m}{m(n-1)}}

Equality holds if and only if the vectors form an equiangular tight frame, meaning all pairwise wiwj|w_i \cdot w_j| are equal and iwiwiT=nmI\sum_i w_i w_i^T = \frac{n}{m} I.

Equiangular tight frames are the densest possible packings, so a network learning to superpose features is quietly solving a frame design problem and the Welch bound is the floor it can never drop below.

Vectors: 5Welch bound: 0.6124
Fig 2.1: Packing Explorer (Drag Vectors to Minimize Coherence)

The Gram matrix WTWW^T W makes the tension visible, since the off-diagonal entries are the interferences and the best configuration is the one that keeps the worst case small. In two dimensions the packing is tightly constrained, but the JL lemma says this constraint loosens exponentially as dimension grows.

Phase transitions in representation

The jump from dedicated to superposed is sharp, and the sharpness has a structure that random matrix theory can pick out.

The Gram matrix G=WTWRn×nG = W^T W \in \mathbb{R}^{n \times n} holds the global geometry of the representation, with diagonal entries Gii=wi2G_{ii} = \|w_i\|^2 showing how faithfully each feature is represented and off-diagonal entries Gij=wiwjG_{ij} = w_i \cdot w_j showing the interference, and the eigenvalues of GG are what we actually observe.

In the dedicated regime WW has at most mm nonzero columns, so GG has rank at most mm and the spectrum is mm values near 1 and nmn - m values at 0, a clean bimodal split.

In the superposition regime WW has n>mn > m nonzero columns packed almost-orthogonally, and the eigenvalues smear out into a continuous distribution between 0 and 1 that is governed by the Marchenko-Pastur law.

Theorem (Marchenko-Pastur Law, 1967).

Let WRm×nW \in \mathbb{R}^{m \times n} have i.i.d. entries with mean 0 and variance σ2/m\sigma^2/m. As m,nm, n \to \infty with n/mγn/m \to \gamma, the empirical spectral distribution of WTWW^T W converges to the Marchenko-Pastur distribution with density:

fγ(λ)=12πγσ2(λ+λ)(λλ)λ1[λ,λ+]f_{\gamma}(\lambda) = \frac{1}{2\pi \gamma \sigma^2} \frac{\sqrt{(\lambda_+ - \lambda)(\lambda - \lambda_-)}}{\lambda} \mathbf{1}_{[\lambda_-, \lambda_+]}

where λ±=σ2(1±γ)2\lambda_{\pm} = \sigma^2(1 \pm \sqrt{\gamma})^2.

The learned weight matrix has been optimized rather than drawn at random, but the eigenvalue distribution of WTWW^T W still shows structure. Eigenvalues above the Marchenko-Pastur bulk edge λ+\lambda_+ are the signal eigenvalues and correspond to features with dedicated dimensions, and eigenvalues inside the bulk are the noise-like interference that comes from superposition.

This is the spiked covariance model from random matrix theory, and it is the same story as the BBP phase transition. A feature is individually detectable exactly when its importance crosses a critical threshold that depends on γ=n/m\gamma = n/m.

Sparsity: 0.90High sparsity → clean eigenvalues
Fig 3.1: Eigenvalue Anatomy (Adjust Sparsity)

At high sparsity the eigenvalues cluster at 0 and 1 because each feature is either cleanly represented or not, and as sparsity drops they smear into a continuous distribution. The Marchenko-Pastur curve predicts the bulk edge, so eigenvalues above it are signal and eigenvalues inside it are the geometric cost of packing.

Sparse autoencoders as geometric decoders

Superposition makes individual neurons polysemantic because a single neuron ends up responding to a linear combination of superposed features rather than just one. Given the compressed representation z=WxRmz = Wx \in \mathbb{R}^m, the problem is to pull the original features xRnx \in \mathbb{R}^n back out.

A sparse autoencoder (SAE) tries to do this by learning an encoder h=ReLU(Wencz+b)h = \text{ReLU}(W_{\text{enc}} z + b) and a decoder z^=Wdech\hat{z} = W_{\text{dec}} h, where the decoder columns dk=Wdec[:,k]d_k = W_{\text{dec}}[:,k] are the learned feature directions. The loss is

L=zz^2+λh1\mathcal{L} = \|z - \hat{z}\|^2 + \lambda \|h\|_1

This is dictionary learning with a sparsity constraint, where the decoder WdecW_{\text{dec}} is the dictionary, its columns are the atoms, and hh is the sparse code. For each input zz the SAE hunts for the sparsest linear combination of dictionary atoms that rebuilds zz.

The L1L^1 penalty plays a geometric role. Without it the dictionary settles on some arbitrary basis, but with it the code is forced sparse so most atoms are off for any given input, which pushes the dictionary atoms toward the true feature directions.

Theorem (Recovery Condition (Compressed Sensing)).

Let xRnx \in \mathbb{R}^n be ss-sparse and let WRm×nW \in \mathbb{R}^{m \times n} satisfy the Restricted Isometry Property of order 2s2s:

(1δ)x2Wx2(1+δ)x2(1-\delta)\|x\|^2 \leq \|Wx\|^2 \leq (1+\delta)\|x\|^2

for all 2s2s-sparse vectors xx and some δ<21\delta < \sqrt{2} - 1. Then the L1L^1-minimization problem minh1\min \|h\|_1 subject to Wh=WxWh = Wx recovers xx exactly.

RIP is the same packing condition in disguise. A matrix has RIP of order ss when every choice of ss columns is nearly an isometry, which is exactly the almost-orthogonality from before, and the Johnson-Lindenstrauss lemma guarantees random matrices satisfy it with high probability.

Recovery works when the features fire sparsely enough that the small interferences between them stay below the noise floor. If features activate with probability pp and the directions have coherence μ\mu then

pμ2lognp \lesssim \frac{\mu^{-2}}{\log n}
L¹ penalty: 0.050Moderate
Fig 4.1: SAE Dictionary Learning (Click Arrows to Isolate)

The faint dashed lines are the true feature directions and the solid arrows are the learned dictionary atoms, and the L1L^1 penalty controls convergence. Too weak and the atoms wander around an over-complete dictionary, too strong and the dictionary collapses to a handful of directions. Clicking a dictionary atom lights up the data points that activate it.

Circuits as transport maps

Features at a single layer are one level of structure, but mechanistic interpretability also studies circuits, the subgraphs of the computational graph that carry out identifiable algorithms. A circuit traces how input features combine through attention heads and MLP layers to produce output features.

Each layer is a map f:RdRdf_\ell: \mathbb{R}^d \to \mathbb{R}^d on the residual stream, and the data distribution at layer \ell, written μ\mu_\ell, is pushed forward to μ+1=(f)#μ\mu_{\ell+1} = (f_\ell)_\# \mu_\ell, so the full network is a composition of pushforward maps.

T(μ,μ+1)=infπΠ(μ,μ+1)xy2dπ(x,y)\mathcal{T}(\mu_\ell, \mu_{\ell+1}) = \inf_{\pi \in \Pi(\mu_\ell, \mu_{\ell+1})} \int \|x - y\|^2 \, d\pi(x, y)

The Wasserstein distance W2(μ,μ+1)W_2(\mu_\ell, \mu_{\ell+1}) measures how much the representation rearranges itself at each layer, and a circuit, which is a sparse interpretable subcomputation, is a low-rank approximation to this transport. It tracks the dominant movement of probability mass from input features to output features and throws away the rest.

Two circuits are close when they induce similar transport plans, moving similar features in similar directions. The Fisher information metric on the statistical manifold of layer representations shows this by measuring how sensitively the representation responds to input changes, and a circuit is a geodesic on this manifold.

Fig 5.1: Layer Flow (Click Features to Trace Circuits)

The dense view shows all the information flow at once and the circuit view pulls out just the sparse interpretable subgraph, so the computation is dense but the meaningful structure is sparse.

The information-theoretic limit

The network has nn features to encode in mm dimensions, and each feature is binary with activation probability pp, so the total information content is nH(p)n \cdot H(p) bits where H(p)=plogp(1p)log(1p)H(p) = -p \log p - (1-p) \log(1-p). The representation has mm real-valued dimensions and so in principle infinite capacity, but the features still have to be decodable.

The rate-distortion function gives the tradeoff, and for distortion DD it is

R(D)=nH(p)(1DDmax)+R(D) = n \cdot H(p) \cdot \left(1 - \frac{D}{D_{\max}}\right)^+

where DmaxD_{\max} is the distortion of the trivial representation, and no scheme can push the error below R1(m)R^{-1}(m) with mm dimensions.

The gap between this bound and what SAEs actually achieve has three sources. Dead features never activate and so waste capacity, feature splitting spreads a single true feature across multiple dictionary atoms, and absorption folds together correlated features that should stay distinct. All three are geometric failures where the learned dictionary does not line up with the true feature directions.

Sparsity: 0.90Features: 100
Total features (n): 100
Fig 6.1: Rate-Distortion Landscape

As sparsity rises the theoretical bound drops and SAE performance improves, so the gap shrinks but never closes. The critical dictionary size k=nH(p)k = n \cdot H(p) marks the line between recoverable and unrecoverable, and below it information is lost for good while above it recovery is possible in principle and the remaining gap is algorithmic.

Structured superposition

The Welch bound says how many features can live together with bounded interference, the Marchenko-Pastur law pulls signal from noise in the eigenspectrum, RIP says when sparse recovery is possible, and the rate-distortion function gives the fundamental limit of that recovery.

Fig 7.1: Concept Map (Click to Explore Neighborhoods)

The theory above treats features as independent, binary, and uniform, but real features have hierarchical structure, since syntactic features depend on lexical features which depend on character-level features, and they have correlations, since subject detection and agreement fire together, and they have varying dimensionalities, since some are one-dimensional and others span whole subspaces.

The geometry of structured superposition, where packing respects semantic relationships and importance follows power laws and correlations create low-dimensional manifolds inside the full feature space, is mostly unexplored. The tools for it exist, including structured random matrices and non-uniform concentration and Riemannian optimization on manifolds with symmetry, but whether real networks learn structure regular enough for these tools to apply is still not known. Sparse autoencoders do recover interpretable features from models trained on natural data and that hints at regularity, but proving it means working out how training dynamics, data structure, and representation geometry interact.

References

Elhage, N., et al. “Toy Models of Superposition.” Transformer Circuits Thread, 2022.

Bricken, T., et al. “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.” Anthropic Research, 2023.

Templeton, A., et al. “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” Anthropic Research, 2024.

Johnson, W. B. and Lindenstrauss, J. “Extensions of Lipschitz mappings into a Hilbert space.” Contemp. Math., 26:189-206, 1984.

Welch, L. “Lower bounds on the maximum cross correlation of signals.” IEEE Trans. Inform. Theory, 20(3):397-399, 1974.

Marchenko, V. A. and Pastur, L. A. “Distribution of eigenvalues for some sets of random matrices.” Matematicheskii Sbornik, 114(4):507-536, 1967.

Donoho, D. L. “Compressed Sensing.” IEEE Trans. Inform. Theory, 52(4):1289-1306, 2006.

Candes, E. J. and Tao, T. “Near-optimal signal recovery from random projections.” IEEE Trans. Inform. Theory, 52(12):5406-5425, 2006.

Villani, C. Optimal Transport: Old and New. Springer, 2009.

Amari, S. Information Geometry and Its Applications. Springer, 2016.