The Geometry of Superposition

4/10/2026

Superposition as a packing problem

A transformer’s residual stream has dimension $d_{\text{model}}$ of about 768 or 4096, but the number of features the network has to represent is far larger, maybe by orders of magnitude, since it needs to track syntax and entities and sentiment and factual associations and positional information all at once. There are simply more features than dimensions to hold them.

The network squeezes $n \gg d$ features into $d$ dimensions by using sparsity. Most features are silent on any given input, and when two features rarely fire together their representations can overlap without interfering at inference time. The real question is how much overlap the geometry allows before the interference becomes impossible to untangle.

The right tools for answering this are concentration of measure, random matrix theory, optimal transport, and information geometry.

The toy model

Following Elhage et al. (2022), take a one-layer linear network with weights $W \in \mathbb{R}^{m \times n}$ where $m < n$ , so the network takes a feature vector $x \in \mathbb{R}^n$ , squeezes it down to $Wx \in \mathbb{R}^m$ , and then rebuilds it as $W^T Wx \in \mathbb{R}^n$ . Each component $x_i$ is nonzero independently with probability $p$ .

The loss is the expected weighted reconstruction error:

L(W) = \sum_{i=1}^{n} s_i \, \mathbb{E}\left[\left\| W^T W e_i - e_i \right\|^2 \cdot \mathbf{1}[x_i \neq 0]\right]

where $s_i$ is the importance of feature $i$ and $e_i$ is the $i$ -th standard basis vector. Two features $i$ and $j$ are active at the same time with probability $p^2$ , so the loss splits apart cleanly.

L(W) = \sum_i s_i \, p \left(1 - \|w_i\|^2\right)^2 + \sum_{i \neq j} s_i \, p^2 \left(w_i \cdot w_j\right)^2

where $w_i = W e_i$ is the $i$ -th column of $W$ . The first term pushes each column toward unit norm so each feature is faithfully represented, and the second term punishes interference between columns, which is the cross-talk.

When $p = 1$ all features fire at once and interference dominates, so the network can only represent $m$ features faithfully by picking orthogonal columns and abandoning the rest. As $p$ drops toward zero the interference fades relative to reconstruction, and the network can carry far more than $m$ features at once.

The transition between these regimes is sharp, and below a critical sparsity the optimal solution flips from dedicated orthogonal representations to dense superposed packing.

Fig 1.1: Superposition Phase Diagram (Drag to Explore)

Theorem (Superposition Phase Transition (Elhage et al., 2022)).

For the toy model with $n$ features of equal importance in $m$ dimensions with sparsity $1-p$ , the optimal representation transitions from dedicated (at most $m$ features represented) to superposed (up to $n$ features represented with controlled interference) as $p$ decreases below a critical threshold $p_c$ that depends on $n/m$ .

At high sparsity and moderate compression the best configurations squeeze in as many features as possible while keeping worst-case interference low, and these configurations are frames.

Almost-orthogonality

High-dimensional space is far roomier than $\mathbb{R}^3$ would suggest, because volume concentrates in thin shells, random projections preserve distances, and typical configurations are more structured than arbitrary ones. The concentration of measure essay works out the general theory, and the consequence for superposition falls out immediately.

In $\mathbb{R}^2$ at most 3 unit vectors can sit at pairwise angles of $120°$ (the Mercedes-Benz frame), and in $\mathbb{R}^3$ at most 6. The growth looks linear at first, but in $\mathbb{R}^m$ with $m$ large it becomes exponential.

Theorem (Johnson-Lindenstrauss, 1984).

For any $0 < \epsilon < 1$ and any integer $n$ , there exist $n$ unit vectors in $\mathbb{R}^m$ with $m = O(\epsilon^{-2} \log n)$ such that all pairwise inner products satisfy $|v_i \cdot v_j| \leq \epsilon$ for $i \neq j$ .

The proof is probabilistic. Draw $n$ vectors uniformly from the unit sphere in $\mathbb{R}^m$ , and for any fixed pair the inner product $v_i \cdot v_j$ is sub-Gaussian with parameter $O(1/m)$ , so a union bound over $\binom{n}{2}$ pairs gives

\mathbb{P}\left[\max_{i \neq j} |v_i \cdot v_j| > \epsilon\right] \leq n^2 \exp\left(-c \, m \epsilon^2\right)

This is less than 1 whenever $m > C \epsilon^{-2} \log n$ , so a good packing exists, and $\mathbb{R}^m$ fits $n = \exp(c \, m \epsilon^2)$ pairwise $\epsilon$ -orthogonal vectors.

In a residual stream of dimension $d = 4096$ this means millions of almost-orthogonal feature directions, and the interference between any two superposed features is about $1/\sqrt{d}$ , small enough that ReLU nonlinearities and activation sparsity wipe out the cross-talk.

Almost-orthogonal is not the same as optimally packed, and the best packing has to obey a lower bound.

Theorem (Welch Bound, 1974).

For $n$ unit vectors $\{w_1, \ldots, w_n\}$ in $\mathbb{R}^m$ with $n > m$ , the maximum coherence $\mu = \max_{i \neq j} |w_i \cdot w_j|$ satisfies:

\mu \geq \sqrt{\frac{n - m}{m(n-1)}}

Equality holds if and only if the vectors form an equiangular tight frame, meaning all pairwise $|w_i \cdot w_j|$ are equal and $\sum_i w_i w_i^T = \frac{n}{m} I$ .

Equiangular tight frames are the densest possible packings, so a network learning to superpose features is quietly solving a frame design problem and the Welch bound is the floor it can never drop below.

Vectors: 5Welch bound: 0.6124

Fig 2.1: Packing Explorer (Drag Vectors to Minimize Coherence)

The Gram matrix $W^T W$ makes the tension visible, since the off-diagonal entries are the interferences and the best configuration is the one that keeps the worst case small. In two dimensions the packing is tightly constrained, but the JL lemma says this constraint loosens exponentially as dimension grows.

Phase transitions in representation

The jump from dedicated to superposed is sharp, and the sharpness has a structure that random matrix theory can pick out.

The Gram matrix $G = W^T W \in \mathbb{R}^{n \times n}$ holds the global geometry of the representation, with diagonal entries $G_{ii} = \|w_i\|^2$ showing how faithfully each feature is represented and off-diagonal entries $G_{ij} = w_i \cdot w_j$ showing the interference, and the eigenvalues of $G$ are what we actually observe.

In the dedicated regime $W$ has at most $m$ nonzero columns, so $G$ has rank at most $m$ and the spectrum is $m$ values near 1 and $n - m$ values at 0, a clean bimodal split.

In the superposition regime $W$ has $n > m$ nonzero columns packed almost-orthogonally, and the eigenvalues smear out into a continuous distribution between 0 and 1 that is governed by the Marchenko-Pastur law.

Theorem (Marchenko-Pastur Law, 1967).

Let $W \in \mathbb{R}^{m \times n}$ have i.i.d. entries with mean 0 and variance $\sigma^2/m$ . As $m, n \to \infty$ with $n/m \to \gamma$ , the empirical spectral distribution of $W^T W$ converges to the Marchenko-Pastur distribution with density:

f_{\gamma}(\lambda) = \frac{1}{2\pi \gamma \sigma^2} \frac{\sqrt{(\lambda_+ - \lambda)(\lambda - \lambda_-)}}{\lambda} \mathbf{1}_{[\lambda_-, \lambda_+]}

where $\lambda_{\pm} = \sigma^2(1 \pm \sqrt{\gamma})^2$ .

The learned weight matrix has been optimized rather than drawn at random, but the eigenvalue distribution of $W^T W$ still shows structure. Eigenvalues above the Marchenko-Pastur bulk edge $\lambda_+$ are the signal eigenvalues and correspond to features with dedicated dimensions, and eigenvalues inside the bulk are the noise-like interference that comes from superposition.

This is the spiked covariance model from random matrix theory, and it is the same story as the BBP phase transition. A feature is individually detectable exactly when its importance crosses a critical threshold that depends on $\gamma = n/m$ .

Sparsity: 0.90High sparsity → clean eigenvalues

Fig 3.1: Eigenvalue Anatomy (Adjust Sparsity)

At high sparsity the eigenvalues cluster at 0 and 1 because each feature is either cleanly represented or not, and as sparsity drops they smear into a continuous distribution. The Marchenko-Pastur curve predicts the bulk edge, so eigenvalues above it are signal and eigenvalues inside it are the geometric cost of packing.

Sparse autoencoders as geometric decoders

Superposition makes individual neurons polysemantic because a single neuron ends up responding to a linear combination of superposed features rather than just one. Given the compressed representation $z = Wx \in \mathbb{R}^m$ , the problem is to pull the original features $x \in \mathbb{R}^n$ back out.

A sparse autoencoder (SAE) tries to do this by learning an encoder $h = \text{ReLU}(W_{\text{enc}} z + b)$ and a decoder $\hat{z} = W_{\text{dec}} h$ , where the decoder columns $d_k = W_{\text{dec}}[:,k]$ are the learned feature directions. The loss is

\mathcal{L} = \|z - \hat{z}\|^2 + \lambda \|h\|_1

This is dictionary learning with a sparsity constraint, where the decoder $W_{\text{dec}}$ is the dictionary, its columns are the atoms, and $h$ is the sparse code. For each input $z$ the SAE hunts for the sparsest linear combination of dictionary atoms that rebuilds $z$ .

The $L^1$ penalty plays a geometric role. Without it the dictionary settles on some arbitrary basis, but with it the code is forced sparse so most atoms are off for any given input, which pushes the dictionary atoms toward the true feature directions.

Theorem (Recovery Condition (Compressed Sensing)).

Let $x \in \mathbb{R}^n$ be $s$ -sparse and let $W \in \mathbb{R}^{m \times n}$ satisfy the Restricted Isometry Property of order $2s$ :

(1-\delta)\|x\|^2 \leq \|Wx\|^2 \leq (1+\delta)\|x\|^2

for all $2s$ -sparse vectors $x$ and some $\delta < \sqrt{2} - 1$ . Then the $L^1$ -minimization problem $\min \|h\|_1$ subject to $Wh = Wx$ recovers $x$ exactly.

RIP is the same packing condition in disguise. A matrix has RIP of order $s$ when every choice of $s$ columns is nearly an isometry, which is exactly the almost-orthogonality from before, and the Johnson-Lindenstrauss lemma guarantees random matrices satisfy it with high probability.

Recovery works when the features fire sparsely enough that the small interferences between them stay below the noise floor. If features activate with probability $p$ and the directions have coherence $\mu$ then

p \lesssim \frac{\mu^{-2}}{\log n}

L¹ penalty: 0.050Moderate

Fig 4.1: SAE Dictionary Learning (Click Arrows to Isolate)

The faint dashed lines are the true feature directions and the solid arrows are the learned dictionary atoms, and the $L^1$ penalty controls convergence. Too weak and the atoms wander around an over-complete dictionary, too strong and the dictionary collapses to a handful of directions. Clicking a dictionary atom lights up the data points that activate it.

Circuits as transport maps

Features at a single layer are one level of structure, but mechanistic interpretability also studies circuits, the subgraphs of the computational graph that carry out identifiable algorithms. A circuit traces how input features combine through attention heads and MLP layers to produce output features.

Each layer is a map $f_\ell: \mathbb{R}^d \to \mathbb{R}^d$ on the residual stream, and the data distribution at layer $\ell$ , written $\mu_\ell$ , is pushed forward to $\mu_{\ell+1} = (f_\ell)_\# \mu_\ell$ , so the full network is a composition of pushforward maps.

\mathcal{T}(\mu_\ell, \mu_{\ell+1}) = \inf_{\pi \in \Pi(\mu_\ell, \mu_{\ell+1})} \int \|x - y\|^2 \, d\pi(x, y)

The Wasserstein distance $W_2(\mu_\ell, \mu_{\ell+1})$ measures how much the representation rearranges itself at each layer, and a circuit, which is a sparse interpretable subcomputation, is a low-rank approximation to this transport. It tracks the dominant movement of probability mass from input features to output features and throws away the rest.

Two circuits are close when they induce similar transport plans, moving similar features in similar directions. The Fisher information metric on the statistical manifold of layer representations shows this by measuring how sensitively the representation responds to input changes, and a circuit is a geodesic on this manifold.

Fig 5.1: Layer Flow (Click Features to Trace Circuits)

The dense view shows all the information flow at once and the circuit view pulls out just the sparse interpretable subgraph, so the computation is dense but the meaningful structure is sparse.

The information-theoretic limit

The network has $n$ features to encode in $m$ dimensions, and each feature is binary with activation probability $p$ , so the total information content is $n \cdot H(p)$ bits where $H(p) = -p \log p - (1-p) \log(1-p)$ . The representation has $m$ real-valued dimensions and so in principle infinite capacity, but the features still have to be decodable.

The rate-distortion function gives the tradeoff, and for distortion $D$ it is

R(D) = n \cdot H(p) \cdot \left(1 - \frac{D}{D_{\max}}\right)^+

where $D_{\max}$ is the distortion of the trivial representation, and no scheme can push the error below $R^{-1}(m)$ with $m$ dimensions.

The gap between this bound and what SAEs actually achieve has three sources. Dead features never activate and so waste capacity, feature splitting spreads a single true feature across multiple dictionary atoms, and absorption folds together correlated features that should stay distinct. All three are geometric failures where the learned dictionary does not line up with the true feature directions.

Sparsity: 0.90Features: 100

Total features (n): 100

Fig 6.1: Rate-Distortion Landscape

As sparsity rises the theoretical bound drops and SAE performance improves, so the gap shrinks but never closes. The critical dictionary size $k = n \cdot H(p)$ marks the line between recoverable and unrecoverable, and below it information is lost for good while above it recovery is possible in principle and the remaining gap is algorithmic.

Structured superposition

The Welch bound says how many features can live together with bounded interference, the Marchenko-Pastur law pulls signal from noise in the eigenspectrum, RIP says when sparse recovery is possible, and the rate-distortion function gives the fundamental limit of that recovery.

Fig 7.1: Concept Map (Click to Explore Neighborhoods)

The theory above treats features as independent, binary, and uniform, but real features have hierarchical structure, since syntactic features depend on lexical features which depend on character-level features, and they have correlations, since subject detection and agreement fire together, and they have varying dimensionalities, since some are one-dimensional and others span whole subspaces.

The geometry of structured superposition, where packing respects semantic relationships and importance follows power laws and correlations create low-dimensional manifolds inside the full feature space, is mostly unexplored. The tools for it exist, including structured random matrices and non-uniform concentration and Riemannian optimization on manifolds with symmetry, but whether real networks learn structure regular enough for these tools to apply is still not known. Sparse autoencoders do recover interpretable features from models trained on natural data and that hints at regularity, but proving it means working out how training dynamics, data structure, and representation geometry interact.

References

Elhage, N., et al. “Toy Models of Superposition.” Transformer Circuits Thread, 2022.

Bricken, T., et al. “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.” Anthropic Research, 2023.

Templeton, A., et al. “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” Anthropic Research, 2024.

Johnson, W. B. and Lindenstrauss, J. “Extensions of Lipschitz mappings into a Hilbert space.” Contemp. Math., 26:189-206, 1984.

Welch, L. “Lower bounds on the maximum cross correlation of signals.” IEEE Trans. Inform. Theory, 20(3):397-399, 1974.

Marchenko, V. A. and Pastur, L. A. “Distribution of eigenvalues for some sets of random matrices.” Matematicheskii Sbornik, 114(4):507-536, 1967.

Donoho, D. L. “Compressed Sensing.” IEEE Trans. Inform. Theory, 52(4):1289-1306, 2006.

Candes, E. J. and Tao, T. “Near-optimal signal recovery from random projections.” IEEE Trans. Inform. Theory, 52(12):5406-5425, 2006.

Villani, C. Optimal Transport: Old and New. Springer, 2009.

Amari, S. Information Geometry and Its Applications. Springer, 2016.