Research

Towards Scalable Parameter Decomposition

Authors

* Co-first Author
† Goodfire – Work primarily carried out while at Apollo Research
‡ Goodfire
Correspondence to lucius@goodfire.ai

Blog post by Michael Byun

Published

June 27, 2025

Full Paper

Read on arXiv →

Jenga tower representing SPD parameter decomposition — Like a game of Jenga, SPD finds parameter components that can be removed while keeping the model's behavior stable. "Jenga" by Sheila Sund, licensed under CC BY 2.0.

In 2013, researchers discovered that word2vec embeddings encoded analogies in their parameters: the now-famous result that vector arithmetic like king - man + woman = queen worked directly in embedding space. A decade later, we're still better at interpreting the information that flows through neural networks than understanding the machinery itself.

The mechanistic interpretability community has made real progress decomposing neural networks into understandable parts—we can identify representations, trace circuits, even edit some specific behaviors. But the most successful methods so far, like SAEs, have focused largely on the activations that flow through a model, rather than directly inspecting the weights that transform and guide the inputs into those flows. It's a bit like trying to understand a program by only looking at its runtime variables, but never its source code.

This matters because we're ultimately after mechanistic stories—causal explanations of how models implement their capabilities. Knowing that a model has a "Paris feature" tells us something, but knowing how it constructs that feature from lower-level components, how it routes information to build up this representation, and what computations it performs on it would tell us much more. True understanding means grasping not just the "what" but the "how".

Parameter decomposition offers a way to decompose a model's parameters—the 'source code'—into components that reveal not only what the network computes, but how it computes it. Today, we're releasing a paper on Stochastic Parameter Decomposition (SPD), which removes key barriers to the scalability of prior methods. Where previous parameter decomposition methods were too brittle and computationally expensive to scale beyond toy models, SPD successfully identifies ground-truth mechanisms with far greater stability—opening a path toward understanding the computational structure of real neural networks.

Why focus on parameters?

In contrast to most previous methods, which focus on decomposing and understanding model activations, our new method focuses on decomposing model parameters (i.e., weights). Why might we want to do that?

First, some background. At a high level, reverse-engineering a neural network happens in three steps:

Decomposition: Breaking down the model into simpler parts
Description of components (interpretation): Formulating hypotheses about the functional role of component parts and how they interact
Validation of descriptions: Testing if our hypotheses are correct
(adapted from Open Problems in Mechanistic Interpretability)

That first step, decomposition, is what our new paper addresses. The most popular current methods for decomposition, like sparse autoencoders (SAEs), are a part of a family of strategies called sparse dictionary learning (SDL) which operate on the activations in a model.

SAEs and related SDL-based methods have given us meaningful, and often fascinating, decompositions of neural networks. We've used them to find interpretable features in language models, to understand how genomics models process DNA sequences, and to steer image generation.

However, they also have some limitations. Of particular interest to us: SAEs don't explain feature geometry—why features arrange themselves in particular ways in activation space. And training bigger SAEs on the same model doesn't converge on a "true" decomposition—it gives us more, finer-grained features. Scaling SAEs describes a model's behaviors at different resolutions, rather than approaching a maximally parsimonious explanation of how the model works.

Newer methods, like CLTs and Matryoshka SAEs, aim to solve some of these issues, with promising results on circuit analysis and reducing feature splitting. But, in a sense, all SDL methods treat the parameters of each layer as a black box—observing only the inputs and outputs while remaining agnostic to how the transformation actually works. As one of our researchers put it: "You're getting a lot about the structure of the dataset, and not so much the computations." We see the shadows that computations cast on activations, but not the computations themselves. That's why we think it's worth exploring alternative approaches.

Parameter decomposition

The natural move is to look directly at the parameters—where the computations actually live. If we could decompose a model's weights into components based on their causal role in the computation, we might find the "atoms" of neural network computation. This is the vision for parameter decomposition: to try to find a set of components that parsimoniously recreate the model's true computational structure. Importantly, while parameter decomposition methods don't directly decompose activations, they actually imply a direct translation into activation space, ultimately allowing us to decompose activations as well as parameters.

Parameter decomposition works by searching for components that are faithful (sum to recreate the original model), minimal (use as few components as possible for any given input), and simple (as "small" as possible). The previous paper in this family of methods, Attribution-based Parameter Decomposition (APD), took a first crack at this problem, using gradient-based attribution to identify which parameter components matter for each input. But it proved too sensitive to hyperparameters and too computationally expensive to scale—bringing us to SPD.

Our new method: Stochastic Parameter Decomposition

SPD decomposition visualization — SPD learns a set of subcomponents that are minimal, simple, and sum to the original model. In this experiment, SPD was run on a toy model of superposition which represents 5 sparsely-activating input features as directions in the 2-dimensional hidden space. SPD successfully recovers those 5 directions with one subcomponent each; while this run of SPD used 20 subcomponents, the fifteen subcomponents not shown are superfluous for replicating the target model's behavior, so SPD trained their norms to ~0. For more, see Figure 1 in the paper.

SPD rethinks how to find parameter components. Instead of using gradient-based attribution to guess which components matter, it directly learns which parts of the network can be deleted for any given input.

The core insight is deceptively simple: if a parameter component isn't causally necessary for a computation, you should be able to remove it without changing the output. SPD trains a system that learns to predict exactly how much each component can be removed—playing a sort of Jenga with the components, while ensuring the model's behavior remains stable.

The method decomposes weights into rank-one subcomponents (the simplest possible building blocks), then trains a causal importance function for each one. During training, SPD learns to mask out unimportant components. One can think of it as akin to learning a super-granular Mixture-of-Experts distillation of the model, where each expert is as atomic as possible in its function.

This approach solves several problems with APD. It's computationally tractable—we no longer use full parameter vectors for each component. It's much less sensitive to hyperparameter choice. And it more directly measures the causal importance of its components, rather than relying on gradient-based proxies.

Technical details: How stochastic masking works

For researchers interested in implementation: SPD masks components based on their predicted importance and optimizes to maintain the original model's outputs while ablating as much as possible. SPD trains small MLPs to predict each subcomponent's "causal importance" on a given input. These importance scores determine a masking distribution—components predicted to be unimportant get stochastically masked out more aggressively (in expectation); masking means that SPD learns to be indifferent to unimportant components being partially active. The key innovation is training through these stochastic masks: by sampling many possible ablation patterns, SPD discovers which components truly matter for a given datapoint without relying on potentially misleading gradient approximations. See Section 2 of the paper for the full algorithm.

What we found

We tested SPD on a suite of toy models where we know the ground truth mechanisms—essential for validating that our decomposition method actually works.

In the Toy Model of Superposition, SPD correctly identified each individual feature direction without the parameter shrinkage that was found in APD. When we added identity matrices (a deliberate challenge for the previous parameter decomposition algorithm), SPD found exactly the components we expected: one per input feature, plus the minimum needed to represent the identity transformation. SPD also handled models beyond APD's reach. It successfully decomposed three-layer networks where mechanisms span multiple layers—something APD couldn't reliably do.

Most important for future scaling: SPD proved more robust to hyperparameter choices than APD, which required careful tuning to avoid pathological decompositions. While it still exhibits non-trivial sensitivity, SPD's relative stability makes it a much more practical tool and basis for further research than APD.

Looking forward

SPD's relative stability opens the door to scaling parameter decomposition beyond toy models. The obvious next target is language models—can we find the parameter components that implement specific capabilities like arithmetic, syntax, or factual recall? This will require modifications, but is an area we're actively working on, and we're excited for others to refine and build on these methods as well.

The method also suggests new research directions. We could investigate how models allocate their parameters—do they dedicate specific components to memorization versus generalization? Can we identify which parameters implement safety-relevant behaviors? There's also the intriguing possibility of training intrinsically decomposed models, where parameter components are built in from the start rather than discovered post-hoc.

SPD isn't a complete solution: it still requires a clustering step to group rank-one subcomponents into full mechanisms; the computational cost is nontrivial (though possibly competitive with SDL methods); and we've only tested it on models where we know what to look for. But we think it represents something important: the beginnings of a practical method for decomposing neural networks in parameter space.

Activation-space interpretability methods like SAEs have unlocked some fascinating insights into the internals of frontier models. We hope that parameter decomposition will help us take the next step in understanding the most important technology of our time.

Acknowledgments

We thank the Goodfire team for valuable discussions and feedback on this work. Special thanks to our colleagues at Apollo Research, where much of the foundational work was carried out.

References

For a complete list of references, please see our paper on arXiv.

Research

Understanding Memorization via Loss Curvature

November 6, 2025

Deploying Interpretability to Production with Rakuten: SAE Probes for PII Detection

October 28, 2025

Finding the Tree of Life in Evo 2

August 28, 2025