Blog

Under the Hood of a Reasoning Model

We have trained the first ever sparse autoencoders (SAEs) on the 671B parameter DeepSeek R1 model and open-sourced the SAEs.

Authors

Dron Hazra †

Max Loeffler †

Levon Avagyan ‡

Published

Apr. 15, 2025

† Core contributor
‡ Independent researcher, work done while visiting Goodfire
* Correspondence to dan@goodfire.ai

Today, we’re excited to share some early work on mechanistic interpretability for DeepSeek’s reasoning model R1—including two state-of-the-art, open-sourced sparse autoencoders (SAEs). These are the first public interpreter models trained on a true reasoning model, and on any model of this scale.

Our early experiments indicate that R1 is qualitatively different from non-reasoning language models, and required some novel insights to steer it. While we plan to do a lot more work to understand the internal workings of reasoning models such as R1, we’re excited to share these initial contributions with the research community.

Feature examples

We have trained both a general reasoning and math specific SAE. Because R1 is a very large model and therefore difficult to run for most independent researchers, we've also uploaded SQL databases containing the max activating examples for each feature. Usage instructions are available on the project's GitHub.

Our SAEs have learned many features which reconstruct behaviors core to reasoning models, such as backtracking. Here are a few curated features from our general reasoning SAE:

Reasoning models and mechanistic interpretability

At Goodfire, we build tools for mechanistic interpretability: the scientific effort to understand how neural networks process information by reverse-engineering their internal components. Recent work in this field, like Anthropic's research on circuit tracing in Claude, continues to uncover computational pathways and features underlying model behaviors, from mental math to hallucinations. We believe that developing this deeper understanding is essential for both scientific progress as well as for ensuring these increasingly powerful systems are reliable and aligned with human intentions.

As part of that mission, we’re building interpretability tooling for the frontier of generative AI capabilities. While we don’t believe SAEs solve the overarching problem of mechanistic interpretability—they have a number of known limitations—they are nonetheless a central component of today’s toolbox for studying model mechanisms. Further advances in unsupervised interpretability techniques could eventually allow for more reliable alignment, the ability to enhance or suppress specific reasoning capabilities on demand, and potentially even the correction of specific failure modes without disrupting overall model performance.

Our SAEs for R1

We’re releasing two SAEs for R1: the first was trained on R1’s activations on a custom reasoning dataset (which we’re also open-sourcing), and the second used OpenR1-Math, a large dataset for mathematical reasoning. These datasets allow us to discover the features that R1 uses to answer challenging problems that exercise its reasoning chops.

At 671B parameters, the undistilled R1 model is an engineering challenge to run at scale. Our SAE training leveraged our custom inference engine and interpreter model training infrastructure, which enables interpretability at the half a trillion parameter scale.

Feature map

We used DataMapPlot to create an interactive UMAP visualization of our general reasoning SAE's features.

Two initial insights into steering R1

Many of our initial experiments with our R1 SAEs have focused on steering the model’s responses by altering the strength of particular features. While we haven’t yet systematically investigated their occurrence or causes, we wanted to share two insights into steering R1, which we haven’t encountered in non-reasoning models.

Steering after “Okay, so the user has asked a question about…”

We usually steer a model (e.g. using Ember) starting from the first token of its response. However, naively steering R1 at the beginning of its chain of thought fails. Instead, we have to wait until after the model first begins its response with something like "Okay, so the user has asked a question about…" to steer effectively.

We also observe attention sinks, or tokens on which the average activation strength is far higher than normal, towards the end of this “response prefix”. Normally, we see attention sinks at the very beginning of model responses. This suggests that R1 doesn’t model that it has begun its true response until after the “Okay, so…” prefix.

Our hypothesis is that phrases like “Okay, so the user has asked a question about…” were so common in the reasoning traces on which R1 was trained, that the model effectively treats it as part of the prompt. (Similar prefixes are extremely common in R1’s reasoning traces: more than 95% of its English reasoning traces begin with "Okay,"). We observed a strong feature distribution shift between the prompt (including this prefix to the thinking trace), the thinking trace, and the assistant’s response.

This subtle unintuitive characteristic of R1's internal processes shows how the conceptual boundaries that initially seem intuitive to an external user may not actually be what the model itself is using.

Steering example #1 (Swapping the operator in a math problem)

Prompt: What is 1024 times 2?

Feature 22286 (Division)

Feature strengths where steering R1 was successful are highlighted green.

+5

+6

+7

0

1

2

3

4

5

6

7

8

9

10

Oversteering R1 makes it revert to its original behavior

When we steer a model, we vary the strength of the feature(s) we’re manipulating, thus controlling how salient that feature is to the downstream model output. For instance, if we increase the activation of a feature that represents dogs, then the model’s outputs will be more dog-related. If we oversteer, by pushing that feature up and up, we typically observe that the model becomes increasingly fixated on dogs until its outputs become incoherent.

However, when steering R1 on some features, we’ve observed that oversteering paradoxically causes the model to revert to its original behavior. (If we continue to amp up the feature, outputs do become incoherent—but only after the behavior we steered towards disappears.)

Steering example #2 (Decreasing time spent thinking)

Prompt: What is the meaning of life?

Feature 15072 (Answer Quickly)

Feature strengths where steering R1 was successful are highlighted green.

5

6

7

1

2

3

4

5

6

7

8

9

10

No Steering

The meaning of life varies across perspectives: philosophers suggest creating your own purpose, religions offer divine frameworks, and science points to survival and connection. Ultimately, it's a personal journey of finding what brings fulfillment through relationships, growth, and contribution.

Our working hypothesis is that the model implicitly recognizes a state of confusion or incoherence when its internal activations are overly perturbed, causing it to stop and course-correct. Why would this “rebalancing” effect appear in reasoning models specifically? We think their training might incentivize greater implicit “awareness” of their internal state. Empirically, reasoning models often backtrack and try an alternative approach when a line of reasoning on a challenging problem isn’t working out, which suggests some internal understanding of when they’re “confused”.

If this phenomenon is a general property of reasoning models, it suggests that attempts to change their behavior—e.g., to suppress dishonest responses—may require more sophisticated techniques, as models find ways to route around modifications. More work is needed here: we’re continuing to investigate these phenomena, but we’re also excited to see what insights others are able to find with our tools.

Why this matters

As reasoning models increasingly become the foundation for advanced AI systems across industries—and potentially begin to interact more agentically with the world—understanding their internal mechanisms is crucial. Mechanistic interpretability provides stronger assurances about model behavior by understanding how they craft their responses, enabling us to:

better understand model capabilities and limitations
identify, monitor for, and debug unexpected behaviors and failure modes
develop more targeted safety interventions
enable greater transparency and trust for users

With this release, the field of mechanistic interpretability is taking another step toward realizing that vision. Our experience building and using interpretability tools for a range of models, such as for the Arc Institute’s genomic foundation model Evo, leads us to believe that this approach is a practical and necessary strategy for developing both safer and more useful foundation models.

As we share these SAEs for a large reasoning model at the frontier of model capabilities, we're excited to see how the wider research community will build upon these findings and develop new techniques for understanding and aligning powerful AI systems. As reasoning models continue to grow in capability and adoption, tools like these will be essential for ensuring they remain reliable, transparent, and aligned with human intentions.

Citation

Hazra, et al., "Under the hood of a reasoning model", Goodfire Research, 2025.

References

Umap: Uniform manifold approximation and projection for dimension reduction [link]
McInnes, L., Healy, J. and Melville, J., 2018. arXiv preprint arXiv:1802.03426.
Ameisen, et al., "Circuit Tracing: Revealing Computational Graphs in Language Models", Transformer Circuits, 2025. [HTML]
“Open R1: Update #2”. [HTML]
Ben Allal, L., Tunstall, L., Lozhkov, A., Bakouch, E., Penedo, G., Kydlicek, H., & Martín Blázquez, G. Feb. 10, 2025. Hugging Face
"Interpreting Evo 2: Arc Institute's Next-Generation Genomic Foundation Model," [HTML]
M. Deng, D. Balsam, L. Gorton, N. Wang, N. Nguyen, E. Ho, and T. McGrath, Goodfire Research, Feb. 20, 2025.
"Our Approach to Safety at Goodfire," [HTML]
Goodfire Research, Dec. 22, 2024.

Research

Towards Scalable Parameter Decomposition

June 28, 2025

Replicating Circuit Tracing for a Simple Known Mechanism

June 11, 2025

Understanding and Steering Llama 3 with Sparse Autoencoders

September 25, 2024

Mapping the Latent Space of Llama 3.3 70B

December 23, 2024