Blog

On Optimism for Interpretability

Author

Published

July 17, 2025

The most powerful technology of our time is also the most inscrutable.

ChatGPT's recent sycophantic update illustrates this well: in April, the chatbot inexplicably began engaging in extreme flattery, urging impulsive actions, and reinforcing negative emotions. Despite pre-release testing, the issues only became apparent from user reports after the model was deployed. OpenAI's post-training and black-box evaluation process had given them little insight into what they were actually changing inside the model.

And other AI companies face the same fundamental problem, of undesired and hard-to-anticipate model behavior: Anthropic has found that its chatbot Claude has learned to scheme and blackmail when threatened with shutdown, while a recent update to xAI's Grok caused it to declare itself “MechaHitler”.

This is the reality of building with frontier AI models today: we throw curated data at opaque systems, crossing our fingers that they'll learn what we want them to learn. When they break, we don't get error messages pointing to specific problems; we simply retrain them with different data. We're achieving remarkable results, but through methods that remain fundamentally mysterious to us.

Comic showing a person standing on a big pile of linear algebra with a data funnel, explaining machine learning as pouring data in and stirring until answers look right — *Relevant xkcd.*

It doesn't have to be this way.

The information we need to understand what's happening inside these models isn't hidden: we have complete access to their weights, activations, and attention patterns. Whatever is happening is right there in the computational graph, fully observable if we just knew how to look.

This is the promise of interpretability, the subfield of machine learning devoted to understanding precisely how neural networks process information and why they produce specific outputs. In other words, the field develops tools to peer inside and map models' internal logic. Current interpretability methods include sparse autoencoders and transcoders, which can break down model activations into interpretable “features” (representations of concepts) and “circuits” (which combine features into algorithms); other approaches promise alternative ways of understanding model internals. Researchers have so far found interpretable circuits, meaningful structure in activation space, and elegant patterns that govern how language and image models handle different types of information.

At Goodfire, we believe we can engineer frontier AI systems that are understandable. We reject the idea that AI needs to be a black box, and search every day to articulate the underlying science of how neural networks represent information. I'm confident that we can build the tools needed to move from powerful-but-unreliable AI to intentionally designed systems – the question is whether we can do so before we're widely deploying superintelligent models. I believe we can.

This piece lays out the case for my optimism: my vision of what we think interpretability could unlock, why I think progress towards that vision is tractable, and what we'll need to get there in time.

The Vision

Interpretability is the science of what goes on inside neural networks. Like any new science, it will create a tree of scientific discovery that bears fruit in both expected and unexpected ways. The core capabilities we're building toward are clear, though: we want to understand how representations form in models, surface problems when they arise, trace them to their root causes, and make surgical edits to robustly fix them.

There's a parallel here with software engineering: when building an application, you need to be able to understand, edit, and debug it. These capabilities feel almost laughably basic until you realize we don't really have them for neural networks. Creating scientific engineering tools (to replace our current blunt instruments) would enable us to build AI systems that are more useful, more reliable, and more aligned with what we actually want them to do. Ultimately, we expect this new science to yield new ways of spending compute to predictably improve the alignment of AI systems.

Below, I outline some of the key applications that I see these tools enabling.

Alignment & Auditing

Sufficient progress in interpretability will enable us to test for crucial alignment and reliability concerns, and could be especially useful for sycophancy, scheming, and chain-of-thought faithfulness, which are difficult to monitor without white-box methods. We can also use interp-based “model diffing” – analogous to a git diff – to understand what changes have happened to a model, even when we're not sure what to specifically look for. This is particularly helpful when post-training yields unexpected behaviors, as with the ChatGPT and Grok incidents.

For applications where transparency matters, such as medical diagnostics or financial decision-making, we need tools to trace how our model came to any given answer – understanding the reasoning path that led to specific outputs. In the future, this may be key to deploying AI models in production environments.

Debugging

When something looks wrong or feels off, we need tooling analogous to stack traces and debugger insight into runtime variables. These tools should allow us to trace problematic behavior back to specific parts of the model and sets of training data, helping point out how to fix the underlying problem. That includes specific issues in individual rollouts, not just behavior in aggregate (which current methods are limited to).

Consider the ChatGPT sycophancy example: OpenAI's expert testers had actually flagged that the model's behavior felt “off” before it was released. If they had had good interpretability-based debugging tools, they might have been able to follow up on their testers' vague qualitative signals much better – tracing those subjective concerns to concrete behavioral changes and understanding the root causes before deployment.

Editing & Intentional Design

Beyond just understanding what's wrong, interpretability should enable us to fix it. This means developing capabilities to modify or remove specific features and mechanisms within trained models. It means allowing us richer control over the training process itself, and more efficient distillation methods that preserve only desired behaviors.

This represents a shift from our current reactive approach toward proactive, intentional design of AI systems, via the deliberate engineering of the internal representations and mechanisms that drive model behavior.

Scientific Discovery

Seeking to understand our models is not only a means to audit or modify them; the understanding itself promises to be a huge reward. Scientific models, like Arc Institute's genomics model Evo 2, are already serving as an excellent application of our interpretability techniques. We can peer inside these models to extract and use the conceptual frameworks they've developed – frameworks that may transcend current human understanding in their respective fields and lead to breakthroughs across those domains.

In genomics, we can imagine recovering novel biological principles from models that have internalized patterns across millions of sequences. Learning to harness these patterns from nature could allow us to do incredible things, like de novo protein design. Materials engineering systems could uncover new principles of structural design, potentially creating a path to inverse design for new materials. In the future, extracting insights from superhuman models trained on vast data will be one of the primary ways we do science.

Moreover, scientific models serve as an excellent testbed for new interpretability methods. Unlike language models, they're usually trained on data gathered from the physical world, giving us ground truth we can benchmark against. And if they have narrowly superhuman capabilities (as many scientific models do), their internal representations must be fairly sophisticated, and can provide complex stress-testing for our interpretability techniques.

Unlocking New Methods

Beyond understanding and debugging individual models, interpretability unlocks, refines, and validates new methods across AI development. Having a much richer and more granular view of how different techniques impact the model – including over training steps rather than relying on black-box evaluations after you've applied a method – allows you to better understand, refine, and discover all sorts of useful approaches for model training, editing, alignment, and efficiency.

In this way, creating interpretability tools is like inventing the microscope. While microscopy is an important diagnostic technique in its own right, its more important contribution has been in unlocking a much deeper understanding of biology, which has in turn led to all manner of new methods and applications that reach far beyond the domain of optics. I expect interpretability to play a similar role for AI development.

Reasons for Optimism

Models are complex systems, and understanding them is a genuine research challenge. But several converging factors suggest this challenge is not only solvable, but that we've only scratched the surface of what seems possible.

Models exhibit rich structure across domains

Models appear to converge on hierarchical representations of concepts that we can meaningfully parse and understand. Existing tools for discovering this structure, like sparse autoencoders (SAEs), empirically demonstrate that models encode many human-interpretable concepts, from simple features to complex abstractions.

For example, we've found that we can recover a species' place on the tree of life from the activations of the Evo 2 genomics model, suggesting that models might be learning “natural ontologies.” Others have found, using other methods, that LLMs represent numbers along helices in latent space, allowing them to use an elegant rotation-based algorithm to do addition.

These findings suggest that, in practice, models converge on concise representations that are more interpretable than their apparent complexity might suggest.

Other complex systems are understandable, and models might be unusually so

We've made remarkable progress studying other complex systems, like biological systems, by understanding their structure at different levels of abstraction. Complexity (which is a key feature of neural networks!) is understandable, given the right analytical framework and tools.

And unlike in biology, where we must work with imperfect measurements and indirect observations, we have perfect access to AI systems and an unlimited ability to experiment. Given enough compute, researchers can probe any layer, trace any computations, and run as many controlled interventions as they want. This unprecedented experimental control, at huge scales, suggests that interpretability should be more tractable than understanding biological systems, where science has made tremendous progress.

Interpretability can be unsupervised and scalable

Unsupervised, scalable methods have proven themselves repeatedly in machine learning, and I don't expect interpretability to be any different. The pattern is clear: use compute-scalable methods that let the data and models reveal their own structure, rather than imposing our preconceptions.

We already have such methods in the form of sparse dictionary learning (SAEs, transcoders, etc.), and are developing new ones like linear parameter decomposition. Furthermore, we can use language models themselves to automate the tedious work of parsing and categorizing model subcomponents, allowing our interpretability tools to track frontier model capability (and go far beyond what manual analysis allows).

At Goodfire, we're already applying these principles internally, using increasingly powerful automated agents for interpretability research, and seeing promising results that suggest this approach can handle the complexity of frontier systems.

Interpretability is still a young field, with big upside

In the field of AI, five years feels like a lifetime. Other fields of AI research have repeatedly demonstrated rapid advancement when scaled up: computer vision, natural language processing, and generative modeling all went from academic curiosities to world-changing technologies within just a few years.

For a nascent field of science such as interpretability, it's important to realize that the number of researchers seriously pursuing interpretability remains remarkably small (my best guess is that there are less than 150 full-time interpretability researchers in the world), concentrated in a handful of companies and academic labs, working on these problems for just a few years. We're very much still in the very early days of new techniques, like linear parameter decomposition.

The prize we're after is a fundamental science of neural networks that could unlock far more reliable, safer, and controllable AI systems. Given the promising early results and how few people have seriously tackled these questions, it would be remarkable to walk away from such transformative potential. Doing so would be like looking up at the stars and saying, “Well, we'll never have a chance of understanding why those are there.”

How We Get There

Realizing this vision requires tackling fundamental technical challenges across multiple fronts. Our full research agenda is a topic for another post, but we need breakthroughs in understanding how models encode and process information at many levels: from disentangling representational geometry, to decoding attention mechanisms, to linking representations and computations, to scaling circuit discovery. Each of these represents a deep technical puzzle where progress could uncover new ways of understanding and intentionally designing AI systems – or help us think about the problem in an entirely different way.

The timeline is uncertain, but I agree with Dario Amodei that we are in “a race between interpretability and model intelligence.” Leaders in the field believe AGI could arrive within the next few years, and we want to make sure that we can understand the systems that are going to be present in every part of our lives. To not understand the most important technology of our time is deeply irresponsible.

To be clear, success is not guaranteed. Interpretability is not a solved science – that's the point – and nobody knows exactly how much model behavior can be explained in the limit. But I think the reasons for optimism are compelling, and the most impactful science happens where things might not bear fruit.

Success will require ambition and ingenuity to achieve the breakthroughs we need. I would like to see radically more investment, energy, and talent turned towards what I see as the field's north star: explaining, debugging, and intentionally designing the most intelligent models.

Looking Forward

Training a model feels like growing a tree.

Our current practices are like throwing sunlight and supercharged fertilizer at the tree, letting it grow wild. It yields incredible fruit – but sometimes it grows too fast, and branches break. Sometimes it becomes diseased. Sometimes it just grows into an awkward and gangly thing that's not quite what we wanted.

Interpretability promises to equip us with tools to practice something more like bonsai: methodically shaping, pruning, and sculpting exactly what we want, built on a deep understanding of how growth works.

This isn't just about making AI safer (though that alone would justify the effort), but rather about unlocking the full potential of artificial intelligence. When we understand these systems deeply, we can build AI that's not just powerful but reliable, not just capable but aligned with our intentions. We can extract scientific insights from models that have learned patterns beyond human comprehension. We can debug problems before they manifest, optimize for the behaviors we want, and create AI systems that augment human capability in ways we can trust.

The seeds of transformative AI are already planted and growing rapidly. The question is whether we'll develop the tools to tend them properly before they grow beyond our ability to shape. At Goodfire, we're betting that we can – and we believe the early results show this bet is worth making. The technical challenges are not trivial, but neither is the opportunity: to help build the science that will define how humanity develops and deploys its most powerful technology.

Bonsai tree representing methodical shaping and pruning — *“Bonsai” by A.Davey, licensed under CC BY 2.0.*

Acknowledgments

Thanks to the Goodfire team for ideas and discussions that shaped this piece, especially Tom McGrath, Dan Balsam, and Myra Deng for key insights, and Michael Byun for help with drafting. I'm grateful for thoughtful feedback on earlier drafts from Chris Olah, Daniel Kang, Lee Sharkey, Dan Braun, Liv Gorton, and Mark Bissell.

References

Expanding on what we missed with sycophancy [link]
OpenAI, 2025.
Anthropic’s new AI model shows ability to deceive and blackmail [link]
Ina Fried, 2025. Axios.
Musk’s AI firm forced to delete posts praising Hitler from Grok chatbot [link]
Josh Taylor, 2025. The Guardian.
Scheming reasoning evaluations [link]
Marius Hobbhahn, 2024. Apollo Research.
Reasoning models don’t always say what they think [link]
Anthropic, 2025.
Interpreting Evo 2: Arc Institute’s Next-Generation Genomic Foundation Model [link]
Myra Deng et al., 2025.
Language Models Use Trigonometry to Do Addition [link]
Subhash Kantamneni, 2025. arXiv.
Towards Scalable Parameter Decomposition [link]
Lucius Bushnaq et al., 2025.
Open Problems in Mechanistic Interpretability [link]
Lee Sharkey et al., 2025. arXiv.
The Urgency of Interpretability [link]
Dario Amodei, 2025.
Interpretability Dreams [link]
Chris Olah, 2023. Transformer Circuits Note.

Research

Understanding Memorization via Loss Curvature

November 6, 2025

Priors in Time: Missing Inductive Biases for Language Model Interpretability

November 3, 2025

Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering

November 1, 2025