
Stanford Guest Lectures: AP293 (Fall 2025)
At Goodfire, we're pursuing a portfolio of fundamental research directions in interpretability, ranging from understanding memorization to linear parameter decomposition to interpretable post-training.
Because interpretability is a relatively young and fast-moving field, some of these high-priority research directions haven't been publicly articulated outside of "related work" sections of various papers. In the spirit of broadly sharing our research, we're working on educational resources to help more folks understand the field and the exciting work that's happening within it.
As part of those efforts, we gave three guest lectures in Surya Ganguli's course on interpretability at Stanford last fall. We're sharing recordings of the lectures here so that researchers and students in the broader community can benefit from them.
Stay tuned for more educational content — including clips from the new Stanford course we're helping to organize!
Causal Mechanistic Interpretability (Atticus Geiger)
How can we use the language of causality to understand and edit the internal mechanisms of AI models? Atticus Geiger discusses applying frameworks and tools from causal modeling to understand LLMs and other neural networks.
00:00 — Intro
01:51 — Activation steering (e.g. Golden Gate Claude)
10:23 — Causal mediation analysis (understanding the contribution of an intermediate component)
21:42 — Causal abstraction methods (explaining a complex causal system with a simple one)
26:11 — Interchange interventions
40:46 — Distributed Alignment Search
54:54 — Lookback mechanisms: a case study in designing counterfactuals
Computational Motifs: The Algorithmic Primitives of Transformers (Jack Merullo)
What algorithmic primitives do transformers use? Certain "computational motifs" show up over and over again when we do interpretability on different models, tasks, and circuits. Jack Merullo discusses these computational motifs, and how they can help us understand models in more generalizable ways.
00:53 — Intro: defining "computational motifs"
05:48 — Induction heads (a classic motif)
08:31 — Motifs in the Indirect Object Identification circuit
44:33 — More examples
51:15 — Challenges and open problems
1:03:12 — Conclusion & questions
In-Context Learning & "Model Systems" Interpretability (Ekdeep Singh)
What counts as an explanation of how an LLM works? Ekdeep Singh Lubana argues that different answers to this question define three different levels of analysis in interpretability, and outlines his neuro-inspired "model systems approach". Plus, as a case study for that approach: how in-context learning and many-shot jailbreaking are explained by LLM representations changing in-context.
00:33 — What counts as an explanation?
04:47 — Levels of analysis & standard interpretability approaches
18:19 — The "model systems" approach to interp
(Case study on in-context learning)
23:36 — How LLM representations change in-context
44:10 — Modeling ICL with rational analysis
1:10:54 — Conclusion & questions