Blog

Stanford Guest Lectures: AP293 (Fall 2025)

At Goodfire, we're pursuing a portfolio of fundamental research directions in interpretability, ranging from understanding memorization to linear parameter decomposition to interpretable post-training.

Because interpretability is a relatively young and fast-moving field, some of these high-priority research directions haven't been publicly articulated outside of "related work" sections of various papers. In the spirit of broadly sharing our research, we're working on educational resources to help more folks understand the field and the exciting work that's happening within it.

As part of those efforts, we gave three guest lectures in Surya Ganguli's course on interpretability at Stanford last fall. We're sharing recordings of the lectures here so that researchers and students in the broader community can benefit from them.

Stay tuned for more educational content — including clips from the new Stanford course we're helping to organize!

Causal Mechanistic Interpretability (Atticus Geiger)

How can we use the language of causality to understand and edit the internal mechanisms of AI models? Atticus Geiger discusses applying frameworks and tools from causal modeling to understand LLMs and other neural networks.

00:00 — Intro

01:51 — Activation steering (e.g. Golden Gate Claude)

10:23 — Causal mediation analysis (understanding the contribution of an intermediate component)

21:42 — Causal abstraction methods (explaining a complex causal system with a simple one)

26:11 — Interchange interventions

40:46 — Distributed Alignment Search

54:54 — Lookback mechanisms: a case study in designing counterfactuals

Computational Motifs: The Algorithmic Primitives of Transformers (Jack Merullo)

What algorithmic primitives do transformers use? Certain "computational motifs" show up over and over again when we do interpretability on different models, tasks, and circuits. Jack Merullo discusses these computational motifs, and how they can help us understand models in more generalizable ways.

00:53 — Intro: defining "computational motifs"

05:48 — Induction heads (a classic motif)

08:31 — Motifs in the Indirect Object Identification circuit

44:33 — More examples

51:15 — Challenges and open problems

1:03:12 — Conclusion & questions

In-Context Learning & "Model Systems" Interpretability (Ekdeep Singh)

What counts as an explanation of how an LLM works? Ekdeep Singh Lubana argues that different answers to this question define three different levels of analysis in interpretability, and outlines his neuro-inspired "model systems approach". Plus, as a case study for that approach: how in-context learning and many-shot jailbreaking are explained by LLM representations changing in-context.

00:33 — What counts as an explanation?

04:47 — Levels of analysis & standard interpretability approaches

18:19 — The "model systems" approach to interp

(Case study on in-context learning)

23:36 — How LLM representations change in-context

44:10 — Modeling ICL with rational analysis

1:10:54 — Conclusion & questions

Read more from Goodfire

February 25, 2026

Interpretability Infrastructure at Frontier Scale: Harvesting Activations from a Trillion-Parameter Model

Michael Anderson
,
Michael Byun
,
Tucker Fross
,
February 5, 2026

Intentionally Designing the Future of AI

Thomas McGrath
,
February 5, 2026

Understanding, Learning From, and Designing AI: Our Series B

No items found.

Research

Features as Rewards: Using Interpretability to Reduce Hallucinations

February 11, 2026

Using Interpretability to Identify a Novel Class of Alzheimer's Biomarkers

January 28, 2026

Understanding Memorization via Loss Curvature

November 6, 2025
No items found.
Educational