
Research
Fundamental interpretability research to understand and intentionally design advanced AI systems
Filter By
Priors in Time: Missing Inductive Biases for Language Model Interpretability
Priors in Time: Missing Inductive Biases for Language Model Interpretability
Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering
Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering
Deploying Interpretability to Production with Rakuten: SAE Probes for PII Detection
Deploying Interpretability to Production with Rakuten: SAE Probes for PII Detection

Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context
Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context
Understanding Sparse Autoencoder Scaling in the Presence of Feature Manifolds
Understanding Sparse Autoencoder Scaling in the Presence of Feature Manifolds
Adversarial Examples Are Not Bugs, They Are Superposition
Adversarial Examples Are Not Bugs, They Are Superposition
Discovering Undesired Rare Behaviors via Model Diff Amplification
Discovering Undesired Rare Behaviors via Model Diff Amplification
The Circuits Research Landscape: Results and Perspectives
The Circuits Research Landscape: Results and Perspectives

Replicating Circuit Tracing for a Simple Known Mechanism
Replicating Circuit Tracing for a Simple Known Mechanism
Painting With Concepts Using Diffusion Model Latents
Painting With Concepts Using Diffusion Model Latents


Contact us
Interested in Goodfire Ember?



