
David Bau
· Professor of the Practice, Khoury College of Computer Sciences, Affiliate AppointmentVerifiedNortheastern University · Artificial Intelligence and Data Science
Active 1994–2026
About
David Bau is an assistant professor in the Khoury College of Computer Sciences at Northeastern University, based in Boston. His research focuses on human-computer interaction and machine learning. Before joining Northeastern, he worked as a software engineer at Google, BEA, and Crossgain. Bau has been published in numerous journals and conferences, including CVPR, NeurIPS, ICCV, ECCV, and SIGGRAPH. Outside of research, he enjoys astronomy and puzzle collecting.
Research topics
- Computer Science
- Artificial Intelligence
- Machine Learning
- Programming language
- Natural Language Processing
- Psychology
- Econometrics
- Cognitive psychology
- Computer vision
Selected publications
Distilling Diversity and Control in Diffusion Models
2026-03-06
articleOpen accessSenior authorDistilled diffusion models generate images in far fewer timesteps but suffer from reduced sample diversity when generating multiple outputs from the same prompt. To understand this phenomenon, we first investigate whether distillation damages concept representations by examining if the required diversity is properly learned. Surprisingly, distilled models retain the base model’s representational structure: control mechanisms like Concept Sliders and LoRAs transfer seamlessly without retraining, and Slider-Space analysis reveals distilled models possess variational directions needed for diversity yet fail to activate them. This redirects our investigation to understanding how the generation dynamics differ between base and distilled models. Using ${{\hat{\mathbf x}}_0}$ trajectory visualization, we discover distilled models commit to their final image structure almost immediately at the first timestep, while base models distribute structural decisions across many steps. To test whether this first-step commitment causes the diversity loss, we introduce diversity distillation, a hybrid approach using the base model for only the first critical timestep before switching to the distilled model. This single intervention restores sample diversity while maintaining computational efficiency. We provide both causal validation and theoretical support showing why the very first timestep concentrates the diversity bottleneck in distilled models. Our code and data are available at distillation.baulab.info
Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare
arXiv (Cornell University) · 2025-02-18
preprintOpen accessWe know from prior work that LLMs encode social biases, and that this manifests in clinical tasks. In this work we adopt tools from mechanistic interpretability to unveil sociodemographic representations and biases within LLMs in the context of healthcare. Specifically, we ask: Can we identify activations within LLMs that encode sociodemographic information (e.g., gender, race)? We find that gender information is highly localized in MLP layers and can be reliably manipulated at inference time via patching. Such interventions can surgically alter generated clinical vignettes for specific conditions, and also influence downstream clinical predictions which correlate with gender, e.g., patient risk of depression. We find that representation of patient race is somewhat more distributed, but can also be intervened upon, to a degree. To our knowledge, this is the first application of mechanistic interpretability methods to LLMs for healthcare.
Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy and Research
SSRN Electronic Journal · 2025-01-01
preprintOpen accessElucidating Mechanisms of Demographic Bias in LLMs for Healthcare
2025-01-01
articleOpen accesstasks (Gerszberg, 2024;Zack et al., 2024;Zhang et al., 2020).In this work we adopt tools from mechanistic interpretability to unveil sociodemographic representations and biases within LLMs in the context of healthcare.Specifically, we ask: Can we identify activations within LLMs that encode sociodemographic information (e.g., gender, race)?We find that, in three open weight LLMs, gender information is highly localized in MLP layers and can be reliably manipulated at inference time via patching.Such interventions can surgically alter generated clinical vignettes for specific conditions, and also influence downstream clinical predictions which correlate with gender, e.g., patient risk of depression.We find that representation of patient race is somewhat more distributed, but can also be intervened upon, to a degree.To our knowledge, this is the first application of mechanistic interpretability methods to LLMs for healthcare 1 .
LLMs Encode Harmfulness and Refusal Separately
ArXiv.org · 2025-07-16
preprintOpen accessLLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs' refusal behaviors can be mediated by a one-dimensional subspace, i.e., a refusal direction. In this work, we identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal. There exists a harmfulness direction that is distinct from the refusal direction. As causal evidence, steering along the harmfulness direction can lead LLMs to interpret harmless instructions as harmful, but steering along the refusal direction tends to elicit refusal responses directly without reversing the model's judgment on harmfulness. Furthermore, using our identified harmfulness concept, we find that certain jailbreak methods work by reducing the refusal signals without reversing the model's internal belief of harmfulness. We also find that adversarially finetuning models to accept harmful instructions has minimal impact on the model's internal belief of harmfulness. These insights lead to a practical safety application: The model's latent harmfulness representation can serve as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals that is robust to finetuning attacks. For instance, our Latent Guard achieves performance comparable to or better than Llama Guard 3 8B, a dedicated finetuned safeguard model, across different jailbreak methods. Our findings suggest that LLMs' internal understanding of harmfulness is more robust than their refusal decision to diverse input instructions, offering a new perspective to study AI safety.
MIB: A Mechanistic Interpretability Benchmark
UvA-DARE (University of Amsterdam) · 2025-04-17
preprintOpen accessHow can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark, with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or causal variables in neural language models. The circuit localization track compares methods that locate the model components - and connections between them - most important for performing a task (e.g., attribution patching or information flow routes). The causal variable localization track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAEs) or distributed alignment search (DAS), and align those features to a task-relevant causal variable. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAE features are not better than neurons, i.e., non-featurized hidden vectors. These findings illustrate that MIB enables meaningful comparisons, and increases our confidence that there has been real progress in the field.
SliderSpace: Decomposing the Visual Capabilities of Diffusion Models
2025-10-19 · 1 citations
preprintOpen accessWe present SliderSpace, a framework for automatically decomposing the visual capabilities of diffusion models into controllable and human-understandable directions. Unlike existing control methods that require a user to specify attributes for each edit direction individually, SliderSpace discovers multiple interpretable and diverse directions simultaneously from a single text prompt. Each direction is trained as a low-rank adaptor, enabling compositional control and the discovery of surprising possibilities in the model's latent space. Through extensive experiments on state-of-the-art diffusion models, we demonstrate SliderSpace's effectiveness across three applications: concept decomposition, artistic style exploration, and diversity enhancement. Our quantitative evaluation shows that SliderSpace-discovered directions decompose the visual structure of model's knowledge effectively, offering insights into the latent capabilities encoded within diffusion models. User studies further validate that our method produces more diverse and useful variations compared to baselines. Our code, data and trained weights are available at https://sliderspace.baulab.info
Leveraging AI for Productive and Trustworthy HPC Software: Challenges and Research Directions
Lecture notes in computer science · 2025-11-23 · 1 citations
book-chapterOpen accessLanguage Models use Lookbacks to Track Beliefs
ArXiv.org · 2025-05-20
preprintOpen accessHow do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze LMs' ability to reason about characters' beliefs using causal mediation and abstraction. We construct a dataset, CausalToM, consisting of simple stories where two characters independently change the state of two objects, potentially unaware of each other's actions. Our investigation uncovers a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating their reference information, represented as Ordering IDs (OIs), in low-rank subspaces of the state token's residual stream. When asked about a character's beliefs regarding the state of an object, the binding lookback retrieves the correct state OI and then the answer lookback retrieves the corresponding state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character's beliefs. Our work provides insights into belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.
Computational Linguistics · 2025-09-22
articleOpen accessAbstract Interpretability provides a toolset for understanding how and why language models behave in certain ways. However, there is little unity in the field: Most studies use ad-hoc evaluations and do not share theoretical foundations, making it difficult to measure progress and compare the pros and cons of different techniques. Furthermore, while mechanistic understanding is frequently discussed, the basic causal units underlying these mechanisms are often not explicitly defined. In this article, we propose a perspective on interpretability research grounded in causal mediation analysis. Specifically, we describe the history and current state of interpretability taxonomized according to the types of causal units (mediators) utilized, as well as methods used to search over mediators. We discuss the pros and cons of each mediator, providing insights as to when particular kinds of mediators and search methods are most appropriate. We argue that this framing yields a more cohesive narrative of the field and helps researchers select appropriate methods based on their research objective. Our analysis yields actionable recommendations for future work, including the discovery of new mediators and the development of standardized evaluations tailored to these goals.
Frequent coauthors
- 48 shared
Antonio Torralba
- 25 shared
Jun-Yan Zhu
- 19 shared
Bolei Zhou
- 16 shared
Gary Hachfeld
- 16 shared
Robert Holcomb
University of Rochester
- 16 shared
William J. Craig
University of Edinburgh
- 13 shared
Joanna Materzyńska
- 12 shared
Hendrik Strobelt
Education
- 2016
Ph.D., Computer Science
Massachusetts Institute of Technology
- 2012
M.S., Computer Science
Massachusetts Institute of Technology
- 2011
B.S., Electrical Engineering and Computer Science
Massachusetts Institute of Technology
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with David Bau
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup