Kate Saenko

· ProfessorVerified

Boston University · Computer Science

Active 2004–2025

h-index89

Citations52.8k

Papers445209 last 5y

Funding$1.1M

Faculty page Lab page

See your match with Kate Saenko — sign in to PhdFit.Sign in

About

Professor Kate Saenko is a faculty member at the Department of Computer Science at Boston University, where she serves as a Professor and the director of the Computer Vision and Learning Group. She is also a member of the IVC Group. She received her PhD from MIT and has held various academic and research positions, including Assistant Professor at UMass Lowell, Postdoctoral Researcher at the International Computer Science Institute, Visiting Scholar at UC Berkeley EECS, and Visiting Postdoctoral Fellow in the School of Engineering and Applied Science at Harvard University. Her research interests encompass the broad area of Artificial Intelligence, with a focus on Adaptive Machine Learning, Learning for Vision and Language Understanding, and Deep Learning.

Research topics

Computer Science
Artificial Intelligence
Machine Learning
Natural Language Processing
Mathematics
Psychology
Cognitive psychology
Theoretical computer science
Computer vision

Selected publications

SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models
2025-06-10
article
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Existing approaches require prompt tuning or architectural adaptations, limiting zero-shot applicability. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. We make two contributions. First, we find that VLM scores suffer from image- and prompt-specific biases, and that simple standardization is surprisingly effective at removing these and boosting MLR performance. And second, we introduce compound prompts grounded in realistic object combinations. Our analysis reveals "AND"/"OR" signal ambiguities that cause maximum compound scores to be surprisingly suboptimal compared to second-highest scores. We introduce an adaptive fusion method to address this issue. Our method enhances other zero-shot approaches, consistently improving their results. Experiments show superior mean Average Precision (mAP) compared to methods requiring training data, achieved through refined object ranking for robust zero-shot MLR. Code can be found at https://github.com/kjmillerCURIS/SPARC.
Publisher DOI
SAM 3: Segment Anything with Concepts
arXiv (Cornell University) · 2025-11-20 · 3 citations
preprintOpen access
We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.
Publisher OA PDF DOI
BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
arXiv (Cornell University) · 2025-12-11
preprintOpen access
Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.
Publisher OA PDF DOI
ERM++: An Improved Baseline for Domain Generalization
2025-02-26 · 8 citations
article
Domain Generalization (DG) aims to develop classifiers that can generalize to new, unseen data distributions, a critical capability when collecting new domain-specific data is impractical. A common DG baseline minimizes the empirical risk on the source domains. Recent studies have shown that this approach, known as Empirical Risk Minimization (ERM), can outperform most more complex DG methods when properly tuned. However, these studies have primarily focused on a narrow set of hyperparameters, neglecting other factors that can enhance robustness and prevent overfitting and catastrophic forgetting, properties which are critical for strong DG performance. In our investigation of training data utilization (i.e., duration and setting validation splits), initialization, and additional regularizers, we find that tuning these previously overlooked factors significantly improves model generalization across diverse datasets without adding much complexity. We call this improved, yet simple baseline ERM++. Despite its ease of implementation, ERM++ improves DG performance by over 5% compared to prior ERM baselines on a standard benchmark of 5 datasets with a ResNet-50 and over 15% with a ViT-B/16. It also outperforms all state-of-the-art methods on DomainBed datasets with both architectures. Importantly, ERM++ is easy to integrate into existing frameworks like DomainBed, making it a practical and powerful tool for researchers and practitioners. Overall, ERM++ challenges the need for more complex DG methods by providing a stronger, more reliable baseline that maintains simplicity and ease of use. Code is available at https://github.com/piotr-teterwak/erm_plusplus
Publisher DOI
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models
ArXiv.org · 2025-02-24
preprintOpen access
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Existing approaches require prompt tuning or architectural adaptations, limiting zero-shot applicability. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. Using large language model insights on object co-occurrence, we introduce compound prompts grounded in realistic object combinations. Analysis of these prompt scores reveals VLM biases and ``AND''/``OR'' signal ambiguities, notably that maximum compound scores are surprisingly suboptimal compared to second-highest scores. We address these through a debiasing and score-fusion algorithm that corrects image bias and clarifies VLM response behaviors. Our method enhances other zero-shot approaches, consistently improving their results. Experiments show superior mean Average Precision (mAP) compared to methods requiring training data, achieved through refined object ranking for robust zero-shot MLR.
Publisher OA PDF DOI
Scaling Up Temporal Domain Generalization via Temporal Experts Averaging
2025-01-01
articleOpen access
Temporal Domain Generalization (TDG) aims to generalize across temporal distribution shifts, e.g., lexical change over time.Prior work often addresses this by predicting future model weights.However, full model prediction is prohibitively expensive for even reasonably sized models.Thus, recent methods only predict the classifier layer, limiting generalization by failing to adjust other model components.To address this, we propose Temporal Experts Averaging (TEA), a novel and scalable TDG framework that updates the entire model using weight averaging to maximize generalization potential while minimizing computational costs.Our theoretical analysis guides us to two steps that enhance generalization to future domains.First, we create expert models with functional diversity yet parameter similarity by fine-tuning a domain-agnostic base model on individual temporal domains while constraining weight changes.Second, we optimize the bias-variance tradeoff through adaptive averaging coefficients derived from modeling temporal weight trajectories in a principal component subspace.Expert's contributions are based on their projected proximity to future domains.Extensive experiments across 7 TDG benchmarks, 5 models, and 2 TDG settings shows TEA outperforms prior TDG methods by up to 69% while being up to 60x more efficient 1 .
Publisher OA PDF DOI
Generalized prompt-driven zero-shot domain adaptive segmentation with feature rectification and semantic modulation
Computer Vision and Image Understanding · 2025-12-17
article
Publisher DOI
OP-LoRA: The Blessing of Dimensionality
arXiv (Cornell University) · 2024-12-13
preprintOpen access
Low-rank adapters enable fine-tuning of large models with only a small number\nof parameters, thus reducing storage costs and minimizing the risk of\ncatastrophic forgetting. However, they often pose optimization challenges, with\npoor convergence. To overcome these challenges, we introduce an\nover-parameterized approach that accelerates training without increasing\ninference costs. This method reparameterizes low-rank adaptation by employing a\nseparate MLP and learned embedding for each layer. The learned embedding is\ninput to the MLP, which generates the adapter parameters. Such\noverparamaterization has been shown to implicitly function as an adaptive\nlearning rate and momentum, accelerating optimization. At inference time, the\nMLP can be discarded, leaving behind a standard low-rank adapter. To study the\neffect of MLP overparameterization on a small yet difficult proxy task, we\nimplement it for matrix factorization, and find it achieves faster convergence\nand lower final loss. Extending this approach to larger-scale tasks, we observe\nconsistent performance gains across domains. We achieve improvements in\nvision-language tasks and especially notable increases in image generation,\nwith CMMD scores improving by up to 15 points.\n
Publisher OA PDF DOI
Koala: Key Frame-Conditioned Long Video-LLM
2024-06-16 · 16 citations
articleSenior author
Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Models (vLLMs) hold promise as a viable solution due to their demonstrated emergent capabilities on new tasks. However, despite being trained on millions of short seconds-long videos, vLLMs are unable to understand minutes-long videos and accurately answer questions about them. To address this limitation, we propose a lightweight and self-supervised approach, Key frame-conditioned long video-LLM (Koala), that introduces learnable spatiotemporal queries to adapt pretrained vLLMs for generalizing to longer videos. Our approach introduces two new tokenizers that condition on visual tokens computed from sparse video key frames for understanding short and long video moments. We train our proposed approach on HowTo100M and demonstrate its effectiveness on zero-shot long video understanding benchmarks, where it outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
Publisher DOI
Learning to Compose SuperWeights for Neural Parameter Allocation Search
2024-01-03 · 3 citations
article
Neural parameter allocation search (NPAS) automates parameter sharing by obtaining weights for a network given an arbitrary, fixed parameter budget. Prior work has two major drawbacks we aim to address. First, there is a disconnect in the sharing pattern between the search and training steps, where weights are warped for layers of different sizes during the search to measure similarity, but not during training, resulting in reduced performance. To address this, we generate layer weights by learning to compose sets of SuperWeights, which represent a group of trainable parameters. These SuperWeights are created to be large enough so they can be used to represent any layer in the network, but small enough that they are computationally efficient. The second drawback we address is the method of measuring similarity between shared parameters. Whereas prior work compared the weights themselves, we argue this does not take into account the amount of conflict between the shared weights. Instead, we use gradient information to identify layers with shared weights that wish to diverge from each other. We demonstrate that our SuperWeight Networks consistently boost performance over the state-of-the-art on the ImageNet and CIFAR datasets in the NPAS setting. We further show that our approach can generate parameters for many network architectures using the same set of weights. This enables us to support tasks like efficient ensembling and anytime prediction, outperforming fully-parameterized ensembles with 17% fewer parameters <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> .
Publisher DOI

Recent grants

CI-NEW: Collaborative Research: COVE-Computer Vision Exchange for Data, Annotations and Tools
NSF · $206k · 2016–2020
EAGER: Quantifying and Reducing Data Bias in Object Detection Using Physics-based Image Synthesis
NSF · $186k · 2014–2017
AitF: FULL: Collaborative Research: PEARL: Perceptual Adaptive Representation Learning in the Wild
NSF · $200k · 2015–2017
AitF: FULL: Collaborative Research: PEARL: Perceptual Adaptive Representation Learning in the Wild
NSF · $174k · 2016–2020
S&AS: FND: COLLAB: Learning Manipulation Skills Using Deep Reinforcement Learning with Domain Transfer
NSF · $316k · 2017–2022

Frequent coauthors

Trevor Darrell
147 shared
Bryan A. Plummer
70 shared
Judy Hoffman
65 shared
Stan Sclaroff
Boston University
56 shared
Marcus Rohrbach
43 shared
Rogério Feris
IBM (United States)
42 shared
Kuniaki Saito
41 shared
Rameswar Panda
33 shared

Labs

Machine LearningPI

Education

Ph.D.
MIT
Other
UMass Lowell
Other
International Computer Science Institute
Other
UC Berkeley EECS
Other
School of Engineering and Applied Science at Harvard University

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Kate Saenko

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you