
Kate Saenko
· ProfessorVerifiedBoston University · Computer Science
Active 2004–2025
About
Professor Kate Saenko is a faculty member at the Department of Computer Science at Boston University, where she serves as a Professor and the director of the Computer Vision and Learning Group. She is also a member of the IVC Group. She received her PhD from MIT and has held various academic and research positions, including Assistant Professor at UMass Lowell, Postdoctoral Researcher at the International Computer Science Institute, Visiting Scholar at UC Berkeley EECS, and Visiting Postdoctoral Fellow in the School of Engineering and Applied Science at Harvard University. Her research interests encompass the broad area of Artificial Intelligence, with a focus on Adaptive Machine Learning, Learning for Vision and Language Understanding, and Deep Learning.
Research topics
- Computer Science
- Artificial Intelligence
- Machine Learning
- Natural Language Processing
- Mathematics
- Psychology
- Cognitive psychology
- Theoretical computer science
- Computer vision
Selected publications
2025-06-10
articleZero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Existing approaches require prompt tuning or architectural adaptations, limiting zero-shot applicability. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. We make two contributions. First, we find that VLM scores suffer from image- and prompt-specific biases, and that simple standardization is surprisingly effective at removing these and boosting MLR performance. And second, we introduce compound prompts grounded in realistic object combinations. Our analysis reveals "AND"/"OR" signal ambiguities that cause maximum compound scores to be surprisingly suboptimal compared to second-highest scores. We introduce an adaptive fusion method to address this issue. Our method enhances other zero-shot approaches, consistently improving their results. Experiments show superior mean Average Precision (mAP) compared to methods requiring training data, achieved through refined object ranking for robust zero-shot MLR. Code can be found at https://github.com/kjmillerCURIS/SPARC.
SAM 3: Segment Anything with Concepts
arXiv (Cornell University) · 2025-11-20 · 3 citations
preprintOpen accessWe present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.
BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
arXiv (Cornell University) · 2025-12-11
preprintOpen accessEarly children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.
ERM++: An Improved Baseline for Domain Generalization
2025-02-26 · 8 citations
articleDomain Generalization (DG) aims to develop classifiers that can generalize to new, unseen data distributions, a critical capability when collecting new domain-specific data is impractical. A common DG baseline minimizes the empirical risk on the source domains. Recent studies have shown that this approach, known as Empirical Risk Minimization (ERM), can outperform most more complex DG methods when properly tuned. However, these studies have primarily focused on a narrow set of hyperparameters, neglecting other factors that can enhance robustness and prevent overfitting and catastrophic forgetting, properties which are critical for strong DG performance. In our investigation of training data utilization (i.e., duration and setting validation splits), initialization, and additional regularizers, we find that tuning these previously overlooked factors significantly improves model generalization across diverse datasets without adding much complexity. We call this improved, yet simple baseline ERM++. Despite its ease of implementation, ERM++ improves DG performance by over 5% compared to prior ERM baselines on a standard benchmark of 5 datasets with a ResNet-50 and over 15% with a ViT-B/16. It also outperforms all state-of-the-art methods on DomainBed datasets with both architectures. Importantly, ERM++ is easy to integrate into existing frameworks like DomainBed, making it a practical and powerful tool for researchers and practitioners. Overall, ERM++ challenges the need for more complex DG methods by providing a stronger, more reliable baseline that maintains simplicity and ease of use. Code is available at https://github.com/piotr-teterwak/erm_plusplus
ArXiv.org · 2025-02-24
preprintOpen accessZero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Existing approaches require prompt tuning or architectural adaptations, limiting zero-shot applicability. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. Using large language model insights on object co-occurrence, we introduce compound prompts grounded in realistic object combinations. Analysis of these prompt scores reveals VLM biases and ``AND''/``OR'' signal ambiguities, notably that maximum compound scores are surprisingly suboptimal compared to second-highest scores. We address these through a debiasing and score-fusion algorithm that corrects image bias and clarifies VLM response behaviors. Our method enhances other zero-shot approaches, consistently improving their results. Experiments show superior mean Average Precision (mAP) compared to methods requiring training data, achieved through refined object ranking for robust zero-shot MLR.
Scaling Up Temporal Domain Generalization via Temporal Experts Averaging
2025-01-01
articleOpen accessTemporal Domain Generalization (TDG) aims to generalize across temporal distribution shifts, e.g., lexical change over time.Prior work often addresses this by predicting future model weights.However, full model prediction is prohibitively expensive for even reasonably sized models.Thus, recent methods only predict the classifier layer, limiting generalization by failing to adjust other model components.To address this, we propose Temporal Experts Averaging (TEA), a novel and scalable TDG framework that updates the entire model using weight averaging to maximize generalization potential while minimizing computational costs.Our theoretical analysis guides us to two steps that enhance generalization to future domains.First, we create expert models with functional diversity yet parameter similarity by fine-tuning a domain-agnostic base model on individual temporal domains while constraining weight changes.Second, we optimize the bias-variance tradeoff through adaptive averaging coefficients derived from modeling temporal weight trajectories in a principal component subspace.Expert's contributions are based on their projected proximity to future domains.Extensive experiments across 7 TDG benchmarks, 5 models, and 2 TDG settings shows TEA outperforms prior TDG methods by up to 69% while being up to 60x more efficient 1 .
Computer Vision and Image Understanding · 2025-12-17
articleOP-LoRA: The Blessing of Dimensionality
arXiv (Cornell University) · 2024-12-13
preprintOpen accessLow-rank adapters enable fine-tuning of large models with only a small number\nof parameters, thus reducing storage costs and minimizing the risk of\ncatastrophic forgetting. However, they often pose optimization challenges, with\npoor convergence. To overcome these challenges, we introduce an\nover-parameterized approach that accelerates training without increasing\ninference costs. This method reparameterizes low-rank adaptation by employing a\nseparate MLP and learned embedding for each layer. The learned embedding is\ninput to the MLP, which generates the adapter parameters. Such\noverparamaterization has been shown to implicitly function as an adaptive\nlearning rate and momentum, accelerating optimization. At inference time, the\nMLP can be discarded, leaving behind a standard low-rank adapter. To study the\neffect of MLP overparameterization on a small yet difficult proxy task, we\nimplement it for matrix factorization, and find it achieves faster convergence\nand lower final loss. Extending this approach to larger-scale tasks, we observe\nconsistent performance gains across domains. We achieve improvements in\nvision-language tasks and especially notable increases in image generation,\nwith CMMD scores improving by up to 15 points.\n
Koala: Key Frame-Conditioned Long Video-LLM
2024-06-16 · 16 citations
articleSenior authorLong video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Models (vLLMs) hold promise as a viable solution due to their demonstrated emergent capabilities on new tasks. However, despite being trained on millions of short seconds-long videos, vLLMs are unable to understand minutes-long videos and accurately answer questions about them. To address this limitation, we propose a lightweight and self-supervised approach, Key frame-conditioned long video-LLM (Koala), that introduces learnable spatiotemporal queries to adapt pretrained vLLMs for generalizing to longer videos. Our approach introduces two new tokenizers that condition on visual tokens computed from sparse video key frames for understanding short and long video moments. We train our proposed approach on HowTo100M and demonstrate its effectiveness on zero-shot long video understanding benchmarks, where it outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
Learning to Compose SuperWeights for Neural Parameter Allocation Search
2024-01-03 · 3 citations
articleNeural parameter allocation search (NPAS) automates parameter sharing by obtaining weights for a network given an arbitrary, fixed parameter budget. Prior work has two major drawbacks we aim to address. First, there is a disconnect in the sharing pattern between the search and training steps, where weights are warped for layers of different sizes during the search to measure similarity, but not during training, resulting in reduced performance. To address this, we generate layer weights by learning to compose sets of SuperWeights, which represent a group of trainable parameters. These SuperWeights are created to be large enough so they can be used to represent any layer in the network, but small enough that they are computationally efficient. The second drawback we address is the method of measuring similarity between shared parameters. Whereas prior work compared the weights themselves, we argue this does not take into account the amount of conflict between the shared weights. Instead, we use gradient information to identify layers with shared weights that wish to diverge from each other. We demonstrate that our SuperWeight Networks consistently boost performance over the state-of-the-art on the ImageNet and CIFAR datasets in the NPAS setting. We further show that our approach can generate parameters for many network architectures using the same set of weights. This enables us to support tasks like efficient ensembling and anytime prediction, outperforming fully-parameterized ensembles with 17% fewer parameters <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> .
Recent grants
CI-NEW: Collaborative Research: COVE-Computer Vision Exchange for Data, Annotations and Tools
NSF · $206k · 2016–2020
EAGER: Quantifying and Reducing Data Bias in Object Detection Using Physics-based Image Synthesis
NSF · $186k · 2014–2017
AitF: FULL: Collaborative Research: PEARL: Perceptual Adaptive Representation Learning in the Wild
NSF · $200k · 2015–2017
AitF: FULL: Collaborative Research: PEARL: Perceptual Adaptive Representation Learning in the Wild
NSF · $174k · 2016–2020
NSF · $316k · 2017–2022
Frequent coauthors
- 147 shared
Trevor Darrell
- 70 shared
Bryan A. Plummer
- 65 shared
Judy Hoffman
- 56 shared
Stan Sclaroff
Boston University
- 43 shared
Marcus Rohrbach
- 42 shared
Rogério Feris
IBM (United States)
- 41 shared
Kuniaki Saito
- 33 shared
Rameswar Panda
Labs
Education
Ph.D.
MIT
Other
UMass Lowell
Other
International Computer Science Institute
Other
UC Berkeley EECS
Other
School of Engineering and Applied Science at Harvard University
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Kate Saenko
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup