
Judy Hoffman
VerifiedGeorgia Institute of Technology · Computer Science
Active 1972–2026
About
Dr. Judy Hoffman is an Associate Professor in the School of Interactive Computing at Georgia Tech, where she is also a member of the Machine Learning Center and a Diversity and Inclusion Fellow. Her research focuses on the intersection of computer vision and machine learning, with specialization in domain adaptation, transfer learning, adversarial robustness, and algorithmic fairness. She has received numerous awards, including NSF CAREER, Google Research Scholar Award, Samsung AI Researcher of the Year Award, NVIDIA female leader in computer vision award, AIMiner top 100 most influential scholars in Machine Learning, MIT EECS Rising Star, and multiple best paper awards. In addition to her research, she co-founded and continues to advise Women in Computer Vision, an organization that provides mentorship and travel support for early-career women in the computer vision community. Prior to joining Georgia Tech, she was a Research Scientist at Facebook AI Research. She earned her PhD in Electrical Engineering and Computer Science from UC Berkeley, followed by Postdoctoral research at Stanford University and UC Berkeley.
Research topics
- Computer science
- Artificial intelligence
- Machine learning
- Computer vision
- Data mining
Selected publications
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
arXiv (Cornell University) · 2026-04-08
articleOpen accessRobot learning increasingly depends on large and diverse data, yet robot data collection remains expensive and difficult to scale. Egocentric human data offer a promising alternative by capturing rich manipulation behavior across everyday environments. However, existing human datasets are often limited in scope, difficult to extend, and fragmented across institutions. We introduce EgoVerse, a collaborative platform for human data-driven robot learning that unifies data collection, processing, and access under a shared framework, enabling contributions from individual researchers, academic labs, and industry partners. The current release includes 1,362 hours (80k episodes) of human demonstrations spanning 1,965 tasks, 240 scenes, and 2,087 unique demonstrators, with standardized formats, manipulation-relevant annotations, and tooling for downstream learning. Beyond the dataset, we conduct a large-scale study of human-to-robot transfer with experiments replicated across multiple labs, tasks, and robot embodiments under shared protocols. We find that policy performance generally improves with increased human data, but that effective scaling depends on alignment between human data and robot learning objectives. Together, the dataset, platform, and study establish a foundation for reproducible progress in human data-driven robot learning. Videos and additional information can be found at https://egoverse.ai/
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
arXiv (Cornell University) · 2026-04-08
preprintOpen accessRobot learning increasingly depends on large and diverse data, yet robot data collection remains expensive and difficult to scale. Egocentric human data offer a promising alternative by capturing rich manipulation behavior across everyday environments. However, existing human datasets are often limited in scope, difficult to extend, and fragmented across institutions. We introduce EgoVerse, a collaborative platform for human data-driven robot learning that unifies data collection, processing, and access under a shared framework, enabling contributions from individual researchers, academic labs, and industry partners. The current release includes 1,362 hours (80k episodes) of human demonstrations spanning 1,965 tasks, 240 scenes, and 2,087 unique demonstrators, with standardized formats, manipulation-relevant annotations, and tooling for downstream learning. Beyond the dataset, we conduct a large-scale study of human-to-robot transfer with experiments replicated across multiple labs, tasks, and robot embodiments under shared protocols. We find that policy performance generally improves with increased human data, but that effective scaling depends on alignment between human data and robot learning objectives. Together, the dataset, platform, and study establish a foundation for reproducible progress in human data-driven robot learning. Videos and additional information can be found at https://egoverse.ai/
Resolving Interference (RI): Disentangling Models for Improved Model Merging
arXiv (Cornell University) · 2026-03-13
preprintOpen accessSenior authorModel merging has shown that multitask models can be created by directly combining the parameters of different models that are each specialized on tasks of interest. However, models trained independently on distinct tasks often exhibit interference that degrades the merged model's performance. To solve this problem, we formally define the notion of Cross-Task Interference as the drift in the representation of the merged model relative to its constituent models. Reducing cross-task interference is key to improving merging performance. To address this issue, we propose our method, Resolving Interference (RI), a light-weight adaptation framework which disentangles expert models to be functionally orthogonal to the space of other tasks, thereby reducing cross-task interference. RI does this whilst using only unlabeled auxiliary data as input (i.e., no task-data is needed), allowing it to be applied in data-scarce scenarios. RI consistently improves the performance of state-of-the-art merging methods by up to 3.8% and generalization to unseen domains by up to 2.3%. We also find RI to be robust to the source of auxiliary input while being significantly less sensitive to tuning of merging hyperparameters. Our codebase is available at: https://github.com/pramesh39/resolving_interference
EgoMimic: Scaling Imitation Learning via Egocentric Video
2025-05-19 · 9 citations
articleThe scale and diversity of demonstration data required for imitation learning is a significant challenge. We present EgoMimic, a full-stack framework which scales manipulation via human embodiment data, specifically egocentric human videos paired with 3D hand tracking. EgoMimic achieves this through: (1) a system to capture human embodiment data using the ergonomic Project Aria glasses, (2) a low-cost bimanual manipulator that minimizes the kinematic gap to human data, (3) cross-domain data alignment techniques, and (4) an imitation learning architecture that co-trains on human and robot data. Compared to prior works that only extract high-level intent from human videos, our approach treats human and robot data equally as embodied demonstration data and learns a unified policy from both data sources. EgoMimic achieves significant improvement on a diverse set of long-horizon, single-arm and bimanual manipulation tasks over state-of-the-art imitation learning methods and enables generalization to entirely new scenes. Finally, we show a favorable scaling trend for EgoMimic, where adding 1 hour of additional hand data is significantly more valuable than 1 hour of additional robot data. Videos and additional information can be found at https://egomimic.github.io/
Improving Personalized Search with Regularized Low-Rank Parameter Updates
ArXiv.org · 2025-06-11
preprintOpen accessPersonalized vision-language retrieval seeks to recognize new concepts (e.g. "my dog Fido") from only a few examples. This task is challenging because it requires not only learning a new concept from a few images, but also integrating the personal and general knowledge together to recognize the concept in different contexts. In this paper, we show how to effectively adapt the internal representation of a vision-language dual encoder model for personalized vision-language retrieval. We find that regularized low-rank adaption of a small set of parameters in the language encoder's final layer serves as a highly effective alternative to textual inversion for recognizing the personal concept while preserving general knowledge. Additionally, we explore strategies for combining parameters of multiple learned personal concepts, finding that parameter addition is effective. To evaluate how well general knowledge is preserved in a finetuned representation, we introduce a metric that measures image retrieval accuracy based on captions generated by a vision language model (VLM). Our approach achieves state-of-the-art accuracy on two benchmarks for personalized image retrieval with natural language queries - DeepFashion2 and ConCon-Chi - outperforming the prior art by 4%-22% on personal retrievals.
Emergence of Human to Robot Transfer in Vision-Language-Action Models
arXiv (Cornell University) · 2025-12-27
preprintOpen accessVision-language-action (VLA) models can enable broad open world generalization, but require large and diverse datasets. It is appealing to consider whether some of this data can come from human videos, which cover diverse real-world situations and are easy to obtain. However, it is difficult to train VLAs with human videos alone, and establishing a mapping between humans and robots requires manual engineering and presents a major research challenge. Drawing inspiration from advances in large language models, where the ability to learn from diverse supervision emerges with scale, we ask whether a similar phenomenon holds for VLAs that incorporate human video data. We introduce a simple co-training recipe, and find that human-to-robot transfer emerges once the VLA is pre-trained on sufficient scenes, tasks, and embodiments. Our analysis suggests that this emergent capability arises because diverse pretraining produces embodiment-agnostic representations for human and robot data. We validate these findings through a series of experiments probing human to robot skill transfer and find that with sufficiently diverse robot pre-training our method can nearly double the performance on generalization settings seen only in human data.
Emergence of Human to Robot Transfer in Vision-Language-Action Models
ArXiv.org · 2025-12-27
articleOpen accessVision-language-action (VLA) models can enable broad open world generalization, but require large and diverse datasets. It is appealing to consider whether some of this data can come from human videos, which cover diverse real-world situations and are easy to obtain. However, it is difficult to train VLAs with human videos alone, and establishing a mapping between humans and robots requires manual engineering and presents a major research challenge. Drawing inspiration from advances in large language models, where the ability to learn from diverse supervision emerges with scale, we ask whether a similar phenomenon holds for VLAs that incorporate human video data. We introduce a simple co-training recipe, and find that human-to-robot transfer emerges once the VLA is pre-trained on sufficient scenes, tasks, and embodiments. Our analysis suggests that this emergent capability arises because diverse pretraining produces embodiment-agnostic representations for human and robot data. We validate these findings through a series of experiments probing human to robot skill transfer and find that with sufficiently diverse robot pre-training our method can nearly double the performance on generalization settings seen only in human data.
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
International Journal of Computer Vision · 2025-11-24 · 1 citations
articleImproving Personalized Search with Regularized Low-Rank Parameter Updates
2025-06-10
articlePersonalized vision-language retrieval seeks to recognize new concepts (e.g., “my dog Fido”) from only a few examples. This task is challenging because it requires not only learning a new concept from a few images, but also integrating the personal and general knowledge together to recognize the concept in different contexts. In this paper, we show how to effectively adapt the internal representation of a vision-language dual encoder model for personalized vision-language retrieval. We find that regularized low-rank adaption of a small set of parameters in the language encoder’s final layer serves as a highly effective alternative to textual inversion for recognizing the personal concept while preserving general knowledge. Additionally, we explore strategies for combining parameters of multiple learned personal concepts, finding that parameter addition is effective. To evaluate how well general knowledge is preserved in a finetuned representation, we introduce a metric that measures image retrieval accuracy based on captions generated by a vision language model (VLM). Our approach achieves state-of-the-art accuracy on two benchmarks for personalized image retrieval with natural language queries – DeepFashion2 and ConCon-Chi – outperforming the prior art by 4% − 22% on personal retrievals.
Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders
2025-06-10 · 14 citations
articleWe address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person’s gaze target requires reasoning both about the person’s appearance and the contents of the scene. Prior works have developed increasingly complex, handcrafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices. Our code and models are available at: http://github.com/fkryan/gazelle.
Frequent coauthors
- 88 shared
Trevor Darrell
- 65 shared
Kate Saenko
- 37 shared
Eric Tzeng
- 27 shared
Jeff Donahue
- 22 shared
Viraj Prabhu
R.M.D. Engineering College
- 20 shared
Daniel Bolya
- 19 shared
Prithvijit Chattopadhyay
Georgia Institute of Technology
- 14 shared
Dhruv Batra
Awards & honors
- NSF CAREER
- Google Research Scholar Award
- Samsung AI Researcher of the Year Award
- NVIDIA female leader in computer vision award
- AIMiner top 100 most influential scholars in Machine Learnin…
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Judy Hoffman
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup