Dieter Fox
· ProfessorVerifiedUniversity of Washington · Computer Science & Engineering
Active 1996–2025
About
Dieter Fox is a Professor in the Allen School of Computer Science & Engineering at the University of Washington. He grew up in Bonn, Germany, and received his Ph.D. in 1998 from the Computer Science Department at the University of Bonn. He joined the University of Washington faculty in the fall of 2000 and currently divides his time between the University of Washington and the Allen Institute for AI (Ai2). He leads the UW Robotics and State Estimation Lab (RSE-Lab). His research interests focus on robotics, artificial intelligence, and state estimation, with the goal of enabling systems to interact intelligently with people and their environment. Much of his work centers on perception and its connection to control, developing techniques to extract relevant information from raw sensor data. Application areas of his research include human activity recognition, 3D mapping and tracking, and robot manipulation and control. Dieter Fox is recognized as a Fellow of the AAAI, ACM, and IEEE, and has received prestigious awards such as the IEEE RAS Pioneer Award and the IJCAI John McCarthy Award. He has served as an editor of the IEEE Transactions on Robotics and has held leadership roles including Program Chair for the Robotics Science and Systems conference and the AAAI Conference. He teaches courses in robotics and AI at both undergraduate and graduate levels, including undergraduate capstone courses on robotics and interactive systems enabled by RGB-D cameras. He works closely with graduate students, undergraduate students, and postdoctoral researchers in his lab, fostering a collaborative research environment through weekly meetings.
Research topics
- Computer Science
- Artificial Intelligence
- Computer vision
- Machine Learning
- Programming language
- Human–computer interaction
- Data Mining
- Natural Language Processing
- Engineering
- Mathematics
- Telecommunications
- Geometry
- Industrial engineering
- Theoretical computer science
- Software engineering
- Psychology
- Mechanical engineering
- Physics
- Embedded system
- Systems engineering
- Engineering drawing
- Mathematical optimization
- Control engineering
- Simulation
Selected publications
RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation
ArXiv.org · 2025-07-01 · 1 citations
preprintOpen accessWe introduce RoboEval, a structured evaluation framework and benchmark for robotic manipulation that augments binary success with principled behavioral and outcome metrics. Existing evaluations often collapse performance into outcome counts, masking differences in execution quality and obscuring failure structure. RoboEval provides eight bimanual tasks with systematically controlled variations, more than three thousand expert demonstrations, and a modular simulation platform for reproducible experimentation. All tasks are instrumented with standardized metrics that quantify efficiency, coordination, and safety/stability, as well as outcome measures that trace stagewise progress and localize failure modes. Through extensive experiments with state-of-the-art visuomotor policies, we validate these metrics by analyzing their stability under variation, discriminative power across policies with similar success rates, and correlation with task success. Project Page: https://robo-eval.github.io
HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation
ArXiv.org · 2025-02-08 · 1 citations
preprintOpen accessLarge foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is the lack of robotic data, which are typically obtained through expensive on-robot operation. A promising remedy is to leverage cheaper, off-domain data such as action-free videos, hand-drawn sketches or simulation data. In this work, we posit that hierarchical vision-language-action (VLA) models can be more effective in utilizing off-domain data than standard monolithic VLA models that directly finetune vision-language models (VLMs) to predict actions. In particular, we study a class of hierarchical VLA models, where the high-level VLM is finetuned to produce a coarse 2D path indicating the desired robot end-effector trajectory given an RGB image and a task description. The intermediate 2D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Doing so alleviates the high-level VLM from fine-grained action prediction, while reducing the low-level policy's burden on complex task-level reasoning. We show that, with the hierarchical design, the high-level VLM can transfer across significant domain gaps between the off-domain finetuning data and real-robot testing scenarios, including differences on embodiments, dynamics, visual appearances and task semantics, etc. In the real-robot experiments, we observe an average of 20% improvement in success rate across seven different axes of generalization over OpenVLA, representing a 50% relative gain. Visual results, code, and dataset are provided at: https://hamster-robot.github.io/
MolmoAct: Action Reasoning Models that can Reason in Space
ArXiv.org · 2025-08-11
preprintOpen accessReasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of robotic foundation models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior. MolmoAct-7B-D achieves strong performance across simulation and real-world settings: 70.5% zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source Pi-0 and GR00T N1.5; 86.6% average success on LIBERO, including an additional 6.3% gain over ThinkAct on long-horizon tasks; and in real-world fine-tuning, an additional 10% (single-arm) and an additional 22.7% (bimanual) task progression over Pi-0-FAST. It also outperforms baselines by an additional 23.3% on out-of-distribution generalization and achieves top human-preference scores for open-ended instruction following and trajectory steering. Furthermore, we release, for the first time, the MolmoAct Dataset -- a mid-training robot dataset comprising over 10,000 high quality robot trajectories across diverse scenarios and tasks. Training with this dataset yields an average 5.5% improvement in general performance over the base model. We release all model weights, training code, our collected dataset, and our action reasoning dataset, establishing MolmoAct as both a state-of-the-art robotics foundation model and an open blueprint for building ARMs that transform perception into purposeful action through structured reasoning. Blogpost: https://allenai.org/blog/molmoact
ACGD: Visual Multitask Policy Learning with Asymmetric Critic Guided Distillation
2025-10-19
articleWe present Asymmetric Critic Guided Distillation, ACGD, a framework for learning multi-task dexterous manipulation policies that can manipulate articulated objects using images as input. ACGD is a scalable student-teacher distillation approach that utilizes behavior cloning to distill multiple expert policies into a single vision-based, multi-task student policy for dexterous manipulation. The expert policies are trained with traditional RL techniques with access to privileged state information of both the robot and the manipulated object, while the distilled student policy operates under realistic sensory constraints, specifically using only camera images and robot proprioception. During distillation, we use an expert-critic that provides action labels and value estimates to refine the student’s action sampling through a dual IL/RL objective. In the multi-task setting, we achieve this through an aggregate critic for different single-task experts. Our approach exhibits strong performance compared to a number of state-of-the-art imitation learning (IL) and reinforcement learning (RL) baselines. We evaluate across a variety of multi-task dexterous manipulation benchmarks including bimanual manipulation, single-hand object articulation tasks, and a tendon-actuated hand and achieves state-of-the-art performance with 10-15% improvement over the baseline algorithms. Visit our website for more details.
SRSA: Skill Retrieval and Adaptation for Robotic Assembly Tasks
ArXiv.org · 2025-03-06
preprintOpen accessEnabling robots to learn novel tasks in a data-efficient manner is a long-standing challenge. Common strategies involve carefully leveraging prior experiences, especially transition data collected on related tasks. Although much progress has been made for general pick-and-place manipulation, far fewer studies have investigated contact-rich assembly tasks, where precise control is essential. We introduce SRSA (Skill Retrieval and Skill Adaptation), a novel framework designed to address this problem by utilizing a pre-existing skill library containing policies for diverse assembly tasks. The challenge lies in identifying which skill from the library is most relevant for fine-tuning on a new task. Our key hypothesis is that skills showing higher zero-shot success rates on a new task are better suited for rapid and effective fine-tuning on that task. To this end, we propose to predict the transfer success for all skills in the skill library on a novel task, and then use this prediction to guide the skill retrieval process. We establish a framework that jointly captures features of object geometry, physical dynamics, and expert actions to represent the tasks, allowing us to efficiently learn the transfer success predictor. Extensive experiments demonstrate that SRSA significantly outperforms the leading baseline. When retrieving and fine-tuning skills on unseen tasks, SRSA achieves a 19% relative improvement in success rate, exhibits 2.6x lower standard deviation across random seeds, and requires 2.4x fewer transition samples to reach a satisfactory success rate, compared to the baseline. Furthermore, policies trained with SRSA in simulation achieve a 90% mean success rate when deployed in the real world. Please visit our project webpage https://srsa2024.github.io/.
DexMachina: Functional Retargeting for Bimanual Dexterous Manipulation
ArXiv.org · 2025-05-30
preprintOpen accessWe study the problem of functional retargeting: learning dexterous manipulation policies to track object states from human hand-object demonstrations. We focus on long-horizon, bimanual tasks with articulated objects, which is challenging due to large action space, spatiotemporal discontinuities, and embodiment gap between human and robot hands. We propose DexMachina, a novel curriculum-based algorithm: the key idea is to use virtual object controllers with decaying strength: an object is first driven automatically towards its target states, such that the policy can gradually learn to take over under motion and contact guidance. We release a simulation benchmark with a diverse set of tasks and dexterous hands, and show that DexMachina significantly outperforms baseline methods. Our algorithm and benchmark enable a functional comparison for hardware designs, and we present key findings informed by quantitative and qualitative results. With the recent surge in dexterous hand development, we hope this work will provide a useful platform for identifying desirable hardware capabilities and lower the barrier for contributing to future research. Videos and more at https://project-dexmachina.github.io/
SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation
arXiv (Cornell University) · 2025-01-30
preprintOpen accessRobotic manipulation systems operating in diverse, dynamic environments must exhibit three critical abilities: multitask interaction, generalization to unseen scenarios, and spatial memory. While significant progress has been made in robotic manipulation, existing approaches often fall short in generalization to complex environmental variations and addressing memory-dependent tasks. To bridge this gap, we introduce SAM2Act, a multi-view robotic transformer-based policy that leverages multi-resolution upsampling with visual representations from large-scale foundation model. SAM2Act achieves a state-of-the-art average success rate of 86.8% across 18 tasks in the RLBench benchmark, and demonstrates robust generalization on The Colosseum benchmark, with only a 4.3% performance gap under diverse environmental perturbations. Building on this foundation, we propose SAM2Act+, a memory-based architecture inspired by SAM2, which incorporates a memory bank, an encoder, and an attention mechanism to enhance spatial memory. To address the need for evaluating memory-dependent tasks, we introduce MemoryBench, a novel benchmark designed to assess spatial memory and action recall in robotic manipulation. SAM2Act+ achieves an average success rate of 94.3% on memory-based tasks in MemoryBench, significantly outperforming existing approaches and pushing the boundaries of memory-based robotic systems. Project page: sam2act.github.io.
ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training
ArXiv.org · 2025-09-01
preprintOpen accessSenior authorThis paper introduces ManiFlow, a visuomotor imitation learning policy for general robot manipulation that generates precise, high-dimensional actions conditioned on diverse visual, language and proprioceptive inputs. We leverage flow matching with consistency training to enable high-quality dexterous action generation in just 1-2 inference steps. To handle diverse input modalities efficiently, we propose DiT-X, a diffusion transformer architecture with adaptive cross-attention and AdaLN-Zero conditioning that enables fine-grained feature interactions between action tokens and multi-modal observations. ManiFlow demonstrates consistent improvements across diverse simulation benchmarks and nearly doubles success rates on real-world tasks across single-arm, bimanual, and humanoid robot setups with increasing dexterity. The extensive evaluation further demonstrates the strong robustness and generalizability of ManiFlow to novel objects and background changes, and highlights its strong scaling capability with larger-scale datasets. Our website: maniflow-policy.github.io.
RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills
ArXiv.org · 2025-06-17
preprintOpen accessEndowing robots with tool design abilities is critical for enabling them to solve complex manipulation tasks that would otherwise be intractable. While recent generative frameworks can automatically synthesize task settings, such as 3D scenes and reward functions, they have not yet addressed the challenge of tool-use scenarios. Simply retrieving human-designed tools might not be ideal since many tools (e.g., a rolling pin) are difficult for robotic manipulators to handle. Furthermore, existing tool design approaches either rely on predefined templates with limited parameter tuning or apply generic 3D generation methods that are not optimized for tool creation. To address these limitations, we propose RobotSmith, an automated pipeline that leverages the implicit physical knowledge embedded in vision-language models (VLMs) alongside the more accurate physics provided by physics simulations to design and use tools for robotic manipulation. Our system (1) iteratively proposes tool designs using collaborative VLM agents, (2) generates low-level robot trajectories for tool use, and (3) jointly optimizes tool geometry and usage for task performance. We evaluate our approach across a wide range of manipulation tasks involving rigid, deformable, and fluid objects. Experiments show that our method consistently outperforms strong baselines in terms of both task success rate and overall performance. Notably, our approach achieves a 50.0\% average success rate, significantly surpassing other baselines such as 3D generation (21.4%) and tool retrieval (11.1%). Finally, we deploy our system in real-world settings, demonstrating that the generated tools and their usage plans transfer effectively to physical execution, validating the practicality and generalization capabilities of our approach.
GraspGen: A Diffusion-based Framework for 6-DOF Grasping with On-Generator Training
ArXiv.org · 2025-07-17
preprintOpen accessGrasping is a fundamental robot skill, yet despite significant research advancements, learning-based 6-DOF grasping approaches are still not turnkey and struggle to generalize across different embodiments and in-the-wild settings. We build upon the recent success on modeling the object-centric grasp generation process as an iterative diffusion process. Our proposed framework, GraspGen, consists of a DiffusionTransformer architecture that enhances grasp generation, paired with an efficient discriminator to score and filter sampled grasps. We introduce a novel and performant on-generator training recipe for the discriminator. To scale GraspGen to both objects and grippers, we release a new simulated dataset consisting of over 53 million grasps. We demonstrate that GraspGen outperforms prior methods in simulations with singulated objects across different grippers, achieves state-of-the-art performance on the FetchBench grasping benchmark, and performs well on a real robot with noisy visual observations.
Recent grants
NRI: Rich Task Perception for Programming by Demonstration
NSF · $1.2M · 2015–2019
Collaborative Research: NRI: FND: Graph Neural Networks for Multi-Object Manipulation
NSF · $429k · 2020–2023
NRI: Collaborative Research: Experiential Learning for Robots: From Physics to Actions to Tasks
NSF · $760k · 2016–2020
RI-Small: Statistical Relational Models for Semantic Robot Mapping
NSF · $400k · 2008–2012
Frequent coauthors
- 99 shared
Arsalan Mousavian
- 85 shared
Byron Boots
- 79 shared
Fábio Ramos
- 62 shared
Nathan Ratliff
- 59 shared
Chris Paxton
- 58 shared
Balakumar Sundaralingam
- 58 shared
Yashraj Narang
- 51 shared
Clemens Eppner
Education
- 1998
Ph.D.
University of Bonn
Awards & honors
- Fellow of the AAAI
- Fellow of the ACM
- Fellow of IEEE
- IEEE RAS Pioneer Award
- IJCAI John McCarthy Award
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Dieter Fox
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup