Jim Rehg
· Professor, Director of the Health Care Engineering Systems CenterVerifiedUniversity of Illinois Urbana-Champaign · Industrial and Enterprise Systems Engineering
Active 1992–2026
About
Jim Rehg is a professor and the Director of the Health Care Engineering Systems Center at the University of Illinois Urbana-Champaign. His research areas include human factors and health technology, with recent courses such as CS 598 CVH - Computer Vision for Health. Rehg develops computational tools for health-related behaviors and has contributed to the fields of health technology and human factors research. He is actively involved in advancing health-related research and education within the Grainger College of Engineering.
Research topics
- Artificial Intelligence
- Computer Science
- Human–computer interaction
- Machine Learning
- Multimedia
- Computer vision
- Psychology
Selected publications
IAM: Identity-Aware Human Motion and Shape Joint Generation
arXiv (Cornell University) · 2026-04-28
preprintOpen accessRecent advances in text-driven human motion generation enable models to synthesize realistic motion sequences from natural language descriptions. However, most existing approaches assume identity-neutral motion and generate movements using a canonical body representation, ignoring the strong influence of body morphology on motion dynamics. In practice, attributes such as body proportions, mass distribution, and age significantly affect how actions are performed, and neglecting this coupling often leads to physically inconsistent motions. We propose an identity-aware motion generation framework that explicitly models the relationship between body morphology and motion dynamics. Instead of relying on explicit geometric measurements, identity is represented using multimodal signals, including natural language descriptions and visual cues. We further introduce a joint motion-shape generation paradigm that simultaneously synthesizes motion sequences and body shape parameters, allowing identity cues to directly modulate motion dynamics. Extensive experiments on motion capture datasets and large-scale in-the-wild videos demonstrate improved motion realism and motion-identity consistency while maintaining high motion quality. Project page: https://vjwq.github.io/IAM
IAM: Identity-Aware Human Motion and Shape Joint Generation
ArXiv.org · 2026-04-28
articleOpen accessRecent advances in text-driven human motion generation enable models to synthesize realistic motion sequences from natural language descriptions. However, most existing approaches assume identity-neutral motion and generate movements using a canonical body representation, ignoring the strong influence of body morphology on motion dynamics. In practice, attributes such as body proportions, mass distribution, and age significantly affect how actions are performed, and neglecting this coupling often leads to physically inconsistent motions. We propose an identity-aware motion generation framework that explicitly models the relationship between body morphology and motion dynamics. Instead of relying on explicit geometric measurements, identity is represented using multimodal signals, including natural language descriptions and visual cues. We further introduce a joint motion-shape generation paradigm that simultaneously synthesizes motion sequences and body shape parameters, allowing identity cues to directly modulate motion dynamics. Extensive experiments on motion capture datasets and large-scale in-the-wild videos demonstrate improved motion realism and motion-identity consistency while maintaining high motion quality. Project page: https://vjwq.github.io/IAM
Naturalistic Language Recordings Reveal “Hypervocal” Infants at High Familial Risk for Autism
UNC Libraries · 2026-02-07
articleOpen accessChildren's early language environments are related to later development. Little is known about this association in siblings of children with autism spectrum disorder (ASD), who often experience language delays or have ASD. Fifty-nine 9-month-old infants at high or low familial risk for ASD contributed full-day in-home language recordings. High-risk infants produced more vocalizations than low-risk peers; conversational turns and adult words did not differ by group. Vocalization differences were driven by a subgroup of "hypervocal" infants. Despite more vocalizations overall, these infants engaged in less social babbling during a standardized clinic assessment, and they experienced fewer conversational turns relative to their rate of vocalizations. Two ways in which these individual and environmental differences may relate to subsequent development are discussed.
Narrative-Driven Paper-to-Slide Generation via ArcDeck
ArXiv.org · 2026-04-13
articleOpen accessSenior authorWe introduce ArcDeck, a multi-agent framework that formulates paper-to-slide generation as a structured narrative reconstruction task. Unlike existing methods that directly summarize raw text into slides, ArcDeck explicitly models the source paper's logical flow. It first parses the input to construct a discourse tree and establish a global commitment document, ensuring the high-level intent is preserved. These structural priors then guide an iterative multi-agent refinement process, where specialized agents iteratively critique and revise the presentation outline before rendering the final visual layouts and designs. To evaluate our approach, we also introduce ArcBench, a newly curated benchmark of academic paper-slide pairs. Experimental results demonstrate that explicit discourse modeling, combined with role-specific agent coordination, significantly improves the narrative flow and logical coherence of the generated presentations.
EgoForge: Goal-Directed Egocentric World Simulator
arXiv (Cornell University) · 2026-03-20
preprintOpen accessGenerative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes, frequent hand-object interactions, and goal-directed procedures whose evolution depends on latent human intent. Existing approaches either focus on hand-centric instructional synthesis with limited scene evolution, perform static view translation without modeling action dynamics, or rely on dense supervision, such as camera trajectories, long video prefixes, synchronized multicamera capture, etc. In this work, we introduce EgoForge, an egocentric goal-directed world simulator that generates coherent, first-person video rollouts from minimal static inputs: a single egocentric image, a high-level instruction, and an optional auxiliary exocentric view. To improve intent alignment and temporal consistency, we propose VideoDiffusionNFT, a trajectory-level reward-guided refinement that optimizes goal completion, temporal causality, scene consistency, and perceptual fidelity during diffusion sampling. Extensive experiments show EgoForge achieves consistent gains in semantic alignment, geometric stability, and motion fidelity over strong baselines, and robust performance in real-world smart-glasses experiments.
Narrative-Driven Paper-to-Slide Generation via ArcDeck
arXiv (Cornell University) · 2026-04-13
preprintOpen accessSenior authorWe introduce ArcDeck, a multi-agent framework that formulates paper-to-slide generation as a structured narrative reconstruction task. Unlike existing methods that directly summarize raw text into slides, ArcDeck explicitly models the source paper's logical flow. It first parses the input to construct a discourse tree and establish a global commitment document, ensuring the high-level intent is preserved. These structural priors then guide an iterative multi-agent refinement process, where specialized agents iteratively critique and revise the presentation outline before rendering the final visual layouts and designs. To evaluate our approach, we also introduce ArcBench, a newly curated benchmark of academic paper-slide pairs. Experimental results demonstrate that explicit discourse modeling, combined with role-specific agent coordination, significantly improves the narrative flow and logical coherence of the generated presentations.
Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
arXiv (Cornell University) · 2026-03-31
preprintOpen accessWe introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution capabilities, which leads to inaccurate social interaction understanding. To address this challenge, we propose Omni-MMSI-R, a reference-guided pipeline that produces identity-attributed social cues with tools and conducts chain-of-thought social reasoning. To facilitate this pipeline, we construct participant-level reference pairs and curate reasoning annotations on top of the existing datasets. Experiments demonstrate that Omni-MMSI-R outperforms advanced LLMs and counterparts on Omni-MMSI. Project page: https://sampson-lee.github.io/omni-mmsi-project-page.
Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
ArXiv.org · 2026-03-31
articleOpen accessWe introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution capabilities, which leads to inaccurate social interaction understanding. To address this challenge, we propose Omni-MMSI-R, a reference-guided pipeline that produces identity-attributed social cues with tools and conducts chain-of-thought social reasoning. To facilitate this pipeline, we construct participant-level reference pairs and curate reasoning annotations on top of the existing datasets. Experiments demonstrate that Omni-MMSI-R outperforms advanced LLMs and counterparts on Omni-MMSI. Project page: https://sampson-lee.github.io/omni-mmsi-project-page.
How Much 3D Do Video Foundation Models Encode?
arXiv (Cornell University) · 2025-12-23
preprintOpen accessSenior authorVideos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
International Journal of Computer Vision · 2025-11-24 · 1 citations
article
Recent grants
NSF · $225k · 2018–2022
Comp Cog: Collaborative Research on the Development of Visual Object Recognition
NSF · $314k · 2015–2019
ITR: Analysis of Complex Audio-Visual Events Using Spatially Distributed Sensors
NSF · $1.1M · 2002–2008
Collaborative Research:Creating Dynamic Social Network Models from Sensor Data
NSF · $161k · 2004–2007
NIH · $11.9M · 2020–2026
Frequent coauthors
- 31 shared
Miao Liu
Shandong University
- 27 shared
Agata Rozga
Georgia Institute of Technology
- 26 shared
Stefan Stojanov
- 24 shared
Eunji Chong
Amazon (Germany)
- 24 shared
Audrey Southerland
Georgia Institute of Technology
- 23 shared
Fiona Ryan
Georgia Institute of Technology
- 23 shared
Yin Li
Southwest University
- 21 shared
Anh Thai
Education
- 1991
Ph.D., Computer Science
University of California, Berkeley
- 1986
M.S., Computer Science
University of California, Berkeley
- 1983
B.S., Computer Science
University of California, Santa Barbara
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Jim Rehg
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup