Jim Rehg

· Professor, Director of the Health Care Engineering Systems CenterVerified

University of Illinois Urbana-Champaign · Industrial and Enterprise Systems Engineering

Active 1992–2026

h-index74

Citations23.9k

Papers429118 last 5y

Funding$37.4M1 active

Faculty page

See your match with Jim Rehg — sign in to PhdFit.Sign in

About

Jim Rehg is a professor and the Director of the Health Care Engineering Systems Center at the University of Illinois Urbana-Champaign. His research areas include human factors and health technology, with recent courses such as CS 598 CVH - Computer Vision for Health. Rehg develops computational tools for health-related behaviors and has contributed to the fields of health technology and human factors research. He is actively involved in advancing health-related research and education within the Grainger College of Engineering.

Research topics

Artificial Intelligence
Computer Science
Human–computer interaction
Machine Learning
Multimedia
Computer vision
Psychology

Selected publications

IAM: Identity-Aware Human Motion and Shape Joint Generation
arXiv (Cornell University) · 2026-04-28
preprintOpen access
Recent advances in text-driven human motion generation enable models to synthesize realistic motion sequences from natural language descriptions. However, most existing approaches assume identity-neutral motion and generate movements using a canonical body representation, ignoring the strong influence of body morphology on motion dynamics. In practice, attributes such as body proportions, mass distribution, and age significantly affect how actions are performed, and neglecting this coupling often leads to physically inconsistent motions. We propose an identity-aware motion generation framework that explicitly models the relationship between body morphology and motion dynamics. Instead of relying on explicit geometric measurements, identity is represented using multimodal signals, including natural language descriptions and visual cues. We further introduce a joint motion-shape generation paradigm that simultaneously synthesizes motion sequences and body shape parameters, allowing identity cues to directly modulate motion dynamics. Extensive experiments on motion capture datasets and large-scale in-the-wild videos demonstrate improved motion realism and motion-identity consistency while maintaining high motion quality. Project page: https://vjwq.github.io/IAM
Publisher DOI
IAM: Identity-Aware Human Motion and Shape Joint Generation
ArXiv.org · 2026-04-28
articleOpen access
Recent advances in text-driven human motion generation enable models to synthesize realistic motion sequences from natural language descriptions. However, most existing approaches assume identity-neutral motion and generate movements using a canonical body representation, ignoring the strong influence of body morphology on motion dynamics. In practice, attributes such as body proportions, mass distribution, and age significantly affect how actions are performed, and neglecting this coupling often leads to physically inconsistent motions. We propose an identity-aware motion generation framework that explicitly models the relationship between body morphology and motion dynamics. Instead of relying on explicit geometric measurements, identity is represented using multimodal signals, including natural language descriptions and visual cues. We further introduce a joint motion-shape generation paradigm that simultaneously synthesizes motion sequences and body shape parameters, allowing identity cues to directly modulate motion dynamics. Extensive experiments on motion capture datasets and large-scale in-the-wild videos demonstrate improved motion realism and motion-identity consistency while maintaining high motion quality. Project page: https://vjwq.github.io/IAM
Publisher OA PDF
Naturalistic Language Recordings Reveal “Hypervocal” Infants at High Familial Risk for Autism
UNC Libraries · 2026-02-07
articleOpen access
Children's early language environments are related to later development. Little is known about this association in siblings of children with autism spectrum disorder (ASD), who often experience language delays or have ASD. Fifty-nine 9-month-old infants at high or low familial risk for ASD contributed full-day in-home language recordings. High-risk infants produced more vocalizations than low-risk peers; conversational turns and adult words did not differ by group. Vocalization differences were driven by a subgroup of "hypervocal" infants. Despite more vocalizations overall, these infants engaged in less social babbling during a standardized clinic assessment, and they experienced fewer conversational turns relative to their rate of vocalizations. Two ways in which these individual and environmental differences may relate to subsequent development are discussed.
Publisher DOI
Narrative-Driven Paper-to-Slide Generation via ArcDeck
ArXiv.org · 2026-04-13
articleOpen accessSenior author
We introduce ArcDeck, a multi-agent framework that formulates paper-to-slide generation as a structured narrative reconstruction task. Unlike existing methods that directly summarize raw text into slides, ArcDeck explicitly models the source paper's logical flow. It first parses the input to construct a discourse tree and establish a global commitment document, ensuring the high-level intent is preserved. These structural priors then guide an iterative multi-agent refinement process, where specialized agents iteratively critique and revise the presentation outline before rendering the final visual layouts and designs. To evaluate our approach, we also introduce ArcBench, a newly curated benchmark of academic paper-slide pairs. Experimental results demonstrate that explicit discourse modeling, combined with role-specific agent coordination, significantly improves the narrative flow and logical coherence of the generated presentations.
Publisher OA PDF
EgoForge: Goal-Directed Egocentric World Simulator
arXiv (Cornell University) · 2026-03-20
preprintOpen access
Generative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes, frequent hand-object interactions, and goal-directed procedures whose evolution depends on latent human intent. Existing approaches either focus on hand-centric instructional synthesis with limited scene evolution, perform static view translation without modeling action dynamics, or rely on dense supervision, such as camera trajectories, long video prefixes, synchronized multicamera capture, etc. In this work, we introduce EgoForge, an egocentric goal-directed world simulator that generates coherent, first-person video rollouts from minimal static inputs: a single egocentric image, a high-level instruction, and an optional auxiliary exocentric view. To improve intent alignment and temporal consistency, we propose VideoDiffusionNFT, a trajectory-level reward-guided refinement that optimizes goal completion, temporal causality, scene consistency, and perceptual fidelity during diffusion sampling. Extensive experiments show EgoForge achieves consistent gains in semantic alignment, geometric stability, and motion fidelity over strong baselines, and robust performance in real-world smart-glasses experiments.
Publisher DOI
Narrative-Driven Paper-to-Slide Generation via ArcDeck
arXiv (Cornell University) · 2026-04-13
preprintOpen accessSenior author
We introduce ArcDeck, a multi-agent framework that formulates paper-to-slide generation as a structured narrative reconstruction task. Unlike existing methods that directly summarize raw text into slides, ArcDeck explicitly models the source paper's logical flow. It first parses the input to construct a discourse tree and establish a global commitment document, ensuring the high-level intent is preserved. These structural priors then guide an iterative multi-agent refinement process, where specialized agents iteratively critique and revise the presentation outline before rendering the final visual layouts and designs. To evaluate our approach, we also introduce ArcBench, a newly curated benchmark of academic paper-slide pairs. Experimental results demonstrate that explicit discourse modeling, combined with role-specific agent coordination, significantly improves the narrative flow and logical coherence of the generated presentations.
Publisher DOI
Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
arXiv (Cornell University) · 2026-03-31
preprintOpen access
We introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution capabilities, which leads to inaccurate social interaction understanding. To address this challenge, we propose Omni-MMSI-R, a reference-guided pipeline that produces identity-attributed social cues with tools and conducts chain-of-thought social reasoning. To facilitate this pipeline, we construct participant-level reference pairs and curate reasoning annotations on top of the existing datasets. Experiments demonstrate that Omni-MMSI-R outperforms advanced LLMs and counterparts on Omni-MMSI. Project page: https://sampson-lee.github.io/omni-mmsi-project-page.
Publisher DOI
Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
ArXiv.org · 2026-03-31
articleOpen access
We introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution capabilities, which leads to inaccurate social interaction understanding. To address this challenge, we propose Omni-MMSI-R, a reference-guided pipeline that produces identity-attributed social cues with tools and conducts chain-of-thought social reasoning. To facilitate this pipeline, we construct participant-level reference pairs and curate reasoning annotations on top of the existing datasets. Experiments demonstrate that Omni-MMSI-R outperforms advanced LLMs and counterparts on Omni-MMSI. Project page: https://sampson-lee.github.io/omni-mmsi-project-page.
Publisher OA PDF
How Much 3D Do Video Foundation Models Encode?
arXiv (Cornell University) · 2025-12-23
preprintOpen accessSenior author
Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.
Publisher DOI
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
International Journal of Computer Vision · 2025-11-24 · 1 citations
article
Publisher DOI

Recent grants

CRI: CI-EN: Collaborative Research: mResearch: A platform for Reproducible and Extensible Mobile Sensor Big Data Research
NSF · $225k · 2018–2022
Comp Cog: Collaborative Research on the Development of Visual Object Recognition
NSF · $314k · 2015–2019
ITR: Analysis of Complex Audio-Visual Events Using Spatially Distributed Sensors
NSF · $1.1M · 2002–2008
Collaborative Research:Creating Dynamic Social Network Models from Sensor Data
NSF · $161k · 2004–2007
mHealth Center for Discovery, Optimization, and Translation of Temporally-Precise Interventions (mDOT)
NIH · $11.9M · 2020–2026

Frequent coauthors

Miao Liu
Shandong University
31 shared
Agata Rozga
Georgia Institute of Technology
27 shared
Stefan Stojanov
26 shared
Eunji Chong
Amazon (Germany)
24 shared
Audrey Southerland
Georgia Institute of Technology
24 shared
Fiona Ryan
Georgia Institute of Technology
23 shared
Yin Li
Southwest University
23 shared
Anh Thai
21 shared

Education

Ph.D., Computer Science
University of California, Berkeley
1991
M.S., Computer Science
University of California, Berkeley
1986
B.S., Computer Science
University of California, Santa Barbara
1983

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Jim Rehg

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you