Romit Roy Choudhury

· Professor, Electrical and Computer Engineering

University of Illinois Urbana-Champaign · Computer Science

Active 2002–2026

h-index51

Citations11.4k

Papers20119 last 5y

Funding$3.7M

Faculty page

OpenAlex

See your match with Romit Roy Choudhury — sign in to PhdFit.Sign in

About

Romit Roy Choudhury is a professor in the Electrical and Computer Engineering (ECE) and Computer Science (CS) departments at the University of Illinois. He holds a Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign, earned in 2006. His research interests include generative models, inverse problems, neural imaging such as NeRFs, blackbox optimization, wireless sensing, and signal processing. He has held various professional positions, including Amazon Scholar since 2022, Visiting Principal Scientist at Samsung AI Center in Cambridge, UK, in Fall 2019, and has been a faculty member at the University of Illinois since August 2017. Prior to that, he was an associate professor at Duke University and held research positions at Microsoft Research and Intel. His research group focuses on advancing artificial intelligence, systems, and networking, with an emphasis on innovative imaging and sensing technologies. He has been recognized for teaching excellence, being listed on the campus list of teachers ranked as excellent by students in multiple recent years.

Research topics

Computer Science
Human–computer interaction
Speech recognition
Acoustics
Data science
Embedded system
Telecommunications
Computer vision
Business
Engineering
Computer network
Physics

Selected publications

Personalized Image Generation via Human-in-the-loop Bayesian Optimization
Open MIND · 2026-02-02
preprintSenior author
Imagine Alice has a specific image $x^\ast$ in her mind, say, the view of the street in which she grew up during her childhood. To generate that exact image, she guides a generative model with multiple rounds of prompting and arrives at an image $x^{p*}$. Although $x^{p*}$ is reasonably close to $x^\ast$, Alice finds it difficult to close that gap using language prompts. This paper aims to narrow this gap by observing that even after language has reached its limits, humans can still tell when a new image $x^+$ is closer to $x^\ast$ than $x^{p*}$. Leveraging this observation, we develop MultiBO (Multi-Choice Preferential Bayesian Optimization) that carefully generates $K$ new images as a function of $x^{p*}$, gets preferential feedback from the user, uses the feedback to guide the diffusion model, and ultimately generates a new set of $K$ images. We show that within $B$ rounds of user feedback, it is possible to arrive much closer to $x^\ast$, even though the generative model has no information about $x^\ast$. Qualitative scores from $30$ users, combined with quantitative metrics compared across $5$ baselines, show promising results, suggesting that multi-choice feedback from humans can be effectively harnessed for personalized image generation.
DOI
Unified Diffusion Refinement for Multi-Channel Speech Enhancement and Separation
ArXiv.org · 2026-03-25
articleOpen accessSenior author
We propose Uni-ArrayDPS, a novel diffusion-based refinement framework for unified multi-channel speech enhancement and separation. Existing methods for multi-channel speech enhancement/separation are mostly discriminative and are highly effective at producing high-SNR outputs. However, they can still generate unnatural speech with non-linear distortions caused by the neural network and regression-based objectives. To address this issue, we propose Uni-ArrayDPS, which refines the outputs of any strong discriminative model using a speech diffusion prior. Uni-ArrayDPS is generative, array-agnostic, and training-free, and supports both enhancement and separation. Given a discriminative model's enhanced/separated speech, we use it, together with the noisy mixtures, to estimate the noise spatial covariance matrix (SCM). We then use this SCM to compute the likelihood required for diffusion posterior sampling of the clean speech source(s). Uni-ArrayDPS requires only a pre-trained clean-speech diffusion model as a prior and does not require additional training or fine-tuning, allowing it to generalize directly across tasks (enhancement/separation), microphone array geometries, and discriminative model backbones. Extensive experiments show that Uni-ArrayDPS consistently improves a wide range of discriminative models for both enhancement and separation tasks. We also report strong results on a real-world dataset. Audio demos are provided at \href{https://xzwy.github.io/Uni-ArrayDPS/}{https://xzwy.github.io/Uni-ArrayDPS/}.
Publisher OA PDF
Discrete Langevin-Inspired Posterior Sampling
ArXiv.org · 2026-05-10
articleOpen accessSenior author
We study posterior sampling for inverse problems in discrete state spaces using discrete diffusion models as generative priors. While continuous diffusion models have become widely used for inverse problems, their discrete counterparts remain comparatively underexplored. Existing discrete posterior samplers often rely on continuous relaxations of discrete variables, Gibbs-style updates, or mechanisms specialized to particular corruption processes, which can limit scalability or generality. We propose $Δ$LPS, a Discrete Langevin-Inspired Posterior Sampler that uses gradient information to identify promising discrete moves without leaving the discrete state space. The resulting approach enables efficient parallel updates across all token dimensions and is agnostic to the training paradigm of the discrete diffusion prior, including masked and uniform-state diffusion. We evaluate our method on image restoration tasks across MNIST, CIFAR, and FFHQ, as well as spatial mapping, covering linear, nonlinear, and blind inverse problems. Across these settings, we improve over recent discrete diffusion posterior samplers and are competitive with strong continuous diffusion-based inverse solvers. Our results suggest that fully discrete, gradient-informed posterior samplers offer a scalable and general path toward solving inverse problems over discrete representations.
Publisher OA PDF
Dependency-Aware Discrete Diffusion for Scene Graph Generation
arXiv (Cornell University) · 2026-05-09
preprintOpen accessSenior author
Scene graphs (SGs) represent objects and their relationships as structured graphs, enabling applications in image generation, robotics, and 3D understanding. Recent work suggests that conditioning image generation on scene graphs improves compositional fidelity compared to text-only prompting. However, since users typically provide text rather than structured graphs, a key challenge is to generate scene graphs from natural language. Prior work on discrete diffusion has demonstrated success in generating generic graphs such as molecules and circuits, but fails to account for the hierarchical structure and strong dependencies between objects, edges, and relations in scene graphs. We address this limitation by introducing a dependency-aware, hierarchically constrained discrete diffusion model for scene graph generation. Our approach decouples structure and semantics across the forward and reverse processes, enabling the model to capture conditional dependencies. At inference time, we perform training-free conditioning to sample text-aligned scene graphs. We evaluate our method on standard SG benchmarks and demonstrate improvements over both continuous and discrete graph generation baselines across graph and layout metrics. When fed to downstream image generation, our approach yields improved compositional alignment compared to text-to-image models, particularly in multi-object scenarios.
Publisher DOI
AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking
arXiv (Cornell University) · 2026-01-25
preprintOpen access
Internet audio-visual clips convey meaning through time-varying sound and motion, which extend beyond what text alone can represent. To examine whether AI models can understand such signals in human cultural contexts, we introduce AVMeme Exam, a human-curated benchmark of over one thousand iconic Internet sounds and videos spanning speech, songs, music, and sound effects. Each meme is paired with a unique Q&A assessing levels of understanding from surface content to context and emotion to usage and world knowledge, along with metadata such as original year, transcript, summary, and sensitivity. We systematically evaluate state-of-the-art multimodal large language models (MLLMs) alongside human participants using this benchmark. Our results reveal a consistent limitation: current models perform poorly on textless music and sound effects, and struggle to think in context and in culture compared to surface content. These findings highlight a key gap in human-aligned multimodal intelligence and call for models that can perceive contextually and culturally beyond the surface of what they hear and see. Project page: avmemeexam.github.io/public
Publisher DOI
Unified Diffusion Refinement for Multi-Channel Speech Enhancement and Separation
arXiv (Cornell University) · 2026-03-25
preprintOpen accessSenior author
We propose Uni-ArrayDPS, a novel diffusion-based refinement framework for unified multi-channel speech enhancement and separation. Existing methods for multi-channel speech enhancement/separation are mostly discriminative and are highly effective at producing high-SNR outputs. However, they can still generate unnatural speech with non-linear distortions caused by the neural network and regression-based objectives. To address this issue, we propose Uni-ArrayDPS, which refines the outputs of any strong discriminative model using a speech diffusion prior. Uni-ArrayDPS is generative, array-agnostic, and training-free, and supports both enhancement and separation. Given a discriminative model's enhanced/separated speech, we use it, together with the noisy mixtures, to estimate the noise spatial covariance matrix (SCM). We then use this SCM to compute the likelihood required for diffusion posterior sampling of the clean speech source(s). Uni-ArrayDPS requires only a pre-trained clean-speech diffusion model as a prior and does not require additional training or fine-tuning, allowing it to generalize directly across tasks (enhancement/separation), microphone array geometries, and discriminative model backbones. Extensive experiments show that Uni-ArrayDPS consistently improves a wide range of discriminative models for both enhancement and separation tasks. We also report strong results on a real-world dataset. Audio demos are provided at \href{https://xzwy.github.io/Uni-ArrayDPS/}{https://xzwy.github.io/Uni-ArrayDPS/}.
Publisher DOI
Inferring Indoor Layouts using Audio
2026-02-25
articleOpen accessSenior author
Cameras and LiDARs underlie today's established tools for inferring indoor layouts. This paper explores audio as a complementary modality for this task. Our system emits short audio beacons from a handheld device (e.g., a smartphone) and records the resulting echoes at multiple, known locations along a user's path. Given these multi-position measurements, we infer the indoor layout, in the form of a 2D floorplan, using a generative approach. Our method employs a conditional GAN (CGAN) to synthesize feasible layouts while incorporating knowledge of indoor acoustic signal propagation to regularize training and avoid overfitting. We train on large-scale, high-fidelity simulations spanning diverse geometries, materials, and noise, then evaluate zero-shot in real homes and offices with no additional training. Results show accurate 2D floorplans with strong precision and recall, demonstrating audio's promise as a robust, privacy-preserving complement to vision and LiDAR.
Publisher DOI
Personalized Image Generation via Human-in-the-loop Bayesian Optimization
arXiv (Cornell University) · 2026-02-02
articleOpen accessSenior author
Imagine Alice has a specific image $x^\ast$ in her mind, say, the view of the street in which she grew up during her childhood. To generate that exact image, she guides a generative model with multiple rounds of prompting and arrives at an image $x^{p*}$. Although $x^{p*}$ is reasonably close to $x^\ast$, Alice finds it difficult to close that gap using language prompts. This paper aims to narrow this gap by observing that even after language has reached its limits, humans can still tell when a new image $x^+$ is closer to $x^\ast$ than $x^{p*}$. Leveraging this observation, we develop MultiBO (Multi-Choice Preferential Bayesian Optimization) that carefully generates $K$ new images as a function of $x^{p*}$, gets preferential feedback from the user, uses the feedback to guide the diffusion model, and ultimately generates a new set of $K$ images. We show that within $B$ rounds of user feedback, it is possible to arrive much closer to $x^\ast$, even though the generative model has no information about $x^\ast$. Qualitative scores from $30$ users, combined with quantitative metrics compared across $5$ baselines, show promising results, suggesting that multi-choice feedback from humans can be effectively harnessed for personalized image generation.
Publisher OA PDF
Dependency-Aware Discrete Diffusion for Scene Graph Generation
ArXiv.org · 2026-05-09
articleOpen accessSenior author
Scene graphs (SGs) represent objects and their relationships as structured graphs, enabling applications in image generation, robotics, and 3D understanding. Recent work suggests that conditioning image generation on scene graphs improves compositional fidelity compared to text-only prompting. However, since users typically provide text rather than structured graphs, a key challenge is to generate scene graphs from natural language. Prior work on discrete diffusion has demonstrated success in generating generic graphs such as molecules and circuits, but fails to account for the hierarchical structure and strong dependencies between objects, edges, and relations in scene graphs. We address this limitation by introducing a dependency-aware, hierarchically constrained discrete diffusion model for scene graph generation. Our approach decouples structure and semantics across the forward and reverse processes, enabling the model to capture conditional dependencies. At inference time, we perform training-free conditioning to sample text-aligned scene graphs. We evaluate our method on standard SG benchmarks and demonstrate improvements over both continuous and discrete graph generation baselines across graph and layout metrics. When fed to downstream image generation, our approach yields improved compositional alignment compared to text-to-image models, particularly in multi-object scenarios.
Publisher OA PDF
Poster: Inferring Floorplans from Walking Trajectories via Contrastive Diffusion Guidance
2026-02-25
articleOpen accessSenior author
Introduction. Consider a user walking around in her home for a few minutes. Using some sensor, e.g., a smartphone, the user's trajectory has been recorded. This trajectory is a sequence of location measurements inside the home, y = [y1, y2, … yn], along which the user has walked (shown in Fig.1). We ask, given this trajectory measurement, is it possible to infer the floorplan x of the home? The problem is non-trivial because an infinite number of floorplans are candidate solutions for the given trajectories; how can one identify the correct floorplan?We tackle this using generative diffusion models that learn realistic floorplan structures from data, and then generate a layout compatible with the observed trajectory. Our method Diff2Plan builds on DDPM [2] and shows robustness to sparse, medium, and dense trajectories. Across synthetic and real-world UWB trajectories, Diff2Plan outperforms baselines such as DPS [1] and CFG [3] and degrades gracefully when measurements are limited. The resulting capability can enable practical applications such as home digital twins, context-aware assistants, and AR/VR.
Publisher DOI

Recent grants

NeTS: Small: Collaborative Research: Logical Localization for Mobile Devices through Ambience Sensing
NSF · $111k · 2010–2012
NIH Grant R21DA034471
NIH · $385k · 2015
NetSE: Large: Collaborative Research: Platys: From Position to Place in Next Generation Networks
NSF · $358k · 2009–2014
NetSE: Large: Collaborative Research: Platys: From Position to Place in Next Generation Networks
NSF · $217k · 2013–2015
NeTS: Small: Collaborative: PHY-Informed Networking (PHY-IN): Rethinking Wireless Protocol Design with the Knowledge of PHY
NSF · $240k · 2010–2014

Frequent coauthors

Srihari Nelakuditi
University of South Carolina
37 shared
Souvik Sen
Baker Hughes (Germany)
29 shared
Xuan Bao
Tianjin University
21 shared
Justin Manweiler
20 shared
Pier Giorgio Masci
20 shared
Giovanni Donato Aquaro
18 shared
Mahanth Gowda
Pennsylvania State University
17 shared
Perry Elliott
St Bartholomew's Hospital
16 shared

Labs

Siebel School of Computing and Data SciencePI

Awards & honors

Campus List of Teachers Ranked as Excellent by their Student…
Campus List of Teachers Ranked as Excellent by their Student…
Campus List of Teachers Ranked as Excellent by their Student…
Campus List of Teachers Ranked as Excellent by their Student…
Campus List of Teachers Ranked as Excellent by their Student…

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Romit Roy Choudhury

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you