Indranil Gupta
· ProfessorVerifiedUniversity of Illinois Urbana-Champaign · Computer Science
Active 1983–2026
About
Professor Indranil Gupta is a distinguished faculty member in the Department of Computer Science at the University of Illinois at Urbana-Champaign, where he leads the Distributed Protocols Research Group. His research focuses on large-scale distributed systems, including datacenters, cloud computing, Internet of Things (IoT), and distributed machine learning. He is interested in developing systems that are reliable, predictable, and manageable, often working on core topics such as batch processing, stream processing, and distributed storage, with solutions that are widely used in industry. His work also extends to collaborations in Distributed Machine Learning, verification of distributed systems, and integrating distributed systems with other domains, reflecting a broad and impactful research portfolio. Indranil Gupta is an ACM Distinguished Scientist and IEEE Senior Member, with numerous awards including the NSF CAREER award, the Junior Xerox Award for Faculty Research, the CAS/Beckman Fellowship, and the Academy for Entrepreneurial Leadership Faculty Fellowship. He has received multiple teaching recognitions at UIUC and has served as General Chair and program committee co-chair for several prominent conferences. His contributions to the field include inventing novel systems techniques to improve the reliability and manageability of distributed systems, with practical implementations that influence industry. Gupta is also dedicated to democratizing education in distributed systems, offering the world's largest online course on the subject, which is used globally by students, companies, and startups. He is actively involved in editorial roles for major journals and has a significant presence in the research community, with a focus on systems and networking, IoT, and distributed systems + X collaborations.
Research topics
- Artificial Intelligence
- Computer Science
- Distributed computing
- Theoretical computer science
- Algorithm
- Telecommunications
Selected publications
RASC: Enhancing Observability & Programmability in Smart Spaces
arXiv (Cornell University) · 2026-01-20
preprintOpen accessSenior authorWhile RPCs form the bedrock of systems stacks, we posit that IoT device collections in smart spaces like homes, warehouses, and office buildings--which are all "user-facing"--require a more expressive abstraction. Orthogonal to prior work, which improved the reliability of IoT communication, our work focuses on improving the observability and programmability of IoT actions. We present the RASC (Request-Acknowledge-Start-Complete) abstraction, which provides acknowledgments at critical points after an IoT device action is initiated. RASC is a better fit for IoT actions, which naturally vary in length spatially (across devices) and temporally (across time, for a given device). RASC also enables the design of several new features: predicting action completion times accurately, detecting failures of actions faster, allowing fine-grained dependencies in programming, and scheduling. RASC is intended to be implemented atop today's available RPC mechanisms, rather than as a replacement. We integrated RASC into a popular and open-source IoT framework called Home Assistant. Our trace-driven evaluation finds that RASC meets latency SLOs, especially for long actions that last O(mins), which are common in smart spaces. Our scheduling policies for home automations (e.g., routines) outperform state-of-the-art counterparts by 10%-55%.
2026-04-13 · 2 citations
articleOpen accessSenior authorShort video streaming systems such as TikTok, YouTube Shorts, Instagram Reels, etc., have reached billions of active users worldwide. At the core of such systems are (proprietary) recommendation algorithms which recommend a sequence of videos to each user, in a personalized way. We aim to understand the temporal evolution of recommendations made by such algorithms, as well as the interplay between the recommendations and user experience. While past work has studied recommendation algorithms using textual data (e.g., titles, hashtags, etc.) as well as user studies and interviews, we add a third modality of analysis—we perform automated analysis of the videos themselves. To perform such multimodal analysis, we develop a new HCI measurement approach that starts with our new tool called VCA (Video Content Analysis) that leverages recent advances in Vision Language Models (VLMs). We apply VCA on a trifecta of HCI methodologies—real user studies, interviews, and data donation. This allows us to understand temporal aspects of how well TikTok’s recommendation algorithm is perceived by users, is affected by user interactions, and aligns with user history; how users are sensitive to the order of videos recommended; and how the algorithm’s effectiveness itself may be predictable in the future. While it is not our goal to reverse-engineer TikTok’s recommendation algorithm, our new findings indicate behavioral aspects that the TikTok user community can benefit from.
Control in Context: How Smart Home Users Navigate Proxy-based Schemes
2026-04-13 · 1 citations
articleOpen accessA homeowner controls their smart home devices along a spectrum of approaches, ranging from physical device control to various proxy-based control modalities. This paper studies how and why users move along this spectrum in their day-to-day lives, building upon existing research that focused only on specific interactions. We surveyed smart home owners (N = 43 users), and conducted follow-up interviews with a subset of the survey participants (N = 8). Our studies allow us to both distill specific contexts and experiences of smart home owners as they navigate the control spectrum, as well as to describe how their experiences (both positive and negative) shape their tendencies to control devices in a particular way. These insights lead us to propose practical implications for designers and researchers of smart home management systems, including the need to support flexible control scheme transitions, reduce switching costs, and account for temporal and spatial heterogeneity in the evaluation and design of control systems.
RASC: Enhancing Observability & Programmability in Smart Spaces
ArXiv.org · 2026-01-20
articleOpen accessSenior authorWhile RPCs form the bedrock of systems stacks, we posit that IoT device collections in smart spaces like homes, warehouses, and office buildings--which are all "user-facing"--require a more expressive abstraction. Orthogonal to prior work, which improved the reliability of IoT communication, our work focuses on improving the observability and programmability of IoT actions. We present the RASC (Request-Acknowledge-Start-Complete) abstraction, which provides acknowledgments at critical points after an IoT device action is initiated. RASC is a better fit for IoT actions, which naturally vary in length spatially (across devices) and temporally (across time, for a given device). RASC also enables the design of several new features: predicting action completion times accurately, detecting failures of actions faster, allowing fine-grained dependencies in programming, and scheduling. RASC is intended to be implemented atop today's available RPC mechanisms, rather than as a replacement. We integrated RASC into a popular and open-source IoT framework called Home Assistant. Our trace-driven evaluation finds that RASC meets latency SLOs, especially for long actions that last O(mins), which are common in smart spaces. Our scheduling policies for home automations (e.g., routines) outperform state-of-the-art counterparts by 10%-55%.
The VLDB Journal · 2025-05-08
articleTopology and density control of satellite-defined photonic quantum networks
Physical Review Research · 2025-03-25
articleOpen accessCreating photonic quantum networks by distributing entangled photon pairs through low Earth orbit satellites is a promising technological advance. A recent work has studied a model of such networks and reported the presence of a heavy-tailed degree distribution. This heterogeneous structure is highly undesirable when it comes to the quantum memory utilization efficiency and network robustness under malicious attack. In this work, however, we show that such a topology is not necessarily inherent to satellite-based quantum networks. We theoretically analyze factors that determine the connection probability between two nodes and propose a principled design methodology to control both the topology and density of the resulting network. We present numerical evidence that our method can continuously transform the heterogeneous heavy-tailed network into a more homogeneous Erdős-Rényi-like random network with the prescribed level of density as characterized by the average or maximal degree. Such results are in good agreement with our theoretical analysis, which not only predicts the qualitative structural transition but also provides a quantitative way to estimate important network features during the process. Under our control strategies, the resulting network can achieve various desirable properties, such as a short path length and diameter, high quantum memory utilization efficiency, and enhanced robustness against attack. We believe the design principle proposed in this work represents an important step towards building and controlling functionally efficient satellite-photonic quantum networks in the future.
Wainscot: Tailoring Model Parallelism to Fit Device Memory Limits
2025-07-21
articleSenior authorWith increasing sizes of DNN (Deep Neural Network) models making them exceed the memory of a single device (GPU), model parallelism-based training has become paramount, splitting a model across multiple devices. Unfortunately, today’s model parallelism approaches often result in memory-unbalanced allocations of the model across the multiple devices, with some devices’ memory heavily utilized while others remain underused. This imbalance limits deployments from reaching high batch sizes, triggers Out of Memory (OOM) errors earlier, and underutilizes resources. We present Wainscot, a model parallelism solution that produces memory-balanced placements of a DNN model across multiple devices, without noticeable increases in step time. We explore and empirically compare different granularities of rebalancing: operators, operator groups, and subgraphs. Experiments with diverse DNNs across a wide range of batch sizes demonstrate that compared to state-of-the-art model parallelism systems, Wainscot reduces maximal peak memory (across devices) by 47.94%, with a modest increase of 1.62% in step time.
arXiv (Cornell University) · 2025-03-25
preprintOpen accessSenior authorShort video streaming systems such as TikTok, YouTube Shorts, Instagram Reels, etc., have reached billions of active users worldwide. At the core of such systems are (proprietary) recommendation algorithms which recommend a sequence of videos to each user, in a personalized way. We aim to understand the temporal evolution of recommendations made by such algorithms, as well as the interplay between the recommendations and user experience. While past work has studied recommendation algorithms using textual data (e.g., titles, hashtags, etc.) as well as user studies and interviews, we add a third modality of analysis - we perform automated analysis of the videos themselves. To perform such multimodal analysis, we develop a new HCI measurement approach that starts with our new tool called VCA (Video Content Analysis) that leverages recent advances in Vision Language Models (VLMs). We apply VCA on a trifecta of HCI methodologies - real user studies, interviews, and data donation. This allows us to understand temporal aspects of how well TikTok's recommendation algorithm is perceived by users, is affected by user interactions, and aligns with user history; how users are sensitive to the order of videos recommended; and how the algorithm's effectiveness itself may be predictable in the future. While it is not our goal to reverse-engineer TikTok's recommendation algorithm, our new findings indicate behavioral aspects that the TikTok user community can benefit from.
2025-03-26
preprintOpen accessSenior authorThis paper tackles the challenge of running multiple ML inference jobs (models) under time-varying workloads, on a constrained on-premises production cluster. Our system Faro takes in latency Service Level Objectives (SLOs) for each job, auto-distills them into utility functions, "sloppifies" these utility functions to make them amenable to mathematical optimization, automatically predicts workload via probabilistic prediction, and dynamically makes implicit cross-job resource allocations, in order to satisfy cluster-wide objectives, e.g., total utility, fairness, and other hybrid variants. A major challenge Faro tackles is that using precise utilities and high-fidelity predictors, can be too slow (and in a sense too precise!) for the fast adaptation we require. Faro's solution is to "sloppify" (relax) its multiple design components to achieve fast adaptation without overly degrading solution quality. Faro is implemented in a stack consisting of Ray Serve running atop a Kubernetes cluster. Trace-driven cluster deployments show that Faro achieves 2.3×-23× lower SLO violations compared to state-of-the-art systems.
There is More Control in Egalitarian Edge IoT Meshes
IEEE Transactions on Network and Service Management · 2025-09-11
articleOpen accessSenior authorWhile mesh networking for edge settings (e.g., smart buildings, farms, battlefields, etc.) has received much attention, the layer of control over such meshes remains largely centralized and cloud-based. This paper focuses on applications with commonplace sense-trigger-actuate (STA) workloads—like the abstraction of routines popular now in smart homes, but applied to larger-scale edge IoT deployments. We present CoMesh, which tackles the challenge of building a decentralized mesh-based control plane for local, non-cloud, and hubless management of sense-trigger-actuate applications. CoMesh builds atop an abstraction called the coterie, which spreads STA load in a finegrained way both across space and across time. A coterie uses a novel combination of techniques such as zero-message-exchange protocols (for fast proactive member selection), quorum-based agreement, and locality-sensitive hashing. We analyze and theoretically prove safety and liveness properties of CoMesh. Our evaluation with both a Raspberry Pi-4 deployment and largerscale simulations, using real building maps and real routine workloads, shows that CoMesh is load-balanced, fast, faulttolerant, and scalable.
Recent grants
NSF · $600k · 2010–2015
CSR: Medium: Availability-Consistency Tradeoffs in Key-Value and NoSQL Storage Systems
NSF · $585k · 2014–2018
CSR: Small: Online Global Reconfigurations in Key-Value and NoSQL Cloud Storage Systems
NSF · $480k · 2013–2017
CAREER: Systematic Design of Distributed Protocols - from Methodologies and Toolkits to Systems
NSF · $456k · 2005–2010
CNS Core: Small: GoT -- Groups of Things Abstractions for Distributed IoT
NSF · $500k · 2019–2023
Frequent coauthors
- 22 shared
Klara Nahrstedt
University of Illinois Urbana-Champaign
- 20 shared
Cong Xie
- 16 shared
Roy H. Campbell
- 16 shared
Steven Y. Ko
Simon Fraser University
- 14 shared
Muntasir Raihan Rahman
Microsoft Research (United Kingdom)
- 13 shared
Kenneth P. Birman
- 13 shared
Oluwasanmi Koyejo
- 12 shared
Brian Cho
Education
- 2007
Ph.D., Computer Science
University of Illinois at Urbana-Champaign
- 2003
M.S., Computer Science
University of Illinois at Urbana-Champaign
- 2001
B.S., Computer Science and Engineering
University of Calcutta
Awards & honors
- ACM Distinguished Scientist (2018)
- IEEE Senior Member
- NSF CAREER award (2005)
- Junior Xerox Award for Faculty Research (2008)
- CAS/Beckman Fellowship (2009)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Indranil Gupta
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup