Saugata Ghose
· Assistant ProfessorVerifiedUniversity of Illinois Urbana-Champaign · Computer Science
Active 1962–2026
About
Saugata Ghose is an Assistant Professor in the Department of Computer Science at the University of Illinois Urbana-Champaign, with a focus on data-centric computer architectures and systems. His research interests include processing-in-memory (PIM), memory and storage systems, and computing platforms for applications such as smart cities, autonomous vehicles, robotics, and genomics. Ghose has contributed to advancing the understanding and development of efficient semiconductor technologies, as evidenced by his recent awards and recognition in the field. He holds a Ph.D. and M.S. in Computer Engineering from Cornell University, obtained in 2014, and a B.S. in Computer Science and a B.S. in Computer Engineering from the State University of New York at Binghamton, both earned in 2007. Ghose has held positions at Carnegie Mellon University as a Systems Scientist and Postdoctoral Research Associate, and he has been an affiliate faculty member at the University of Illinois since April 2022. His work has been recognized through awards such as the NSF grant for developing faster and more efficient semiconductors, the 2023 Intel Rising Star Faculty Award, and induction into the ISCA and HPCA Hall of Fame for his impactful research contributions.
Research topics
- Computer Science
- Computer hardware
- Embedded system
- Artificial Intelligence
- Biology
- Distributed computing
- Computer architecture
- Parallel computing
- Engineering
- Operating system
- Genetics
Selected publications
The Memory Processing Unit: A Generalized Interface for End-to-End In-Memory Execution
2026-01-31 · 1 citations
articleSenior authorThe processing-using-memory (PUM; a.k.a. inmemory computing) paradigm aims to eliminate data movement energy and performance costs by using memory cell interactions to directly perform computation. Given PUM's potential for large savings, prior works have proposed many different datapath microarchitectures to demonstrate how general-purpose PUM benefits a wide range of application kernels. Unfortunately, these efforts largely depend on microarchitecture-specific vector-like interfaces that (1) force many of an application's operations to be offloaded to a CPU, (2) require significant programmer effort to scale up applications to an entire memory chip, and (3) make it impractical to develop badly-needed systems software and programming tools for PUM. To address these three issues, we propose the memory processing unit (MPU), a microarchitecture-agnostic interface layer for general-purpose PUM with three components. First, we develop an MPU instruction set architecture (ISA) with instructions to facilitate application scaling and task coordination. Second, we propose an ensemble execution model that coordinates execution across millions of PUM vector function units and maps to most general-purpose PUM microarchitectures. Third, we design a comprehensive MPU control path that efficiently executes MPU ISA binaries across multiple ensembles, and can enable CPU-free execution of complex end-to-end applications with PUM. We demonstrate how the MPU maps to multiple previously-proposed PUM datapaths, and how it achieves average performance/energy improvements of <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{1. 7 9} \times \boldsymbol{/} \mathbf{3. 2 3} \times$</tex> for <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{2 1}$</tex> data-intensive kernels over these prior works (<tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$67 \times / 47 \times$</tex> vs. a modern GPU), while also achieving performance and energy improvements for the complex end-to-end applications.
Special Issue on Data-Centric Computing
IEEE Micro · 2026-03-01
articleSenior authorDARTH-PUM: A Hybrid Processing-Using-Memory Architecture
ArXiv.org · 2026-02-17
articleOpen accessSenior authorAnalog processing-using-memory (PUM; a.k.a. in-memory computing) makes use of electrical interactions inside memory arrays to perform bulk matrix-vector multiplication (MVM) operations. However, many popular matrix-based kernels need to execute non-MVM operations, which analog PUM cannot directly perform. To retain its energy efficiency, analog PUM architectures augment memory arrays with CMOS-based domain-specific fixed-function hardware to provide complete kernel functionality, but the difficulty of integrating such specialized CMOS logic with memory arrays has largely limited analog PUM to being an accelerator for machine learning inference, or for closely related kernels. An opportunity exists to harness analog PUM for general-purpose computation: recent works have shown that memory arrays can also perform Boolean PUM operations, albeit with very different supporting hardware and electrical signals than analog PUM. We propose DARTH-PUM, a general-purpose hybrid PUM architecture that tackles key hardware and software challenges to integrating analog PUM and digital PUM. We propose optimized peripheral circuitry, coordinating hardware to manage and interface between both types of PUM, an easy-to-use programming interface, and low-cost support for flexible data widths. These design elements allow us to build a practical PUM architecture that can execute kernels fully in memory, and can scale easily to cater to domains ranging from embedded applications to large-scale data-driven computing. We show how three popular applications (AES encryption, convolutional neural networks, large language models) can map to and benefit from DARTH-PUM, with speedups of 59.4x, 14.8x, and 40.8x over an analog+CPU baseline.
DARTH-PUM: A Hybrid Processing-Using-Memory Architecture
Open MIND · 2026-02-17
preprintSenior authorAnalog processing-using-memory (PUM; a.k.a. in-memory computing) makes use of electrical interactions inside memory arrays to perform bulk matrix-vector multiplication (MVM) operations. However, many popular matrix-based kernels need to execute non-MVM operations, which analog PUM cannot directly perform. To retain its energy efficiency, analog PUM architectures augment memory arrays with CMOS-based domain-specific fixed-function hardware to provide complete kernel functionality, but the difficulty of integrating such specialized CMOS logic with memory arrays has largely limited analog PUM to being an accelerator for machine learning inference, or for closely related kernels. An opportunity exists to harness analog PUM for general-purpose computation: recent works have shown that memory arrays can also perform Boolean PUM operations, albeit with very different supporting hardware and electrical signals than analog PUM. We propose DARTH-PUM, a general-purpose hybrid PUM architecture that tackles key hardware and software challenges to integrating analog PUM and digital PUM. We propose optimized peripheral circuitry, coordinating hardware to manage and interface between both types of PUM, an easy-to-use programming interface, and low-cost support for flexible data widths. These design elements allow us to build a practical PUM architecture that can execute kernels fully in memory, and can scale easily to cater to domains ranging from embedded applications to large-scale data-driven computing. We show how three popular applications (AES encryption, convolutional neural networks, large language models) can map to and benefit from DARTH-PUM, with speedups of 59.4x, 14.8x, and 40.8x over an analog+CPU baseline.
CRAVE: Analyzing Cross-Resource Interaction to Improve Energy Efficiency in Systems-on-Chip
2025-03-26 · 2 citations
articleMobile platforms make use of dynamic voltage and frequency scaling (DVFS) to trade off runtime performance and power consumption for their systems-on-chip (SoCs). State-of-the-art governors in the OS use application-based characteristics to control the SoC's DVFS settings for CPU cores, as well as the GPU in some SoCs. Through experimental characterization of real-world mobile platforms, we find that key SoC components have a complex relationship with one another, which directly affects their performance and power usage. This relationship is dependent on the architecture of the SoC as it is caused by the interaction of processing elements such as the CPU and GPU through a shared main memory. Unfortunately, existing application-oriented governors do not explicitly capture this design-induced relationship.
CPU Autoscaling With a Kernel of Truth
2025-10-09
articleOpen accessSenior authorCloud computing paradigms such as microservices and functions-as-a-service have made autoscaling an essential component of cloud application management. However, existing autoscalers struggle at capturing application dynamics, and have difficulties with precisely allocating quotas of shared system resources to the applications. We argue that one fundamental issue is the gap between native OS resource interfaces and surrogate user metrics that existing autoscalers use. In this paper, we take CPU autoscaling as an example: the cloud interface treats CPU resources as a percentage of the host CPU (e.g., millicore), while the OS kernel interprets CPU resources as time-shared quota slices allowed to run within a set period. We advocate for OS kernel support for CPU autoscaling to close the semantic gap, as it allows the autoscaler to perform precise, highly responsive resource allocation. We demonstrate the idea by developing Kscaler, a millisecond-scale CPU autoscaler for Linux. With kernel-level observability of fine-grained scheduler behavior, Kscaler outperforms state-of-the-art CPU autoscalers in responsiveness, precision, and efficiency while employing simple statistical methods.
2025-06-08 · 1 citations
articlePOSTER: DaPPA: A Data-Parallel Programming Framework for Processing-in-Memory Architectures
2025-11-03
articleThe increasing prevalence and growing size of data in modern applications have led to high costs for computation in traditional processor-centric computing systems [1–20]. To mitigate these costs, the processing-in-memory (PIM) $[1,2,5,6$, 8,9,20-27] paradigm moves computation closer to where the data resides, reducing the need to move data between memory and the processor. Even though the concept of PIM was first proposed in the 1960s [24, 28], and various PIM architectures have been proposed since then [$10,17,18,20,29-65$], realworld PIM systems have only recently been manufactured [6670], among which the UPMEM PIM system [66, 67, 71] is the first PIM architecture to become commercially available. A general-purpose PIM system is often composed of regular DRAM (as its main memory) and specialized PIM DRAM DIMMs. A PIM module is a standard DDRx DIMM (module) with multiple PIM chips. Inside each PIM chip, there are multiple (e.g., 8) general-purpose in-order PIM cores, which have exclusive access to a DRAM bank and SRAM-based instruction/scratchpad memories.
ArXiv.org · 2025-01-29
preprintOpen accessProcessing-using-DRAM (PUD) is a paradigm where the analog operational properties of DRAM are used to perform bulk logic operations. While PUD promises high throughput at low energy and area cost, we uncover three limitations of existing PUD approaches that lead to significant inefficiencies: (i) static data representation, i.e., two's complement with fixed bit-precision, leading to unnecessary computation over useless (i.e., inconsequential) data; (ii) support for only throughput-oriented execution, where the high latency of individual PUD operations can only be hidden in the presence of bulk data-level parallelism; and (iii) high latency for high-precision (e.g., 32-bit) operations. To address these issues, we propose Proteus, the first hardware framework that addresses the high execution latency of bulk bitwise PUD operations by implementing a data-aware runtime engine for PUD. Proteus reduces the latency of PUD operations in three different ways: (i) Proteus dynamically reduces the bit-precision (and thus the latency and energy consumption) of PUD operations by exploiting narrow values (i.e., values with many leading zeros or ones); (ii) Proteus concurrently executes independent in-DRAM primitives belonging to a single PUD operation across multiple DRAM arrays; (iii) Proteus chooses and uses the most appropriate data representation and arithmetic algorithm implementation for a given PUD instruction transparently to the programmer.
ANVIL: An In-Storage Accelerator for Name–Value Data Stores
2025-06-20 · 2 citations
articleOpen accessName-value pairs (NVPs) are a widely-used abstraction to organize data in millions of applications.At a high level, an NVP associates a name (e.g., array index, key, hash) with each value in a collection of data.Specific NVP data store formats can vary widely, ranging from simple arrays/dictionaries and lookup tables to key-value stores and data mining workloads.Despite their importance, existing optimizations for NVPs are limited to only a single data store format, as the broad definition of NVPs allows for significant heterogeneity in encoding and implementation.We propose ANVIL, the first end-to-end system that allows programmers to broadly accelerate most formats of NVPs.With a conventional solid-state drive (SSD), large-scale NVP lookups can saturate both external and internal SSD bandwidth, as every NVP in the data store needs to be sent back to the host CPU to check for a matching name.ANVIL makes use of in-storage processing to avoid reading out any data for names that do not match, by performing name match checks directly inside the SSD's NAND flash chips.We demonstrate that ANVIL can substantially reduce disk I/O, reduce metadata overheads, and provide speedups of 4.0, 25, and 14.6% over a conventional SSD, for three different NVP workloads (database transactions, analytics, and graph processing).
Frequent coauthors
- 183 shared
Onur Mutlu
- 48 shared
Juan Gómez-Luna
- 41 shared
Rachata Ausavarungnirun
King Mongkut's University of Technology North Bangkok
- 36 shared
Hasan Hassan
- 32 shared
Amirali Boroumand
Islamic Azad University Sari Branch
- 30 shared
Donghyuk Lee
Seoul National University of Science and Technology
- 28 shared
Jeremie S. Kim
- 26 shared
Minesh Patel
Labs
Siebel School of Computing and Data SciencePI
Education
- 2000
Ph.D., Computer Science
University of Illinois at Urbana-Champaign
- 1995
M.S., Computer Science
University of Illinois at Urbana-Champaign
- 1993
B.S., Computer Science and Engineering
Indian Institute of Technology, Kharagpur
Awards & honors
- Induction into the ISCA Hall of Fame (2026)
- Induction into the HPCA Hall of Fame (2025)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Saugata Ghose
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup