Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Saugata  Ghose

Saugata Ghose

· Assistant ProfessorVerified

University of Illinois Urbana-Champaign · Computer Science

Active 1962–2026

h-index46
Citations5.8k
Papers17358 last 5y
Funding
See your match with Saugata Ghose — sign in to PhdFit.Sign in

About

Saugata Ghose is an Assistant Professor in the Department of Computer Science at the University of Illinois Urbana-Champaign, with a focus on data-centric computer architectures and systems. His research interests include processing-in-memory (PIM), memory and storage systems, and computing platforms for applications such as smart cities, autonomous vehicles, robotics, and genomics. Ghose has contributed to advancing the understanding and development of efficient semiconductor technologies, as evidenced by his recent awards and recognition in the field. He holds a Ph.D. and M.S. in Computer Engineering from Cornell University, obtained in 2014, and a B.S. in Computer Science and a B.S. in Computer Engineering from the State University of New York at Binghamton, both earned in 2007. Ghose has held positions at Carnegie Mellon University as a Systems Scientist and Postdoctoral Research Associate, and he has been an affiliate faculty member at the University of Illinois since April 2022. His work has been recognized through awards such as the NSF grant for developing faster and more efficient semiconductors, the 2023 Intel Rising Star Faculty Award, and induction into the ISCA and HPCA Hall of Fame for his impactful research contributions.

Research topics

  • Computer Science
  • Computer hardware
  • Embedded system
  • Artificial Intelligence
  • Biology
  • Distributed computing
  • Computer architecture
  • Parallel computing
  • Engineering
  • Operating system
  • Genetics

Selected publications

  • The Memory Processing Unit: A Generalized Interface for End-to-End In-Memory Execution

    2026-01-31 · 1 citations

    articleSenior author

    The processing-using-memory (PUM; a.k.a. inmemory computing) paradigm aims to eliminate data movement energy and performance costs by using memory cell interactions to directly perform computation. Given PUM's potential for large savings, prior works have proposed many different datapath microarchitectures to demonstrate how general-purpose PUM benefits a wide range of application kernels. Unfortunately, these efforts largely depend on microarchitecture-specific vector-like interfaces that (1) force many of an application's operations to be offloaded to a CPU, (2) require significant programmer effort to scale up applications to an entire memory chip, and (3) make it impractical to develop badly-needed systems software and programming tools for PUM. To address these three issues, we propose the memory processing unit (MPU), a microarchitecture-agnostic interface layer for general-purpose PUM with three components. First, we develop an MPU instruction set architecture (ISA) with instructions to facilitate application scaling and task coordination. Second, we propose an ensemble execution model that coordinates execution across millions of PUM vector function units and maps to most general-purpose PUM microarchitectures. Third, we design a comprehensive MPU control path that efficiently executes MPU ISA binaries across multiple ensembles, and can enable CPU-free execution of complex end-to-end applications with PUM. We demonstrate how the MPU maps to multiple previously-proposed PUM datapaths, and how it achieves average performance/energy improvements of <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{1. 7 9} \times \boldsymbol{/} \mathbf{3. 2 3} \times$</tex> for <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{2 1}$</tex> data-intensive kernels over these prior works (<tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$67 \times / 47 \times$</tex> vs. a modern GPU), while also achieving performance and energy improvements for the complex end-to-end applications.

  • Special Issue on Data-Centric Computing

    IEEE Micro · 2026-03-01

    articleSenior author
  • DARTH-PUM: A Hybrid Processing-Using-Memory Architecture

    ArXiv.org · 2026-02-17

    articleOpen accessSenior author

    Analog processing-using-memory (PUM; a.k.a. in-memory computing) makes use of electrical interactions inside memory arrays to perform bulk matrix-vector multiplication (MVM) operations. However, many popular matrix-based kernels need to execute non-MVM operations, which analog PUM cannot directly perform. To retain its energy efficiency, analog PUM architectures augment memory arrays with CMOS-based domain-specific fixed-function hardware to provide complete kernel functionality, but the difficulty of integrating such specialized CMOS logic with memory arrays has largely limited analog PUM to being an accelerator for machine learning inference, or for closely related kernels. An opportunity exists to harness analog PUM for general-purpose computation: recent works have shown that memory arrays can also perform Boolean PUM operations, albeit with very different supporting hardware and electrical signals than analog PUM. We propose DARTH-PUM, a general-purpose hybrid PUM architecture that tackles key hardware and software challenges to integrating analog PUM and digital PUM. We propose optimized peripheral circuitry, coordinating hardware to manage and interface between both types of PUM, an easy-to-use programming interface, and low-cost support for flexible data widths. These design elements allow us to build a practical PUM architecture that can execute kernels fully in memory, and can scale easily to cater to domains ranging from embedded applications to large-scale data-driven computing. We show how three popular applications (AES encryption, convolutional neural networks, large language models) can map to and benefit from DARTH-PUM, with speedups of 59.4x, 14.8x, and 40.8x over an analog+CPU baseline.

  • DARTH-PUM: A Hybrid Processing-Using-Memory Architecture

    Open MIND · 2026-02-17

    preprintSenior author

    Analog processing-using-memory (PUM; a.k.a. in-memory computing) makes use of electrical interactions inside memory arrays to perform bulk matrix-vector multiplication (MVM) operations. However, many popular matrix-based kernels need to execute non-MVM operations, which analog PUM cannot directly perform. To retain its energy efficiency, analog PUM architectures augment memory arrays with CMOS-based domain-specific fixed-function hardware to provide complete kernel functionality, but the difficulty of integrating such specialized CMOS logic with memory arrays has largely limited analog PUM to being an accelerator for machine learning inference, or for closely related kernels. An opportunity exists to harness analog PUM for general-purpose computation: recent works have shown that memory arrays can also perform Boolean PUM operations, albeit with very different supporting hardware and electrical signals than analog PUM. We propose DARTH-PUM, a general-purpose hybrid PUM architecture that tackles key hardware and software challenges to integrating analog PUM and digital PUM. We propose optimized peripheral circuitry, coordinating hardware to manage and interface between both types of PUM, an easy-to-use programming interface, and low-cost support for flexible data widths. These design elements allow us to build a practical PUM architecture that can execute kernels fully in memory, and can scale easily to cater to domains ranging from embedded applications to large-scale data-driven computing. We show how three popular applications (AES encryption, convolutional neural networks, large language models) can map to and benefit from DARTH-PUM, with speedups of 59.4x, 14.8x, and 40.8x over an analog+CPU baseline.

  • CRAVE: Analyzing Cross-Resource Interaction to Improve Energy Efficiency in Systems-on-Chip

    2025-03-26 · 2 citations

    article

    Mobile platforms make use of dynamic voltage and frequency scaling (DVFS) to trade off runtime performance and power consumption for their systems-on-chip (SoCs). State-of-the-art governors in the OS use application-based characteristics to control the SoC's DVFS settings for CPU cores, as well as the GPU in some SoCs. Through experimental characterization of real-world mobile platforms, we find that key SoC components have a complex relationship with one another, which directly affects their performance and power usage. This relationship is dependent on the architecture of the SoC as it is caused by the interaction of processing elements such as the CPU and GPU through a shared main memory. Unfortunately, existing application-oriented governors do not explicitly capture this design-induced relationship.

  • CPU Autoscaling With a Kernel of Truth

    2025-10-09

    articleOpen accessSenior author

    Cloud computing paradigms such as microservices and functions-as-a-service have made autoscaling an essential component of cloud application management. However, existing autoscalers struggle at capturing application dynamics, and have difficulties with precisely allocating quotas of shared system resources to the applications. We argue that one fundamental issue is the gap between native OS resource interfaces and surrogate user metrics that existing autoscalers use. In this paper, we take CPU autoscaling as an example: the cloud interface treats CPU resources as a percentage of the host CPU (e.g., millicore), while the OS kernel interprets CPU resources as time-shared quota slices allowed to run within a set period. We advocate for OS kernel support for CPU autoscaling to close the semantic gap, as it allows the autoscaler to perform precise, highly responsive resource allocation. We demonstrate the idea by developing Kscaler, a millisecond-scale CPU autoscaler for Linux. With kernel-level observability of fine-grained scheduler behavior, Kscaler outperforms state-of-the-art CPU autoscalers in responsiveness, precision, and efficiency while employing simple statistical methods.

  • Proteus: Achieving High-Performance Processing-Using-DRAM with Dynamic Bit-Precision, Adaptive Data Representation, and Flexible Arithmetic

    2025-06-08 · 1 citations

    article
  • POSTER: DaPPA: A Data-Parallel Programming Framework for Processing-in-Memory Architectures

    2025-11-03

    article

    The increasing prevalence and growing size of data in modern applications have led to high costs for computation in traditional processor-centric computing systems [1–20]. To mitigate these costs, the processing-in-memory (PIM) $[1,2,5,6$, 8,9,20-27] paradigm moves computation closer to where the data resides, reducing the need to move data between memory and the processor. Even though the concept of PIM was first proposed in the 1960s [24, 28], and various PIM architectures have been proposed since then [$10,17,18,20,29-65$], realworld PIM systems have only recently been manufactured [6670], among which the UPMEM PIM system [66, 67, 71] is the first PIM architecture to become commercially available. A general-purpose PIM system is often composed of regular DRAM (as its main memory) and specialized PIM DRAM DIMMs. A PIM module is a standard DDRx DIMM (module) with multiple PIM chips. Inside each PIM chip, there are multiple (e.g., 8) general-purpose in-order PIM cores, which have exclusive access to a DRAM bank and SRAM-based instruction/scratchpad memories.

  • Proteus: Enabling High-Performance Processing-Using-DRAM with Dynamic Bit-Precision, Adaptive Data Representation, and Flexible Arithmetic

    ArXiv.org · 2025-01-29

    preprintOpen access

    Processing-using-DRAM (PUD) is a paradigm where the analog operational properties of DRAM are used to perform bulk logic operations. While PUD promises high throughput at low energy and area cost, we uncover three limitations of existing PUD approaches that lead to significant inefficiencies: (i) static data representation, i.e., two's complement with fixed bit-precision, leading to unnecessary computation over useless (i.e., inconsequential) data; (ii) support for only throughput-oriented execution, where the high latency of individual PUD operations can only be hidden in the presence of bulk data-level parallelism; and (iii) high latency for high-precision (e.g., 32-bit) operations. To address these issues, we propose Proteus, the first hardware framework that addresses the high execution latency of bulk bitwise PUD operations by implementing a data-aware runtime engine for PUD. Proteus reduces the latency of PUD operations in three different ways: (i) Proteus dynamically reduces the bit-precision (and thus the latency and energy consumption) of PUD operations by exploiting narrow values (i.e., values with many leading zeros or ones); (ii) Proteus concurrently executes independent in-DRAM primitives belonging to a single PUD operation across multiple DRAM arrays; (iii) Proteus chooses and uses the most appropriate data representation and arithmetic algorithm implementation for a given PUD instruction transparently to the programmer.

  • ANVIL: An In-Storage Accelerator for Name–Value Data Stores

    2025-06-20 · 2 citations

    articleOpen access

    Name-value pairs (NVPs) are a widely-used abstraction to organize data in millions of applications.At a high level, an NVP associates a name (e.g., array index, key, hash) with each value in a collection of data.Specific NVP data store formats can vary widely, ranging from simple arrays/dictionaries and lookup tables to key-value stores and data mining workloads.Despite their importance, existing optimizations for NVPs are limited to only a single data store format, as the broad definition of NVPs allows for significant heterogeneity in encoding and implementation.We propose ANVIL, the first end-to-end system that allows programmers to broadly accelerate most formats of NVPs.With a conventional solid-state drive (SSD), large-scale NVP lookups can saturate both external and internal SSD bandwidth, as every NVP in the data store needs to be sent back to the host CPU to check for a matching name.ANVIL makes use of in-storage processing to avoid reading out any data for names that do not match, by performing name match checks directly inside the SSD's NAND flash chips.We demonstrate that ANVIL can substantially reduce disk I/O, reduce metadata overheads, and provide speedups of 4.0, 25, and 14.6% over a conventional SSD, for three different NVP workloads (database transactions, analytics, and graph processing).

Frequent coauthors

  • Onur Mutlu

    183 shared
  • Juan Gómez-Luna

    48 shared
  • Rachata Ausavarungnirun

    King Mongkut's University of Technology North Bangkok

    41 shared
  • Hasan Hassan

    36 shared
  • Amirali Boroumand

    Islamic Azad University Sari Branch

    32 shared
  • Donghyuk Lee

    Seoul National University of Science and Technology

    30 shared
  • Jeremie S. Kim

    28 shared
  • Minesh Patel

    26 shared

Labs

  • Siebel School of Computing and Data SciencePI

Education

  • Ph.D., Computer Science

    University of Illinois at Urbana-Champaign

    2000
  • M.S., Computer Science

    University of Illinois at Urbana-Champaign

    1995
  • B.S., Computer Science and Engineering

    Indian Institute of Technology, Kharagpur

    1993

Awards & honors

  • Induction into the ISCA Hall of Fame (2026)
  • Induction into the HPCA Hall of Fame (2025)
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Saugata Ghose

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup