Haryadi Gunawi

· Associate Professor of Computer ScienceVerified

University of Chicago · Computer Science

Active 2003–2025

h-index28

Citations2.8k

Papers9420 last 5y

Funding$2.7M

Faculty page Lab page Website

See your match with Haryadi Gunawi — sign in to PhdFit.Sign in

About

Professor Haryadi S. Gunawi is a faculty member in the Computer Science Department at the University of Chicago. His research focuses on operating and storage systems, distributed (cloud) systems, and the application of machine learning techniques to storage and systems problems. He leads research groups including UCARE, the Systems Group, and the Chameleon project. Professor Gunawi's work addresses challenges in optimizing storage I/O, improving system reliability and performance, and scaling cloud and distributed systems. He has contributed extensively to the field through numerous publications and has advised multiple PhD students. Prior to joining the University of Chicago, he was a postdoctoral fellow at UC Berkeley from 2010 to 2012 and completed his PhD at the University of Wisconsin-Madison in 2009, where his dissertation on reliable storage systems earned the ACM Doctoral Dissertation Award Honorable Mention and a departmental best thesis award.

Research topics

Computer Science
Embedded system
World Wide Web
Operating system

Selected publications

Alchemist: Towards the Design of Efficient Online Continual Learning System
ArXiv.org · 2025-03-03
preprintOpen access
Continual learning has become a promising solution to refine large language models incrementally by leveraging user feedback. In particular, online continual learning - iteratively training the model with small batches of user feedback - has demonstrated notable performance improvements. However, the existing practice of separating training and serving processes forces the online trainer to recompute the intermediate results already done during serving. Such redundant computations can account for 30%-42% of total training time. In this paper, we propose Alchemist, to the best of our knowledge, the first online continual learning system that efficiently reuses serving activations to increase training throughput. Alchemist introduces two key techniques: (1) recording and storing activations and KV cache only during the prefill phase to minimize latency and memory overhead; and (2) smart activation offloading and hedging. Evaluations with inputs of varied token length sampled from ShareGPT dataset show that compared with a separate training cluster, Alchemist significantly increases training throughput by up to 1.72x, reduces up to 47% memory usage during training, and supports up to 2x more training tokens - all while maintaining negligible impact on serving latency.
Publisher OA PDF DOI
GPEmu: A GPU Emulator for Faster and Cheaper Prototyping and Evaluation of Deep Learning System Research
Proceedings of the VLDB Endowment · 2025-02-01
articleSenior author
Deep learning (DL) system research is often impeded by the limited availability and expensive costs of GPUs. In this paper, we introduce GPEmu, a GPU emulator for faster and cheaper prototyping and evaluation of deep learning system research without using real GPUs. GPEmu comes with four novel features: time emulation, memory emulation, distributed system support, and sharing support. We support over 30 DL models and 6 GPU models, the largest scale to date. We demonstrate the power of GPEmu by successfully reproducing the main results of nine recent publications and easily prototyping three new micro-optimizations.
Publisher DOI
Heimdall: Optimizing Storage I/O Admission with Extensive Machine Learning Pipeline
2025-03-26 · 3 citations
articleSenior author
This paper introduces Heimdall, a highly accurate and efficient machine learning-powered I/O admission policy for flash storage, designed to operate in a black-box manner. We make domain-specific innovations in various ML stages by introducing accurate period-based labeling, 3-stage noise filtering, in-depth feature engineering, and fine-grained tuning, which together improve the decision accuracy from 67% up to 93%. We perform various deployment optimizations to reach a sub-μs inference latency and a small, 28KB, memory overhead. With 500 unbiased random experiments derived from production traces, we show Heimdall delivers 15-35% lower average I/O latency compared to the state of the art and up to 2x faster to a baseline. Heimdall is ready for user-level, in-kernel, and distributed deployments.
Publisher DOI
Concierge: Towards Accuracy-Driven Bandwidth Allocation for Video Analytics Applications in Edge Network
2024-07-07
article
When performing inference on sensor data, edge video analytics applications may not always need high-fidelity data, since important information may not appear all the time. Consequently, each edge AI application’s bandwidth demand is highly dynamic. Thus, a shared edge system should dynamically allocate more bandwidth to the applications in need to reach high accuracy at each moment. However, previous bandwidth allocators are ill-suited because they are agnostic to the timevarying impact of bandwidth on each application’s accuracy.This short paper explores a new accuracy-driven approach to bandwidth allocation, which periodically re-allocates bandwidth across edge AI applications based on the sensitivity of each application’s accuracy to its bandwidth share. To examine its practical benefit and technical challenges, we present a concrete accuracy-driven bandwidth allocator called ConciERGE, which exposes a simple yet efficient interface to estimate each application’s sensitivity to a small change in its bandwidth share.We run CONCIERGE on state-of-the-art video-analytics applications with real video streams and show its early promise in greatly improving the inference accuracy of video analytics.
Publisher DOI
RHIK: Re-configurable Hash-based Indexing for KVSSD
2023-08-07 · 1 citations
article
Key-Value Solid State Drive (KV-SSD), a key addressable SSD technology, promises to simplify storage management for unstructured data and improve system performance with minimal host-side intervention. However, we find that the current state-of-the-art KV-SSD exhibits indexing peculiarities that limit their widespread adoption. Through experiments, we observe that the performance degrades as more data are stored, and the KV-SSD can only store a limited number of key-value pairs even though the amount of data stored on the device is significantly lower than its capacity. We introduce RHIK, a reconfigurable hash-bashed indexing for KV-SSD, for high performance and high occupancy. We implement our proposed indexing scheme on the open-source KV-SSD emulator that is validated against a real KV-SSD, and demonstrate its effectiveness using real workload traces and synthetic microbenchmarks.
Publisher DOI
Performance Bug Analysis and Detection for Distributed Storage and Computing Systems
ACM Transactions on Storage · 2023-01-18 · 10 citations
article
This article systematically studies 99 distributed performance bugs from five widely deployed distributed storage and computing systems (Cassandra, HBase, HDFS, Hadoop MapReduce and ZooKeeper). We present the TaxPerf database, which collectively organizes the analysis results as over 400 classification labels and over 2,500 lines of bug re-description. TaxPerf is classified into six bug categories (and 18 bug subcategories) by their root causes; resource, blocking, synchronization, optimization, configuration, and logic. TaxPerf can be used as a benchmark for performance bug studies and debug tool designs. Although it is impractical to automatically detect all categories of performance bugs in TaxPerf, we find that an important category of blocking bugs can be effectively solved by analysis tools. We analyze the cascading nature of blocking bugs and design an automatic detection tool called PCatch , which (i) performs program analysis to identify code regions whose execution time can potentially increase dramatically with the workload size; (ii) adapts the traditional happens-before model to reason about software resource contention and performance dependency relationship; and (iii) uses dynamic tracking to identify whether the slowdown propagation is contained in one job. Evaluation shows that PCatch can accurately detect blocking bugs of representative distributed storage and computing systems by observing system executions under small-scale workloads.
Publisher DOI
Towards Continually Learning Application Performance Models
arXiv (Cornell University) · 2023-10-25
preprintOpen access
Machine learning-based performance models are increasingly being used to build critical job scheduling and application optimization decisions. Traditionally, these models assume that data distribution does not change as more samples are collected over time. However, owing to the complexity and heterogeneity of production HPC systems, they are susceptible to hardware degradation, replacement, and/or software patches, which can lead to drift in the data distribution that can adversely affect the performance models. To this end, we develop continually learning performance models that account for the distribution drift, alleviate catastrophic forgetting, and improve generalizability. Our best model was able to retain accuracy, regardless of having to learn the new distribution of data inflicted by system changes, while demonstrating a 2x improvement in the prediction accuracy of the whole data sequence in comparison to the naive approach.
Publisher OA PDF DOI
EVStore: Storage and Caching Capabilities for Scaling Embedding Tables in Deep Recommendation Systems
2023-01-27 · 13 citations
articleSenior author
Modern recommendation systems, primarily driven by deep-learning models, depend on fast model inferences to be useful. To tackle the sparsity in the input space, particularly for categorical variables, such inferences are made by storing increasingly large embedding vector (EV) tables in memory. A core challenge is that the inference operation has an all-or-nothing property: each inference requires multiple EV table lookups, but if any memory access is slow, the whole inference request is slow. In our paper, we design, implement and evaluate EVStore, a 3-layer EV table lookup system that harnesses both structural regularity in inference operations and domain-specific approximations to provide optimized caching, yielding up to 23% and 27% reduction on the average and p90 latency while quadrupling throughput at 0.2% loss in accuracy. Finally, we show that at a minor cost of accuracy, EVStore can reduce the Deep Recommendation System (DRS) memory usage by up to 94%, yielding potentially enormous savings for these costly, pervasive systems.
Publisher DOI
Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers
2023-10-30 · 6 citations
articleOpen accessSenior author
Multi-level erasure coding (MLEC) has seen large deployments in the field, but there is no in-depth study of design considerations for MLEC at scale. In this paper, we provide comprehensive design considerations and analysis of MLEC at scale. We introduce the design space of MLEC in multiple dimensions, including various code parameter selections, chunk placement schemes, and various repair methods. We quantify their performance and durability, and show which MLEC schemes and repair methods can provide the best tolerance against independent/correlated failures and reduce repair network traffic by orders of magnitude. To achieve this, we use various evaluation strategies including simulation, splitting, dynamic programming, and mathematical modeling. We also compare the performance and durability of MLEC with other EC schemes such as SLEC and LRC and show that MLEC can provide high durability with higher encoding throughput and less repair network traffic over both SLEC and LRC.
Publisher OA PDF DOI
CNT: Semi-Automatic Translation from CWL to Nextflow for Genomic Workflows
2023-12-04 · 1 citations
article
With the rise of advanced workflow languages for scientific computations, Nextflow has gained increased attention from the bioinformatics community. Nextflow offers native support for advanced parallelism, which can greatly enhance resource utilization and throughput. Still, a significant portion of bioinformatics workflows are developed with the Common Workflow Language (CWL). Transitioning from CWL to Nextflow poses a significant challenge due to the differences in programming models, scripting language compatibilities, and the prerequisite for in-depth knowledge in both languages. To address this challenge, we present CNT, a novel, semi-automated translator converting CWL workflows into Nextflow ones. At its core, CNT uses an automated translation mechanism that converts the CommandLineTool, the most basic unit of CWL, into Nextflow's Process class. This component integrates tool-level conversion, graph dependency analysis, and correctness checks to provide highly automated translation coverage, significantly reducing the development time while satisfying language-specific requirements like building a proper dataflow model when creating workflows. Furthermore, CNT incorporates a module for aiding manual translation. Specifically, it can identify three common JavaScript patterns in CWL workflows, offering further guidance for developers during the translation phase. We evaluated CNT with production-grade workflows and found that it can cover up to 81% of the original workflows, substantially reducing development time. Additionally, transitioning from a cwltool-based system to Nextflow with CNT can result in a 72% speedup and 85% increased CPU utilization.
Publisher DOI

Recent grants

XPS:CLCCA:LigHTS: Lagging-Hardware Tolerant Systems" in the system.
NSF · $750k · 2013–2017
CSR: Small: BreezeFS: File System Transformation for Cloud and Multistore Era
NSF · $498k · 2015–2019
PPoSS: Planning: CP2: Towards Systems Correctness Checkability and Performance Predictability at Scale
NSF · $248k · 2020–2022
CSR: Medium:Combating Distributed Concurrency Bugs in Cloud Systems
NSF · $800k · 2016–2021
CAREER: DrCloud: Drill-Ready Cloud Computing
NSF · $449k · 2014–2020

Frequent coauthors

Tanakorn Leesatapornwongsa
Microsoft (United States)
28 shared
Shan Lu
Microsoft (United States)
25 shared
Andrea C. Arpaci-Dusseau
University of Wisconsin–Madison
19 shared
Mingzhe Hao
China Three Gorges University
18 shared
Remzi H. Arpaci-Dusseau
18 shared
Jeffrey F. Lukman
18 shared
Huaicheng Li
Virginia Tech
16 shared
Pallavi Joshi
Amrita Vishwa Vidyapeetham
13 shared

Labs

Haryadi Gunawi LabPI

Education

Ph.D., Computer Science
University of Wisconsin, Madison
2009
Other
University of California, Berkeley

Awards & honors

NSF CAREER Award (2014)
NSF Computing Innovation Fellowship (2012)
Google Faculty Research Award (2015)
NetApp Faculty Fellowship (2015)
NetApp Faculty Fellowship (2013)

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Haryadi Gunawi

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you