Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Haryadi Gunawi

Haryadi Gunawi

· Associate Professor of Computer ScienceVerified

University of Chicago · Computer Science

Active 2003–2025

h-index28
Citations2.8k
Papers9420 last 5y
Funding$2.7M
See your match with Haryadi Gunawi — sign in to PhdFit.Sign in

About

Professor Haryadi S. Gunawi is a faculty member in the Computer Science Department at the University of Chicago. His research focuses on operating and storage systems, distributed (cloud) systems, and the application of machine learning techniques to storage and systems problems. He leads research groups including UCARE, the Systems Group, and the Chameleon project. Professor Gunawi's work addresses challenges in optimizing storage I/O, improving system reliability and performance, and scaling cloud and distributed systems. He has contributed extensively to the field through numerous publications and has advised multiple PhD students. Prior to joining the University of Chicago, he was a postdoctoral fellow at UC Berkeley from 2010 to 2012 and completed his PhD at the University of Wisconsin-Madison in 2009, where his dissertation on reliable storage systems earned the ACM Doctoral Dissertation Award Honorable Mention and a departmental best thesis award.

Research topics

  • Computer Science
  • Embedded system
  • World Wide Web
  • Operating system

Selected publications

  • Alchemist: Towards the Design of Efficient Online Continual Learning System

    ArXiv.org · 2025-03-03

    preprintOpen access

    Continual learning has become a promising solution to refine large language models incrementally by leveraging user feedback. In particular, online continual learning - iteratively training the model with small batches of user feedback - has demonstrated notable performance improvements. However, the existing practice of separating training and serving processes forces the online trainer to recompute the intermediate results already done during serving. Such redundant computations can account for 30%-42% of total training time. In this paper, we propose Alchemist, to the best of our knowledge, the first online continual learning system that efficiently reuses serving activations to increase training throughput. Alchemist introduces two key techniques: (1) recording and storing activations and KV cache only during the prefill phase to minimize latency and memory overhead; and (2) smart activation offloading and hedging. Evaluations with inputs of varied token length sampled from ShareGPT dataset show that compared with a separate training cluster, Alchemist significantly increases training throughput by up to 1.72x, reduces up to 47% memory usage during training, and supports up to 2x more training tokens - all while maintaining negligible impact on serving latency.

  • GPEmu: A GPU Emulator for Faster and Cheaper Prototyping and Evaluation of Deep Learning System Research

    Proceedings of the VLDB Endowment · 2025-02-01

    articleSenior author

    Deep learning (DL) system research is often impeded by the limited availability and expensive costs of GPUs. In this paper, we introduce GPEmu, a GPU emulator for faster and cheaper prototyping and evaluation of deep learning system research without using real GPUs. GPEmu comes with four novel features: time emulation, memory emulation, distributed system support, and sharing support. We support over 30 DL models and 6 GPU models, the largest scale to date. We demonstrate the power of GPEmu by successfully reproducing the main results of nine recent publications and easily prototyping three new micro-optimizations.

  • Heimdall: Optimizing Storage I/O Admission with Extensive Machine Learning Pipeline

    2025-03-26 · 3 citations

    articleSenior author

    This paper introduces Heimdall, a highly accurate and efficient machine learning-powered I/O admission policy for flash storage, designed to operate in a black-box manner. We make domain-specific innovations in various ML stages by introducing accurate period-based labeling, 3-stage noise filtering, in-depth feature engineering, and fine-grained tuning, which together improve the decision accuracy from 67% up to 93%. We perform various deployment optimizations to reach a sub-μs inference latency and a small, 28KB, memory overhead. With 500 unbiased random experiments derived from production traces, we show Heimdall delivers 15-35% lower average I/O latency compared to the state of the art and up to 2x faster to a baseline. Heimdall is ready for user-level, in-kernel, and distributed deployments.

  • Concierge: Towards Accuracy-Driven Bandwidth Allocation for Video Analytics Applications in Edge Network

    2024-07-07

    article

    When performing inference on sensor data, edge video analytics applications may not always need high-fidelity data, since important information may not appear all the time. Consequently, each edge AI application’s bandwidth demand is highly dynamic. Thus, a shared edge system should dynamically allocate more bandwidth to the applications in need to reach high accuracy at each moment. However, previous bandwidth allocators are ill-suited because they are agnostic to the timevarying impact of bandwidth on each application’s accuracy.This short paper explores a new accuracy-driven approach to bandwidth allocation, which periodically re-allocates bandwidth across edge AI applications based on the sensitivity of each application’s accuracy to its bandwidth share. To examine its practical benefit and technical challenges, we present a concrete accuracy-driven bandwidth allocator called ConciERGE, which exposes a simple yet efficient interface to estimate each application’s sensitivity to a small change in its bandwidth share.We run CONCIERGE on state-of-the-art video-analytics applications with real video streams and show its early promise in greatly improving the inference accuracy of video analytics.

  • RHIK: Re-configurable Hash-based Indexing for KVSSD

    2023-08-07 · 1 citations

    article

    Key-Value Solid State Drive (KV-SSD), a key addressable SSD technology, promises to simplify storage management for unstructured data and improve system performance with minimal host-side intervention. However, we find that the current state-of-the-art KV-SSD exhibits indexing peculiarities that limit their widespread adoption. Through experiments, we observe that the performance degrades as more data are stored, and the KV-SSD can only store a limited number of key-value pairs even though the amount of data stored on the device is significantly lower than its capacity. We introduce RHIK, a reconfigurable hash-bashed indexing for KV-SSD, for high performance and high occupancy. We implement our proposed indexing scheme on the open-source KV-SSD emulator that is validated against a real KV-SSD, and demonstrate its effectiveness using real workload traces and synthetic microbenchmarks.

  • Performance Bug Analysis and Detection for Distributed Storage and Computing Systems

    ACM Transactions on Storage · 2023-01-18 · 10 citations

    article

    This article systematically studies 99 distributed performance bugs from five widely deployed distributed storage and computing systems (Cassandra, HBase, HDFS, Hadoop MapReduce and ZooKeeper). We present the TaxPerf database, which collectively organizes the analysis results as over 400 classification labels and over 2,500 lines of bug re-description. TaxPerf is classified into six bug categories (and 18 bug subcategories) by their root causes; resource, blocking, synchronization, optimization, configuration, and logic. TaxPerf can be used as a benchmark for performance bug studies and debug tool designs. Although it is impractical to automatically detect all categories of performance bugs in TaxPerf, we find that an important category of blocking bugs can be effectively solved by analysis tools. We analyze the cascading nature of blocking bugs and design an automatic detection tool called PCatch , which (i) performs program analysis to identify code regions whose execution time can potentially increase dramatically with the workload size; (ii) adapts the traditional happens-before model to reason about software resource contention and performance dependency relationship; and (iii) uses dynamic tracking to identify whether the slowdown propagation is contained in one job. Evaluation shows that PCatch can accurately detect blocking bugs of representative distributed storage and computing systems by observing system executions under small-scale workloads.

  • Towards Continually Learning Application Performance Models

    arXiv (Cornell University) · 2023-10-25

    preprintOpen access

    Machine learning-based performance models are increasingly being used to build critical job scheduling and application optimization decisions. Traditionally, these models assume that data distribution does not change as more samples are collected over time. However, owing to the complexity and heterogeneity of production HPC systems, they are susceptible to hardware degradation, replacement, and/or software patches, which can lead to drift in the data distribution that can adversely affect the performance models. To this end, we develop continually learning performance models that account for the distribution drift, alleviate catastrophic forgetting, and improve generalizability. Our best model was able to retain accuracy, regardless of having to learn the new distribution of data inflicted by system changes, while demonstrating a 2x improvement in the prediction accuracy of the whole data sequence in comparison to the naive approach.

  • EVStore: Storage and Caching Capabilities for Scaling Embedding Tables in Deep Recommendation Systems

    2023-01-27 · 13 citations

    articleSenior author

    Modern recommendation systems, primarily driven by deep-learning models, depend on fast model inferences to be useful. To tackle the sparsity in the input space, particularly for categorical variables, such inferences are made by storing increasingly large embedding vector (EV) tables in memory. A core challenge is that the inference operation has an all-or-nothing property: each inference requires multiple EV table lookups, but if any memory access is slow, the whole inference request is slow. In our paper, we design, implement and evaluate EVStore, a 3-layer EV table lookup system that harnesses both structural regularity in inference operations and domain-specific approximations to provide optimized caching, yielding up to 23% and 27% reduction on the average and p90 latency while quadrupling throughput at 0.2% loss in accuracy. Finally, we show that at a minor cost of accuracy, EVStore can reduce the Deep Recommendation System (DRS) memory usage by up to 94%, yielding potentially enormous savings for these costly, pervasive systems.

  • Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers

    2023-10-30 · 6 citations

    articleOpen accessSenior author

    Multi-level erasure coding (MLEC) has seen large deployments in the field, but there is no in-depth study of design considerations for MLEC at scale. In this paper, we provide comprehensive design considerations and analysis of MLEC at scale. We introduce the design space of MLEC in multiple dimensions, including various code parameter selections, chunk placement schemes, and various repair methods. We quantify their performance and durability, and show which MLEC schemes and repair methods can provide the best tolerance against independent/correlated failures and reduce repair network traffic by orders of magnitude. To achieve this, we use various evaluation strategies including simulation, splitting, dynamic programming, and mathematical modeling. We also compare the performance and durability of MLEC with other EC schemes such as SLEC and LRC and show that MLEC can provide high durability with higher encoding throughput and less repair network traffic over both SLEC and LRC.

  • CNT: Semi-Automatic Translation from CWL to Nextflow for Genomic Workflows

    2023-12-04 · 1 citations

    article

    With the rise of advanced workflow languages for scientific computations, Nextflow has gained increased attention from the bioinformatics community. Nextflow offers native support for advanced parallelism, which can greatly enhance resource utilization and throughput. Still, a significant portion of bioinformatics workflows are developed with the Common Workflow Language (CWL). Transitioning from CWL to Nextflow poses a significant challenge due to the differences in programming models, scripting language compatibilities, and the prerequisite for in-depth knowledge in both languages. To address this challenge, we present CNT, a novel, semi-automated translator converting CWL workflows into Nextflow ones. At its core, CNT uses an automated translation mechanism that converts the CommandLineTool, the most basic unit of CWL, into Nextflow's Process class. This component integrates tool-level conversion, graph dependency analysis, and correctness checks to provide highly automated translation coverage, significantly reducing the development time while satisfying language-specific requirements like building a proper dataflow model when creating workflows. Furthermore, CNT incorporates a module for aiding manual translation. Specifically, it can identify three common JavaScript patterns in CWL workflows, offering further guidance for developers during the translation phase. We evaluated CNT with production-grade workflows and found that it can cover up to 81% of the original workflows, substantially reducing development time. Additionally, transitioning from a cwltool-based system to Nextflow with CNT can result in a 72% speedup and 85% increased CPU utilization.

Recent grants

Frequent coauthors

Labs

Education

  • Ph.D., Computer Science

    University of Wisconsin, Madison

    2009
  • Other

    University of California, Berkeley

Awards & honors

  • NSF CAREER Award (2014)
  • NSF Computing Innovation Fellowship (2012)
  • Google Faculty Research Award (2015)
  • NetApp Faculty Fellowship (2015)
  • NetApp Faculty Fellowship (2013)
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Haryadi Gunawi

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup