Haiying Shen
VerifiedUniversity of Virginia · Computer Science
Active 2004–2026
About
Professor Haiying Shen leads research focused on the characterization and adaptive determination of computing platforms for computational and data-enabled science and engineering (CDS&E). Her work addresses the effective use of traditional high performance computing clusters (HPCC) and Hadoop clusters, recognizing their distinct strengths in handling computational problems and large-scale data analysis, respectively. The research explores the integration of Hadoop MapReduce with HPC storage systems, including configurations such as Hadoop+HPCC and Hadoop/HPCC, to optimize platform selection and data placement based on application characteristics, performance objectives, and system cost metrics. This approach aims to enhance the suitability and efficiency of computing platforms for various CDS&E applications, contributing to the advancement of high performance computing systems. The broader impacts of her research include technology transfer to industry partners, dissemination through peer-reviewed publications and software releases, and fostering further research in cyberinfrastructure supporting CDS&E fields. Additionally, Professor Shen's project emphasizes comprehensive training and collaborative research opportunities for students at multiple levels, including underrepresented groups, integrating research outcomes into educational courses to promote STEM discipline engagement.
Research topics
- Computer Science
- Machine Learning
- Data Mining
- Distributed computing
- Computer network
- Simulation
- Engineering
- Operating system
- Mathematics
- Environmental economics
- Real-time computing
- Economics
- Electrical engineering
Selected publications
Development of a reference material for the accurate determination of multiple pesticide residues
Food Chemistry · 2026-02-04
articleArXiv.org · 2025-02-05
preprintOpen accessDisaggregated Large Language Model (LLM) inference has gained popularity as it separates the computation-intensive prefill stage from the memory-intensive decode stage, avoiding the prefill-decode interference and improving resource utilization. However, transmitting Key-Value (KV) data between the two stages can be a bottleneck, especially for long prompts. Additionally, the computation time overhead for prefill and decode is key for optimizing Job Completion Time (JCT), and KV data size can become prohibitive for long prompts and sequences. Existing KV quantization methods can alleviate the transmission bottleneck and reduce memory requirements, but they introduce significant dequantization overhead, exacerbating the computation time. We propose Homomorphic Acceleration via Compression of the KV cache (HACK) for disaggregated LLM inference. HACK eliminates the heavy KV dequantization step, and directly performs computations on quantized KV data to approximate and reduce the cost of the expensive matrix-multiplication step. Extensive trace-driven experiments show that HACK reduces JCT by up to 70.9% compared to disaggregated LLM inference baseline and by up to 52.3% compared to state-of-the-art KV quantization methods.
ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving
arXiv (Cornell University) · 2025-02-02
preprintOpen accessLarge multimodal models (LMMs) demonstrate impressive capabilities in understanding images, videos, and audio beyond text. However, efficiently serving LMMs in production environments poses significant challenges due to their complex architectures and heterogeneous characteristics across their multi-stage inference pipelines. We present the first comprehensive systems analysis of two prominent LMM architectures, decoder-only and cross-attention, across six representative open-source models, revealing key systems design implications. We also present an in-depth analysis of production LMM inference traces, uncovering unique workload characteristics, including variable, heavy-tailed request distributions and bursty traffic patterns. Based on these insights, we propose ModServe, a modular LMM serving system that decouples stages for independent optimization and adaptive scaling. ModServe dynamically reconfigures stages and handles bursty traffic with modality-aware scheduling and autoscaling to meet tail latency SLOs while minimizing costs. ModServe achieves 3.3-5.5x higher throughput (leading to 25-41.3% cost saving) while meeting SLOs on a 128-GPU cluster with production traces.
MobiRescue: Optimal Dispatching of Rescue Teams Under Flooding Disasters
IEEE Transactions on Mobile Computing · 2025-05-12
articleEffective dispatching of rescue teams under flooding disasters is crucial. However, previous methods are either incapable of handling flooding disaster situations, or cannot accurately estimate the distribution of rescue requests and accordingly adjust the search of the rescue teams. We propose <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MobiRescue</i>, a human <underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><i>Mobi</i></u>lity based <underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><i>Rescue</i></u> team dispatching system, which aims to maximize the total number of rescued people, minimize the rescue delay and the number of serving rescue teams. We studied a city-scale human mobility dataset collected under the Hurricane Florence, and observed that several natural and demographic factors are closely related to impact severity, and road segment passability must be considered. Accordingly, we first propose a Support Vector Machine based method to predict the distribution of rescue requests considering the disaster-related factors. Then, we design an Euler path based method to determine the search paths for rescue team dispatching. Subsequently, we develop a Reinforcement Learning based method to guide the search of the rescue teams. Finally, we design a multi-objective optimization problem based method to adapt to the changed road segment passability. Our experiments demonstrate that compared with the other methods, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MobiRescue</i> increases the total number of timely served rescue requests by 43.4% in average.
Deep Learning Training Job Scheduling for Proactive Straggler Reduction
2025-05-19
article1st authorCorrespondingIn this work, from our trace-driven experimental measurements, we observed that despite employing homogeneous GPUs for distributed deep learning (DDL) training, stragglers persistently emerge, significantly prolonging time-to-accuracy (TTA) and squandering GPU resources. Previous approaches typically react to stragglers as they occur or are imminent during execution after scheduling, resulting in training delays before they are addressed and introducing additional overhead. To reduce the number of stragglers, this paper introduces a novel DDL training job scheduler for proactive straggler reduction (STRN), the first effort in mitigating stragglers during scheduling. STRN is devised based on our findings that various DDL jobs exhibit distinct sensitivities to straggling and a particular resource's overload, and the optimal synchronization strategy for a job hinges on its specific characteristics and operational environment. Thus, STRN assesses each job's sensitivity to each resource type and straggling. It aims to minimize the likelihood of resource overload for jobs with higher sensitivity to that resource and to ensure that jobs with higher sensitivity to straggling encounter less straggling. STRN initially runs a heuristic method and then transitions to a Reinforcement Learning (RL)-based method once trained, facilitating expedited and optimal job scheduling. STRN also optimizes synchronization strategies for each DDL job to reduce overall TTA. Trace-driven real experiments demonstrate that STRN reduces average TTA by up to 59% and enhances average accuracy by up to 91% compared to state-of-the-art methods. We have made the source code available for distribution.
HERA: Hybrid Edge-cloud Resource Allocation for Cost-Efficient AI Agents
ArXiv.org · 2025-04-01
preprintOpen accessIn the realm of AI, large language models (LLMs) like GPT-4, central to the operation of AI agents, predominantly operate in the cloud, incurring high operational costs. With local-based small language models (SLMs) becoming more accurate, the necessity of cloud-exclusive processing is being reconsidered. An AI agent's response to a user's request comprises a series of subtasks or iterations. Existing approaches only allocate a single request between SLM and LLM to ensure their outputs are similar, but adopting this approach in the AI agent scenario for assigning each subtask is not effective since SLM will output a different subsequent subtask, which affects the accuracy of the final output. In this paper, we first conduct experimental analysis to understand the features of AI agent operations. Leveraging our findings, we propose the Adaptive Iteration-level Model Selector (AIMS), a lightweight scheduler to automatically partition AI agent's subtasks between local-based SLM and cloud-based LLM. AIMS considers the varying subtask features and strategically decides the location for each subtask in order to use SLM as much as possible while attaining the accuracy level. Our experimental results demonstrate that AIMS increases accuracy by up to 9.1% and SLM usage by up to 10.8% compared to HybridLLM. It offloads 45.67% of subtasks to a local SLM while attaining similar accuracy on average compared with the cloud-only LLM approach.
Straggler Tolerant and Resilient DL Training on Homogeneous GPUs
ArXiv.org · 2025-12-10
preprintOpen accessSenior authorDespite the popularity of homogeneous GPU-based deep learning (DL) training, the prevalence, causes and impact of stragglers and the effectiveness of existing straggler mitigation approaches are still not well understood in this scenario due to limited research on these questions. To fill this gap, we conducted comprehensive experiments and found that stragglers remain widespread due to CPU and bandwidth usage imbalances. Additionally, existing mitigation methods that switch from synchronous stochastic gradient descent (SSGD) to asynchronous SGD (ASGD) may not improve Time-To-Accuracy (TTA) and can even generate more stragglers due to its higher resource consumption. To address these newly found problems, we propose the Straggler Tolerant And Resilient DL training system (STAR). STAR includes new synchronization modes that group workers for each parameter updating. It has a heuristic and an ML method to choose the optimal synchronization mode for minimizing TTA, and reallocates resources to support the selected mode while minimizing the impact on co-located jobs. Moreover, it proactively prevents stragglers by avoiding overloading the CPU and bandwidth resources in allocating PSs (which consume high CPU and bandwidth) and in gradient transmission. Our trace-driven evaluation on AWS shows that STAR generates 48-84% and 51-70% lower TTA than state-of-the-art systems in the PS and all-reduce architectures, respectively, while maintaining the converged accuracy of SSGD. The code for STAR is open-sourced.
Ensuring Fair LLM Serving Amid Diverse Applications
arXiv (Cornell University) · 2024-11-24
preprintOpen accessIn a multi-tenant large language model (LLM) serving platform hosting diverse applications, some users may submit an excessive number of requests, causing the service to become unavailable to other users and creating unfairness. Existing fairness approaches do not account for variations in token lengths across applications and multiple LLM calls, making them unsuitable for such platforms. To address the fairness challenge, this paper analyzes millions of requests from thousands of users on MS CoPilot, a real-world multi-tenant LLM platform hosted by Microsoft. Our analysis confirms the inadequacy of existing methods and guides the development of FairServe, a system that ensures fair LLM access across diverse applications. FairServe proposes application-characteristic aware request throttling coupled with a weighted service counter based scheduling technique to curb abusive behavior and ensure fairness. Our experimental results on real-world traces demonstrate FairServe's superior performance compared to the state-of-the-art method in ensuring fairness. We are actively working on deploying our system in production, expecting to benefit millions of customers world-wide.
QUART: Latency-Aware FaaS System for Pipelining Large Model Inference
2024-07-23 · 3 citations
articlePipeline parallelism is a key mechanism to ensure the performance of large model serving systems. These systems need to deal with unpredictable online workloads with low latency and high good put. However, due to the specific characteristics of large models and resource constraints in pipeline parallelism, existing systems struggle to balance resource allocation across pipeline stages. The primary challenge resides in the differential distribution of requests across various stages of the pipeline. We propose QUART, a large model serving system that focuses on optimizing the performance of key stages in pipeline parallelism. QUART dynamically identifies the key stages of the pipeline and introduces an innovative two-level model parameter caching system based on forks to achieve rapid scaling of key stages within seconds. In evaluations with real-world request workloads, QUART reduces average response latency by up to 87.1%) and increases good put by 2.37x compared to the baseline. The experiments demonstrate that QUART effectively reduces tail latency and the average queue length of the pipeline.
PecSched: Preemptive and Efficient Cluster Scheduling for LLM Inference
arXiv (Cornell University) · 2024-09-23
preprintOpen accessSenior authorThe scaling of transformer-based Large Language Models (LLMs) has significantly expanded their context lengths, enabling applications where inputs exceed 100K tokens. Our analysis of a recent Azure LLM inference trace reveals a highly skewed long-tail distribution of input lengths, with approximately 80% of inputs shorter than 2K tokens. Long inputs constitute only a small fraction. Existing cluster-level LLM scheduling strategies, including First-In-First-Out (FIFO), reservation-based, and priority-based approaches, primarily target short-input requests with lengths below 2K and fail to address this heterogeneity, leading to inefficiencies such as head-of-line blocking, resource underutilization, and starvation of long-input requests. We propose PecSched, a Preemptive and Efficient Cluster SCHEDuling system for LLM inference. PecSched introduces the following key techniques: 1) preemptive scheduling that prioritizes short-input requests for their performance; 2) coordinated prefill-decode colocation and disaggregation, which reduces both the duration and frequency of preemptions; 3) fast Sequence Parallelism (SP) that minimizes the prefill time of long-input requests to further reduce the likelihood and frequency of preemptions. Evaluations based on Azure LLM inference trace show that, compared to state-of-the-art cluster-level LLM inference schedulers, PecSched reduces the 99th percentile queueing delay of short-input requests by up to 92% and improves their throughput by up to 595%, without significantly affecting the Job Completion Time (JCT) of long-input requests. We open-sourced our code.
Recent grants
CIF21 DIBBs: PD: Building High-Availability Data Capabilities in Data-Centric Cyberinfrastructure
NSF · $532k · 2017–2020
NSF · $176k · 2009–2011
NEDG: Mechanisms for Efficient and Reliable Routing in Hybrid Wireless Networks
NSF · $221k · 2009–2013
EAGER: GENI Experiments on Pervasive Data Sharing over Heterogeneous Networks
NSF · $137k · 2010–2013
EAGER: An Efficient and Effective Distributed Information System
NSF · $300k · 2013–2016
Frequent coauthors
- 104 shared
Zhuozhao Li
Southern University of Science and Technology
- 75 shared
Jinwei Liu
Chongqing Vocational Institute of Engineering
- 64 shared
Wei Chang
Sichuan University of Arts and Science
- 64 shared
Zhanfeng Zhao
Collaborative Innovation Center of Chemical Science and Engineering Tianjin
- 64 shared
Jiangtao Wang
Communication University of China
- 64 shared
Ning Wang
- 48 shared
Ankur Sarker
University of California, Los Angeles
- 45 shared
Kang Chen
Education
- 2000
B.S., Computer Science and Engineering
Tongji University, China
- 2004
M.S., Computer Engineering
Wayne State University
- 2006
Ph.D., Computer Engineering
Wayne State University
Awards & honors
- TCSC Mid-career Award 2015
- IBM Faculty Award 2015
- NSF CAREER Award 2013
- Sigma Xi Clemson Chapter Young Investigator of the Year Awar…
- Microsoft Faculty Fellowship Award 2010
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Haiying Shen
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup