
Abhinav Bhatele
· Associate ProfessorVerifiedUniversity of Maryland, College Park · Computer Science
Active 2007–2026
About
Abhinav Bhatele is an associate professor in the Department of Computer Science at the University of Maryland, College Park, and the director of the Parallel Software and Systems Group. His research interests broadly encompass systems and networks, with a particular focus on parallel computing and large-scale data analytics. He has published research in areas including parallel programming models and runtimes, network design and simulation, applications of machine learning to parallel systems, parallel deep learning, and the analysis, visualization, modeling, and optimization of the performance of parallel software and systems. He holds a Ph.D. and M.S. from the University of Illinois at Urbana-Champaign, obtained in 2010 and 2007 respectively, and a B.Tech. from the Indian Institute of Technology, Kanpur, earned in 2005. Bhatele has received numerous honors and awards, including the IEEE TCSC Award for Excellence in Scalable Computing for both early and mid-career researchers, the NSF CAREER award, and the David J. Kuck Outstanding Ph.D. Thesis Award. He has also been recognized with awards from organizations such as IPDPS, Euro-Par, and LLNL. Throughout his career, Bhatele has advised several Ph.D. and master's students and has been acknowledged for his contributions to scalable computing, including winning the IEEE Technical Committee on Scalable Computing Award in 2024 and the NSF CAREER Award in 2021.
Research signals
Five dimensions sourced from public faculty / publication signals. Sign in to compare against your own profile and see your match score.
Research topics
- Computer Science
- Machine Learning
- Artificial Intelligence
- Programming language
- Operating system
- Computer engineering
- Parallel computing
- Algorithm
- Computer network
- Real-time computing
- Mathematics
- Simulation
- Distributed computing
- Theoretical computer science
Selected publications
Speculating Experts Accelerates Inference for Mixture-of-Experts
arXiv (Cornell University) · 2026-03-09
preprintOpen accessMixture-of-Experts (MoE) models have gained popularity as a means of scaling the capacity of large language models (LLMs) while maintaining sparse activations and reduced per-token compute. However, in memory-constrained inference settings, expert weights must be offloaded to CPU, creating a performance bottleneck from CPU-GPU transfers during decoding. We propose an expert prefetching scheme that leverages currently computed internal model representations to speculate future experts, enabling memory transfers to overlap with computation. Across multiple MoE architectures, we demonstrate that future experts can be reliably predicted by these internal representations. We also demonstrate that executing speculated experts generally maintains downstream task accuracy, thus preserving more effective compute-memory overlap by eliminating the need to re-fetch true router-selected experts. Integrated into an optimized inference engine, our approach achieves up to 14\% reduction in time per output token (TPOT) over on-demand loading of experts from CPU memory. For MoEs where speculative execution alone yields suboptimal accuracy, we further examine lightweight estimators that improve expert prediction hit rates, thereby reducing performance degradation. Our code is released in open-source at https://github.com/axonn-ai/yalis/tree/offload_prefetch.
Performance-Aligned LLMs for Generating Fast HPC Code
IEEE Transactions on Parallel and Distributed Systems · 2026-03-18
articleSenior authorOptimizing scientific software is a difficult task because codebases are often large and complex, and performance can depend upon several factors including the algorithm, its implementation, and hardware among others. Causes of poor performance can originate from disparate sources and be difficult to diagnose. Recent years have seen a multitude of work that use large language models (LLMs) to assist in software development tasks. However, these tools are trained to model the distribution of code as text, and are not specifically designed to understand performance aspects of code. In this work, we introduce a reinforcement learning based methodology to align the outputs of code LLMs with performance. This allows us to build upon the current code modeling capabilities of LLMs and extend them to generate better performing code. We demonstrate that our fine tuned model improves the expected speedup of generated code over base models for a set of benchmark tasks from 0.9 to 1.6 for serial code and 1.9 to 4.5 for OpenMP parallel code.
KEET: Explaining Performance of GPU Kernels Using LLM Agents
ArXiv.org · 2026-05-06
articleOpen accessSenior authorPerformance profiles of GPU kernels generated by tools such as Nsight Compute are rich in detail but are often challenging to interpret. To achieve the best performance possible on a given GPU architecture, kernel developers need to spend significant time analyzing and comparing profiles in the tool's graphical interface to identify and understand kernel performance bottlenecks. Large Language Models (LLMs) have shown promise in understanding complex data and generating natural language explanations. In this paper, we propose the Kernel Execution Explanation Toolkit (KEET), an LLM-based agentic framework for interpreting Nsight Compute profiles to generate useful and data-grounded natural language explanations of performance issues in GPU kernels, and suggestions for optimizations. We evaluate \toolname using several CUDA kernels of varying complexity on NVIDIA H100 GPUs. We find that the generated explanations, when provided as context, improve the quality of LLM code optimization and multiple-choice question answering in downstream tasks. We further demonstrate that the tool can be used to interpret performance data from large sets of profiles to improve the quality of optimization suggestions.
Communication-free Sampling and 4D Hybrid Parallelism for Scalable Mini-batch GNN Training
arXiv (Cornell University) · 2026-04-03
preprintOpen accessSenior authorGraph neural networks (GNNs) are widely used for learning on graph datasets derived from various real-world scenarios. Learning from extremely large graphs requires distributed training, and mini-batching with sampling is a popular approach for parallelizing GNN training. Existing distributed mini-batch approaches have significant performance bottlenecks due to expensive sampling methods and limited scaling when using data parallelism. In this work, we present ScaleGNN, a 4D parallel framework for scalable mini-batch GNN training that combines communication-free distributed sampling, 3D parallel matrix multiplication (PMM), and data parallelism. ScaleGNN introduces a uniform vertex sampling algorithm, enabling each process (GPU device) to construct its local mini-batch, i.e., subgraph partitions without any inter-process communication. 3D PMM enables scaling mini-batch training to much larger GPU counts than vanilla data parallelism with significantly lower communication overheads. We also present additional optimizations to overlap sampling with training, reduce communication overhead by sending data in lower precision, kernel fusion, and communication-computation overlap. We evaluate ScaleGNN on five graph datasets and demonstrate strong scaling up to 2048 GPUs on Perlmutter, 2048 GCDs on Frontier, and 1024 GPUs on Tuolumne. On Perlmutter, ScaleGNN achieves 3.5x end-to-end training speedup over the SOTA baseline on ogbn-products.
Communication-free Sampling and 4D Hybrid Parallelism for Scalable Mini-batch GNN Training
ArXiv.org · 2026-04-03
articleOpen accessSenior authorGraph neural networks (GNNs) are widely used for learning on graph datasets derived from various real-world scenarios. Learning from extremely large graphs requires distributed training, and mini-batching with sampling is a popular approach for parallelizing GNN training. Existing distributed mini-batch approaches have significant performance bottlenecks due to expensive sampling methods and limited scaling when using data parallelism. In this work, we present ScaleGNN, a 4D parallel framework for scalable mini-batch GNN training that combines communication-free distributed sampling, 3D parallel matrix multiplication (PMM), and data parallelism. ScaleGNN introduces a uniform vertex sampling algorithm, enabling each process (GPU device) to construct its local mini-batch, i.e., subgraph partitions without any inter-process communication. 3D PMM enables scaling mini-batch training to much larger GPU counts than vanilla data parallelism with significantly lower communication overheads. We also present additional optimizations to overlap sampling with training, reduce communication overhead by sending data in lower precision, kernel fusion, and communication-computation overlap. We evaluate ScaleGNN on five graph datasets and demonstrate strong scaling up to 2048 GPUs on Perlmutter, 2048 GCDs on Frontier, and 1024 GPUs on Tuolumne. On Perlmutter, ScaleGNN achieves 3.5x end-to-end training speedup over the SOTA baseline on ogbn-products.
Understanding and Improving Communication Performance in Multi-node LLM Inference
2026-05-22
articleOpen accessSenior authorAs large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference using LLMs on GPU-based supercomputers. We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation. We analyze the strong-scaling behavior of different model-parallel schemes and identify key bottlenecks. Because all-reduce operations are a common performance bottleneck, we develop NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. NVRAR achieves up to 1.9 × –3.6 × lower latency than NCCL for message sizes between 128 KB and 2 MB on HPE Slingshot and InfiniBand interconnects. Integrated into YALIS, NVRAR achieves up to a 1.72 × reduction in end-to-end batch latency for the Llama 3.1 405B model in multi-node decode-heavy workloads using tensor parallelism.
KEET: Explaining Performance of GPU Kernels Using LLM Agents
arXiv (Cornell University) · 2026-05-06
preprintOpen accessSenior authorPerformance profiles of GPU kernels generated by tools such as Nsight Compute are rich in detail but are often challenging to interpret. To achieve the best performance possible on a given GPU architecture, kernel developers need to spend significant time analyzing and comparing profiles in the tool's graphical interface to identify and understand kernel performance bottlenecks. Large Language Models (LLMs) have shown promise in understanding complex data and generating natural language explanations. In this paper, we propose the Kernel Execution Explanation Toolkit (KEET), an LLM-based agentic framework for interpreting Nsight Compute profiles to generate useful and data-grounded natural language explanations of performance issues in GPU kernels, and suggestions for optimizations. We evaluate \toolname using several CUDA kernels of varying complexity on NVIDIA H100 GPUs. We find that the generated explanations, when provided as context, improve the quality of LLM code optimization and multiple-choice question answering in downstream tasks. We further demonstrate that the tool can be used to interpret performance data from large sets of profiles to improve the quality of optimization suggestions.
VAHRM: Variation-Aware Resource Management in Heterogeneous Supercomputing Systems
IEEE Transactions on Parallel and Distributed Systems · 2025-06-12 · 1 citations
articleOpen accessIn this paper, we propose a novel resource management technique for heterogeneous supercomputing systems affected by manufacturing variability. Our proposed technique called VAHRM (Variation-Aware Heterogeneous Resource Management) takes a holistic approach to job scheduling on highly heterogeneous computing resources. VAHRM preferentially allocates energy-efficient computing resources to an energy-consuming job in a job queue, considering the impact on both the job turnaround time and the power consumption of individual resources. Furthermore, we have developed a novel approach to modeling the power consumption of computing resources that have manufacturing variability. Our approach called TSMVA (Two-Stage Modeling with Variation Awareness) enables us to generate the first variation-aware GPU power models, which can correctly estimate the power consumption of each GPU for a given job. Our experimental results show that, compared to conventional first-come-first-serve (FCFS) and state-of-the-art variation-aware scheduling algorithms, VAHRM can achieve respective improvements in system energy efficiency of up to 5.8% and 5.4% (4.5% and 4.2% on average) while reducing the average turnaround time of 21.2% and 11.9%, respectively, for various workloads obtained from a production system.
The Big Send-off: Scalable and Performant Collectives for Deep Learning
ArXiv.org · 2025-04-25
preprintOpen accessSenior authorCollective communication is becoming increasingly important in data center and supercomputer workloads with an increase in distributed AI related jobs. However, existing libraries that provide collective support such as NCCL, RCCL, and Cray-MPICH exhibit several performance and scalability limitations on modern GPU supercomputers. To address these challenges, we introduce the Performant Collective Communication Library (PCCL), specifically targeted for distributed deep learning (DL) workloads. PCCL provides highly optimized implementations of key collectives used in distributed DL: all-gather, reduce-scatter, and all-reduce. PCCL uses a hierarchical design with learning-based adaptive selection of the best performing algorithms to scale efficiently to thousands of GPUs. It achieves substantial performance speedups over RCCL on 2048 GCDs of Frontier -- up to 168x for reduce-scatter, 33x for all-gather and 10x for all-reduce. More modest but still significant gains up to 5.7x over NCCL are observed on Perlmutter. These gains translate directly to performance improvement of production DL workloads: up to 4.9x speedup over RCCL in DeepSpeed ZeRO-3 training, and up to 2.4x speedup in DDP training.
Understanding and Improving Communication Performance in Multi-node LLM Inference
ArXiv.org · 2025-11-12
preprintOpen accessSenior authorAs large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference using LLMs on GPU-based supercomputers. We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation. We analyze the strong-scaling behavior of different model-parallel schemes and identify key bottlenecks. Because all-reduce operations are a common performance bottleneck, we develop NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. NVRAR achieves up to 1.9$\times$-3.6$\times$ lower latency than NCCL for message sizes between 128 KB and 2 MB on HPE Slingshot and InfiniBand interconnects. Integrated into YALIS, NVRAR achieves up to a 1.72$\times$ reduction in end-to-end batch latency for the Llama 3.1 405B model in multi-node decode-heavy workloads using tensor parallelism.
Recent grants
CAREER: Self-tuning Parallel Software and Systems
NSF · $619k · 2021–2027
Frequent coauthors
- 78 shared
Laxmikant V. Kalé
University of Illinois Urbana-Champaign
- 66 shared
Frank Olaf Sem-Jacobsen
- 64 shared
Olav Lysne
- 62 shared
David Padua
- 47 shared
Eric J. Bohm
University of Illinois Urbana-Champaign
- 41 shared
Nikhil Jain
University of Maryland, College Park
- 40 shared
Todd Gamblin
- 34 shared
Maurice Herlihy
Awards & honors
- Early Career Alumni Award Dept. of Computer Science, Univ. o…
- IEEE TCSC Award for Excellence in Scalable Computing for Mid…
- NSF CAREER (2021)
- LLNL Early and Mid-Career Recognition Award (2018)
- IPDPS Best Paper Award (2016)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Abhinav Bhatele
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup