
Nikos Hardavellas
· Professor of Computer ScienceVerifiedNorthwestern University · Chemical Engineering
Active 1993–2026
About
Nikos Hardavellas is a professor of Computer Science and Computer Engineering at Northwestern University, where he directs the Parallel Architecture Group at Northwestern (PARAG@N). His research focuses on computer architecture, specifically at the intersection of computer architecture with the computer systems stack, including programming languages, compilers, and operating systems. His work also encompasses memory systems, nanophotonics, energy-efficient computing, and quantum computing systems. Hardavellas serves on the Executive Committee of the Northwestern Institute for Quantum Information Research and Engineering (INQUIRE) and the Scientific Advisory Committee of the National Quantum Algorithms Center (NQAC). He has received numerous awards and recognitions, including an NSF CAREER award, being named a Future CRA Leader, and receiving best paper awards at various conferences. Prior to his academic career, he contributed to the design of several generations of Alpha microprocessors and high-end multiprocessor servers at Digital Equipment Corp., Compaq, and Hewlett-Packard.
Research topics
- Computer Science
- Parallel computing
- Programming language
- Artificial Intelligence
- Operating system
- Embedded system
- Database
- Distributed computing
- Algorithm
- Computer engineering
- Computer graphics (images)
Selected publications
Extrapolating Pauli Checks for Expectation Value Estimation on Noisy Quantum Devices
IEEE Transactions on Quantum Engineering · 2026-01-01
articleOpen accessPauli Check Sandwiching (PCS) is an error detection scheme that protects quantum circuits by inserting pairs of parity checks and discarding runs that signal errors. However, each additional check introduces noise and exponentially increases sampling costs. To address these limitations, we propose Pauli Check Extrapolation (PCE), an error mitigation technique that obtains measured expectation values from circuits with different numbers of checks and, analogous to ZNE, extrapolates to the “maximum check” limit - the theoretical number of checks required for unit fidelity. We test linear and exponential ansatzes, deriving the exponential form from the Markovian error model. Benchmarking PCE against ZNE on random Clifford circuits with simulated depolarizing noise shows PCE outperforming ZNE for larger circuits. On real IBM hardware, PCE achieves an accuracy of up to 99.2% (56.2% improvement over baseline), compared to ZNE's 82% accuracy (29.1% improvement over baseline), for 4-qubit circuits. To demonstrate a practical use case, we then apply PCE towards mitigating errors in classical shadow measurements. Our results show that PCE can achieve fidelities greater than the state-of-the-art Robust Shadow estimation, while significantly reducing the number of required samples by eliminating the need for a calibration procedure. We validate these findings on both fully connected topologies and simulated IBM hardware backends.
Practical Machine Learning Autotuning for Large-Scale Collective Communication
IEEE Transactions on Parallel and Distributed Systems · 2026-02-06
articleSenior authorQuantEM: The quantum error management compiler
ArXiv.org · 2025-09-19
preprintOpen accessAs quantum computing advances toward fault-tolerant architectures, quantum error detection (QED) has emerged as a practical and scalable intermediate strategy in the transition from error mitigation to full error correction. By identifying and discarding faulty runs rather than correcting them, QED enables improved reliability with significantly lower overhead. Applying QED to arbitrary quantum circuits remains challenging, however, because of the need for manual insertion of detection subcircuits, ancilla allocation, and hardware-specific mapping and scheduling. We present QuantEM, a modular and extensible compiler designed to automate the integration of QED codes into arbitrary quantum programs. Our compiler consists of three key modules: (1) program analysis and transformation module to examine quantum programs in a QED-aware context and introduce checks and ancilla qubits, (2) error detection code integration module to map augmented circuits onto specific hardware backends, and (3) postprocessing and resource management for measurement results postprocessing and resource-efficient estimation techniques. The compiler accepts a high-level quantum circuit, a chosen error detection code, and a target hardware topology and then produces an optimized and executable circuit. It can also automatically select an appropriate detection code for the user based on circuit structure and resource estimates. QuantEM currently supports Pauli check sandwiching and Iceberg codes and is designed to support future QED schemes and hardware targets. By automating the complex QED compilation flow, this work reduces developer burden, enables fast code exploration, and ensures consistent and correct application of detection logic across architectures.
Modular Compilation for Quantum Chiplet Architectures
ArXiv.org · 2025-01-14 · 1 citations
preprintOpen accessSenior authorAs quantum computing technology matures, industry is adopting modular quantum architectures to keep quantum scaling on the projected path and meet performance targets. However, the complexity of chiplet-based quantum devices, coupled with their growing size, presents an imminent scalability challenge for quantum compilation. Contemporary compilation methods are not well-suited to chiplet architectures - in particular, existing qubit allocation methods are often unable to contend with inter-chiplet links, which don't necessarily support a universal basis gate set. Furthermore, existing methods of logical-to-physical qubit placement, swap insertion (routing), unitary synthesis, and/or optimization, are typically not designed for qubit links of significantly varying latency or fidelity. In this work, we propose SEQC, a hierarchical parallelized compilation pipeline optimized for chiplet-based quantum systems, including several novel methods for qubit placement, qubit routing, and circuit optimization. SEQC attains a $9.3\%$ average increase in circuit fidelity (up to $49.99\%$). Additionally, owing to its ability to parallelize compilation, SEQC achieves $3.27\times$ faster compilation on average (up to $6.74\times$) over a chiplet-unaware Qiskit baseline.
Pauli Check Extrapolation for Quantum Error Mitigation
2024-09-15 · 1 citations
articlePauli Check Sandwiching (PCS) is an error mitigation scheme that uses pairs of parity checks to detect errors in the payload circuit. While increasing the number of check pairs improves error detection, it also introduces additional noise to the circuit and exponentially increases the required sampling size. To address these limitations, we propose a novel error mitigation scheme, Pauli Check Extrapolation (PCE), which integrates PCS with an extrapolation technique similar to Zero-Noise Extrapolation (ZNE). However, instead of extrapolating to the ‘zero-noise’ limit, as is done in ZNE, PCE extrapolates to the ‘maximum check’ limit-the number of check pairs theoretically required to achieve unit fidelity. In this study, we focus on applying a linear model for extrapolation and also derive a more general exponential ansatz based on the Markovian error model. We demonstrate the effectiveness of PCE by using it to mitigate errors in the shadow estimation protocol, particularly for states prepared by the variational quantum eigensolver (VQE). Our results show that this method can achieve higher fidelities than the state-of-the-art Robust Shadow (RS) estimation scheme, while significantly reducing the number of required samples by eliminating the need for a calibration procedure. We validate these findings on both fully-connected topologies and simulated IBM hardware backends.
Dynamic Resource Allocation with Quantum Error Detection
arXiv (Cornell University) · 2024-08-10
preprintOpen accessQuantum processing units (QPUs) are highly heterogeneous in terms of physical qubit performance. To add even more complexity, drift in quantum noise landscapes has been well-documented. This makes resource allocation a challenging problem whenever a quantum program must be mapped to hardware. As a solution, we propose a novel resource allocation framework that applies Pauli checks. Pauli checks have demonstrated their efficacy at error mitigation in prior work, and in this paper, we highlight their potential to infer the noise characteristics of a quantum system. Circuits with embedded Pauli checks can be executed on different regions of qubits, and the syndrome data created by error-detecting Pauli checks can be leveraged to guide quantum program outcomes toward regions that produce higher-fidelity final distributions. Using noisy simulation and a real QPU testbed, we show that dynamic quantum resource allocation with Pauli checks can outperform state-of-art mapping techniques, such as those that are noise-aware. Further, when applied toward the Quantum Approximate Optimization Algorithm, techniques guided by Pauli checks demonstrate the ability to increase circuit fidelity 11% on average, and up to 33%.
Extrapolating Pauli Checks for Expectation Value Estimation on Noisy Quantum Devices
arXiv (Cornell University) · 2024-06-20
preprintOpen accessPauli Check Sandwiching (PCS) is an error detection scheme that protects quantum circuits by inserting pairs of parity checks and discarding runs that signal errors. However, each additional check introduces noise and exponentially increases sampling costs. To address these limitations, we propose Pauli Check Extrapolation (PCE), an error mitigation technique that obtains measured expectation values from circuits with different numbers of checks and, analogous to ZNE, extrapolates to the ``maximum check'' limit -- the theoretical number of checks required for unit fidelity. We test linear and exponential ansatzes, deriving the exponential form from the Markovian error model. Benchmarking PCE against ZNE on random Clifford circuits with simulated depolarizing noise shows PCE outperforming ZNE for larger circuits. On real IBM hardware, PCE achieves an accuracy of up to 99.2% (56.2% improvement over baseline), compared to ZNE's 82% accuracy (29.1% improvement over baseline), for 4-qubit circuits. To demonstrate a practical use case, we then apply PCE towards mitigating errors in classical shadow measurements. Our results show that PCE can achieve fidelities greater than the state-of-the-art Robust Shadow estimation, while significantly reducing the number of required samples by eliminating the need for a calibration procedure. We validate these findings on both fully connected topologies and simulated IBM hardware backends.
Pauli Check Sandwiching for Quantum Characterization and Error Mitigation during Runtime
2024-09-15
articleThis work presents a novel quantum system characterization and error mitigation framework that applies Pauli check sandwiching (PCS). We motivate our work with prior art in software optimizations for quantum programs like noise-adaptive mapping and multi-programming, and we introduce the concept of PCS while emphasizing design considerations for its practical use. We show that by carefully embedding Pauli checks within a target application (i.e. a quantum circuit), we can learn quantum system noise profiles. Further, PCS combined with multi-programming unlocks non-trivial fidelity improvements.
Generalized Collective Algorithms for the Exascale Era
2023-10-31 · 4 citations
articleSenior authorExascale supercomputers have renewed the exigence of improving distributed communication, specifically MPI collectives. Previous works accelerated collectives for specific scenarios by changing the radix of the collective algorithms. However, these approaches fail to explore the interplay between modern hardware features, such as multi-port networks, and software features, such as message size. In this paper, we present a novel approach that uses system-agnostic, generalized (i.e., variableradix) algorithms to capture relevant features and provide broad speedups for upcoming exascale-class supercomputers.We identify hardware commonalities found on announced exascale systems and three omnipresent communication kernels (binomial tree, ring, and recursive doubling) that can be generalized to better leverage these features, creating 10 total implementations. For each kernel, we develop analytical models to intuit algorithm performance with varying radix values.Experiments on the world’s first exascale supercomputer (Frontier at ORNL) and a pre-exascale system (Polaris at ANL) show that our generalized algorithms outperform the baseline open-source and proprietary vendor MPI implementations by a significant margin, up to over 4.5x. We empirically determine optimal algorithms and parameter values, identifying where the analytical models are accurate and where hardware features directly determine performance. Most notably, we show how a single, system-agnostic implementation of a generalized algorithm can optimize for multiple hardware/software features across multiple systems.
Evaluating Functional Memory-Managed Parallel Languages for HPC using the NAS Parallel Benchmarks
2023-05-01 · 1 citations
articleFunctiona1, memory-managed parallel languages (FMPLs) are a recent innovative approach to shared-memory parallel programming. Despite their rising prevalence in other areas, FMPLs have yet to gain traction in HPC. In this work, we explore the utility of FMPLs for HPC by re-implementing the NAS Parallel Benchmarks in an FMPL.For this study, we ported the benchmarks into the Parallel ML language. We discuss the advantages and disadvantages of using Parallel ML for HPC applications based on our development experience. We compare the performance of our Parallel ML implementation to the existing C/OpenMP version. The FMPL implementations are $1.02 \times -5.76 \times$ slower compared to OpenMP. Our positive development experience combined with some competitive performance results suggest that FMPLs have the potential to become a viable choice for HPC applications. We conclude by describing our future work to automatically manage distributed memory within an FMPL, creating a compelling new programming model for HPC.
Recent grants
Frequent coauthors
- 28 shared
Babak Falsafi
- 22 shared
Ippokratis Pandis
Amazon (United States)
- 19 shared
Chris Wilkerson
Intel (United States)
- 19 shared
Tor M. Aamodt
University of British Columbia
- 17 shared
Anastasia Ailamaki
- 17 shared
Lieven Eeckhout
Ghent University
- 17 shared
Jared C. Smolens
Oracle (United States)
- 17 shared
Juanita Hoe
University of West London
Labs
Parallel Architecture Group at Northwestern (PARAG@N)PI
Education
- 2009
PhD, CS
Carnegie Mellon University
- 2006
MSc, CS
Carnegie Mellon University
- 1997
MSc, CS
University of Rochester
- 1995
BSc, CS
University of Crete
Awards & honors
- Future CRA Leader by the Computing Research Association (202…
- NSF CAREER award (2015)
- Best paper awards, nominations and test-of-time awards at HP…
- IEEE Micro Top Picks Award (2010)
- IEEE Micro Top Picks Honorable Mention (2023)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Nikos Hardavellas
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup