
Jason Lowe-Power
· Assistant Professor in Computer ScienceVerifiedUniversity of California, Davis · Electrical and Computer Engineering
Active 2013–2025
About
Jason Lowe-Power is an Assistant Professor in the Department of Computer Science at UC Davis. His research focuses on computer architecture, high-performance computing, memory system architecture, and accelerator architecture. He aims to develop new hardware, software, and systems to improve performance and scalability of end-to-end applications such as big-data analytics, especially as traditional manufacturing advances like Moore's Law approach their physical limits. Lowe-Power is building an interdisciplinary research group to bridge the gap between architectural advances and important new applications, emphasizing cross-layer optimizations from application to hardware. He is actively recruiting motivated Ph.D. students interested in these areas.
Research topics
- Computer Science
- Operating system
- Embedded system
- Distributed computing
- Parallel computing
- Computer hardware
- Artificial Intelligence
- Mathematical optimization
- Computer architecture
- Mathematics
- Telecommunications
- Algorithm
- Data science
Selected publications
Characterizing GPU Energy Usage in Exascale-Ready Portable Science Applications
Lecture notes in computer science · 2025-11-23 · 1 citations
book-chapterOpen accessEfficient Caching with A Tag-enhanced DRAM
2025-03-01 · 1 citations
articleSenior authorAs SRAM-based caches are hitting a scaling wall, manufacturers are integrating DRAM-based caches into system designs to continue increasing cache sizes. While DRAM caches can improve the performance of memory systems, existing DRAM cache designs suffer from high miss penalties, wasted data movement, and interference between misses and demands. In this paper, we propose TDRAM, a novel DRAM microarchitecture tailored for caching. TDRAM enhances existing DRAM, such as HBM3, by adding small, low-latency mats to store tags and metadata on the same die as the data mats. These mats enable tag and data access in lockstep, in-DRAM tag comparison, and conditional data response based on the comparison result (reducing wasted data transfers), akin to SRAM cache mechanisms. TDRAM further optimizes hit and miss latencies through opportunistic early tag probing. Moreover, TDRAM introduces a flush buffer to store conflicting dirty data on write misses, eliminating data bus turnaround delays on write demands. We evaluate TDRAM in a full-system simulation using a set of HPC workloads with large memory footprints, showing that TDRAM, on average, provides $2.65 \times$ faster tag checks, $1.23 \times$ speedup, and 21% less energy consumption compared to state-of-the-art commercial and research designs.
Portable Targeted Sampling Framework Using LLVM
arXiv (Cornell University) · 2025-09-02
preprintOpen accessSenior authorEvaluating architectural ideas on realistic workloads is increasingly challenging due to the prohibitive cost of detailed simulation and the lack of portable sampling tools. Existing targeted sampling techniques are often tied to specific binaries, incur significant overhead, and make rapid validation across systems infeasible. To address these limitations, we introduce Nugget, a flexible framework that enables portable sampling across simulators, hardware, architectural differences, and libraries. Nugget leverages LLVM IR to perform binary-independent interval analysis, then generates lightweight, cross-platform executable snippets (nuggets), that can be validated natively on real hardware before use in simulation. This approach decouples samples from specific binaries, dramatically reduces analysis overhead, and allows researchers to iterate on sampling methodologies while efficiently validating samples across diverse systems.
Toward Reproducible and Standardized Computer Architecture Simulation with gem5
ArXiv.org · 2025-12-15
preprintOpen accessSenior authorReproducibility in simulation-based computer architecture research requires coordinating artifacts like disk images, kernels, and benchmarks, but existing workflows are inconsistent. We improve gem5, an open-source simulator with over 1600 forks, and gem5 Resources, a centralized repository of over 2000 pre-packaged artifacts, to address these issues. While gem5 Resources enables artifact sharing, researchers still face challenges. Creating custom disk images is complex and time-consuming, with no standardized process across ISAs, making it difficult to extend and share images. gem5 provides limited guest-host communication features through a set of predefined exit events that restrict researchers' ability to dynamically control and monitor simulations. Lastly, running simulations with multiple workloads requires researchers to write custom external scripts to coordinate multiple gem5 simulations which creates error-prone and hard-to-reproduce workflows. To overcome this, we introduce several features in gem5 and gem5 Resources. We standardize disk-image creation across x86, ARM, and RISC-V using Packer, and provide validated base images with pre-annotated benchmark suites (NPB, GAPBS). We provide 12 new disk images, 6 new kernels, and over 200 workloads across three ISAs. We refactor the exit event system to a class-based model and introduce hypercalls for enhanced guest-host communication that allows researchers to define custom behavior for their exit events. We also provide a utility to remotely monitor simulations and the gem5-bridge driver for user-space m5 operations. Additionally, we implemented Suites and MultiSim to enable parallel full-system simulations from gem5 configuration scripts, eliminating the need for external scripting. These features reduce setup complexity and provide extensible, validated resources that improve reproducibility and standardization.
Choreographer: A Full-System Framework for Fine-Grained Tasks in Cache Hierarchies
ArXiv.org · 2025-10-30
preprintOpen accessIn this paper, we introduce Choreographer, a simulation framework that enables a holistic system-level evaluation of fine-grained accelerators designed for latency-sensitive tasks. Unlike existing frameworks, Choreographer captures all hardware and software overheads in core-accelerator and cache-accelerator interactions, integrating a detailed gem5-based hardware stack featuring an AMBA coherent hub interface (CHI) mesh network and a complete Linux-based software stack. To facilitate rapid prototyping, it offers a C++ application programming interface and modular configuration options. Our detailed cache model provides accurate insights into performance variations caused by cache configurations, which are not captured by other frameworks. The framework is demonstrated through two case studies: a data-aware prefetcher for graph analytics workloads, and a quicksort accelerator. Our evaluation shows that the prefetcher achieves speedups between 1.08x and 1.88x by reducing memory access latency, while the quicksort accelerator delivers more than 2x speedup with minimal address translation overhead. These findings underscore the ability of Choreographer to model complex hardware-software interactions and optimize performance in small task offloading scenarios.
Implications of Full-System Modeling for Superconducting Architectures
2025-11-07
articleOpen accessSenior authorAs Moore’s Law slows, superconducting electronics offer ultra-low-power, high-speed computation potential. This paper presents the first full-system superconducting architecture modeling in gem5, evaluating superconducting cores, caches, and interconnects under realistic workloads. We extend gem5 with cryogenic semiconductor (4 GHz) and superconducting (100 GHz) RISC-V cores and multi-level caches, evaluating RISC-V benchmarks and SPEC CPU2006 applications. We also integrate SRNoC, a superconducting interconnect, with the NOVA graph accelerator.
TEGRA - Scaling Up Graph Processing with Disaggregated Computing
2025-11-07 · 1 citations
articleOpen accessGraph processing workloads continue to grow in scale and complexity, demanding architectures that can adapt to diverse compute and memory requirements. Traditional scale-out accelerators couple compute and memory resources, resulting in resource underutilization when executing workloads with varying compute-to-memory intensities. In this paper, we present TEGRA, a composable, scale-up architecture for large-scale graph processing. TEGRA leverages disaggregated memory via CXL and a message-passing communication model to decouple compute and memory, enabling independent scaling of each. Through detailed evaluation using the gem5 simulator, we show that TEGRA improves memory bandwidth utilization by up to 15% over state-of-the-art accelerators by dynamically provisioning compute based on workload demands. Our results demonstrate that TEGRA provides a flexible and efficient foundation for supporting emerging graph analytics workloads across a wide range of arithmetic intensities.
NOVA: A Novel Vertex Management Architecture for Scalable Graph Processing
2025-03-01 · 2 citations
articleWe propose a scalable graph processing hardware accelerator called NOVA that is based on a novel vertex management architecture that decouples the execution of reduction and propagation operations in the popular vertex-centric graph processing paradigm. This allows us to store the working set in off-chip memory and utilize the available on-chip memory as a buffer to hide the latency of DRAM accesses instead of a traditional cache. This overcomes one of the key drawbacks of almost all the prior works which require temporal partitioning of graphs to scale to large graphs. We develop a cycle-accurate model of the architecture in gem 5 and demonstrate that NOVA exhibits near-perfect weak and strong scaling while scaling to large graphs by spatially tiling multiple nodes. In addition, our simulations show that NOVA is $2.35 \times$ better than a state-of-the-art graph accelerator (PolyGraph) while using a fraction of the on-chip memory on a synthetic graph with 134M vertices and over 2.14B edges.
FP-Rowhammer: DRAM-Based Device Fingerprinting
2025-08-13
articleOpen accessTDRAM: Tag-enhanced DRAM for Efficient Caching
arXiv (Cornell University) · 2024-04-22
preprintOpen accessSenior authorAs SRAM-based caches are hitting a scaling wall, manufacturers are integrating DRAM-based caches into system designs to continue increasing cache sizes. While DRAM caches can improve the performance of memory systems, existing DRAM cache designs suffer from high miss penalties, wasted data movement, and interference between misses and demand requests. In this paper, we propose TDRAM, a novel DRAM microarchitecture tailored for caching. TDRAM enhances HBM3 by adding a set of small low-latency mats to store tags and metadata on the same die as the data mats. These mats enable fast parallel tag and data access, on-DRAM-die tag comparison, and conditional data response based on comparison result (reducing wasted data transfers) akin to SRAM caches mechanism. TDRAM further optimizes the hit and miss latencies by performing opportunistic early tag probing. Moreover, TDRAM introduces a flush buffer to store conflicting dirty data on write misses, eliminating turnaround delays on data bus. We evaluate TDRAM using a full-system simulator and a set of HPC workloads with large memory footprints showing TDRAM provides at least 2.6$\times$ faster tag check, 1.2$\times$ speedup, and 21% less energy consumption, compared to the state-of-the-art commercial and research designs.
Recent grants
Frequent coauthors
- 20 shared
Ayaz Akram
- 17 shared
Venkatesh Akella
University of California, Davis
- 14 shared
Daniel Rodrigues Carvalho
Université de Rennes
- 13 shared
Marjan Fariborz
- 13 shared
Mahyar Samani
- 13 shared
Adrià Armejach
Barcelona Supercomputing Center
- 13 shared
Binh Thai Pham
University Of Transport Technology
- 11 shared
Pouya Fotouhi
University of California, Davis
Education
- 2017
PhD, Computer Sciences
University of Wisconsin Madison
- 2015
Master's of Science, Computer Science
University of Wisconsin-Madison
- 2010
Bachelor's of Science, Computer Science
Georgia Institute of Technology
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Jason Lowe-Power
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup