Jason Lowe-Power

· Assistant Professor in Computer ScienceVerified

University of California, Davis · Electrical and Computer Engineering

Active 2013–2025

h-index12

Citations799

Papers5335 last 5y

Funding$2.4M

Faculty page Lab page

See your match with Jason Lowe-Power — sign in to PhdFit.Sign in

About

Jason Lowe-Power is an Assistant Professor in the Department of Computer Science at UC Davis. His research focuses on computer architecture, high-performance computing, memory system architecture, and accelerator architecture. He aims to develop new hardware, software, and systems to improve performance and scalability of end-to-end applications such as big-data analytics, especially as traditional manufacturing advances like Moore's Law approach their physical limits. Lowe-Power is building an interdisciplinary research group to bridge the gap between architectural advances and important new applications, emphasizing cross-layer optimizations from application to hardware. He is actively recruiting motivated Ph.D. students interested in these areas.

Research topics

Computer Science
Operating system
Embedded system
Distributed computing
Parallel computing
Computer hardware
Artificial Intelligence
Mathematical optimization
Computer architecture
Mathematics
Telecommunications
Algorithm
Data science

Selected publications

Characterizing GPU Energy Usage in Exascale-Ready Portable Science Applications
Lecture notes in computer science · 2025-11-23 · 1 citations
book-chapterOpen access
Publisher OA PDF DOI
Efficient Caching with A Tag-enhanced DRAM
2025-03-01 · 1 citations
articleSenior author
As SRAM-based caches are hitting a scaling wall, manufacturers are integrating DRAM-based caches into system designs to continue increasing cache sizes. While DRAM caches can improve the performance of memory systems, existing DRAM cache designs suffer from high miss penalties, wasted data movement, and interference between misses and demands. In this paper, we propose TDRAM, a novel DRAM microarchitecture tailored for caching. TDRAM enhances existing DRAM, such as HBM3, by adding small, low-latency mats to store tags and metadata on the same die as the data mats. These mats enable tag and data access in lockstep, in-DRAM tag comparison, and conditional data response based on the comparison result (reducing wasted data transfers), akin to SRAM cache mechanisms. TDRAM further optimizes hit and miss latencies through opportunistic early tag probing. Moreover, TDRAM introduces a flush buffer to store conflicting dirty data on write misses, eliminating data bus turnaround delays on write demands. We evaluate TDRAM in a full-system simulation using a set of HPC workloads with large memory footprints, showing that TDRAM, on average, provides $2.65 \times$ faster tag checks, $1.23 \times$ speedup, and 21% less energy consumption compared to state-of-the-art commercial and research designs.
Publisher DOI
Portable Targeted Sampling Framework Using LLVM
arXiv (Cornell University) · 2025-09-02
preprintOpen accessSenior author
Evaluating architectural ideas on realistic workloads is increasingly challenging due to the prohibitive cost of detailed simulation and the lack of portable sampling tools. Existing targeted sampling techniques are often tied to specific binaries, incur significant overhead, and make rapid validation across systems infeasible. To address these limitations, we introduce Nugget, a flexible framework that enables portable sampling across simulators, hardware, architectural differences, and libraries. Nugget leverages LLVM IR to perform binary-independent interval analysis, then generates lightweight, cross-platform executable snippets (nuggets), that can be validated natively on real hardware before use in simulation. This approach decouples samples from specific binaries, dramatically reduces analysis overhead, and allows researchers to iterate on sampling methodologies while efficiently validating samples across diverse systems.
Publisher OA PDF DOI
Toward Reproducible and Standardized Computer Architecture Simulation with gem5
ArXiv.org · 2025-12-15
preprintOpen accessSenior author
Reproducibility in simulation-based computer architecture research requires coordinating artifacts like disk images, kernels, and benchmarks, but existing workflows are inconsistent. We improve gem5, an open-source simulator with over 1600 forks, and gem5 Resources, a centralized repository of over 2000 pre-packaged artifacts, to address these issues. While gem5 Resources enables artifact sharing, researchers still face challenges. Creating custom disk images is complex and time-consuming, with no standardized process across ISAs, making it difficult to extend and share images. gem5 provides limited guest-host communication features through a set of predefined exit events that restrict researchers' ability to dynamically control and monitor simulations. Lastly, running simulations with multiple workloads requires researchers to write custom external scripts to coordinate multiple gem5 simulations which creates error-prone and hard-to-reproduce workflows. To overcome this, we introduce several features in gem5 and gem5 Resources. We standardize disk-image creation across x86, ARM, and RISC-V using Packer, and provide validated base images with pre-annotated benchmark suites (NPB, GAPBS). We provide 12 new disk images, 6 new kernels, and over 200 workloads across three ISAs. We refactor the exit event system to a class-based model and introduce hypercalls for enhanced guest-host communication that allows researchers to define custom behavior for their exit events. We also provide a utility to remotely monitor simulations and the gem5-bridge driver for user-space m5 operations. Additionally, we implemented Suites and MultiSim to enable parallel full-system simulations from gem5 configuration scripts, eliminating the need for external scripting. These features reduce setup complexity and provide extensible, validated resources that improve reproducibility and standardization.
Publisher OA PDF DOI
Choreographer: A Full-System Framework for Fine-Grained Tasks in Cache Hierarchies
ArXiv.org · 2025-10-30
preprintOpen access
In this paper, we introduce Choreographer, a simulation framework that enables a holistic system-level evaluation of fine-grained accelerators designed for latency-sensitive tasks. Unlike existing frameworks, Choreographer captures all hardware and software overheads in core-accelerator and cache-accelerator interactions, integrating a detailed gem5-based hardware stack featuring an AMBA coherent hub interface (CHI) mesh network and a complete Linux-based software stack. To facilitate rapid prototyping, it offers a C++ application programming interface and modular configuration options. Our detailed cache model provides accurate insights into performance variations caused by cache configurations, which are not captured by other frameworks. The framework is demonstrated through two case studies: a data-aware prefetcher for graph analytics workloads, and a quicksort accelerator. Our evaluation shows that the prefetcher achieves speedups between 1.08x and 1.88x by reducing memory access latency, while the quicksort accelerator delivers more than 2x speedup with minimal address translation overhead. These findings underscore the ability of Choreographer to model complex hardware-software interactions and optimize performance in small task offloading scenarios.
Publisher OA PDF DOI
Implications of Full-System Modeling for Superconducting Architectures
2025-11-07
articleOpen accessSenior author
As Moore’s Law slows, superconducting electronics offer ultra-low-power, high-speed computation potential. This paper presents the first full-system superconducting architecture modeling in gem5, evaluating superconducting cores, caches, and interconnects under realistic workloads. We extend gem5 with cryogenic semiconductor (4 GHz) and superconducting (100 GHz) RISC-V cores and multi-level caches, evaluating RISC-V benchmarks and SPEC CPU2006 applications. We also integrate SRNoC, a superconducting interconnect, with the NOVA graph accelerator.
Publisher DOI
TEGRA - Scaling Up Graph Processing with Disaggregated Computing
2025-11-07 · 1 citations
articleOpen access
Graph processing workloads continue to grow in scale and complexity, demanding architectures that can adapt to diverse compute and memory requirements. Traditional scale-out accelerators couple compute and memory resources, resulting in resource underutilization when executing workloads with varying compute-to-memory intensities. In this paper, we present TEGRA, a composable, scale-up architecture for large-scale graph processing. TEGRA leverages disaggregated memory via CXL and a message-passing communication model to decouple compute and memory, enabling independent scaling of each. Through detailed evaluation using the gem5 simulator, we show that TEGRA improves memory bandwidth utilization by up to 15% over state-of-the-art accelerators by dynamically provisioning compute based on workload demands. Our results demonstrate that TEGRA provides a flexible and efficient foundation for supporting emerging graph analytics workloads across a wide range of arithmetic intensities.
Publisher DOI
NOVA: A Novel Vertex Management Architecture for Scalable Graph Processing
2025-03-01 · 2 citations
article
We propose a scalable graph processing hardware accelerator called NOVA that is based on a novel vertex management architecture that decouples the execution of reduction and propagation operations in the popular vertex-centric graph processing paradigm. This allows us to store the working set in off-chip memory and utilize the available on-chip memory as a buffer to hide the latency of DRAM accesses instead of a traditional cache. This overcomes one of the key drawbacks of almost all the prior works which require temporal partitioning of graphs to scale to large graphs. We develop a cycle-accurate model of the architecture in gem 5 and demonstrate that NOVA exhibits near-perfect weak and strong scaling while scaling to large graphs by spatially tiling multiple nodes. In addition, our simulations show that NOVA is $2.35 \times$ better than a state-of-the-art graph accelerator (PolyGraph) while using a fraction of the on-chip memory on a synthetic graph with 134M vertices and over 2.14B edges.
Publisher DOI
FP-Rowhammer: DRAM-Based Device Fingerprinting
2025-08-13
articleOpen access
Publisher DOI
TDRAM: Tag-enhanced DRAM for Efficient Caching
arXiv (Cornell University) · 2024-04-22
preprintOpen accessSenior author
As SRAM-based caches are hitting a scaling wall, manufacturers are integrating DRAM-based caches into system designs to continue increasing cache sizes. While DRAM caches can improve the performance of memory systems, existing DRAM cache designs suffer from high miss penalties, wasted data movement, and interference between misses and demand requests. In this paper, we propose TDRAM, a novel DRAM microarchitecture tailored for caching. TDRAM enhances HBM3 by adding a set of small low-latency mats to store tags and metadata on the same die as the data mats. These mats enable fast parallel tag and data access, on-DRAM-die tag comparison, and conditional data response based on comparison result (reducing wasted data transfers) akin to SRAM caches mechanism. TDRAM further optimizes the hit and miss latencies by performing opportunistic early tag probing. Moreover, TDRAM introduces a flush buffer to store conflicting dirty data on write misses, eliminating turnaround delays on data bus. We evaluate TDRAM using a full-system simulator and a set of HPC workloads with large memory footprints showing TDRAM provides at least 2.6$\times$ faster tag check, 1.2$\times$ speedup, and 21% less energy consumption, compared to the state-of-the-art commercial and research designs.
Publisher OA PDF DOI

Recent grants

CCRI: ENS: Collaborative Research: Modernizing gem5: Expanding the Reach of Computer Architecture Simulation
NSF · $1.9M · 2019–2024
CRII: CSR: Programmable Heterogeneous Memory Systems via Multiple Address Spaces and RAM Lake
NSF · $478k · 2019–2022

Frequent coauthors

Ayaz Akram
20 shared
Venkatesh Akella
University of California, Davis
17 shared
Daniel Rodrigues Carvalho
Université de Rennes
14 shared
Marjan Fariborz
13 shared
Mahyar Samani
13 shared
Adrià Armejach
Barcelona Supercomputing Center
13 shared
Binh Thai Pham
University Of Transport Technology
13 shared
Pouya Fotouhi
University of California, Davis
11 shared

Education

PhD, Computer Sciences
University of Wisconsin Madison
2017
Master's of Science, Computer Science
University of Wisconsin-Madison
2015
Bachelor's of Science, Computer Science
Georgia Institute of Technology
2010

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Jason Lowe-Power

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you