
About
Joseph (Joe) Izraelevitz is an Assistant Professor in the Department of Electrical, Computer & Energy Engineering at the University of Colorado Boulder. His research interests include Systems Software, Parallel Programming, and Nonvolatile Memory. He is involved in academic activities within the College of Engineering and Applied Science, contributing to the advancement of knowledge in these technical areas.
Research topics
- Computer Science
- Parallel computing
- Programming language
Selected publications
A Verified High-Performance Composable Object Library for Remote Direct Memory Access
Proceedings of the ACM on Programming Languages · 2026-01-08 · 2 citations
articleOpen accessRemote Direct Memory Access (RDMA) is a memory technology that allows remote devices to directly write to and read from each other's memory, bypassing components such as the CPU and operating system. This enables low-latency high-throughput networking, as required for many modern data centres, HPC applications and AI/ML workloads. However, baseline RDMA comprises a highly permissive weak memory model that is difficult to use in practice and has only recently been formalised. In this paper, we introduce the Library of Composable Objects (LOCO), a formally verified library for building multi-node objects on RDMA, filling the gap between shared memory and distributed system programming. LOCO objects are well-encapsulated and take advantage of the strong locality and the weak consistency characteristics of RDMA. They have performance comparable to custom RDMA systems (e.g. distributed maps), but with a far simpler programming model amenable to formal proofs of correctness. To support verification, we develop a novel modular declarative verification framework, called Mowgli, that is flexible enough to model multinode objects and is independent of a memory consistency model. We instantiate Mowgli with the RDMA memory model, and use it to verify correctness of LOCO libraries.
NMP-PaK: Near-Memory Processing Acceleration of Scalable De Novo Genome Assembly
2025-06-20 · 2 citations
articleOpen accessDe novo assembly enables investigations of unknown genomes, paving the way for personalized medicine and disease management.However, it faces immense computational challenges arising from the excessive data volumes and algorithmic complexity.While state-of-the-art de novo assemblers utilize distributed systems for extreme-scale genome assembly, they demand substantial computational and memory resources.They also fail to address the inherent challenges of de novo assembly, including a large memory footprint, memory-bound behavior, and irregular data patterns stemming from complex, interdependent data structures.Given these challenges, de novo assembly merits a custom hardware solution, though existing approaches have not fully addressed the limitations.We propose NMP-PaK, a hardware-software co-designed system that accelerates scalable de novo genome assembly through near-memory processing (NMP).Our channel-level NMP architecture addresses memory bottlenecks while providing sufficient scratchpad space for processing elements.Customized processing elements maximize parallelism while efficiently handling large data structures that are both dynamic and interdependent.Software optimizations include customized batch processing to reduce the memory footprint and hybrid CPU-NMP processing to address hardware underutilization caused by irregular data patterns.NMP-PaK conducts the same genome assembly while incurring a 14 smaller memory footprint compared to the state-of-the-art de novo assembly.Moreover, NMP-PaK delivers 16 and 5.7 performance improvements over the CPU and GPU baselines, respectively, with a 2.4 reduction in memory operations.Consequently, NMP-PaK achieves 8.3 greater throughput than state-of-the-art
Enabling Cost-Efficient LLM Inference on Mid-Tier GPUs With NMP DIMMs
IEEE Computer Architecture Letters · 2025-12-22
articleSenior authorLarge Language Models (LLMs) require substantial computational resources, making cost-efficient inference challenging. Scaling out with mid-tier GPUs (e.g., NVIDIA A10) appears attractive for LLMs, but our characterization shows that communication bottlenecks prevent them from matching high-end GPUs (e.g., 4×A100). Using 16×A10 GPUs, we find the decode stage—dominant in inference runtime—is memory-bandwidth-bound in matrix multiplications, I/O-bandwidth-bound in <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">AllReduce</i>, and underutilizes compute resources. These traits make it well-suited for both DIMM-based Near-Memory Processing (NMP) offloading and also communication quantization. Analytical modeling shows that a 16×A10 with NMP DIMMs and INT8 communication quantization can match 4×A100 performance at 30% lower cost and even surpass it under equal cost. These results demonstrate the potential of our approach for cost-efficient LLM inference on mid-tier GPUs.
PrediPrune: Reducing Verification Overhead in Souper with Machine Learning Driven Pruning
ArXiv.org · 2025-09-20
preprintOpen accessSenior authorSouper is a powerful enumerative superoptimizer that enhances the runtime performance of programs by optimizing LLVM intermediate representation (IR) code. However, its verification process, which relies on a computationally expensive SMT solver to validate optimization candidates, must explore a large search space. This large search space makes the verification process particularly expensive, increasing the burden to incorporate Souper into compilation tools. We propose PrediPrune, a stochastic candidate pruning strategy that effectively reduces the number of invalid candidates passed to the SMT solver. By utilizing machine learning techniques to predict the validity of candidates based on features extracted from the code, PrediPrune prunes unlikely candidates early, decreasing the verification workload. When combined with the state-of-the-art approach (Dataflow), PrediPrune decreases compilation time by 51% compared to the Baseline and by 12% compared to using only Dataflow, emphasizing the effectiveness of the combined approach that integrates a purely ML-based method (PrediPrune) with a purely non-ML based (Dataflow) method. Additionally, PrediPrune offers a flexible interface to trade-off compilation time and optimization opportunities, allowing end users to adjust the balance according to their needs.
AutoSSD: CXL-Enhanced Autonomous SSDs for Low Tail Latency
2025-07-20
articleOpen accessAll-SSD RAID arrays offer high performance but suffer from long tail latencies due to background processes like garbage collection. Globally scheduling accesses to avoid busy SSDs may mitigate this problem, but such a solution presents challenges in collecting realtime data about SSD performance and tracking the location of redirected blocks.
Baobab Merkle Tree for Efficient Secure Memory
IEEE Computer Architecture Letters · 2024-01-01 · 1 citations
articleSecure memory is a natural solution to hardware vulnerabilities in memory, but it faces fundamental challenges of performance and memory overheads. While significant work has gone into optimizing the protocol for performance, far less work has gone into optimizing its memory overhead. In this work, we propose the Baobab Merkle Tree, in which counters are memoized in an on-chip table. The Baobab Merkle Tree reduces spatial overhead of a Bonsai Merkle Tree by 2-4X without incurring performance overhead.
Leroy: Library Learning for Imperative Programming Languages
arXiv (Cornell University) · 2024-10-09
preprintOpen accessSenior authorLibrary learning is the process of building a library of common functionalities from a given set of programs. Typically, this process is applied in the context of aiding program synthesis: concise functions can help the synthesizer produce modularized code that is smaller in size. Previous work has focused on functional Lisp-like languages, as their regularity makes them more amenable to extracting repetitive structures. Our work introduces Leroy, which extends existing library learning techniques to imperative higher-level programming languages, with the goal of facilitating reusability and ease of maintenance. Leroy wraps the existing Stitch framework for library learning and converts imperative programs into a Lisp-like format using the AST. Our solution uses Stitch to do a top-down, corpus-guided extraction of repetitive expressions. Further, we prune abstractions that cannot be implemented in the programming language and convert the best abstractions back to the original language. We implement our technique in a tool for a subset of the Python programming language and evaluate it on a large corpus of programs. Leroy achieves a compression ratio of 1.04x of the original code base, with a slight expansion when the library is included. Additionally, we show that our technique prunes invalid abstractions.
A Midsummer Night’s Tree: Efficient and High Performance Secure SCM
2024-04-24 · 3 citations
articleOpen accessSecure memory is a highly desirable property to prevent memory corruption-based attacks. The emergence of nonvolatile, storage class memory (SCM) devices presents new challenges for secure memory. Metadata for integrity verification, organized in a Bonsai Merkle Tree (BMT), is cached on-chip in volatile caches, and may be lost on a power failure. As a consequence, care is required to ensure that metadata updates are always propagated into SCM. To optimize metadata updates, state-of-the-art approaches propose lazy update crash consistent metadata schemes. However, few consider the implications of their optimizations on on-chip area, which leads to inefficient utilization of scarce on-chip space. In this paper, we propose A Midsummer Night's Tree (AMNT), a novel "tree within a tree" approach to provide crash consistent integrity with low run-time overhead while limiting on-chip area for security metadata. Our approach offloads the potential hardware complexity of our technique to software to keep area overheads low. Our proposed mechanism results in significant improvements (a 41% reduction in execution overhead on average versus the state-of-the-art) for in-memory storage applications while significantly reducing the required on-chip area to implement our protocol.
Puddles: Application-Independent Recovery and Location-Independent Data for Persistent Memory
2024-04-18 · 2 citations
articleOpen accessIn this paper, we argue that current work has failed to provide a comprehensive and maintainable in-memory representation for persistent memory. PM data should be easily mappable into a process address space, shareable across processes, shippable between machines, consistent after a crash, and accessible to legacy code with fast, efficient pointers as first-class abstractions.
Puddles: Application-Independent Recovery and Location-Independent Data for Persistent Memory
arXiv (Cornell University) · 2023-10-03
preprintOpen accessIn this paper, we argue that current work has failed to provide a comprehensive and maintainable in-memory representation for persistent memory. PM data should be easily mappable into a process address space, shareable across processes, shippable between machines, consistent after a crash, and accessible to legacy code with fast, efficient pointers as first-class abstractions. While existing systems have provided niceties like mmap()-based load/store access, they have not been able to support all these necessary properties due to conflicting requirements. We propose Puddles, a new persistent memory abstraction, to solve these problems. Puddles provide application-independent recovery after a power outage; they make recovery from a system failure a system-level property of the stored data rather than the responsibility of the programs that access it. Puddles use native pointers, so they are compatible with existing code. Finally, Puddles implement support for sharing and shipping of PM data between processes and systems without expensive serialization and deserialization. Compared to existing systems, Puddles are at least as fast as and up to 1.34$\times$ faster than PMDK while being competitive with other PM libraries across YCSB workloads. Moreover, to demonstrate Puddles' ability to relocate data, we showcase a sensor network data-aggregation workload that results in a 4.7$\times$ speedup over PMDK.
Recent grants
Frequent coauthors
- 17 shared
Steven Swanson
Brigham and Women's Hospital
- 13 shared
Michael L. Scott
University of Rochester
- 10 shared
Jian Yang
State Key Laboratory of Chemical Engineering
- 9 shared
Juno Kim
UNSW Sydney
- 7 shared
Tamara Silbergleit Lehman
University of Colorado Boulder
- 6 shared
Morteza Hoseinzadeh
University of California, San Diego
- 5 shared
Terence Kelly
- 5 shared
Gaukas Wang
University of Colorado Boulder
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Joseph (Joe) Izraelevitz
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup