
Gene Cooperman
VerifiedNortheastern University · Electrical and Energy Engineering
Active 1981–2025
About
Gene Cooperman is a professor at the Khoury College of Computer Sciences and an affiliated faculty member at the College of Engineering at Northeastern University. He has worked in a series of interdisciplinary research areas, including applied mathematics, computational and symbolic algebra, numerical analysis, computing in high energy physics, bioinformatics, high-performance computing, and computer systems. He has graduated 10 PhD students and has co-authored over 125 refereed publications. His ongoing DMTCP (Distributed MultiThreaded Checkpointing) project supports transparent checkpointing with no modification to the target application binary. This project extends support to external hardware and software environments such as GPUs and network interconnects to facilitate MPI for high-performance computing. The project is being extended and validated for production use in collaboration with the DOE’s NERSC supercomputing center, with plans to deploy on NERSC’s Perlmutter supercomputer. The functionality provided by DMTCP, MANA, and CRAC aims to enable scientists to execute long-running computations by chaining multiple allocation time slots through checkpoint-restart technology, thus addressing current limitations of maximum 48-hour allocation slots. His research also includes leading the High Performance Computing Laboratory within the Khoury College of Computer Sciences, focusing on computer systems, high-performance computing, and model checking.
Research topics
- Computer Science
- Distributed computing
- Operating system
- Computer Security
- Engineering
- Database
- Software engineering
Selected publications
The Case for ABI Interoperability in a Fault Tolerant MPI
ArXiv.org · 2025-03-14
preprintOpen accessSenior authorThere is new momentum behind an interoperable ABI for MPI, which will be a major component of MPI-5. This capability brings true separation of concerns to a running MPI computation. The linking and compilation of an MPI application becomes completely independent of the choice of MPI library. The MPI application is compiled once, and runs everywhere. This ABI allows users to independently choose: the compiler for the MPI application; the MPI runtime library; and, with this work, the transparent checkpointing package. Arbitrary combinations of the above are supported. The result is a "three-legged stool", which supports performance, portability, and resilience for long-running computations. An experimental proof-of-concept is presented, using the MANA checkpointing package and the Mukautuva ABI library for MPI interoperability. The result demonstrates that the combination of an ABI-compliant MPI and transparent checkpointing can bring extra flexibility in portability and dynamic resource management at runtime without compromising performance. For example, an MPI application can execute and checkpoint under one MPI library, and later restart under another MPI library. The work is not specific to the MANA package, since the approach using Mukautuva can be adapted to other transparent checkpointing packages.
The Case for ABI Interoperability in a Fault Tolerant MPI
2025-06-03
articleSenior authorThere is new momentum behind an interoperable ABI for MPI, which will be a major component of MPI-5. This capability brings true separation of concerns to a running MPI computation. The linking and compilation of an MPI application becomes completely independent of the choice of MPI library. The MPI application is compiled once, and runs everywhere.This ABI allows users to independently choose: the compiler for the MPI application; the MPI runtime library; and, with this work, the transparent checkpointing package. Arbitrary combinations of the above are supported. The result is a "three-legged stool", which supports performance, portability, and resilience for long-running computations.An experimental proof-of-concept is presented, using the MANA checkpointing package and the Mukautuva ABI library for MPI interoperability. The result demonstrates that the combination of an ABI-compliant MPI and transparent checkpointing can bring extra flexibility in portability and dynamic resource management at runtime without compromising performance. For example, an MPI application can execute and checkpoint under one MPI library, and later restart under another MPI library. The work is not specific to the MANA package, since the approach using Mukautuva can be adapted to other transparent checkpointing packages.
BootSeer: Analyzing and Mitigating Initialization Bottlenecks in Large-Scale LLM Training
ArXiv.org · 2025-07-16
preprintOpen accessSenior authorLarge Language Models (LLMs) have become a cornerstone of modern AI, driving breakthroughs in natural language processing and expanding into multimodal jobs involving images, audio, and video. As with most computational software, it is important to distinguish between ordinary runtime performance and startup overhead. Prior research has focused on runtime performance: improving training efficiency and stability. This work focuses instead on the increasingly critical issue of startup overhead in training: the delay before training jobs begin execution. Startup overhead is particularly important in large, industrial-scale LLMs, where failures occur more frequently and multiple teams operate in iterative update-debug cycles. In one of our training clusters, more than 3.5% of GPU time is wasted due to startup overhead alone. In this work, we present the first in-depth characterization of LLM training startup overhead based on real production data. We analyze the components of startup cost, quantify its direct impact, and examine how it scales with job size. These insights motivate the design of Bootseer, a system-level optimization framework that addresses three primary startup bottlenecks: (a) container image loading, (b) runtime dependency installation, and (c) model checkpoint resumption. To mitigate these bottlenecks, Bootseer introduces three techniques: (a) hot block record-and-prefetch, (b) dependency snapshotting, and (c) striped HDFS-FUSE. Bootseer has been deployed in a production environment and evaluated on real LLM training workloads, demonstrating a 50% reduction in startup overhead.
HotSwap: Enabling Live Dependency Sharing in Serverless Computing
2025-07-07
articleSenior authorThis work presents HotSwap, a novel provider-side cold-start optimization for serverless computing. This optimization reduces cold-start time when booting and loading depen-dencies at runtime inside a function container. Previous research has extensively focused on reducing cold-start latency for specific functions. However, little attention has been given to skewed production workloads. In such cases, cross-function optimization becomes essential. Without cross-function optimization, a cloud provider is left with two equally poor options: (i) Either the cloud provider gives up optimization for each function in the long tail (which is slow); or (ii) the cloud provider applies function-specific optimizations (e.g., cache function images) to every function in the long tail (which violates the vendor's cache constraints). HotSwap demonstrates cross-function optimization using a novel pre-warming strategy. In this strategy, a pre-initialized live dependency image is migrated to the new function instance. At the same time, HotSwap respects the provider's cache constraints, because a single pre-warmed dependency image in the cache can be shared among all serverless functions that require that image. HotSwap has been tested on seven representative functions from FunctionBench. In those tests, HotSwap accelerates dependency loading for those serverless functions with large dependency requirements by a factor ranging from 2.2 to 3.2. Simulation experiments using Azure traces indicate that HotSwap can save 88% of space, compared with a previous function-specific method, PreBaking, when sharing a dependency image among ten different functions.
Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach
2024-09-24 · 1 citations
articleSenior authorMPI is the de facto standard for parallel computing on a cluster of computers. Checkpointing is an important component in any strategy for software resilience and for long-running jobs that must be executed by chaining together time-bounded resource allocations. This work solves an old problem: a practical and general algorithm for transparent checkpointing of MPI that is both efficient and compatible with most of the latest network software. Transparent checkpointing is attractive due to its generality and ease of use for most MPI application developers. Earlier efforts at transparent checkpointing for MPI, one decade ago, had two difficult problems: (i) by relying on a specific MPI implementation tied to a specific network technology; and (ii) by failing to demonstrate sufficiently low runtime overhead. Problem (i) (network dependence) was already solved in 2019 by MANA's introduction of split processes. Problem (ii) (efficient runtime overhead) is solved in this work. This paper introduces an approach that avoids these limitations, employing a novel topological sort to algorithmically determine a safe future synchronization point. The algorithm is valid for both blocking and non-blocking collective communication in MPI. We demonstrate the efficacy and scalability of our approach through both micro-benchmarks and a set of five real-world MPI applications, notably including the widely used VASP (Vienna Ab Initio Simulation Package), which is responsible for 11% of the workload on the Perlmutter supercomputer at Lawrence Berkley National Laboratory. VASP was previously cited as a special challenge for checkpointing, in part due to its multi-algorithm codes.
HotSwap: Enabling Live Dependency Sharing in Serverless Computing
arXiv (Cornell University) · 2024-09-13
preprintOpen accessSenior authorThis work presents HotSwap, a novel provider-side cold-start optimization for serverless computing. This optimization reduces cold-start time when booting and loading dependencies at runtime inside a function container. Previous research has extensively focused on reducing cold-start latency for specific functions. However, little attention has been given to skewed production workloads. In such cases, cross-function optimization becomes essential. Without cross-function optimization, a cloud provider is left with two equally poor options: (i) Either the cloud provider gives up optimization for each function in the long tail (which is slow); or (ii) the cloud provider applies function-specific optimizations (e.g., cache function images) to every function in the long tail (which violates the vendor's cache constraints). HotSwap demonstrates cross-function optimization using a novel pre-warming strategy. In this strategy, a pre-initialized live dependency image is migrated to the new function instance. At the same time, HotSwap respects the provider's cache constraints, because a single pre-warmed dependency image in the cache can be shared among all serverless functions that require that image. HotSwap has been tested on seven representative functions from FunctionBench. In those tests, HotSwap accelerates dependency loading for those serverless functions with large dependency requirements by a factor ranging from 2.2 to 3.2. Simulation experiments using Azure traces indicate that HotSwap can save 88\% of space, compared with a previous function-specific method, PreBaking, when sharing a dependency image among ten different functions.
Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach
arXiv (Cornell University) · 2024-08-05
preprintOpen accessSenior authorMPI is the de facto standard for parallel computing on a cluster of computers. Checkpointing is an important component in any strategy for software resilience and for long-running jobs that must be executed by chaining together time-bounded resource allocations. This work solves an old problem: a practical and general algorithm for transparent checkpointing of MPI that is both efficient and compatible with most of the latest network software. Transparent checkpointing is attractive due to its generality and ease of use for most MPI application developers. Earlier efforts at transparent checkpointing for MPI, one decade ago, had two difficult problems: (i) by relying on a specific MPI implementation tied to a specific network technology; and (ii) by failing to demonstrate sufficiently low runtime overhead. Problem (i) (network dependence) was already solved in 2019 by MANA's introduction of split processes. Problem (ii) (efficient runtime overhead) is solved in this work. This paper introduces an approach that avoids these limitations, employing a novel topological sort to algorithmically determine a safe future synchronization point. The algorithm is valid for both blocking and non-blocking collective communication in MPI. We demonstrate the efficacy and scalability of our approach through both micro-benchmarks and a set of five real-world MPI applications, notably including the widely used VASP (Vienna Ab Initio Simulation Package), which is responsible for 11% of the workload on the Perlmutter supercomputer at Lawrence Berkley National Laboratory. VASP was previously cited as a special challenge for checkpointing, in part due to its multi-algorithm codes.
McMini: A Programmable DPOR-Based Model Checker for Multithreaded Programs
The Art Science and Engineering of Programming · 2023-06-15
articleOpen accessSenior authorContext Model checking has become a key tool for gaining confidence in correctness of multi-threaded programs.Unit tests and functional tests do not suffice because of race conditions that are not discovered by those tests.This problem is addressed by model checking tools.A simple model checker is useful for detecting race conditions prior to production.Inquiry Current model checkers hardwire the behavior of common thread operations, and do not recognize application-dependent thread paradigms or functions using simpler primitive operations.This introduces additional operations, causing current model checkers to be excessively slow.In addition, there is no mechanism to model the semantics of the actual thread wakeup policies implemented in the underlying thread library or operating system.Eliminating these constraints can make model checkers faster.Approach McMini is an extensible model checker based on DPOR (Dynamic Partial Order Reduction).A mechanism was invented to declare to McMini new, primitive thread operations, typically in 100 lines or less of C code.The mechanism was extended to also allow a user of McMini to declare alternative thread wakeup policies, including spurious wakeups from condition variables.Knowledge In McMini, the user defines new thread operations.The user optimizes these operations by declaring to the DPOR algorithm information that reduces the number of thread schedules to be searched.One declares: (i) under what conditions an operation is enabled; (ii) which thread operations are independent of each other; and (iii) when two operations can be considered as co-enabled.An optional wakeup policy is implemented by defining when a wait operation (on a semaphore, condition variable, etc.) is enabled.A new enqueue thread operation is described, allowing a user to declare alternative wakeup policies.Grounding McMini was first confirmed to operate correctly and efficiently as a traditional, but extensible model checker for mutex, semaphore, condition variable, and reader-writer lock.McMini's extensibility was then tested on novel primitive operations, representing other useful paradigms for multithreaded operations.An example is readers-and-two-writers.The speed of model checking was found to be five times faster and more, as compared to traditional implementations on top of condition variables.Alternative wakeup policies (e.g., FIFO, LIFO, arbitrary, etc.) were then tested using an enqueue operation.Finally, spurious wakeups were tested with a program that exposes a bug only in the presence of a spurious wakeup.Importance Many applications employ functions for multithreaded paradigms that go beyond the traditional mutex, semaphore, and condition variables.They are defined on top of basic operations.The ability to directly define new primitives for these paradigms makes model checkers run faster by searching fewer thread schedules.The ability to model particular thread wakeup policies, including spurious wakeup for condition variables, is also important.Note that POSIX leaves undefined the wakeup policies of pthread_mutex_lock, sem_wait, and pthread_cond_wait.The POSIX thread implementation then chooses a particular policy (e.g., FIFO, arbitrary), which can be directly modeled by McMini.
Implementation-Oblivious Transparent Checkpoint-Restart for MPI
2023-11-10 · 3 citations
articleOpen accessSenior authorThis work presents experience with traditional use cases of checkpointing on a novel platform. A single codebase (MANA) transparently checkpoints production workloads for major available MPI implementations: “develop once, run everywhere”. The new platform enables application developers to compile their application against any of the available standards-compliant MPI implementations, and test each MPI implementation according to performance or other features.
Implementation-Oblivious Transparent Checkpoint-Restart for MPI
arXiv (Cornell University) · 2023-09-26
preprintOpen accessSenior authorThis work presents experience with traditional use cases of checkpointing on a novel platform. A single codebase (MANA) transparently checkpoints production workloads for major available MPI implementations: "develop once, run everywhere". The new platform enables application developers to compile their application against any of the available standards-compliant MPI implementations, and test each MPI implementation according to performance or other features.
Recent grants
SI2-SSE: Enhancement and Support of DMTCP for Adaptive, Extensible Checkpoint-Restart
NSF · $514k · 2014–2018
MRI: Enabling Research on Terabyte-Scale Datasets
NSF · $199k · 2006–2009
NSF · $408k · 2018–2021
Frequent coauthors
- 52 shared
Saïd Tazi
Université de Pau et des Pays de l'Adour
- 37 shared
Aya Omezzine
Université de Toulouse
- 23 shared
Rohan Garg
Nutanix (United States)
- 21 shared
Kapil Arya
Microsoft (United States)
- 16 shared
Larry Finkelstein
Philadelphia College of Osteopathic Medicine
- 15 shared
Narjès Bellamine Ben Saoud
Manouba University
- 13 shared
Leonard H. Finkelstein
- 13 shared
Daniel Kunkle
Google (United States)
Labs
High Performance Computing LaboratoryPI
Awards & honors
- FY13 TIER 1 Interdisciplinary Research Seed Grants (2012)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Gene Cooperman
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup