Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Josep  Torrellas

Josep Torrellas

· Saburo Muroga ProfessorVerified

University of Illinois Urbana-Champaign · Computer Science

Active 1988–2026

h-index64
Citations14.0k
Papers64366 last 5y
Funding$16.3M2 active
See your match with Josep Torrellas — sign in to PhdFit.Sign in

About

Josep Torrellas is a Saburo Muroga Professor at the University of Illinois Urbana-Champaign, affiliated with the Siebel School of Computing and Data Science. His research interests include low power design, hardware and software reliability, parallel processing, shared-memory multiprocessors, and computer architecture. He has made pioneering contributions to shared-memory multiprocessor architectures and thread-level speculation, which have been recognized through numerous awards including the IEEE Computer Society Harry H. Goode Memorial Award in 2021, the IEEE Computer Society Technical Achievement Award in 2015, and fellowships from the American Association for the Advancement of Science, ACM, and IEEE.

Research topics

  • Computer Science
  • Operating system
  • Computer Security
  • Programming language
  • Embedded system
  • Parallel computing

Selected publications

  • HEAL: Online Incremental Recovery for Leaderless Distributed Systems Across Persistency Models

    Open MIND · 2026-02-09

    preprintSenior author

    Ensuring resilience in distributed systems has become an acute concern. In today's environment, it is crucial to develop light-weight mechanisms that recover a distributed system from faults quickly and with only a small impact on the live-system throughput. To address this need, this paper proposes a new low-overhead, general recovery scheme for modern non-transactional leaderless distributed systems. We call our scheme HEAL. On a node failure, HEAL performs an optimized online incremental recovery. This paper presents HEAL's algorithms for settings with Linearizable consistency and different memory persistency models. We implement HEAL on a 6-node Intel cluster. Our experiments running TAOBench workloads show that HEAL is very effective. HEAL recovers the cluster in 120 milliseconds on average, while reducing the throughput of the running workload by an average of 8.7%. In contrast, a conventional recovery scheme for leaderless systems needs 360 seconds to recover, reducing the throughput of the system by 16.2%. Finally, compared to an incremental recovery scheme for a state-of-the-art leader-based system, HEAL reduces the average recovery latency by 20.7x and the throughput degradation by 62.4%.

  • Towards CXL Resilience to CPU Failures

    Open MIND · 2026-02-09

    preprintSenior author

    Compute Express Link (CXL) 3.0 and beyond allows the compute nodes of a cluster to share data with hardware cache coherence and at the granularity of a cache line. This enables shared-memory semantics for distributed computing, but introduces new resilience challenges: a node failure leads to the loss of the dirty data in its caches, corrupting application state. Unfortunately, the CXL specification does not consider processor failures. Moreover, when a component fails, the specification tries to isolate it and continue application execution; there is no attempt to bring the application to a consistent state. To address these limitations, this paper extends the CXL specification to be resilient to node failures, and to correctly recover the application after node failures. We call the system ReCXL. To handle the failure of nodes, ReCXL augments the coherence transaction of a write with messages that propagate the update to a small set of other nodes (i.e., Replicas). Replicas save the update in a hardware Logging Unit. Such replication ensures resilience to node failures. Then, at regular intervals, the Logging Units dump the updates to memory. Recovery involves using the logs in the Logging Units to bring the directory and memory to a correct state. Our evaluation shows that ReCXL enables fault-tolerant execution with only a 30% slowdown over the same platform with no fault-tolerance support.

  • HEAL: Online Incremental Recovery for Leaderless Distributed Systems Across Persistency Models

    ArXiv.org · 2026-02-09

    articleOpen accessSenior author

    Ensuring resilience in distributed systems has become an acute concern. In today's environment, it is crucial to develop light-weight mechanisms that recover a distributed system from faults quickly and with only a small impact on the live-system throughput. To address this need, this paper proposes a new low-overhead, general recovery scheme for modern non-transactional leaderless distributed systems. We call our scheme HEAL. On a node failure, HEAL performs an optimized online incremental recovery. This paper presents HEAL's algorithms for settings with Linearizable consistency and different memory persistency models. We implement HEAL on a 6-node Intel cluster. Our experiments running TAOBench workloads show that HEAL is very effective. HEAL recovers the cluster in 120 milliseconds on average, while reducing the throughput of the running workload by an average of 8.7%. In contrast, a conventional recovery scheme for leaderless systems needs 360 seconds to recover, reducing the throughput of the system by 16.2%. Finally, compared to an incremental recovery scheme for a state-of-the-art leader-based system, HEAL reduces the average recovery latency by 20.7x and the throughput degradation by 62.4%.

  • PhasedStore: Supporting High-Performance Write-Through Cache-Coherence Protocols Under TSO

    2026-01-31

    articleSenior author

    Current multiprocessors that support the total store order (TSO) memory consistency model invariably use writeback (WB) cache-coherence protocols. When their hardware needs to issue write-through (WT) stores as in uncached operations, their performance may suffer: writes to main memory have to be fully serialized, potentially forcing the program to observe the full latency of round trips to memory. To solve this problem, this paper presents a novel architecture that supports high-performance cache-coherent WT stores under TSO. The architecture, called PhasedStore, extends the store queue in the cores and the directory. Individual WT stores fully overlap with other stores and still satisfy TSO. PhasedStore is useful in environments that require a WT cache-coherence protocol. This can be the case in resiliencecritical platforms where node failures should not cause the loss of shared program state, or platforms with CPUs and accelerators where programs follow a producer-consumer pattern. This paper evaluates PhasedStore in the first environment, namely a CXL-based distributed shared-memory platform where shared data in the program uses a WT protocol to enable recovery. Our evaluation shows that PhasedStore is very effective. Compared to using the conventional approach to implement WT under TSO, PhasedStore reduces the average execution time of a set of parallel applications by 1.88x.

  • Towards CXL Resilience to CPU Failures

    arXiv (Cornell University) · 2026-02-09

    articleOpen accessSenior author

    Compute Express Link (CXL) 3.0 and beyond allows the compute nodes of a cluster to share data with hardware cache coherence and at the granularity of a cache line. This enables shared-memory semantics for distributed computing, but introduces new resilience challenges: a node failure leads to the loss of the dirty data in its caches, corrupting application state. Unfortunately, the CXL specification does not consider processor failures. Moreover, when a component fails, the specification tries to isolate it and continue application execution; there is no attempt to bring the application to a consistent state. To address these limitations, this paper extends the CXL specification to be resilient to node failures, and to correctly recover the application after node failures. We call the system ReCXL. To handle the failure of nodes, ReCXL augments the coherence transaction of a write with messages that propagate the update to a small set of other nodes (i.e., Replicas). Replicas save the update in a hardware Logging Unit. Such replication ensures resilience to node failures. Then, at regular intervals, the Logging Units dump the updates to memory. Recovery involves using the logs in the Logging Units to bring the directory and memory to a correct state. Our evaluation shows that ReCXL enables fault-tolerant execution with only a 30% slowdown over the same platform with no fault-tolerance support.

  • GRANII: Selection and Ordering of Primitives in GRAph Neural Networks using Input Inspection

    2026-01-31

    article

    Over the years, many frameworks and optimization techniques have been proposed to accelerate graph neural networks (GNNs). In contrast to the optimizations explored in these systems, we observe that different matrix re-associations of GNN computations lead to novel input-sensitive performance behavior. We leverage this observation to propose GRANII, a system that exposes different compositions of sparse and dense matrix primitives based on different matrix re-associations of GNN computations and selects the best among them based on input attributes. GRANII executes in two stages: (1) an offline compilation stage that enumerates all valid re-associations leading to different sparse-dense matrix compositions and uses input-oblivious pruning techniques to prune away clearly unprofitable candidates, and (2) an online runtime system that explores the remaining candidates and uses lightweight cost models to select the best re-association based on the input graph and the embedding sizes. On a wide range of configurations, GRANII achieves a geo-mean speedup of 1.56× for inference and 1.4× for training across multiple GNN models and systems. We also show GRANII’s technique functions on diverse implementations and with techniques such as sampling.

  • AccelFlow: Orchestrating an On-Package Ensemble of Fine-Grained Accelerators for Microservices

    2026-01-31

    articleSenior author

    Microservices suffer from the execution of auxiliary operations known as datacenter tax, such as RPC and TCP processing, and data (de)serialization, (de)encryption, and (de)compression. To minimize this tax, multiple hardware accelerators have been proposed. However, it is unclear how these accelerators should be orchestrated. Past work has focused only on orchestrating accelerators in coarse-grained environments with monolithic applications. In this paper, we characterize the needs of orchestrating an ensemble of on-package accelerators in microservice environments. We observe that orchestration frameworks need to be highly dynamic and nimble. The basic operations to be accelerated are fine grained, potentially taking only tens of <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mu \mathrm{s}$</tex>. Moreover, the sequence of accelerators to use is often affected by “branch conditions” whose real-time resolution determines the set of subsequent accelerators needed. To address these challenges, we present AccelFlow, the first orchestration framework for onpackage accelerators of microservices. In AccelFlow, CPU cores build software structures called Traces that contain sequences of accelerators to call. A core enqueues a trace in an accelerator in user mode and, from then on, the accelerators in the trace execute in sequence without CPU involvement. A trace can include branch conditions whose outcomes determine the trace control flow. Compared to state-of-the-art accelerator orchestrators, AccelFlow on average reduces P99 tail latency by 70 %, reduces average latency by 38 %, and increases throughput by 120 %.

  • MeshSlice: Efficient 2D Tensor Parallelism for Distributed DNN Training

    2025-06-20 · 1 citations

    articleOpen accessSenior author

    In distributed training of large DNN models, the scalability of onedimensional (1D) tensor parallelism (TP) is limited because of its high communication cost.2D TP attains extra scalability and efficiency because it reduces communication relative to 1D TP.Unfortunately, existing algorithms for general matrix multiplication (GeMM) in 2D TP suffer from inefficiencies.Indeed, Cannon's algorithm incurs high traffic, SUMMA suffers from high synchronization overhead, and a 2D GeMM with collective communication operations does not overlap communication with computation.In addition, it is difficult to optimize the numerous parameters of 2D TP, including the dataflow, mesh shape, and sharding.As a result, human experts are needed to find efficient configurations of 2D TP.To address these problems, this paper proposes MeshSlice, a novel 2D GeMM algorithm for efficient 2D TP in distributed DNN training.The MeshSlice algorithm slices the collective communications into multiple partial collectives that allow overlapping communication with computation.As a result, MeshSlice hides most of the communication latency.We also present the MeshSlice LLM autotuner, which automates finding an efficient 2D GeMM dataflow configuration, mesh shape, and communication granularity for Large Language Model (LLM) training using analytical cost models.To evaluate MeshSlice, we simulate TPUv4 clusters training LLM models.We show that MeshSlice maintains good efficiency up to at least 256-way 2D TP.In a cluster of 256 TPUs, MeshSlice trains the GPT-3 and Megatron-NLG models 12.0% and 23.4% faster, respectively, than the state-of-the-art algorithm.

  • Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems

    ArXiv.org · 2025-11-28

    preprintOpen accessSenior author

    Low-Rank Adaptation (LoRA) has become the de facto method for parameter-efficient fine-tuning of large language models (LLMs), enabling rapid adaptation to diverse domains. In production, LoRA-based models are served at scale, creating multi-tenant environments with hundreds of adapters sharing a base model. However, state-of-the-art serving systems co-batch heterogeneous adapters without accounting for rank (size) variability, leading to severe performance skew, which ultimately requires adding more GPUs to satisfy service-level objectives (SLOs). Existing optimizations, focused on loading, caching, and kernel execution, ignore this heterogeneity, leaving GPU resources underutilized. We present LoRAServe, a workload-aware dynamic adapter placement and routing framework designed to tame rank diversity in LoRA serving. By dynamically rebalancing adapters across GPUs and leveraging GPU Direct RDMA for remote access, LoRAServe maximizes throughput and minimizes tail latency under real-world workload drift. Evaluations on production traces from Company X show that LoRAServe elicits up to 2$\times$ higher throughput, up to 9$\times$ lower TTFT, while using up to 50% fewer GPUs under SLO constraints compared to state-of-the-art systems.

  • COGNATE: Acceleration of Sparse Tensor Programs on Emerging Hardware using Transfer Learning

    ArXiv.org · 2025-05-31

    preprintOpen access

    Sparse tensor programs are essential in deep learning and graph analytics, driving the need for optimized processing. To meet this demand, specialized hardware accelerators are being developed. Optimizing these programs for accelerators is challenging for two reasons: program performance is highly sensitive to variations in sparse inputs, and early-stage accelerators rely on expensive simulators. Therefore, ML-based cost models used for optimizing such programs on general-purpose hardware are often ineffective for early-stage accelerators, as they require large datasets for proper training. To this end, we introduce COGNATE, a novel framework that leverages inexpensive data samples from general-purpose hardware (e.g., CPUs) to train cost models, followed by few-shot fine-tuning on emerging hardware. COGNATE exploits the homogeneity of input features across hardware platforms while effectively mitigating heterogeneity, enabling cost model training with just 5% of the data samples needed by accelerator-specific models to achieve comparable performance. We conduct extensive experiments to demonstrate that COGNATE outperforms existing techniques, achieving average speedups of 1.47x (up to 5.46x) for SpMM and 1.39x (up to 4.22x) for SDDMM.

Recent grants

Frequent coauthors

Labs

Education

  • Ph.D., Computer Science

    University of California, Berkeley

    1992
  • M.S., Computer Science

    University of California, Berkeley

    1988
  • B.S., Computer Science

    University of Barcelona

    1985

Awards & honors

  • IEEE Computer Society Harry H. Goode Memorial Award (2021)
  • Fellow of the American Association for the Advancement of Sc…
  • IEEE Computer Society Technical Achievement Award (2015)
  • High-Impact Paper Award, International Conference on Compute…
  • ACM Fellow (2010)
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Josep Torrellas

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup