Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…

Peipei Zhou

Verified

Brown University · Civil Engineering

Active 2007–2026

h-index20
Citations1.9k
Papers9361 last 5y
Funding
See your match with Peipei Zhou — sign in to PhdFit.Sign in

About

Peipei Zhou is an Assistant Professor of Engineering at Brown University, specializing in electronic design automation, computer architecture, reconfigurable computing, heterogeneous computing, compiler design, modeling and optimization for computing systems, chiplet-based systems, sustainable computing, precision medicine, and artificial intelligence. She bridges the gap between computer science and engineering, seamlessly flowing between physical hardware and operating systems, abstraction and optimization interfaces, and software output. Her research focuses on advancing the scalable, verifiable co-design of heterogeneous reconfigurable computing systems, enabling domain experts to efficiently build next-generation customized AI and data-intensive applications such as autonomous physical systems, adaptive intelligent agents, and advanced healthcare technologies. Zhou has received notable recognition for her work, including the CAREER Award from the National Science Foundation and the 10-Year Most Influential Paper Award at ICCAD 2025. Her contributions have significantly impacted the fields of computer architecture and design automation, and she is actively involved in research initiatives at Brown University.

Research topics

  • Computer Science
  • Operating system
  • Artificial Intelligence
  • Embedded system
  • Parallel computing
  • Computer engineering
  • Computer architecture
  • Engineering

Selected publications

  • PHAROS: Pipelined Heterogeneous Accelerators for Real-time Safety-critical Systems With Deadline Compliance

    arXiv (Cornell University) · 2026-04-07

    preprintOpen accessSenior author

    Spatially partitioned heterogeneous accelerators (HAs) are increasingly adopted in embedded systems for their performance and flexibility. Yet most existing HA design frameworks optimize primarily for throughput or quality-of-service (QoS) metrics. They often overlook safety-critical real-time requirements, including hardware support for predictable execution, real-time-aware design space exploration (DSE), and rigorous schedulability analysis. These requirements are essential in safety-critical applications such as smart transportation, where schedulability guarantees directly affect system safety. To address this gap, we present PHAROS, a real-time-centric HA design framework. PHAROS introduces preemption mechanisms and scheduler designs for spatially partitioned HAs under first-in-first-out (FIFO) and earliest-deadline-first (EDF) policies. Leveraging modern real-time theory, we further develop a soft real-time (SRT) schedulability-oriented DSE with objectives and constraints tailored to SRT schedulability. Through comprehensive modeling, analysis, and evaluation across diverse applications, we show that PHAROS's DSE discovers more feasible configurations for a broader range of task sets than throughput-oriented DSE baselines while delivering improved real-time performance. We also provide response-time analyses for the supported scheduling algorithms.

  • FILCO: Flexible Composing Architecture with Real-Time Reconfigurability for DNN Acceleration

    arXiv (Cornell University) · 2026-04-08

    preprintOpen accessSenior author

    With the development of deep neural network (DNN) enabled applications, achieving high hardware resource efficiency on diverse workloads is non-trivial in heterogeneous computing platforms. Prior works discuss dedicated architectures to achieve maximal resource efficiency. However, a mismatch between hardware and workloads always exists in various diverse workloads. Other works discuss overlay architecture that can dynamically switch dataflow for different workloads. However, these works are still limited by flexibility granularity and induce much resource inefficiency. To solve this problem, we propose a flexible composing architecture, FILCO, that can efficiently match diverse workloads to achieve the optimal storage and computation resource efficiency. FILCO can be reconfigured in real-time and flexibly composed into a unified or multiple independent accelerators. We also propose the FILCO framework, including an analytical model with a two-stage DSE that can achieve the optimal design point. We also evaluate the FILCO framework on the 7nm AMD Versal VCK190 board. Compared with prior works, our design can achieve 1.3x - 5x throughput and hardware efficiency on various diverse workloads.

  • μ-ORCA: Optimizing Acceleration for Microsecond-Scale Deep Neural Network Inference on ACAP

    ArXiv.org · 2026-05-17

    articleOpen accessSenior author

    Heterogeneous reconfigurable platforms with tensor cores, such as AMD ACAP, are increasingly adopted for deep neural network (DNN) inference due to their high throughput and flexibility. However, their suitability for microsecond-scale inference on small problem sizes remains underexplored. In jet-tagging applications in high-energy physics, inefficient on-chip communication and large inter-layer latency prevent existing frameworks from meeting the 1-μs latency budget. Moreover, hardware overheads such as synchronization and VLIW processor prologue are often overlooked, making it infeasible to optimize accelerators correctly. To address these problems, we propose μ-ORCA, a customized heterogeneous accelerator framework for ultra-low-latency model inference. μ-ORCA enables direct inter-layer communication between DNN layers on the AIE array, instead of using shared memory tiles or FPGA fabric. Moreover, a 512-bit/cycle cascade connection is applied instead of a 32-bit/cycle DMA connection. μ-ORCA also provides an overhead-aware performance model that adapts to different NN layer sizes, and conducts design space exploration to optimize end-to-end latency. μ-ORCA supports MLP and DeepSets models with non-MM kernels, including bias, ReLU, and global aggregation on AIE. We evaluate μ-ORCA on the AMD ACAP VEK280 platform. Experimental results show that μ-ORCA achieves average latency reduction of >1.70$\times$ and >1.83$\times$ compared with different state-of-the-art ACAP frameworks, and achieves 0.93 μs latency for a 6-layer real-world DeepSets model, satisfying the latency budget. We open source μ-ORCA at https://github.com/arc-research-lab/u-ORCA.

  • PHAROS: Pipelined Heterogeneous Accelerators for Real-time Safety-critical Systems With Deadline Compliance

    arXiv (Cornell University) · 2026-04-07

    articleOpen accessSenior author

    Spatially partitioned heterogeneous accelerators (HAs) are increasingly adopted in embedded systems for their performance and flexibility. Yet most existing HA design frameworks optimize primarily for throughput or quality-of-service (QoS) metrics. They often overlook safety-critical real-time requirements, including hardware support for predictable execution, real-time-aware design space exploration (DSE), and rigorous schedulability analysis. These requirements are essential in safety-critical applications such as smart transportation, where schedulability guarantees directly affect system safety. To address this gap, we present PHAROS, a real-time-centric HA design framework. PHAROS introduces preemption mechanisms and scheduler designs for spatially partitioned HAs under first-in-first-out (FIFO) and earliest-deadline-first (EDF) policies. Leveraging modern real-time theory, we further develop a soft real-time (SRT) schedulability-oriented DSE with objectives and constraints tailored to SRT schedulability. Through comprehensive modeling, analysis, and evaluation across diverse applications, we show that PHAROS's DSE discovers more feasible configurations for a broader range of task sets than throughput-oriented DSE baselines while delivering improved real-time performance. We also provide response-time analyses for the supported scheduling algorithms.

  • FILCO: Flexible Composing Architecture with Real-Time Reconfigurability for DNN Acceleration

    ArXiv.org · 2026-04-08

    articleOpen accessSenior author

    With the development of deep neural network (DNN) enabled applications, achieving high hardware resource efficiency on diverse workloads is non-trivial in heterogeneous computing platforms. Prior works discuss dedicated architectures to achieve maximal resource efficiency. However, a mismatch between hardware and workloads always exists in various diverse workloads. Other works discuss overlay architecture that can dynamically switch dataflow for different workloads. However, these works are still limited by flexibility granularity and induce much resource inefficiency. To solve this problem, we propose a flexible composing architecture, FILCO, that can efficiently match diverse workloads to achieve the optimal storage and computation resource efficiency. FILCO can be reconfigured in real-time and flexibly composed into a unified or multiple independent accelerators. We also propose the FILCO framework, including an analytical model with a two-stage DSE that can achieve the optimal design point. We also evaluate the FILCO framework on the 7nm AMD Versal VCK190 board. Compared with prior works, our design can achieve 1.3x - 5x throughput and hardware efficiency on various diverse workloads.

  • μ-ORCA: Optimizing Acceleration for Microsecond-Scale Deep Neural Network Inference on ACAP

    arXiv (Cornell University) · 2026-05-17

    preprintOpen accessSenior author

    Heterogeneous reconfigurable platforms with tensor cores, such as AMD ACAP, are increasingly adopted for deep neural network (DNN) inference due to their high throughput and flexibility. However, their suitability for microsecond-scale inference on small problem sizes remains underexplored. In jet-tagging applications in high-energy physics, inefficient on-chip communication and large inter-layer latency prevent existing frameworks from meeting the 1-μs latency budget. Moreover, hardware overheads such as synchronization and VLIW processor prologue are often overlooked, making it infeasible to optimize accelerators correctly. To address these problems, we propose μ-ORCA, a customized heterogeneous accelerator framework for ultra-low-latency model inference. μ-ORCA enables direct inter-layer communication between DNN layers on the AIE array, instead of using shared memory tiles or FPGA fabric. Moreover, a 512-bit/cycle cascade connection is applied instead of a 32-bit/cycle DMA connection. μ-ORCA also provides an overhead-aware performance model that adapts to different NN layer sizes, and conducts design space exploration to optimize end-to-end latency. μ-ORCA supports MLP and DeepSets models with non-MM kernels, including bias, ReLU, and global aggregation on AIE. We evaluate μ-ORCA on the AMD ACAP VEK280 platform. Experimental results show that μ-ORCA achieves average latency reduction of >1.70$\times$ and >1.83$\times$ compared with different state-of-the-art ACAP frameworks, and achieves 0.93 μs latency for a 6-layer real-world DeepSets model, satisfying the latency budget. We open source μ-ORCA at https://github.com/arc-research-lab/u-ORCA.

  • Targeting Regnase-1 unleashes CAR T cell antitumor activity for osteosarcoma and creates a proinflammatory tumor microenvironment

    bioRxiv (Cold Spring Harbor Laboratory) · 2025-05-23 · 2 citations

    preprintOpen access

    Negative regulators of T cell function represent promising targets to enhance the intrinsic antitumor activity of CAR T cells against solid tumors. However, the endogenous immune ecosystem in solid tumors often represents an immunosuppressive therapeutic barrier to CAR T cell therapy, and it is currently unknown whether deletion of negative regulators in CAR T cells reshapes the endogenous immune landscape. To address this knowledge gap, we developed CAR T cells targeting B7-H3 in immune-competent osteosarcoma models and evaluated the intrinsic and extrinsic effects of deleting a potent negative regulator called Regnase-1 (Reg-1). Deletion of Reg-1 not only improved the effector function of B7-H3-CAR T cells but also endowed them with the ability to create a proinflammatory landscape characterized by an influx of IFNγ-producing endogenous T cells and NK cells and a reduction of inhibitory myeloid cells, including M2 macrophages. Thus, deleting negative regulators in CAR T cells enforces a non-cell-autonomous state by creating a proinflammatory tumor microenvironment.

  • Digital discourse: we-media's impact on pragmatic competence in college students

    Interactive Learning Environments · 2025-06-02 · 1 citations

    articleSenior author
  • Ph.D. Project AIM: Accelerating Arbitrary-Precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP

    2025-05-04

    articleSenior author

    Arbitrary-precision integer multiplication serves as the core kernel in many applications such as cryptographic algorithms, scientific computing, and etc. To compute arbitrary-precision integer multiplication using low-bit function units (32/64-bit) on existing hardware, decomposition methods like Karatsuba and Schoolbook are usually adopted. In general, the decomposition methods use two steps to finish the calculation. First, it decomposes the two large integers into many smaller integers and generates a group of low-bit multiplications that can be calculated in a spatial or sequential manner. Second, the results of low-bit multiplications are shifted and added together to get the final result. The first step involves massive parallel byte-level processing, while the second step requires a long propagation chain, which involves bit-level processing. Prior works have leveraged vector instructions on CPUs, CUDA cores on GPUs, and DSPs on FPGAs to accelerate arbitrary-precision multiplication. We use the state-of-the-art FPGA accelerator and libraries on GPUs and CPUs, and find that the FPGA has the lowest energy efficiency. We identify that the dedicated vector units on CPUs and GPUs bring the biggest energy efficiency in the first computation step. Although DSPs and LUTs on FPGAs introduce extra energy overhead in the first step compared to dedicated vector units but are more suitable for the second computation step. To benefit both two steps, we propose the AIM framework to generate efficient arbitrary-precision integer multiplication accelerator on AMD Versal adaptive compute acceleration platform (ACAP) VCK190, which comprises 400 AI Engine (AIE) ASIC processors, an FPGA, and a ARM CPU. AIM uses 400 AIEs to compute the first step and the FPGA to process the second step. Our experimental results show that AIM achieves up to 12.6x and 2.1x energy efficiency gain over the Intel Xeon Ice Lake 6346 CPU, and Nvidia A5000 GPU respectively with the respect to the multiplication kernel. We open-sourced AIM on GitHub: https://github.com/arc-research-lab/AIM. Also, we use three different applications including large integer multiplication, RSA, Mandelbrot to demonstrate the usability of AIM.

  • DERCA: DetERministic Cycle-Level Accelerator on Reconfigurable Platforms in DNN-Enabled Real-Time Safety-Critical Systems

    2025-12-02 · 1 citations

    articleSenior author

    Deep neural network (DNN) models are increasingly deployed in real-time, safety-critical systems such as autonomous vehicles, driving the need for specialized AI accelerators. However, most existing accelerators support only non-preemptive execution or limited preemptive scheduling at the coarse granularity of DNN layers. This restriction leads to frequent priority inversion due to the scarcity of preemption points, resulting in unpredictable execution behavior and, ultimately, system failure. To address these limitations and improve the real-time performance of AI accelerators, we propose DERCA, a novel accelerator architecture that supports fine-grained, intra-layer flexible preemptive scheduling with cycle-level determinism. DERCA incorporates an on-chip Earliest Deadline First (EDF) scheduler to reduce both scheduling latency and variance, along with a customized dataflow design that enables intralayer preemption points (PPs) while minimizing the overhead associated with preemption. Leveraging the limited preemptive task model, we perform a comprehensive predictability analysis of DERCA, enabling formal schedulability analysis and optimized placement of preemption points within the constraints of limited preemptive scheduling. We implement DERCA on the AMD ACAP VCK190 reconfigurable platform. Experimental results show that DERCA outperforms state-of-the-art designs using non-preemptive and layer-wise preemptive dataflows, with less than 5 % overhead in worst-case execution time (WCET) and only 6% additional resource utilization. DERCA is open-sourced on GitHub: https://github.com/arc-research-lab/DERCA

Frequent coauthors

  • Jingtong Hu

    25 shared
  • Zhuoping Yang

    University of Pittsburgh

    24 shared
  • Alex K. Jones

    University of Pittsburgh

    23 shared
  • Jason Cong

    UCLA Health

    21 shared
  • Jinming Zhuang

    Brown University

    17 shared
  • Shixin Ji

    16 shared
  • Julius Mugweru

    University of Embu

    15 shared
  • Xiujie Yuan

    Shanghai CASB Biotechnology (China)

    12 shared

Education

  • PhD, Computer Science

    University of California Los Angeles

    2019
  • Master of Science, Electrical and Computer Engineering

    University of California Los Angeles

    2014
  • B.S., Chien-Shiung Wu Honor College, EECS track

    Southeast University

    2012

Awards & honors

  • CAREER Award from the National Science Foundation (2026)
  • 10-Year Retrospective Most Influential Paper Award at ICCAD…
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Peipei Zhou

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup