Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Scott Beamer

Scott Beamer

· Assistant Professor

University of California, Santa Cruz · Electrical and Computer Engineering

Active 1992–2025

h-index16
Citations2.1k
Papers356 last 5y
Funding
See your match with Scott Beamer — sign in to PhdFit.Sign in

About

I design architectures for data-intensive applications, with a focus on improving communication efficiency. I am interested in computer architecture, agile and open-source hardware design, graph processing, and data movement optimization. I lead the Vertical Architectures, Memory, and Algorithms (VAMA) group, and we are part of the Hardware Systems Collective.

Research topics

  • Computer Science
  • Computer architecture
  • Embedded system
  • Operating system
  • Mathematics
  • Theoretical computer science
  • Software engineering
  • Distributed computing
  • Parallel computing
  • Engineering

Selected publications

  • SORS: Accelerating RTL simulation with a vertically-integrated approach

    UPCommons institutional repository (Universitat Politècnica de Catalunya) · 2025-07-03

    other1st authorCorresponding

    Simulation is a critical tool for hardware design but its current slow speed often bottlenecks the entire design process. Simulation speed becomes even more crucial for agile and open-source hardware design methodologies, because the designers not only want to iterate on designs quicker, but they may also have less resources with which to simulate them. In this work, we survey our various techniques for accelerating hardware (RTL) simulation. We explore the challenge of efficiently detecting opportunities for reuse due to low activity factors, and we demonstrate streamlined techniques to profitably exploit them. We take advantage of the replication that is common in large designs to increase scalability. We parallelize simulation for multicore and achieve superlinear speedups. Throughout our work, we leverage insights about both the application workload and the host platform. Many of our innovations are enabled by novel graph partitioning algorithms or optimizations for the host processor. Our simulators outperform both leading open-source and industrial simulators, and we use performance counters to analyze our performance advantages.

  • miniGiraffe: A Pangenomic Mapping Proxy App

    2025-10-12

    articleSenior author

    Large, real-world scientific applications are often complex, making them difficult to analyze, characterize, and optimize. Such applications typically involve intricate I/O patterns and library dependencies, which can make workflow analysis and tuning difficult. Proxy applications offer a practical solution by emulating the essential characteristics of the original application while significantly reducing complexity.In this work, we present miniGiraffe, a proxy application for Giraffe, a sophisticated genomics tool that operates over a pangenome, a graph-based structure capturing genetic variation across a species. We develop miniGiraffe using a principled methodology: carefully characterizing Giraffe’s behavior and validating that our proxy faithfully reproduces its key computational features. miniGiraffe contains only 2% of Giraffe’s codebase, while producing identical outputs for the most computationally intensive code components and closely matches Giraffe’s execution time and scaling behavior in these regions. The simplified design of miniGiraffe enabled rapid experimentation across multiple architectures, which we utilize to perform an autotuning experiment of the mapping workflow; we found that specializing parameters to inputs and architectures provided a geometric mean speedup of 1.15× and up to a 3.32× speedup over the default parameters.

  • PivotScale: A Holistic Approach for Scalable Clique Counting

    2025-06-03

    articleSenior author

    Counting cliques of size <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$k$</tex> (k-cliques) in a graph is an important problem in graph pattern mining. Due to a combinatorial explosion in the amount of work, counting large cliques in real-world networks is challenging, as leading parallel approaches become untenable for even modestly large clique sizes (e.g. <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$k=10$</tex>). Pivoter, a recent algorithm, is able to scale to much larger clique sizes due to its superior algorithmic complexity. While it has leading single-thread performance, its naive parallel implementation results in poor parallel speedups. Efforts to optimize its parallel performance on CPUs are absent in the current literature. We present PivotScale, a scalable approach to accelerate exact clique counting. Our approach scales with both the number of cores as well as the clique size <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$k$</tex>. During the initial ordering phase, we introduce a heuristic to select which parallel ordering approach will result in the fastest overall execution time. In the subsequent counting phase, we increase scalability by reducing memory usage. Our high-performance parallel implementation outperforms prior work and demonstrates near-linear parallel scaling for up to 64 threads on large real-world social networks.

  • Scala defined hardware generators for Chisel

    Microprocessors and Microsystems · 2025-07-21 · 1 citations

    articleOpen accessSenior author

    We describe digital hardware designs in hardware description languages such as VHDL and SystemVerilog. Both languages were developed in the 1980s and, although regularly updated, are still in the style of their time. They lack the constructs to write more configurable generators than just the number of bits for an operation. Based on Scala, Chisel is a hardware construction language that helps to write hardware generators. Hardware generators are not a new idea. Scripting languages, such as Perl and TCL, are often used to generate VHDL or Verilog code from other sources of system description. However, mixing two languages and embedding VHDL or Verilog strings in generator code is not scalable. As Chisel is embedded in Scala, we can write the generators using the same language/environment as we use to describe the digital logic. This paper explores different examples and patterns to describe parameterizable hardware generators. We are confident that practices from software development can improve the productivity of hardware designers to build and test the next billion transistor chips.

  • Teaching Agile Hardware Design with Chisel

    2024-08-28 · 2 citations

    articleOpen access1st authorCorresponding

    Agile hardware design techniques take the best of software engineering methods and apply them to improve hardware design productivity. Agile approaches not only reduce the time to solution, but they can also produce solutions which are better tailored to their target problems. Chisel provides the perfect vehicle to teach these techniques as it allows for the creation of reusable hardware generators. In this work, we outline our experiences creating an agile hardware design course using Chisel, and the lessons learned from teaching it four times. All of the course materials are available as open source.

  • Don't Repeat Yourself! Coarse-Grained Circuit Deduplication to Accelerate RTL Simulation

    2024-04-27 · 5 citations

    articleOpen accessSenior author

    Designing a digital integrated circuit requires many register transfer level (RTL) simulations for design, debugging, and especially verification. To cope with the slow speed of RTL simulation, industry frequently uses private server farms to run many simulations in parallel. Surprisingly, the implications of parallel runs of different RTL simulations have not been extensively explored. Moreover, in modern digital hardware, there is a growing trend to replicate components to scale out. However, the potential for circuit deduplication has been mostly overlooked.

  • RepCut: Superlinear Parallel RTL Simulation with Replication-Aided Partitioning

    2023-03-20 · 20 citations

    articleOpen accessSenior author

    Register transfer level (RTL) simulation is an invaluable tool for developing, debugging, verifying, and validating hardware designs. Despite the parallel nature of hardware, existing parallel RTL simulators yield speedups unattractive for practical application due to high communication and synchronization costs incurred by typical circuit topologies.

  • Accelerating Clique Counting in Sparse Real-World Graphs via Communication-Reducing Optimizations

    arXiv (Cornell University) · 2021-12-21 · 1 citations

    preprintOpen accessSenior author

    Counting instances of specific subgraphs in a larger graph is an important problem in graph mining. Finding cliques of size k (k-cliques) is one example of this NP-hard problem. Different algorithms for clique counting avoid counting the same clique multiple times by pivoting or ordering the graph. Ordering-based algorithms include an ordering step to direct the edges in the input graph, and a counting step, which is dominated by building node or edge-induced subgraphs. Of the ordering-based algorithms, kClist is the state-of-the art algorithm designed to work on sparse real-world graphs. Despite its leading overall performance, kClist's vertex-parallel implementation does not scale well in practice on graphs with a few million vertices. We present CITRON (Clique counting with Traffic Reducing Optimizations) to improve the parallel scalability and thus overall performance of clique counting. We accelerate the ordering phase by abandoning kClist's sequential core ordering and using a parallelized degree ordering. We accelerate the counting phase with our reorganized subgraph data structures that reduce memory traffic to improve scaling bottlenecks. Our sorted, compact neighbor lists improve locality and communication efficiency which results in near-linear parallel scaling. CITRON significantly outperforms kClist while counting moderately sized cliques, and thus increases the size of graph practical for clique counting. We have recently become aware of ArbCount (arXiv:2002.10047), which often outperforms us. However, we believe that the analysis included in this paper will be helpful for anyone who wishes to understand the performance characteristics of k-clique counting.

  • A Case for Accelerating Software RTL Simulation

    IEEE Micro · 2020 · 15 citations

    1st authorCorresponding
    • Computer Science
    • Computer Science
    • Embedded system

    RTL simulation is a critical tool for hardware design but its current slow speed often bottlenecks the whole design process. Simulation speed becomes even more crucial for agile and open-source hardware design methodologies, because the designers not only want to iterate on designs quicker, but they may also have less resources with which to simulate them. In this article, we execute multiple simulators and analyze them with hardware performance counters. We find some open-source simulators not only outperform a leading commercial simulator, they also achieve comparable or higher instruction throughput on the host processor. Although advanced optimizations may increase the complexity of the simulator, they do not significantly hinder instruction throughput. Our findings make the case that there is significant room to accelerate software simulation and open-source simulators are a great starting point for researchers.

  • Evaluation of Graph Analytics Frameworks Using the GAP Benchmark Suite

    2020 · 25 citations

    • Computer Science
    • Computer Science
    • Theoretical computer science

    Graphs play a key role in data analytics. Graphs and the software systems used to work with them are highly diverse. Algorithms interact with hardware in different ways and which graph solution works best on a given platform changes with the structure of the graph. This makes it difficult to decide which graph programming framework is the best for a given situation. In this paper, we try to make sense of this diverse landscape. We evaluate five different frameworks for graph analytics: SuiteS-parse GraphBLAS, Galois, the NWGraph library, the Graph Kernel Collection, and GraphIt. We use the GAP Benchmark Suite to evaluate each framework. GAP consists of 30 tests: six graph algorithms (breadth-first search, single-source shortest path, PageRank, betweenness centrality, connected components, and triangle counting) on five graphs. The GAP Benchmark Suite includes high-performance reference implementations to provide a performance baseline for comparison. Our results show the relative strengths of each framework, but also serve as a case study for the challenges of establishing objective measures for comparing graph frameworks.

Frequent coauthors

Labs

Education

  • Ph.D., Computer Science

    UC Berkeley

    2016
  • Other

    GAP Project

  • Other

    RISC-V

Awards & honors

  • NSF CAREER Award
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Scott Beamer

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup