Scott Beamer

· Assistant Professor

University of California, Santa Cruz · Electrical and Computer Engineering

Active 1992–2025

h-index16

Citations2.1k

Papers356 last 5y

Funding—

Faculty page Lab page Website

See your match with Scott Beamer — sign in to PhdFit.Sign in

About

I design architectures for data-intensive applications, with a focus on improving communication efficiency. I am interested in computer architecture, agile and open-source hardware design, graph processing, and data movement optimization. I lead the Vertical Architectures, Memory, and Algorithms (VAMA) group, and we are part of the Hardware Systems Collective.

Research topics

Computer Science
Computer architecture
Embedded system
Operating system
Mathematics
Theoretical computer science
Software engineering
Distributed computing
Parallel computing
Engineering

Selected publications

SORS: Accelerating RTL simulation with a vertically-integrated approach
UPCommons institutional repository (Universitat Politècnica de Catalunya) · 2025-07-03
other1st authorCorresponding
Simulation is a critical tool for hardware design but its current slow speed often bottlenecks the entire design process. Simulation speed becomes even more crucial for agile and open-source hardware design methodologies, because the designers not only want to iterate on designs quicker, but they may also have less resources with which to simulate them. In this work, we survey our various techniques for accelerating hardware (RTL) simulation. We explore the challenge of efficiently detecting opportunities for reuse due to low activity factors, and we demonstrate streamlined techniques to profitably exploit them. We take advantage of the replication that is common in large designs to increase scalability. We parallelize simulation for multicore and achieve superlinear speedups. Throughout our work, we leverage insights about both the application workload and the host platform. Many of our innovations are enabled by novel graph partitioning algorithms or optimizations for the host processor. Our simulators outperform both leading open-source and industrial simulators, and we use performance counters to analyze our performance advantages.
Publisher
miniGiraffe: A Pangenomic Mapping Proxy App
2025-10-12
articleSenior author
Large, real-world scientific applications are often complex, making them difficult to analyze, characterize, and optimize. Such applications typically involve intricate I/O patterns and library dependencies, which can make workflow analysis and tuning difficult. Proxy applications offer a practical solution by emulating the essential characteristics of the original application while significantly reducing complexity.In this work, we present miniGiraffe, a proxy application for Giraffe, a sophisticated genomics tool that operates over a pangenome, a graph-based structure capturing genetic variation across a species. We develop miniGiraffe using a principled methodology: carefully characterizing Giraffe’s behavior and validating that our proxy faithfully reproduces its key computational features. miniGiraffe contains only 2% of Giraffe’s codebase, while producing identical outputs for the most computationally intensive code components and closely matches Giraffe’s execution time and scaling behavior in these regions. The simplified design of miniGiraffe enabled rapid experimentation across multiple architectures, which we utilize to perform an autotuning experiment of the mapping workflow; we found that specializing parameters to inputs and architectures provided a geometric mean speedup of 1.15× and up to a 3.32× speedup over the default parameters.
Publisher DOI
PivotScale: A Holistic Approach for Scalable Clique Counting
2025-06-03
articleSenior author
Counting cliques of size <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$k$</tex> (k-cliques) in a graph is an important problem in graph pattern mining. Due to a combinatorial explosion in the amount of work, counting large cliques in real-world networks is challenging, as leading parallel approaches become untenable for even modestly large clique sizes (e.g. <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$k=10$</tex>). Pivoter, a recent algorithm, is able to scale to much larger clique sizes due to its superior algorithmic complexity. While it has leading single-thread performance, its naive parallel implementation results in poor parallel speedups. Efforts to optimize its parallel performance on CPUs are absent in the current literature. We present PivotScale, a scalable approach to accelerate exact clique counting. Our approach scales with both the number of cores as well as the clique size <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$k$</tex>. During the initial ordering phase, we introduce a heuristic to select which parallel ordering approach will result in the fastest overall execution time. In the subsequent counting phase, we increase scalability by reducing memory usage. Our high-performance parallel implementation outperforms prior work and demonstrates near-linear parallel scaling for up to 64 threads on large real-world social networks.
Publisher DOI
Scala defined hardware generators for Chisel
Microprocessors and Microsystems · 2025-07-21 · 1 citations
articleOpen accessSenior author
We describe digital hardware designs in hardware description languages such as VHDL and SystemVerilog. Both languages were developed in the 1980s and, although regularly updated, are still in the style of their time. They lack the constructs to write more configurable generators than just the number of bits for an operation. Based on Scala, Chisel is a hardware construction language that helps to write hardware generators. Hardware generators are not a new idea. Scripting languages, such as Perl and TCL, are often used to generate VHDL or Verilog code from other sources of system description. However, mixing two languages and embedding VHDL or Verilog strings in generator code is not scalable. As Chisel is embedded in Scala, we can write the generators using the same language/environment as we use to describe the digital logic. This paper explores different examples and patterns to describe parameterizable hardware generators. We are confident that practices from software development can improve the productivity of hardware designers to build and test the next billion transistor chips.
Publisher DOI
Teaching Agile Hardware Design with Chisel
2024-08-28 · 2 citations
articleOpen access1st authorCorresponding
Agile hardware design techniques take the best of software engineering methods and apply them to improve hardware design productivity. Agile approaches not only reduce the time to solution, but they can also produce solutions which are better tailored to their target problems. Chisel provides the perfect vehicle to teach these techniques as it allows for the creation of reusable hardware generators. In this work, we outline our experiences creating an agile hardware design course using Chisel, and the lessons learned from teaching it four times. All of the course materials are available as open source.
Publisher OA PDF DOI
Don't Repeat Yourself! Coarse-Grained Circuit Deduplication to Accelerate RTL Simulation
2024-04-27 · 5 citations
articleOpen accessSenior author
Designing a digital integrated circuit requires many register transfer level (RTL) simulations for design, debugging, and especially verification. To cope with the slow speed of RTL simulation, industry frequently uses private server farms to run many simulations in parallel. Surprisingly, the implications of parallel runs of different RTL simulations have not been extensively explored. Moreover, in modern digital hardware, there is a growing trend to replicate components to scale out. However, the potential for circuit deduplication has been mostly overlooked.
Publisher OA PDF DOI
RepCut: Superlinear Parallel RTL Simulation with Replication-Aided Partitioning
2023-03-20 · 20 citations
articleOpen accessSenior author
Register transfer level (RTL) simulation is an invaluable tool for developing, debugging, verifying, and validating hardware designs. Despite the parallel nature of hardware, existing parallel RTL simulators yield speedups unattractive for practical application due to high communication and synchronization costs incurred by typical circuit topologies.
Publisher OA PDF DOI
Accelerating Clique Counting in Sparse Real-World Graphs via Communication-Reducing Optimizations
arXiv (Cornell University) · 2021-12-21 · 1 citations
preprintOpen accessSenior author
Counting instances of specific subgraphs in a larger graph is an important problem in graph mining. Finding cliques of size k (k-cliques) is one example of this NP-hard problem. Different algorithms for clique counting avoid counting the same clique multiple times by pivoting or ordering the graph. Ordering-based algorithms include an ordering step to direct the edges in the input graph, and a counting step, which is dominated by building node or edge-induced subgraphs. Of the ordering-based algorithms, kClist is the state-of-the art algorithm designed to work on sparse real-world graphs. Despite its leading overall performance, kClist's vertex-parallel implementation does not scale well in practice on graphs with a few million vertices. We present CITRON (Clique counting with Traffic Reducing Optimizations) to improve the parallel scalability and thus overall performance of clique counting. We accelerate the ordering phase by abandoning kClist's sequential core ordering and using a parallelized degree ordering. We accelerate the counting phase with our reorganized subgraph data structures that reduce memory traffic to improve scaling bottlenecks. Our sorted, compact neighbor lists improve locality and communication efficiency which results in near-linear parallel scaling. CITRON significantly outperforms kClist while counting moderately sized cliques, and thus increases the size of graph practical for clique counting. We have recently become aware of ArbCount (arXiv:2002.10047), which often outperforms us. However, we believe that the analysis included in this paper will be helpful for anyone who wishes to understand the performance characteristics of k-clique counting.
Publisher OA PDF DOI
A Case for Accelerating Software RTL Simulation
IEEE Micro · 2020 · 15 citations
1st authorCorresponding
- Computer Science
- Computer Science
- Embedded system
RTL simulation is a critical tool for hardware design but its current slow speed often bottlenecks the whole design process. Simulation speed becomes even more crucial for agile and open-source hardware design methodologies, because the designers not only want to iterate on designs quicker, but they may also have less resources with which to simulate them. In this article, we execute multiple simulators and analyze them with hardware performance counters. We find some open-source simulators not only outperform a leading commercial simulator, they also achieve comparable or higher instruction throughput on the host processor. Although advanced optimizations may increase the complexity of the simulator, they do not significantly hinder instruction throughput. Our findings make the case that there is significant room to accelerate software simulation and open-source simulators are a great starting point for researchers.
Publisher OA PDF DOI
Evaluation of Graph Analytics Frameworks Using the GAP Benchmark Suite
2020 · 25 citations
- Computer Science
- Computer Science
- Theoretical computer science
Graphs play a key role in data analytics. Graphs and the software systems used to work with them are highly diverse. Algorithms interact with hardware in different ways and which graph solution works best on a given platform changes with the structure of the graph. This makes it difficult to decide which graph programming framework is the best for a given situation. In this paper, we try to make sense of this diverse landscape. We evaluate five different frameworks for graph analytics: SuiteS-parse GraphBLAS, Galois, the NWGraph library, the Graph Kernel Collection, and GraphIt. We use the GAP Benchmark Suite to evaluate each framework. GAP consists of 30 tests: six graph algorithms (breadth-first search, single-source shortest path, PageRank, betweenness centrality, connected components, and triangle counting) on five graphs. The GAP Benchmark Suite includes high-performance reference implementations to provide a performance baseline for comparison. Our results show the relative strengths of each framework, but also serve as a case study for the challenges of establishing objective measures for comparing graph frameworks.
Publisher OA PDF DOI

Frequent coauthors

Krste Asanović
University of California, Berkeley
25 shared
Ajay Joshi
12 shared
Vladimir Stojanović
11 shared
Yong-Jin Kwon
Hongik University
11 shared
David A. Patterson
Google (United States)
10 shared
Christopher Batten
9 shared
Imran Shamim
Massachusetts Institute of Technology
4 shared
Aydın Buluç
University of California, Berkeley
3 shared

Labs

Scott Beamer LabPI

Education

Ph.D., Computer Science
UC Berkeley
2016
Other
GAP Project
Other
RISC-V

Awards & honors

NSF CAREER Award

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Scott Beamer

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you