Christopher Batten
· ProfessorVerifiedCornell University · Microelectronic Engineering
Active 1975–2025
About
Christopher Batten is a Professor of Electrical and Computer Engineering at Cornell University and a graduate field member of Computer Science. His research group is part of the Computer Systems Laboratory and focuses on the intersection of computer architecture, electronic design automation, and digital VLSI. His work includes projects on parallel programming frameworks, programmable accelerator design, interconnection networks, productive VLSI chip design methodologies, and architectures for emerging technologies. Building prototype systems is a key aspect of his research, serving to validate assumptions, understand physical design issues, and create platforms for future software research. Batten's contributions have been recognized through numerous awards, including the ACM/IEEE MICRO Hall of Fame, a Cornell Engineering Research Excellence Award, and several teaching awards such as the Ruth and Joel Spira Award for Excellence in Teaching and the Canaan Family Award for Excellence in Academic Advising. He has held visiting positions at prestigious institutions, including the University of California, Berkeley, the University of Cambridge, and NVIDIA. His academic background includes a Ph.D. in Electrical Engineering and Computer Science from MIT, an M.Phil. in Engineering from the University of Cambridge, and a B.S. in Electrical Engineering from the University of Virginia.
Research topics
- Computer Science
- Embedded system
- Parallel computing
- Operating system
- Database
- Programming language
- Software engineering
- Telecommunications
- Computer hardware
- Computer architecture
Selected publications
EntoBench: A Benchmark Suite and Evaluation Framework for Insect-Scale Robotics
2025-10-12
articleSenior authorEntoBench is the first open, MCU-ready benchmark suite and evaluation framework that captures the full insect-scale robot pipeline. Thirty-one kernels, each configurable for float, double, or fixed-point arithmetic, map how milliwatt power budgets and tight memory constraints can reshape algorithmic trade-offs. A synchronized GPIO harness couples logic-analyzer timing with inline current sensing, allowing any Cortex-M0+, M4, M33, or M7 board to report latency, energy, and peak power under cache-on/off condition and varied kernel parameters. More than 400 experiment runs reveal systematic patterns linking architecture features to achievable autonomy, providing a rigorous baseline for future software optimization and hardware-software co-design in this emerging cyber-physical domain. The full framework and benchmark suite are released as open source.
SMX: Heterogeneous Architecture for Universal Sequence Alignment Acceleration
2025-10-17 · 4 citations
articleOpen accessSequence alignment is a fundamental building block for critical applications across multiple fields, such as computational biology and information retrieval. The rapid advancement of genome sequencing technologies and breakthrough generative AI tools, like AlphaFold, has driven an exponential increase in sequence-data production, creating a pressing need for fast and efficient sequence alignment tools to analyze ever-growing biological sequence databases. Notwithstanding the numerous accelerators proposed, from general-purpose architectures (CPUs and GPUs) to domain-specific designs (FPGAs and ASICs), the most efficient solutions suffer from over-specialization and fail to adapt to the wide variety of irregular use cases demanded by practical sequence alignment applications. Thus, it remains a challenge to design an architecture that can balance efficiency and flexibility to meet the demands of real-world alignment applications. This work introduces SMX, a heterogeneous architecture designed for high-performance sequence alignment that supports various configurations for different sequence types (DNA, protein, and ASCII text) and alignment models (including weighted gaps and substitution matrices). SMX integrates an ISA extension (SMX-1D) for irregular and sequential tasks and a specialized coprocessor (SMX-2D) to accelerate regular and parallel tasks, both orchestrated by the general-purpose core to enable seamless integration with state-of-the-art sequence alignment algorithms. Our results demonstrate that SMX’s heterogeneous architecture accelerates different sequence alignment use cases by 256–744 × compared to state-of-the-art software implementations when aligning real datasets. Compared to specialized hardware accelerators, SMX delivers up to 18.5 × more peak performance per area added while providing greater flexibility to accelerate different use cases. Physical design results targeting a 22nm technology node estimate SMX’s area at 0.34mm2, which is only 30% of a single-issue in-order CPU. SMX offers a high-performance and efficient heterogeneous architecture for accelerating practical sequence alignment algorithms, providing a scalable and flexible solution tailored to meet the needs of modern sequence-analysis tools. Furthermore, an SMX case study explores the frontier between flexibility and efficiency in domain-specific architectures and accelerators.
Creating a biomedical knowledge base by addressing GPT inaccurate responses and benchmarking context
Open Research Africa · 2025-09-02
articleOpen access<ns3:p>Background We created the GeneNetwork Question Answer system (GNQA), a generative pre-trained transformer (GPT) knowledge base driven by a performant retrieval augmented generation (RAG) with a focus on aging, dementia, Alzheimer’s, and diabetes. Methods We uploaded a corpus of three thousand peer reviewed publications on these topics into the RAG. To address concerns about inaccurate responses and GPT ‘hallucinations’, we implemented a context provenance tracking mechanism that enables researchers to validate responses against the original material and to get references to the original papers. To assess the effectiveness of contextual information we collected evaluations and feedback from both domain expert users and ‘citizen scientists’ on the relevance of GPT responses. Results When evaluating the responses to their questions, human respondents give a “thumbs-up” 76% of the time. Meanwhile, RAGAS scores 90% on answer relevance on questions posed by experts. And when GPT generates questions, RAGAS scores 74% on answer relevance. Discussion A key innovation of our study is automated evaluation by way of a RAG assessment system (RAGAS). RAGAS combines human expert assessment with AI-driven evaluation to measure the effectiveness of RAG systems. With RAGAS, we created a benchmark that can be used to continuously assess our knowledge base's performance. Full GNQA functionality is embedded in the free GeneNetwork.org web service, an open-source system containing over 25 years of experimental data on model organisms and humans. The code developed for this study is published under a free and open-source software license at <ns3:ext-link xmlns:ns4="http://www.w3.org/1999/xlink" ext-link-type="uri" ns4:href="https://git.genenetwork.org/gn-ai/tree/README.md">https://git.genenetwork.org/gn-ai/tree/README.md</ns3:ext-link>.</ns3:p>
Optically Connected Multi-Stack HBM Modules for Large Language Model Training and Inference
IEEE Computer Architecture Letters · 2025-01-01 · 1 citations
articleSenior authorLarge language models (LLMs) have grown exponentially in size, presenting significant challenges to traditional memory architectures. Current high bandwidth memory (HBM) systems are constrained by chiplet I/O bandwidth and the limited number of HBM stacks that can be integrated due to packaging constraints. In this letter, we propose a novel memory system architecture that leverages silicon photonic interconnects to increase memory capacity and bandwidth for compute devices. By introducing optically connected multi-stack HBM modules, we extend the HBM memory system off the compute chip, significantly increasing the number of HBM stacks. Our evaluations show that this architecture can improve training efficiency for a trillion-parameter model by 1.4× compared to a modeled A100 baseline, while also enhancing inference performance by 4.2× if the L2 is modified to provide sufficient bandwidth.
Revisiting VerilogEval: A Year of Improvements in Large-Language Models for Hardware Code Generation
ACM Transactions on Design Automation of Electronic Systems · 2025-02-19 · 13 citations
articleOpen accessThe application of large language models (LLMs) to digital hardware code generation is an emerging field, with most LLMs primarily trained on natural language and software code. Hardware code like Verilog constitutes a small portion of training data, and few hardware benchmarks exist. The open-source VerilogEval benchmark, released in November 2023, provided a consistent evaluation framework for LLMs on code completion tasks. Since then, both commercial and open models have seen significant development. In this work, we evaluate new commercial and open models since VerilogEval’s original release—including GPT-4o, GPT-4 Turbo, Llama3.1 (8B/70B/405B), Llama3 70B, Mistral Large, DeepSeek Coder (33B and 6.7B), CodeGemma 7B, and RTL-Coder—against an improved VerilogEval benchmark suite. We find measurable improvements in state-of-the-art models: GPT-4o achieves a 63% pass rate on specification-to-RTL tasks. The recently released and open Llama3.1 405B achieves a 58% pass rate, almost matching GPT-4o, while the smaller domain-specific RTL-Coder 6.7B models achieve an impressive 34% pass rate. Additionally, we enhance VerilogEval’s infrastructure by automatically classifying failures, introducing in-context learning support, and extending the tasks to specification-to-RTL translation. We find that prompt engineering remains crucial for achieving good pass rates and varies widely with model and task. A benchmark infrastructure that allows for prompt engineering and failure analysis is essential for continued model development and deployment.
Scaling Co-Packaged Optical Interconnects Using Hybrid 2.5D/3D Integration
2025-05-25 · 4 citations
articleSenior authorTightly integrated optical interconnects can provide high-bandwidth, energy-efficient inter-node communication. We describe a novel system which uses hybrid 2.5D/3D integration to compose a state-of-the-art FPGA compute chiplet, three electrical interface chiplets, and three photonic interface chiplets. We use register-transfer-, gate-, transistor-, and device-level simulations to demonstrate the potential for this system to achieve 96Tb/s of bi-directional bandwidth, and we experimentally demonstrate key components including a complete opto-electrical channel. Our results provide a strong case for hybrid 2.5D/3D integration as the key enabler for scaling co-packaged optical interconnects.
PangenomicsBench: A Benchmark Suite and Characterization of Pangenomics
2025-10-12
articleCheaper and more accurate sequencing technologies have led to a large volume of genetic data that poses significant computational challenges and requires novel computing solutions to keep pace. This increased volume has also enabled the use of pangenome graph references, which provide better quality alignments because they represent variation, but they require new algorithms that are usually slower than those using a traditional reference genome, and exhibit different computational characteristics.We introduce PangenomicsBench, the first benchmark suite targeting computational pangenomics, with six CPU and two GPU kernels extracted from popular tools, designed to guide future research in pangenomics software and hardware acceleration. We characterize these workloads to reveal the following key insights: (a) Seq2Graph mapping algorithms are limited by control complexity rather than memory access to the reference graph because they process small, cache-friendly subgraphs. (b) GPUs have the potential for large speedups, but are limited by control divergence for mapping workloads. (c) Pangenomics introduces computational patterns different from traditional genomics like stochastic gradient descent. (d) Pangenomic mapping algorithms are highly sensitive to reference graph structures. (e) There are opportunities for optimizing existing software.
Characterizing and Optimizing Realistic Workloads on a Commercial Compute-in-SRAM Device
ArXiv.org · 2025-09-05
preprintOpen accessCompute-in-SRAM architectures offer a promising approach to achieving higher performance and energy efficiency across a range of data-intensive applications. However, prior evaluations have largely relied on simulators or small prototypes, limiting the understanding of their real-world potential. In this work, we present a comprehensive performance and energy characterization of a commercial compute-in-SRAM device, the GSI APU, under realistic workloads. We compare the GSI APU against established architectures, including CPUs and GPUs, to quantify its energy efficiency and performance potential. We introduce an analytical framework for general-purpose compute-in-SRAM devices that reveals fundamental optimization principles by modeling performance trade-offs, thereby guiding program optimizations. Exploiting the fine-grained parallelism of tightly integrated memory-compute architectures requires careful data management. We address this by proposing three optimizations: communication-aware reduction mapping, coalesced DMA, and broadcast-friendly data layouts. When applied to retrieval-augmented generation (RAG) over large corpora (10GB--200GB), these optimizations enable our compute-in-SRAM system to accelerate retrieval by 4.8$\times$--6.6$\times$ over an optimized CPU baseline, improving end-to-end RAG latency by 1.1$\times$--1.8$\times$. The shared off-chip memory bandwidth is modeled using a simulated HBM, while all other components are measured on the real compute-in-SRAM device. Critically, this system matches the performance of an NVIDIA A6000 GPU for RAG while being significantly more energy-efficient (54.4$\times$-117.9$\times$ reduction). These findings validate the viability of compute-in-SRAM for complex, real-world applications and provide guidance for advancing the technology.
SparseZipper: Enhancing Matrix Extensions to Accelerate SpGEMM on CPUs
ArXiv.org · 2025-02-17
preprintOpen accessSenior authorThe importance of general matrix multiplication (GEMM) is motivating new instruction set extensions for multiplying dense matrices in almost all contemporary ISAs, and these extensions are often implemented using high-performance systolic arrays. However, matrices in emerging workloads are not always dense, and sparse matrices where the vast majority of values are zeros are becoming more common. Existing matrix extensions and micro-architectures cannot efficiently process highly sparse matrices due to two reasons: (1) wasted work when one or both input values are zero; and (2) incompatibility with sparse matrix formats. This work proposes SparseZipper that minimally modifies existing matrix extensions and systolic-array-based micro-architectures specialized for dense-dense GEMM to accelerate sparse-sparse GEMM operating on highly sparse matrices with unstructured sparsity structures. Our performance evaluation shows SparseZipper achieves 5.98x and 2.61x speedup over a scalar hash-based implementation of SpGEMM and a state-of-the-art vectorized SpGEMM version, respectively. Our component-level area evaluation shows SparseZipper increases the area of a baseline 16x16 systolic array by only 12.7% resulting in an area overhead for an entire system-on-chip of just a few percent.
Characterizing and Optimizing Realistic Workloads on a Commercial Compute-in-SRAM Device
2025-10-17 · 1 citations
articleOpen access
Recent grants
NSF · $500k · 2015–2020
NSF · $500k · 2012–2017
II-New: PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research
NSF · $444k · 2015–2020
SHF: Small: EVE: Ephemeral Vector Engines
NSF · $400k · 2020–2024
XPS: DSD: Polymorphic Hardware Specialization for Domain-Specific Algorithms and Data Structures
NSF · $698k · 2013–2018
Frequent coauthors
- 26 shared
Krste Asanović
University of California, Berkeley
- 25 shared
J. Christopher Anderson
- 25 shared
Will DeLoache
University of California, Berkeley
- 25 shared
Joshua T. Kittleson
National Institute on Aging
- 25 shared
Timothy H.-C. Hsiau
University of California, Berkeley
- 25 shared
Douglas Densmore
Boston University
- 15 shared
W. E. Wentworth
University of Houston
- 14 shared
Ajay Joshi
Labs
Education
- 1999
B.S.
University of Virginia
- 2000
Other
University of Cambridge
Ph.D.
Massachusetts Institute of Technology
Awards & honors
- ACM/IEEE MICRO Hall of Fame
- Cornell Engineering Research Excellence Award
- AFOSR Young Investigator Program award
- Intel Early Career Faculty Honor Program award
- NSF CAREER award
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Christopher Batten
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup