
Caiwen Ding
VerifiedUniversity of Minnesota · Computer Science and Engineering
Active 2015–2026
About
Caiwen Ding is an Associate Professor in the Department of Computer Science & Engineering at the University of Minnesota, Twin Cities. His research interests include algorithm-system co-design of ML/AI, computer architecture and heterogeneous computing, privacy-preserving machine learning, machine learning for electronic design automation, neuromorphic computing, computer vision, and natural language processing.
Research topics
- Computer Science
- Artificial Intelligence
- Parallel computing
- Computer engineering
- Algorithm
- Embedded system
- Engineering
- Electrical engineering
- Computer architecture
- Electronic engineering
Selected publications
FPTC: A Fast Parallel Transform-based Codec for Efficient Asymmetric Signal Compression
ArXiv.org · 2026-05-01
articleOpen accessSenior authorModern high-performance computing and Internet-of-Things deployments increasingly generate large volumes of signal data that must be compressed efficiently on resource-constrained acquisition devices and decompressed at scale on centralized servers. Lossy compression is widely adopted to minimize storage and transmission costs on low-power hardware sensors, yet existing methods rarely optimize for both reconstruction quality and decompression throughput simultaneously, nor do they apply methods that generalize across signal domains. In this work, we introduce FPTC, a high-throughput asymmetric signal codec that pairs a lightweight sequential encoder with a massively parallel GPU decoder designed for server-side batch decompression. FPTC applies a windowed discrete cosine transform (DCT) to exploit frequency-domain sparsity, quantizes spectral coefficients with a hybrid three-zone mapping, and entropy codes the result using Huffman coding with a novel packing scheme. The pipeline used in FPTC is designed to be throughput oriented on the GPU, maximizing performance without sacrificing reconstruction quality. We evaluate FPTC on ten datasets spanning four signal domains: biomedical diagnostic, seismic reflections, power-grid production metrics, and meteorological recordings. Our results demonstrate that FPTC outperforms existing frameworks in compression ratio while maintaining competitive throughput, achieving multiplicative compression performance of 3.6x (power), 3.1x (meteorological), 1.5x (biomedical), and 1.2x (seismic) over existing frameworks.
FPTC: A Fast Parallel Transform-based Codec for Efficient Asymmetric Signal Compression
arXiv (Cornell University) · 2026-05-01
preprintOpen accessSenior authorModern high-performance computing and Internet-of-Things deployments increasingly generate large volumes of signal data that must be compressed efficiently on resource-constrained acquisition devices and decompressed at scale on centralized servers. Lossy compression is widely adopted to minimize storage and transmission costs on low-power hardware sensors, yet existing methods rarely optimize for both reconstruction quality and decompression throughput simultaneously, nor do they apply methods that generalize across signal domains. In this work, we introduce FPTC, a high-throughput asymmetric signal codec that pairs a lightweight sequential encoder with a massively parallel GPU decoder designed for server-side batch decompression. FPTC applies a windowed discrete cosine transform (DCT) to exploit frequency-domain sparsity, quantizes spectral coefficients with a hybrid three-zone mapping, and entropy codes the result using Huffman coding with a novel packing scheme. The pipeline used in FPTC is designed to be throughput oriented on the GPU, maximizing performance without sacrificing reconstruction quality. We evaluate FPTC on ten datasets spanning four signal domains: biomedical diagnostic, seismic reflections, power-grid production metrics, and meteorological recordings. Our results demonstrate that FPTC outperforms existing frameworks in compression ratio while maintaining competitive throughput, achieving multiplicative compression performance of 3.6x (power), 3.1x (meteorological), 1.5x (biomedical), and 1.2x (seismic) over existing frameworks.
ArXiv.org · 2025-09-10
preprintOpen accessSenior authorLarge Language Models (LLMs) are gaining prominence in various fields, thanks to their ability to generate high- quality content from human instructions. This paper delves into the field of chip design using LLMs, specifically in Power- Performance-Area (PPA) optimization and the generation of accurate Verilog codes for circuit designs. We introduce a novel framework VeriPPA designed to optimize PPA and generate Verilog code using LLMs. Our method includes a two-stage process where the first stage focuses on improving the functional and syntactic correctness of the generated Verilog codes, while the second stage focuses on optimizing the Verilog codes to meet PPA constraints of circuit designs, a crucial element of chip design. Our framework achieves an 81.37% success rate in syntactic correctness and 62.06% in functional correctness for code genera- tion, outperforming current state-of-the-art (SOTA) methods. On the RTLLM dataset. On the VerilogEval dataset, our framework achieves 99.56% syntactic correctness and 43.79% functional correctness, also surpassing SOTA, which stands at 92.11% for syntactic correctness and 33.57% for functional correctness. Furthermore, Our framework able to optimize the PPA of the designs. These results highlight the potential of LLMs in handling complex technical areas and indicate an encouraging development in the automation of chip design processes.
Attacking all tasks at once using adversarial examples in multi-task learning
Neurocomputing · 2025-09-14
preprintOpen accessSelected topics in electornics and systems · 2025-12-17
book-chapterBeyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation
ArXiv.org · 2025-11-29
preprintOpen accessLarge language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency, and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran -> C++ and C++ -> CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. We show this data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.
Hardware Architecture for Convolutional Neural Network with Memristor-Bridges
Selected topics in electornics and systems · 2025-12-17
book-chapterHardware Architecture for Convolutional Neural Network with Memristor-Bridges
International Journal of High Speed Electronics and Systems · 2025-07-23
articleIntegration of memristors into neuromorphic systems is receiving substantial attention due to their potential to facilitate energy-efficient and highly parallel in-memory computation. In this paper, a memristor-bridge based design for the convolution operation of convolution neural network (CNN) and its crossbar realization are developed. A LeNet-5 network is realized using the proposed design and tested on the MNIST dataset. The architecture includes circuit configurations for activation and pooling operations. The weight-mapping procedure for the memristor-bridges is developed in relation to the exact physics of conduction mechanism of memristor. Efficient modeling of the devices results in excellent performance of the network, achieving up to 99.08% inference accuracy. Isolation among the bridges and parallelization of the convolution operation leads to a rapid mapping within 0.11[Formula: see text][Formula: see text] and fast response in less than 20[Formula: see text][Formula: see text]. The overall energy consumption by the memristor units during mapping and inference remains well below [Formula: see text].
Attacking the spike: On the security of spiking neural networks to adversarial examples
Neurocomputing · 2025-09-12 · 2 citations
articleOpen accessGraph Convolutional Network Acceleration Using Adiabatic Superconductor Josephson Devices
2025-06-08
articleOpen accessSenior author
Recent grants
Frequent coauthors
- 74 shared
Yanzhi Wang
Sichuan University
- 34 shared
Hongwu Peng
- 32 shared
Shanglin Zhou
- 24 shared
Qinru Qiu
- 24 shared
Bo Yuan
Zhejiang University of Science and Technology
- 24 shared
Xuehai Qian
Purdue University System
- 24 shared
Geng Yuan
- 23 shared
Chenghong Wang
Labs
UMN APEX (Algorithm-Platform Exploration for Efficient AI) LabPI
Research in ML/AI systems, computer architecture, privacy-preserving ML, and EDA.
Education
- 2019
Ph.D.
Northeastern University
Awards & honors
- NSF CAREER Award
- Amazon Research Award
- CISCO Research Award
- Best Paper Award at 2025 ICLAD
- Outstanding Student Paper Award at 2023 HPEC
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Caiwen Ding
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup