
Mattia Fazzini
VerifiedUniversity of Minnesota · Computer Science and Engineering
Active 2012–2026
About
Mattia Fazzini is an Assistant Professor in the Department of Computer Science & Engineering at the University of Minnesota. He joined the department in 2019 and has a research focus on improving the overall quality of software through the development of techniques for software testing and maintenance. His work also involves investigating software attacks and securing software systems. Fazzini's research aims to advance the field of software engineering by creating innovative solutions for testing, verification, and security challenges. He holds a Ph.D. in Computer Science from Georgia Institute of Technology, obtained in 2019, and has earned multiple master's degrees in related fields from institutions including the University of Illinois at Chicago, Politecnico di Milano, and Politecnico di Torino. His professional background includes serving as a panelist at the National Science Foundation and participating as a program committee member at the IEEE/ACM International Conference on Automated Software Engineering. Fazzini's contributions to software engineering have been recognized through awards such as the IEEE TCSE Distinguished Paper Award and the Facebook Testing and Verification Research Award.
Research topics
- Computer Science
- World Wide Web
- Data science
- Artificial Intelligence
- Software engineering
- Operating system
- Human–computer interaction
Selected publications
Human-Agent versus Human Pull Requests: A Testing-Focused Characterization and Comparison
ArXiv.org · 2026-01-29
articleOpen accessSenior authorAI-based coding agents are increasingly integrated into software development workflows, collaborating with developers to create pull requests (PRs). Despite their growing adoption, the role of human-agent collaboration in software testing remains poorly understood. This paper presents an empirical study of 6,582 human-agent PRs (HAPRs) and 3,122 human PRs (HPRs) from the AIDev dataset. We compare HAPRs and HPRs along three dimensions: (i) testing frequency and extent, (ii) types of testing-related changes (code-and-test co-evolution vs. test-focused), and (iii) testing quality, measured by test smells. Our findings reveal that, although the likelihood of including tests is comparable (42.9% for HAPRs vs. 40.0% for HPRs), HAPRs exhibit a larger extent of testing, nearly doubling the test-to-source line ratio found in HPRs. While test-focused task distributions are comparable, HAPRs are more likely to add new tests during co-evolution (OR=1.79), whereas HPRs prioritize modifying existing tests. Finally, although some test smell categories differ statistically, negligible effect sizes suggest no meaningful differences in quality. These insights provide the first characterization of how human-agent collaboration shapes testing practices.
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
arXiv (Cornell University) · 2026-05-08
preprintOpen accessDebugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of GPU usage across scientific computing, machine learning, graphics, and systems workloads, CUDA debugging has become more challenging than ever. Current evaluations of LLM-based CUDA programming largely miss this setting: a model can pass correctness tests with repair by degeneration, simplifying the CUDA code into a safer but slower program that abandons the original optimization structure. We introduce CUDABEAVER, a benchmark for CUDA debugging from real failing workspaces produced during LLM-based CUDA generation. Each task provides the broken candidate, native build/test commands, raw error evidence, and a single editable file. CUDABEAVER evaluates whether a fixer truly repairs the failing CUDA code or merely finds a slower test-passing replacement, reporting results by failure category, debugging trajectory, stagnation mode, and performance preservation. We further propose pass@k(M,C,A), a protocol-conditional CUDA debugging metric by making the fixer M, corpus C, and protocol axes Aexplicit. Using this metric across 213 tasks and seven frontier LLMs, we show that protocol-aware evaluation gives a more faithful view of CUDA debugging ability: when performance-loss tolerance is high, fixers appear much stronger, but even a minor stricter performance requirement can sharply reduce measured success, shifting scores by up to 40 percentage points.
Zenodo (CERN European Organization for Nuclear Research) · 2026-04-10
datasetOpen accessImproving LLM-Driven Test Generation by Learning from Mocking Information (AIST 2026) — Replication Package Authors:Jamie Lee*, Flynn Teh*, Hengcheng Zhu†, Mengzhen Li‡, Mattia Fazzini‡, Valerio Terragni* Affiliations: * University of Auckland, Auckland, New Zealand† The Hong Kong University of Science and Technology, Hong Kong SAR‡ University of Minnesota, Minneapolis, USA This repository is the replication package for the paper “Improving LLM-Driven Test Generation by Learning from Mocking Information,” presented at AIST 2026. It contains the MOCKMILL tool implementation, raw experimental results, subject programs, and analysis scripts required to reproduce the study. MOCKMILL is a Java-based approach for unit test generation with large language models that leverages mocking information to improve test quality and coverage. The package includes materials for tool setup, experiment execution, and result analysis, and is intended to support transparency, reproducibility, and follow-up research in AI-based software testing. Citation (BibTeX) @inproceedings{lee2026mockmill,title={Improving LLM-Driven Test Generation by Learning from Mocking Information},author={Lee, Jamie and Teh, Flynn and Zhu, Hengcheng and Li, Mengzhen and Fazzini, Mattia and Terragni, Valerio},booktitle={Proceedings of the 19th International Conference on Software Testing, Verification and Validation Workshops (AIST)},year={2026},organization={IEEE}}
Human-Agent versus Human Pull Requests: A Testing-Focused Characterization and Comparison
Open MIND · 2026-01-29
preprintSenior authorAI-based coding agents are increasingly integrated into software development workflows, collaborating with developers to create pull requests (PRs). Despite their growing adoption, the role of human-agent collaboration in software testing remains poorly understood. This paper presents an empirical study of 6,582 human-agent PRs (HAPRs) and 3,122 human PRs (HPRs) from the AIDev dataset. We compare HAPRs and HPRs along three dimensions: (i) testing frequency and extent, (ii) types of testing-related changes (code-and-test co-evolution vs. test-focused), and (iii) testing quality, measured by test smells. Our findings reveal that, although the likelihood of including tests is comparable (42.9% for HAPRs vs. 40.0% for HPRs), HAPRs exhibit a larger extent of testing, nearly doubling the test-to-source line ratio found in HPRs. While test-focused task distributions are comparable, HAPRs are more likely to add new tests during co-evolution (OR=1.79), whereas HPRs prioritize modifying existing tests. Finally, although some test smell categories differ statistically, negligible effect sizes suggest no meaningful differences in quality. These insights provide the first characterization of how human-agent collaboration shapes testing practices.
Zenodo (CERN European Organization for Nuclear Research) · 2026-04-10
datasetOpen accessImproving LLM-Driven Test Generation by Learning from Mocking Information (AIST 2026) — Replication Package Authors:Jamie Lee*, Flynn Teh*, Hengcheng Zhu†, Mengzhen Li‡, Mattia Fazzini‡, Valerio Terragni* Affiliations: * University of Auckland, Auckland, New Zealand† The Hong Kong University of Science and Technology, Hong Kong SAR‡ University of Minnesota, Minneapolis, USA This repository is the replication package for the paper “Improving LLM-Driven Test Generation by Learning from Mocking Information,” presented at AIST 2026. It contains the MOCKMILL tool implementation, raw experimental results, subject programs, and analysis scripts required to reproduce the study. MOCKMILL is a Java-based approach for unit test generation with large language models that leverages mocking information to improve test quality and coverage. The package includes materials for tool setup, experiment execution, and result analysis, and is intended to support transparency, reproducibility, and follow-up research in AI-based software testing. Citation (BibTeX) @inproceedings{lee2026mockmill,title={Improving LLM-Driven Test Generation by Learning from Mocking Information},author={Lee, Jamie and Teh, Flynn and Zhu, Hengcheng and Li, Mengzhen and Fazzini, Mattia and Terragni, Valerio},booktitle={Proceedings of the 19th International Conference on Software Testing, Verification and Validation Workshops (AIST)},year={2026},organization={IEEE}}
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
ArXiv.org · 2026-05-08
articleOpen accessDebugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of GPU usage across scientific computing, machine learning, graphics, and systems workloads, CUDA debugging has become more challenging than ever. Current evaluations of LLM-based CUDA programming largely miss this setting: a model can pass correctness tests with repair by degeneration, simplifying the CUDA code into a safer but slower program that abandons the original optimization structure. We introduce CUDABEAVER, a benchmark for CUDA debugging from real failing workspaces produced during LLM-based CUDA generation. Each task provides the broken candidate, native build/test commands, raw error evidence, and a single editable file. CUDABEAVER evaluates whether a fixer truly repairs the failing CUDA code or merely finds a slower test-passing replacement, reporting results by failure category, debugging trajectory, stagnation mode, and performance preservation. We further propose pass@k(M,C,A), a protocol-conditional CUDA debugging metric by making the fixer M, corpus C, and protocol axes Aexplicit. Using this metric across 213 tasks and seven frontier LLMs, we show that protocol-aware evaluation gives a more faithful view of CUDA debugging ability: when performance-loss tolerance is high, fixers appear much stronger, but even a minor stricter performance requirement can sharply reduce measured success, shifting scores by up to 40 percentage points.
AndroT: A dataset of Android Apps with Tests
Zenodo (CERN European Organization for Nuclear Research) · 2026-01-23
datasetOpen accessSenior authorCharacterizing Installation- and Run-time Compatibility Issues in Android Benign Apps and Malware
ACM Transactions on Software Engineering and Methodology · 2025-03-25 · 3 citations
articleOpen accessThe Android ecosystem has experienced rapid growth, resulting in a diverse range of platforms and devices. This expansion has also brought about compatibility issues that negatively impact user experiences and hinder app development productivity. Existing relevant studies are focused on and limited to the “static” sense of those issues (in terms of potentialities and proneness), while only addressing compatibility issues that possibly occur during app executions. In this article, we present an extensive and longitudinal study on app compatibility issues that are disparate from yet complementary to prior studies, characterizing the incompatibilities based on actual , exercised observations and evidence at both installation and run-time. With a dataset of 74,545 benign apps and 56,919 malicious apps over a span of 12 years (2010 through 2021) and 10 Android versions, we extensively examine the prevalence and symptoms/effects and causes of, as well as the contributing factors to, installation-time and run-time compatibility issues. Our study reveals 12 major novel findings regarding Android app incompatibilities. First (Findings 1, 2), installation-time incompatibilities persisted significantly over the 12 years, even more so in malware than benign apps. Second (Findings 7, 8), run-time compatibility issues were also seen persistently over time but only on specific Android platforms (such as API 26,27, etc.) and much less by malware than benign apps. Third (Findings 5, 6, 11, 12), there is a significant (moderate/stronger) correlation between an app’s specified minSdkVersion and its incompatibilities (over all symptoms and/or with respect to one of its dominating symptom), with stronger correlations seen in malware than in benign apps, for both installation-time and run-time incompatibilities. Similar observations hold (although with much stronger correlation in absolute terms) when considering, instead of the minSdkVersion itself, the gap between the app’s minSdkVersion and the SDK API level of the platform the app is installed to or runs on. Last (Findings 3, 4, 9, 10), installation-time incompatibilities are primarily caused by the utilization of architecture-incompatible native libraries within apps, while run-time incompatibilities are mainly attributed to API changes during the evolution of the Android SDK; the symptoms of run-time failures seen by malware are much more diverse than by benign apps. In addition to these insights, we provide practical recommendations for both app developers and end users on how to effectively address compatibility issues in Android apps, as well as how to devise effective defenses against malware from the compatibility perspectives.
Human-Agent versus Human Pull Requests: A Testing-Focused Characterization and Comparison
Zenodo (CERN European Organization for Nuclear Research) · 2025-12-31
datasetOpen accessSenior authorReplication package
X2J Object Reconstruction Replication Package
Zenodo (CERN European Organization for Nuclear Research) · 2025-12-22
otherOpen accessThis replication package provides complete implementations to fully reproduce the experiments from our paper Reconstructing Objects for Software Testing.
Frequent coauthors
- 27 shared
Alessandra Gorla
IMDEA Software
- 25 shared
Konstantin Kuznetsov
- 25 shared
Rui Abreu
- 25 shared
Daniel Dominguez Alvarez
University of Minnesota
- 22 shared
Alessandro Orso
Georgia Institute of Technology
- 10 shared
John Grundy
- 8 shared
Tyler Wendland
University of Minnesota
- 8 shared
Kevin Moran
University of Central Florida
Awards & honors
- Moka: Improving App Testing with Automated Mocking Fazzini,…
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Mattia Fazzini
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup