Brendan Dolan-Gavitt

· Assistant Professor of Computer Science and EngineeringVerified

New York University · Computer Science and Engineering

Active 2007–2026

h-index24

Citations4.4k

Papers6440 last 5y

Funding$175k

Faculty page

See your match with Brendan Dolan-Gavitt — sign in to PhdFit.Sign in

Research topics

Computer Science
Computer Security
Artificial Intelligence
Programming language
Software engineering
Operating system
Nuclear physics
Embedded system
Linguistics
World Wide Web
Physics

Selected publications

AI In Cybersecurity Education -- Scalable Agentic CTF Design Principles and Educational Outcomes
arXiv (Cornell University) · 2026-03-23
articleOpen access
Large language models are rapidly changing how learners acquire and demonstrate cybersecurity skills. However, when human--AI collaboration is allowed, educators still lack validated competition designs and evaluation practices that remain fair and evidence-based. This paper presents a cross-regional study of LLM-centered Capture-the-Flag competitions built on the Cyber Security Awareness Week competition system. To understand how autonomy levels and participants' knowledge backgrounds influence problem-solving performance and learning-related behaviors, we formalize three autonomy levels: human-in-the-loop, autonomous agent frameworks, and hybrid. To enable verification, we require traceable submissions including conversation logs, agent trajectories, and agent code. We analyze multi-region competition data covering an in-class track, a standard track, and a year-long expert track, each targeting participants with different knowledge backgrounds. Using data from the 2025 competition, we compare solve performance across autonomy levels and challenge categories, and observe that autonomous agent frameworks and hybrid achieve higher completion rates on challenges requiring iterative testing and tool interactions. In the in-class track, we classify participants' agent designs and find a preference for lightweight, tool-augmented prompting and reflection-based retries over complex multi-agent architectures. Our results offer actionable guidance for designing LLM-assisted cybersecurity competitions as learning technologies, including autonomy-specific scoring criteria, evidence requirements that support solution verification, and track structures that improve accessibility while preserving reliable evaluation and engagement.
Publisher OA PDF
Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions
Communications of the ACM · 2025-01-20 · 57 citations
article
There is burgeoning interest in designing AI-based systems to assist humans in designing computing systems, including tools that automatically generate computer code. The most notable of these comes in the form of the first self-described “AI pair programmer,” GitHub Copilot, which is a language model trained over open-source GitHub code. However, code often contains bugs—and so, given the vast quantity of unvetted code that Copilot has processed, it is certain that the language model will have learned from exploitable, buggy code. This raises concerns on the security of Copilot’s code contributions. In this work, we systematically investigate the prevalence and conditions that can cause GitHub Copilot to recommend insecure code. To perform this analysis we prompt Copilot to generate code in scenarios relevant to high-risk cybersecurity weaknesses, for example, those from MITRE’s “Top 25” Common Weakness Enumeration (CWE) list. We explore Copilot’s performance on three distinct code generation axes—examining how it performs given diversity of weaknesses, diversity of prompts, and diversity of domains. In total, we produce 89 different scenarios for Copilot to complete, producing 1,689 programs. Of these, we found approximately 40% to be vulnerable.
Publisher DOI
D-CIPHER: Dynamic Collaborative Intelligent Multi-Agent System with Planner and Heterogeneous Executors for Offensive Security
ArXiv.org · 2025-02-15
preprintOpen access
Large Language Models (LLMs) have been used in cybersecurity such as autonomous security analysis or penetration testing. Capture the Flag (CTF) challenges serve as benchmarks to assess automated task-planning abilities of LLM agents for cybersecurity. Early attempts to apply LLMs for solving CTF challenges used single-agent systems, where feedback was restricted to a single reasoning-action loop. This approach was inadequate for complex CTF tasks. Inspired by real-world CTF competitions, where teams of experts collaborate, we introduce the D-CIPHER LLM multi-agent framework for collaborative CTF solving. D-CIPHER integrates agents with distinct roles with dynamic feedback loops to enhance reasoning on complex tasks. It introduces the Planner-Executor agent system, consisting of a Planner agent for overall problem-solving along with multiple heterogeneous Executor agents for individual tasks, facilitating efficient allocation of responsibilities among the agents. Additionally, D-CIPHER incorporates an Auto-prompter agent to improve problem-solving by auto-generating a highly relevant initial prompt. We evaluate D-CIPHER on multiple CTF benchmarks and LLM models via comprehensive studies to highlight the impact of our enhancements. Additionally, we manually map the CTFs in NYU CTF Bench to MITRE ATT&CK techniques that apply for a comprehensive evaluation of D-CIPHER's offensive security capability. D-CIPHER achieves state-of-the-art performance on three benchmarks: 22.0% on NYU CTF Bench, 22.5% on Cybench, and 44.0% on HackTheBox, which is 2.5% to 8.5% better than previous work. D-CIPHER solves 65% more ATT&CK techniques compared to previous work, demonstrating stronger offensive capability.
Publisher OA PDF DOI
ELFuzz: Efficient Input Generation via LLM-driven Synthesis Over Fuzzer Space
ArXiv.org · 2025-06-12
preprintOpen access
Generation-based fuzzing produces appropriate test cases according to specifications of input grammars and semantic constraints to test systems and software. However, these specifications require significant manual effort to construct. This paper proposes a new approach, ELFuzz (Evolution Through Large Language Models for Fuzzing), that automatically synthesizes generation-based fuzzers tailored to a system under test (SUT) via LLM-driven synthesis over fuzzer space. At a high level, it starts with minimal seed fuzzers and propels the synthesis by fully automated LLM-driven evolution with coverage guidance. Compared to previous approaches, ELFuzz can 1) seamlessly scale to SUTs of real-world sizes -- up to 1,791,104 lines of code in our evaluation -- and 2) synthesize efficient fuzzers that catch interesting grammatical structures and semantic constraints in a human-understandable way. Our evaluation compared ELFuzz with specifications manually written by domain experts and synthesized by state-of-the-art approaches. It shows that ELFuzz achieves up to 434.8% more coverage over the second best and triggers up to 216.7% more artificially injected bugs, compared to the state-of-the-art. We also used ELFuzz to conduct a real-world fuzzing campaign on the newest version of cvc5 for 14 days, and encouragingly, it found five 0-day bugs (three are exploitable). Moreover, we conducted an ablation study, which shows that the fuzzer space model, the key component of ELFuzz, contributes the most (up to 62.5%) to the effectiveness of ELFuzz. Further analysis of the fuzzers synthesized by ELFuzz confirms that they catch interesting grammatical structures and semantic constraints in a human-understandable way. The results present the promising potential of ELFuzz for more automated, efficient, and extensible input generation for fuzzing.
Publisher OA PDF DOI
From Trace to Line: LLM Agent for Real-World OSS Vulnerability Localization
arXiv (Cornell University) · 2025-09-30
preprintOpen access
Large language models show promise for vulnerability discovery, yet prevailing methods inspect code in isolation, struggle with long contexts, and focus on coarse function- or file-level detections that offer limited guidance to engineers who need precise line-level localization for targeted patches. We introduce T2L, an executable framework for project-level, line-level vulnerability localization that progressively narrows scope from repository modules to exact vulnerable lines via AST-based chunking and evidence-guided refinement. We provide a baseline agent with an Agentic Trace Analyzer (ATA) that fuses runtime evidence such as crash points and stack traces to translate failure symptoms into actionable diagnoses. To enable rigorous evaluation, we introduce T2L-ARVO, an expert-verified 50-case benchmark spanning five crash families in real-world projects. On T2L-ARVO, our baseline achieves up to 58.0% detection and 54.8% line-level localization rate. Together, T2L framework advance LLM-based vulnerability detection toward deployable, precision diagnostics in open-source software workflows.
Publisher OA PDF DOI
(Security) Assertions by Large Language Models
IEEE Transactions on Information Forensics and Security · 2024-01-01 · 60 citations
articleOpen access
The security of computer systems typically relies on a hardware root of trust. As vulnerabilities in hardware can have severe implications on a system, there is a need for techniques to support security verification activities. Assertion-based verification is a popular verification technique that involves capturing design intent in a set of assertions that can be used in formal verification or testing-based checking. However, writing security-centric assertions is a challenging task. In this work, we investigate the use of emerging large language models (LLMs) for code generation in hardware assertion generation for security, where primarily natural language prompts, such as those one would see as code comments in assertion files, are used to produce SystemVerilog assertions. We focus our attention on a popular LLM and characterize its ability to write assertions out of the box, given varying levels of detail in the prompt. We design an evaluation framework that generates a variety of prompts, and we create a benchmark suite comprising real-world hardware designs and corresponding golden reference assertions that we want to generate with the LLM.
Publisher OA PDF DOI
Fuzz to the Future: Uncovering Occluded Future Vulnerabilities via Robust Fuzzing
2024-12-02 · 1 citations
articleOpen access
The security landscape of software systems has witnessed considerable advancements through dynamic testing methodologies, especially fuzzing. Traditionally, fuzzing involves a sequential, cyclic process where software is tested to identify crashes. These crashes are then triaged and patched, leading to subsequent cycles that uncover further vulnerabilities. While effective, this method is not efficient as each cycle potentially reveals new issues previously obscured by earlier crashes, thus resulting in vulnerabilities being discovered sequentially.
Publisher OA PDF DOI
Model Cascading for Code: A Cascaded Black-Box Multi-Model Framework for Cost-Efficient Code Completion with Self-Testing
arXiv (Cornell University) · 2024-05-24
preprintOpen access
The rapid advancement of large language models (LLMs) has significantly improved code completion tasks, yet the trade-off between accuracy and computational cost remains a critical challenge. While using larger models and incorporating inference-time self-testing algorithms can significantly improve output accuracy, they incur substantial computational expenses at the same time. Furthermore, servers in real-world scenarios usually have a dynamic preference on the cost-accuracy tradeoff, depending on the budget, bandwidth, the concurrent user volume, and users' sensitivity to wrong answers. In this work, we introduce a novel framework combining model cascading and inference-time self-feedback algorithms to find multiple near-optimal self-testing options on the cost-accuracy tradeoff in LLM-based code generation. Our approach leverages self-generated tests to both enhance accuracy and evaluate model cascading decisions. As a blackbox inference-time method, it requires no access to internal model parameters. We further propose a threshold-based algorithm to determine when to deploy larger models and a heuristic to optimize the number of solutions, test cases, and test lines generated per model, based on budget constraints. Experimental results show that our cascading approach reduces costs by an average of 26%, and up to 70% in the best case, across various model families and datasets, while maintaining or improving accuracy in natural language generation tasks compared to both random and optimal single-model self-testing schemes. To our knowledge, this is the first work to provide a series of choices for optimizing the cost-accuracy trade-off in LLM code generation with self-testing.
Publisher OA PDF DOI
NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security
arXiv (Cornell University) · 2024-06-08 · 1 citations
preprintOpen access
Large Language Models (LLMs) are being deployed across various domains today. However, their capacity to solve Capture the Flag (CTF) challenges in cybersecurity has not been thoroughly evaluated. To address this, we develop a novel method to assess LLMs in solving CTF challenges by creating a scalable, open-source benchmark database specifically designed for these applications. This database includes metadata for LLM testing and adaptive learning, compiling a diverse range of CTF challenges from popular competitions. Utilizing the advanced function calling capabilities of LLMs, we build a fully automated system with an enhanced workflow and support for external tool calls. Our benchmark dataset and automated framework allow us to evaluate the performance of five LLMs, encompassing both black-box and open-source models. This work lays the foundation for future research into improving the efficiency of LLMs in interactive cybersecurity tasks and automated task planning. By providing a specialized benchmark, our project offers an ideal platform for developing, testing, and refining LLM-based approaches to vulnerability detection and resolution. Evaluating LLMs on these challenges and comparing with human performance yields insights into their potential for AI-driven cybersecurity solutions to perform real-world threat management. We make our benchmark dataset open source to public https://github.com/NYU-LLM-CTF/NYU_CTF_Bench along with our playground automated framework https://github.com/NYU-LLM-CTF/llm_ctf_automation.
Publisher OA PDF DOI
An Empirical Evaluation of LLMs for Solving Offensive Security Challenges
arXiv (Cornell University) · 2024-02-19 · 4 citations
preprintOpen access
Capture The Flag (CTF) challenges are puzzles related to computer security scenarios. With the advent of large language models (LLMs), more and more CTF participants are using LLMs to understand and solve the challenges. However, so far no work has evaluated the effectiveness of LLMs in solving CTF challenges with a fully automated workflow. We develop two CTF-solving workflows, human-in-the-loop (HITL) and fully-automated, to examine the LLMs' ability to solve a selected set of CTF challenges, prompted with information about the question. We collect human contestants' results on the same set of questions, and find that LLMs achieve higher success rate than an average human participant. This work provides a comprehensive evaluation of the capability of LLMs in solving real world CTF challenges, from real competition to fully automated workflow. Our results provide references for applying LLMs in cybersecurity education and pave the way for systematic evaluation of offensive cybersecurity capabilities in LLMs.
Publisher OA PDF DOI

Recent grants

CRII: SaTC: ExHume: An Empirical Approach to Program Analysis for Security
NSF · $175k · 2017–2021

Frequent coauthors

Ramesh Karri
29 shared
Hammond Pearce
UNSW Sydney
24 shared
Tim Leek
24 shared
Benjamin Tan
22 shared
Andrew Fasano
20 shared
Siddharth Garg
California State University, Long Beach
20 shared
Joshua Bundt
Universidad del Noreste
14 shared
Baleegh Ahmad
New York University
14 shared

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Brendan Dolan-Gavitt

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup