Benjamin Van Durme

· Joint Appointment; Associate Professor, Whiting School of Engineering, Computer ScienceVerified

Johns Hopkins University · Neuroscience

Active 2003–2026

h-index54

Citations11.8k

Papers445207 last 5y

Funding$721k

Faculty page Lab page

See your match with Benjamin Van Durme — sign in to PhdFit.Sign in

About

Benjamin Van Durme is an Associate Professor in Computer Science and Cognitive Science at Johns Hopkins University. He is a member of the Center for Language and Speech Processing (CLSP) and leads Natural Language Understanding research at the Human Language Technology Center of Excellence (HLTCOE). His research focuses on helping people work with large amounts of information by understanding the content of documents and images, assisting in information retrieval, and enabling systems to answer questions about that content. Van Durme collaborates on research topics including natural language processing, data mining, social media analysis, machine learning, linguistic semantics, and broader areas within Artificial Intelligence and Cognitive Science. His work in decompositional semantics is organized through Decomp.io. Additionally, he serves as the research lead at Microsoft Semantic Machines.

Research topics

Computer Science
Artificial Intelligence
Natural Language Processing
Information Retrieval
Philosophy
History
Art history
Linguistics
Geology
Epistemology
Library science
Physics
Chemistry
Programming language
Algorithm

Selected publications

Synthetic Function Demonstrations Improve Generation in Low-Resource Programming Languages
2026-04-30
articleOpen access
A key consideration when training an LLM is whether the target language is more or less resourced, for example English compared to Welsh, or Python compared to Excel. Typical training data for programming languages consists of real program demonstrations coupled with explanatory human-written comments. In this work we present a novel approach to the creation of such data for low resource programming languages, which lack naturally occurring data. Our process generates synthetic, textbook-quality demonstrations of how to use library functions, which we show makes for good model finetuning data. We demonstrate in an example domain of Excel Formulas. First, we collate language documentation, then we use this to augment a powerful teacher model which generates synthetic training data, and finally finetune student models on the demonstrations. Our technique improves student performance on 2 question-answering datasets: WikiTQ and TAT-QA. We also show advantages of finetuning over standard RAG approaches, which can offer only modest improvement due to the unfamiliarity of the target domain to student models.
Publisher OA PDF DOI
Does Reasoning Make Search More Fair? Comparing Fairness in Reasoning and Non-Reasoning Rerankers
arXiv (Cornell University) · 2026-03-11
preprintOpen access
While reasoning rerankers, such as Rank1, have demonstrated strong abilities in improving ranking relevance, it is unclear how they perform on other retrieval qualities such as fairness. We conduct the first systematic comparison of fairness between reasoning and non-reasoning rerankers. Using the TREC 2022 Fair Ranking Track dataset, we evaluate six reranking models across multiple retrieval settings and demographic attributes. Our findings demonstrate reasoning neither improve nor harm fairness compared to non-reasoning approaches. Our fairness metric, Attention-Weighted Rank Fairness (AWRF) remained stable (0.33-0.35) across all models, even as relevance varies substantially (nDCG 0.247-1.000). Demographic breakdown analysis revealed fairness gaps for geographic attributes regardless of model architecture. These results indicate that future work in specializing reasoning models to be aware of fairness attributes could lead to improvements, as current implementations preserve the fairness characteristics of their input ranking.
Publisher DOI
Does Reasoning Make Search More Fair? Comparing Fairness in Reasoning and Non-Reasoning Rerankers
ArXiv.org · 2026-03-11
articleOpen access
While reasoning rerankers, such as Rank1, have demonstrated strong abilities in improving ranking relevance, it is unclear how they perform on other retrieval qualities such as fairness. We conduct the first systematic comparison of fairness between reasoning and non-reasoning rerankers. Using the TREC 2022 Fair Ranking Track dataset, we evaluate six reranking models across multiple retrieval settings and demographic attributes. Our findings demonstrate reasoning neither improve nor harm fairness compared to non-reasoning approaches. Our fairness metric, Attention-Weighted Rank Fairness (AWRF) remained stable (0.33-0.35) across all models, even as relevance varies substantially (nDCG 0.247-1.000). Demographic breakdown analysis revealed fairness gaps for geographic attributes regardless of model architecture. These results indicate that future work in specializing reasoning models to be aware of fairness attributes could lead to improvements, as current implementations preserve the fairness characteristics of their input ranking.
Publisher OA PDF
Can LLMs Identify Tax Abuse?
Proceedings of the AAAI Conference on Artificial Intelligence · 2026-03-14
articleOpen accessSenior author
We investigate whether large language models can discover and analyze U.S. tax-minimization strategies. This real-world domain challenges even seasoned human experts, and progress can reduce tax revenue lost from well-advised, wealthy taxpayers. We evaluate the most advanced LLMs on their ability to (1) interpret and verify tax strategies, (2) fill in gaps in partially specified strategies, and (3) generate complete, end-to-end strategies from scratch. This domain should be of particular interest to the LLM reasoning community: unlike synthetic challenge problems or scientific reasoning tasks, U.S. tax law involves navigating hundreds of thousands of pages of statutes, case law, and administrative guidance, all updated regularly. Notably, an LLM identified an apparently novel tax strategy, highlighting these models' potential to revolutionize tax agencies' fight against tax abuse.
Publisher DOI
ChatGPT Generates a Novel Tax Strategy
SSRN Electronic Journal · 2026-01-01
preprintOpen accessSenior author
Publisher DOI
SocialNLI: A Dialogue-Centric Social Inference Dataset
ArXiv.org · 2025-10-06
preprintOpen accessSenior author
Making theory-of-mind inferences from human dialogue is a strong indicator of a model's underlying social abilities, which are fundamental for adept AI assistants. However, large language and reasoning models struggle to understand sophisticated social phenomena in transcript data, such as sarcasm and irony. To assess the weaknesses of current models and to identify their solutions, we introduce SocialNLI (SoNLI) -- the first social dialogue inference dataset. SoNLI consists of a collection of dialogue transcripts hand-picked to center complex social nuances like irony and sarcasm, paired with inferences, corresponding likelihood scores, and human-written explanations. We explore social inference analysis as a facet of theory-of-mind, and evaluate LLM and reasoning model theory-of-mind ability through multi-step counterfactual reasoning.
Publisher OA PDF DOI
Certified Mitigation of Worst-Case LLM Copyright Infringement
ArXiv.org · 2025-04-22
preprintOpen access
The exposure of large language models (LLMs) to copyrighted material during pre-training raises concerns about unintentional copyright infringement post deployment. This has driven the development of "copyright takedown" methods, post-training approaches aimed at preventing models from generating content substantially similar to copyrighted ones. While current mitigation approaches are somewhat effective for average-case risks, we demonstrate that they overlook worst-case copyright risks exhibits by the existence of long, verbatim quotes from copyrighted sources. We propose BloomScrub, a remarkably simple yet highly effective inference-time approach that provides certified copyright takedown. Our method repeatedly interleaves quote detection with rewriting techniques to transform potentially infringing segments. By leveraging efficient data sketches (Bloom filters), our approach enables scalable copyright screening even for large-scale real-world corpora. When quotes beyond a length threshold cannot be removed, the system can abstain from responding, offering certified risk reduction. Experimental results show that BloomScrub reduces infringement risk, preserves utility, and accommodates different levels of enforcement stringency with adaptive abstention. Our results suggest that lightweight, inference-time methods can be surprisingly effective for copyright prevention.
Publisher OA PDF DOI
All Claims Are Equal, but Some Claims Are More Equal Than Others: Importance-Sensitive Factuality Evaluation of LLM Generations
ArXiv.org · 2025-10-08
preprintOpen access
Existing methods for evaluating the factuality of large language model (LLM) responses treat all claims as equally important. This results in misleading evaluations when vital information is missing or incorrect as it receives the same weight as peripheral details, raising the question: how can we reliably detect such differences when there are errors in key information? Current approaches that measure factuality tend to be insensitive to omitted or false key information. To investigate this lack of sensitivity, we construct VITALERRORS, a benchmark of 6,733 queries with minimally altered LLM responses designed to omit or falsify key information. Using this dataset, we demonstrate the insensitivities of existing evaluation metrics to key information errors. To address this gap, we introduce VITAL, a set of metrics that provide greater sensitivity in measuring the factuality of responses by incorporating the relevance and importance of claims with respect to the query. Our analysis demonstrates that VITAL metrics more reliably detect errors in key information than previous methods. Our dataset, metrics, and analysis provide a foundation for more accurate and robust assessment of LLM factuality.
Publisher OA PDF DOI
arXiv2Table: Toward Realistic Benchmarking and Evaluation for LLM-Based Literature-Review Table Generation
arXiv (Cornell University) · 2025-04-14
preprintOpen access
Literature review tables are essential for summarizing and comparing collections of scientific papers. In this paper, we study the automatic generation of such tables from a pool of papers to satisfy a user's information need. Building on recent work (Newman et al., 2024), we move beyond oracle settings by (i) simulating well-specified yet schema-agnostic user demands that avoid leaking gold column names or values, (ii) explicitly modeling retrieval noise via semantically related but out-of-scope distractor papers verified by human annotators, and (iii) introducing a lightweight, annotation-free, utilization-oriented evaluation that decomposes utility into schema coverage, unary cell fidelity, and pairwise relational consistency, while measuring paper selection through a two-way QA procedure (gold to system and system to gold) with recall, precision, and F1. To support reproducible evaluation, we introduce arXiv2Table, a benchmark of 1,957 tables referencing 7,158 papers, with human-verified distractors and rewritten, schema-agnostic user demands. We also develop an iterative, batch-based generation method that co-refines paper filtering and schema over multiple rounds. We validate the evaluation protocol with human audits and cross-evaluator checks. Extensive experiments show that our method consistently improves over strong baselines, while absolute scores remain modest, underscoring the task's difficulty. Our data and code is available at https://github.com/JHU-CLSP/arXiv2Table.
Publisher OA PDF DOI
Rank-K: Test-Time Reasoning for Listwise Reranking
ArXiv.org · 2025-05-20 · 1 citations
preprintOpen access
Retrieve-and-rerank is a popular retrieval pipeline because of its ability to make slow but effective rerankers efficient enough at query time by reducing the number of comparisons. Recent works in neural rerankers take advantage of large language models for their capability in reasoning between queries and passages and have achieved state-of-the-art retrieval effectiveness. However, such rerankers are resource-intensive, even after heavy optimization. In this work, we introduce Rank-K, a listwise passage reranking model that leverages the reasoning capability of the reasoning language model at query time that provides test time scalability to serve hard queries. We show that Rank-K improves retrieval effectiveness by 23\% over the RankZephyr, the state-of-the-art listwise reranker, when reranking a BM25 initial ranked list and 19\% when reranking strong retrieval results by SPLADE-v3. Since Rank-K is inherently a multilingual model, we found that it ranks passages based on queries in different languages as effectively as it does in monolingual retrieval.
Publisher OA PDF DOI

Recent grants

Computational Statutory Reasoning
NSF · $597k · 2022–2025
Collaborative Research: The MegaAttitude Project: Investigating selection and polysemy at the scale of the lexicon
NSF · $124k · 2018–2023

Frequent coauthors

Adam Poliak
72 shared
Patrick Xia
67 shared
Catherine Havasi
66 shared
Felipe Meneguzzi
65 shared
Antoine Raux
Honda (United States)
65 shared
Gita Sukthankar
65 shared
William F. Lawless
Paine College
65 shared
Mirsad Hadžikadić
University of North Carolina at Charlotte
65 shared

Education

Ph.D., Computer Science
University of California, Berkeley
2008
M.S., Computer Science
University of California, Berkeley
2003
B.S., Computer Science
University of California, Berkeley
2001

Awards & honors

Celebrating Women in Data Science and AI Symposium recogniti…
Amazon AI fellowship program for Johns Hopkins students

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Benjamin Van Durme

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you