
Benjamin Van Durme
· Joint Appointment; Associate Professor, Whiting School of Engineering, Computer ScienceVerifiedJohns Hopkins University · Neuroscience
Active 2003–2026
About
Benjamin Van Durme is an Associate Professor in Computer Science and Cognitive Science at Johns Hopkins University. He is a member of the Center for Language and Speech Processing (CLSP) and leads Natural Language Understanding research at the Human Language Technology Center of Excellence (HLTCOE). His research focuses on helping people work with large amounts of information by understanding the content of documents and images, assisting in information retrieval, and enabling systems to answer questions about that content. Van Durme collaborates on research topics including natural language processing, data mining, social media analysis, machine learning, linguistic semantics, and broader areas within Artificial Intelligence and Cognitive Science. His work in decompositional semantics is organized through Decomp.io. Additionally, he serves as the research lead at Microsoft Semantic Machines.
Research topics
- Computer Science
- Artificial Intelligence
- Natural Language Processing
- Information Retrieval
- Philosophy
- History
- Art history
- Linguistics
- Geology
- Epistemology
- Library science
- Physics
- Chemistry
- Programming language
- Algorithm
Selected publications
Synthetic Function Demonstrations Improve Generation in Low-Resource Programming Languages
2026-04-30
articleOpen accessA key consideration when training an LLM is whether the target language is more or less resourced, for example English compared to Welsh, or Python compared to Excel. Typical training data for programming languages consists of real program demonstrations coupled with explanatory human-written comments. In this work we present a novel approach to the creation of such data for low resource programming languages, which lack naturally occurring data. Our process generates synthetic, textbook-quality demonstrations of how to use library functions, which we show makes for good model finetuning data. We demonstrate in an example domain of Excel Formulas. First, we collate language documentation, then we use this to augment a powerful teacher model which generates synthetic training data, and finally finetune student models on the demonstrations. Our technique improves student performance on 2 question-answering datasets: WikiTQ and TAT-QA. We also show advantages of finetuning over standard RAG approaches, which can offer only modest improvement due to the unfamiliarity of the target domain to student models.
Does Reasoning Make Search More Fair? Comparing Fairness in Reasoning and Non-Reasoning Rerankers
arXiv (Cornell University) · 2026-03-11
preprintOpen accessWhile reasoning rerankers, such as Rank1, have demonstrated strong abilities in improving ranking relevance, it is unclear how they perform on other retrieval qualities such as fairness. We conduct the first systematic comparison of fairness between reasoning and non-reasoning rerankers. Using the TREC 2022 Fair Ranking Track dataset, we evaluate six reranking models across multiple retrieval settings and demographic attributes. Our findings demonstrate reasoning neither improve nor harm fairness compared to non-reasoning approaches. Our fairness metric, Attention-Weighted Rank Fairness (AWRF) remained stable (0.33-0.35) across all models, even as relevance varies substantially (nDCG 0.247-1.000). Demographic breakdown analysis revealed fairness gaps for geographic attributes regardless of model architecture. These results indicate that future work in specializing reasoning models to be aware of fairness attributes could lead to improvements, as current implementations preserve the fairness characteristics of their input ranking.
Does Reasoning Make Search More Fair? Comparing Fairness in Reasoning and Non-Reasoning Rerankers
ArXiv.org · 2026-03-11
articleOpen accessWhile reasoning rerankers, such as Rank1, have demonstrated strong abilities in improving ranking relevance, it is unclear how they perform on other retrieval qualities such as fairness. We conduct the first systematic comparison of fairness between reasoning and non-reasoning rerankers. Using the TREC 2022 Fair Ranking Track dataset, we evaluate six reranking models across multiple retrieval settings and demographic attributes. Our findings demonstrate reasoning neither improve nor harm fairness compared to non-reasoning approaches. Our fairness metric, Attention-Weighted Rank Fairness (AWRF) remained stable (0.33-0.35) across all models, even as relevance varies substantially (nDCG 0.247-1.000). Demographic breakdown analysis revealed fairness gaps for geographic attributes regardless of model architecture. These results indicate that future work in specializing reasoning models to be aware of fairness attributes could lead to improvements, as current implementations preserve the fairness characteristics of their input ranking.
Proceedings of the AAAI Conference on Artificial Intelligence · 2026-03-14
articleOpen accessSenior authorWe investigate whether large language models can discover and analyze U.S. tax-minimization strategies. This real-world domain challenges even seasoned human experts, and progress can reduce tax revenue lost from well-advised, wealthy taxpayers. We evaluate the most advanced LLMs on their ability to (1) interpret and verify tax strategies, (2) fill in gaps in partially specified strategies, and (3) generate complete, end-to-end strategies from scratch. This domain should be of particular interest to the LLM reasoning community: unlike synthetic challenge problems or scientific reasoning tasks, U.S. tax law involves navigating hundreds of thousands of pages of statutes, case law, and administrative guidance, all updated regularly. Notably, an LLM identified an apparently novel tax strategy, highlighting these models' potential to revolutionize tax agencies' fight against tax abuse.
ChatGPT Generates a Novel Tax Strategy
SSRN Electronic Journal · 2026-01-01
preprintOpen accessSenior authorSocialNLI: A Dialogue-Centric Social Inference Dataset
ArXiv.org · 2025-10-06
preprintOpen accessSenior authorMaking theory-of-mind inferences from human dialogue is a strong indicator of a model's underlying social abilities, which are fundamental for adept AI assistants. However, large language and reasoning models struggle to understand sophisticated social phenomena in transcript data, such as sarcasm and irony. To assess the weaknesses of current models and to identify their solutions, we introduce SocialNLI (SoNLI) -- the first social dialogue inference dataset. SoNLI consists of a collection of dialogue transcripts hand-picked to center complex social nuances like irony and sarcasm, paired with inferences, corresponding likelihood scores, and human-written explanations. We explore social inference analysis as a facet of theory-of-mind, and evaluate LLM and reasoning model theory-of-mind ability through multi-step counterfactual reasoning.
Certified Mitigation of Worst-Case LLM Copyright Infringement
ArXiv.org · 2025-04-22
preprintOpen accessThe exposure of large language models (LLMs) to copyrighted material during pre-training raises concerns about unintentional copyright infringement post deployment. This has driven the development of "copyright takedown" methods, post-training approaches aimed at preventing models from generating content substantially similar to copyrighted ones. While current mitigation approaches are somewhat effective for average-case risks, we demonstrate that they overlook worst-case copyright risks exhibits by the existence of long, verbatim quotes from copyrighted sources. We propose BloomScrub, a remarkably simple yet highly effective inference-time approach that provides certified copyright takedown. Our method repeatedly interleaves quote detection with rewriting techniques to transform potentially infringing segments. By leveraging efficient data sketches (Bloom filters), our approach enables scalable copyright screening even for large-scale real-world corpora. When quotes beyond a length threshold cannot be removed, the system can abstain from responding, offering certified risk reduction. Experimental results show that BloomScrub reduces infringement risk, preserves utility, and accommodates different levels of enforcement stringency with adaptive abstention. Our results suggest that lightweight, inference-time methods can be surprisingly effective for copyright prevention.
ArXiv.org · 2025-10-08
preprintOpen accessExisting methods for evaluating the factuality of large language model (LLM) responses treat all claims as equally important. This results in misleading evaluations when vital information is missing or incorrect as it receives the same weight as peripheral details, raising the question: how can we reliably detect such differences when there are errors in key information? Current approaches that measure factuality tend to be insensitive to omitted or false key information. To investigate this lack of sensitivity, we construct VITALERRORS, a benchmark of 6,733 queries with minimally altered LLM responses designed to omit or falsify key information. Using this dataset, we demonstrate the insensitivities of existing evaluation metrics to key information errors. To address this gap, we introduce VITAL, a set of metrics that provide greater sensitivity in measuring the factuality of responses by incorporating the relevance and importance of claims with respect to the query. Our analysis demonstrates that VITAL metrics more reliably detect errors in key information than previous methods. Our dataset, metrics, and analysis provide a foundation for more accurate and robust assessment of LLM factuality.
arXiv (Cornell University) · 2025-04-14
preprintOpen accessLiterature review tables are essential for summarizing and comparing collections of scientific papers. In this paper, we study the automatic generation of such tables from a pool of papers to satisfy a user's information need. Building on recent work (Newman et al., 2024), we move beyond oracle settings by (i) simulating well-specified yet schema-agnostic user demands that avoid leaking gold column names or values, (ii) explicitly modeling retrieval noise via semantically related but out-of-scope distractor papers verified by human annotators, and (iii) introducing a lightweight, annotation-free, utilization-oriented evaluation that decomposes utility into schema coverage, unary cell fidelity, and pairwise relational consistency, while measuring paper selection through a two-way QA procedure (gold to system and system to gold) with recall, precision, and F1. To support reproducible evaluation, we introduce arXiv2Table, a benchmark of 1,957 tables referencing 7,158 papers, with human-verified distractors and rewritten, schema-agnostic user demands. We also develop an iterative, batch-based generation method that co-refines paper filtering and schema over multiple rounds. We validate the evaluation protocol with human audits and cross-evaluator checks. Extensive experiments show that our method consistently improves over strong baselines, while absolute scores remain modest, underscoring the task's difficulty. Our data and code is available at https://github.com/JHU-CLSP/arXiv2Table.
Rank-K: Test-Time Reasoning for Listwise Reranking
ArXiv.org · 2025-05-20 · 1 citations
preprintOpen accessRetrieve-and-rerank is a popular retrieval pipeline because of its ability to make slow but effective rerankers efficient enough at query time by reducing the number of comparisons. Recent works in neural rerankers take advantage of large language models for their capability in reasoning between queries and passages and have achieved state-of-the-art retrieval effectiveness. However, such rerankers are resource-intensive, even after heavy optimization. In this work, we introduce Rank-K, a listwise passage reranking model that leverages the reasoning capability of the reasoning language model at query time that provides test time scalability to serve hard queries. We show that Rank-K improves retrieval effectiveness by 23\% over the RankZephyr, the state-of-the-art listwise reranker, when reranking a BM25 initial ranked list and 19\% when reranking strong retrieval results by SPLADE-v3. Since Rank-K is inherently a multilingual model, we found that it ranks passages based on queries in different languages as effectively as it does in monolingual retrieval.
Recent grants
Computational Statutory Reasoning
NSF · $597k · 2022–2025
NSF · $124k · 2018–2023
Frequent coauthors
- 72 shared
Adam Poliak
- 67 shared
Patrick Xia
- 66 shared
Catherine Havasi
- 65 shared
Felipe Meneguzzi
- 65 shared
Antoine Raux
Honda (United States)
- 65 shared
Gita Sukthankar
- 65 shared
William F. Lawless
Paine College
- 65 shared
Mirsad Hadžikadić
University of North Carolina at Charlotte
Education
- 2008
Ph.D., Computer Science
University of California, Berkeley
- 2003
M.S., Computer Science
University of California, Berkeley
- 2001
B.S., Computer Science
University of California, Berkeley
Awards & honors
- Celebrating Women in Data Science and AI Symposium recogniti…
- Amazon AI fellowship program for Johns Hopkins students
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Benjamin Van Durme
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup