James Allan

· Associate Dean of Research & Engagement Distinguished ProfessorVerified

University of Massachusetts Amherst · Information Science and Technology

Active 1829–2025

h-index60

Citations15.6k

Papers40760 last 5y

Funding$4.9M

Faculty page Lab page

See your match with James Allan — sign in to PhdFit.Sign in

About

James Allan is a Distinguished Professor and Associate Dean of Research and Engagement at the University of Massachusetts Amherst, where he has been a faculty member since 1994. He directs the Center for Intelligent Information Retrieval (CIIR) and his research focuses on information retrieval, event-based information organization, minimally interactive retrieval and organization, as well as incorporating novelty into retrieval algorithms. His work also explores techniques for querying across languages and methods for recognizing controversial or misleading information in internet texts. Professor Allan has served on the organizing and program committees for major conferences such as SIGIR, CIKM, and WSDM, and has been an associate editor for ACM's Transactions on Information Systems and Elsevier's Information Processing and Management. He is currently on the editorial board of Foundation and Trends in Information Retrieval. His contributions have been recognized through awards including Best Paper awards at SIGIR and CHIIR, a SIGIR Test of Time Award, and he has been elected to the CRA Board of Directors and its Executive Committee as Treasurer. His educational background includes a PhD and MS in computer science from Cornell University and an AB in mathematics from Grinnell College.

Research topics

Computer Science
Artificial Intelligence
Information Retrieval
Data Mining
Machine Learning
Business
Microeconomics
World Wide Web
Economics
Marketing
Data science

Selected publications

Reducing the Emotional Distress of Content Moderators through LLM-based Target Substitution in Implicit and Explicit Hate-Speech
2025-06-12 · 3 citations
articleOpen accessSenior author
Hate speech is often subtle and context-dependent, making it especially difficult to detect particularly when it requires contextual familiarity related to the targeted group.Exposure to hate speech and toxic content can lead to significant psychological harm, including increased stress and anxiety levels and content moderators are particularly vulnerable due to exposure to such harmful material.This work explores the role of personalization in content moderation by examining how alignment between a moderator's background and the targeted group affects emotional and cognitive responses.We propose a target substitution method that replaces references to real communities in hate speech with fictional characters, aiming to reduce emotional distress while preserving the semantic integrity necessary for accurate moderation.Through both automated and human evaluations, we find that substitution significantly reduces emotional distress across all groups with a trade-off in accuracy.Moreover, we observe that moderators demonstrate higher accuracy when moderating content aligned with their own demographic background, even after substitution.This suggests the key role of contextual familiarity in interpreting implicit hate.Additionally, our study highlights the cumulative impact of prolonged exposure to hate speech, showing that moderators experience increased emotional distress over time, particularly in non-targeted scenarios.Despite this, target substitution consistently mitigates distress while maintaining moderation efficacy.
Publisher OA PDF DOI
Probing Ranking LLMs: A Mechanistic Analysis for Information Retrieval
2025-07-18 · 3 citations
articleSenior author
Transformer networks, particularly those achieving performance comparable to GPT models, are well known for their robust feature extraction abilities. However, the nature of these extracted features and their alignment with human-engineered ones remain largely unexplored. In this work, we investigate the internal mechanisms of state-of-the-art, fine-tuned LLMs for passage reranking. We employ a probing-based analysis to examine neuron activations in ranking LLMs, identifying the presence of known human-engineered and semantic features. Our study spans a broad range of feature categories, including lexical signals, document structure, query-document interactions, and complex semantic representations, to uncover underlying patterns influencing ranking decisions. Through experiments on four different ranking LLMs, we identify statistical IR features that are prominently encoded in LLM activations, as well as others that are notably missing. Furthermore, we analyze how these models respond to out-of-distribution queries and documents, revealing distinct generalization behaviors. By dissecting the latent representations within LLM activations, we provide an important initial step toward better intepretability and aim to improve both the interpretability and effectiveness of ranking models. Our findings offer crucial insights for developing more transparent and reliable retrieval systems. We release all necessary scripts and code to support further exploration.
Publisher DOI
RankSHAP: Shapley Value Based Feature Attributions for Learning to Rank
arXiv (Cornell University) · 2024-05-03
preprintOpen accessSenior author
Numerous works propose post-hoc, model-agnostic explanations for learning to rank, focusing on ordering entities by their relevance to a query through feature attribution methods. However, these attributions often weakly correlate or contradict each other, confusing end users. We adopt an axiomatic game-theoretic approach, popular in the feature attribution community, to identify a set of fundamental axioms that every ranking-based feature attribution method should satisfy. We then introduce Rank-SHAP, extending classical Shapley values to ranking. We evaluate the RankSHAP framework through extensive experiments on two datasets, multiple ranking methods and evaluation metrics. Additionally, a user study confirms RankSHAP's alignment with human intuition. We also perform an axiomatic analysis of existing rank attribution algorithms to determine their compliance with our proposed axioms. Ultimately, our aim is to equip practitioners with a set of axiomatically backed feature attribution methods for studying IR ranking models, that ensure generality as well as consistency.
Publisher OA PDF DOI
Target Span Detection for Implicit Harmful Content
2024-08-02 · 3 citations
articleOpen access
Identifying the targets of hate speech is a crucial step in grasping the nature of such speech and, ultimately, improving the detection of offensive posts on online forums. Much harmful content on online platforms uses implicit language -- especially when targeting vulnerable and protected groups -- such as using stereotypical characteristics instead of explicit target names, making it harder to detect and mitigate the language. In this study, we focus on identifying implied targets of hate speech, essential for recognizing subtler hate speech and enhancing the detection of harmful content on digital platforms. We define a new task aimed at identifying the targets even when they are not explicitly stated. To address that task, we collect and annotate target spans in three prominent implicit hate speech datasets: SBIC, DynaHate, and IHC. We call the resulting merged collection Implicit-Target-Span. The collection is achieved using an innovative pooling method with matching scores based on human annotations and Large Language Models (LLMs). Our experiments indicate that Implicit-Target-Span provides a challenging test bed for target span detection methods.
Publisher OA PDF DOI
Language Concept Erasure for Language-invariant Dense Retrieval
2024-01-01 · 1 citations
articleOpen accessSenior author
Multilingual models aim for language-invariant representations but still prominently encode language identity.This, along with the scarcity of high-quality parallel retrieval data, limits their performance in retrieval.We introduce LANCER, a multi-task learning framework that improves language-invariant dense retrieval by reducing language-specific signals in the embedding space.Leveraging the notion of linear concept erasure, we design a loss function that penalizes cross-correlation between representations and their language labels.LANCER leverages only English retrieval data and general multilingual corpora, training models to focus on language-invariant retrieval by semantic similarity without necessitating a vast parallel corpus.Experimental results on various datasets show our method consistently improves over baselines, with extensive analyses demonstrating greater language agnosticism.
Publisher OA PDF DOI
Discovering Biases in Information Retrieval Models Using Relevance Thesaurus as Global Explanation
arXiv (Cornell University) · 2024-10-04
preprintOpen accessSenior author
Most efforts in interpreting neural relevance models have focused on local explanations, which explain the relevance of a document to a query but are not useful in predicting the model's behavior on unseen query-document pairs. We propose a novel method to globally explain neural relevance models by constructing a "relevance thesaurus" containing semantically relevant query and document term pairs. This thesaurus is used to augment lexical matching models such as BM25 to approximate the neural model's predictions. Our method involves training a neural relevance model to score the relevance of partial query and document segments, which is then used to identify relevant terms across the vocabulary space. We evaluate the obtained thesaurus explanation based on ranking effectiveness and fidelity to the target neural ranking model. Notably, our thesaurus reveals the existence of brand name bias in ranking models, demonstrating one advantage of our explanation method.
Publisher OA PDF DOI
Discovering Biases in Information Retrieval Models Using Relevance Thesaurus as Global Explanation
2024-01-01
articleOpen accessSenior author
Publisher OA PDF DOI
Future of Information Retrieval Research in the Age of Generative AI
arXiv (Cornell University) · 2024-12-03 · 3 citations
preprintOpen access1st authorCorresponding
In the fast-evolving field of information retrieval (IR), the integration of generative AI technologies such as large language models (LLMs) is transforming how users search for and interact with information. Recognizing this paradigm shift at the intersection of IR and generative AI (IR-GenAI), a visioning workshop supported by the Computing Community Consortium (CCC) was held in July 2024 to discuss the future of IR in the age of generative AI. This workshop convened 44 experts in information retrieval, natural language processing, human-computer interaction, and artificial intelligence from academia, industry, and government to explore how generative AI can enhance IR and vice versa, and to identify the major challenges and opportunities in this rapidly advancing field. This report contains a summary of discussions as potentially important research topics and contains a list of recommendations for academics, industry practitioners, institutions, evaluation campaigns, and funding agencies.
Publisher OA PDF DOI
Soft Prompt Decoding for Multilingual Dense Retrieval
2023-07-18 · 10 citations
articleOpen accessSenior author
In this work, we explore a Multilingual Information Retrieval (MLIR) task, where the collection includes documents in multiple languages. We demonstrate that applying state-of-the-art approaches developed for cross-lingual information retrieval to MLIR tasks leads to sub-optimal performance. This is due to the heterogeneous and imbalanced nature of multilingual collections -- some languages are better represented in the collection and some benefit from large-scale training data. To address this issue, we present KD-SPD, a novel soft prompt decoding approach for MLIR that implicitly "translates'' the representation of documents in different languages into the same embedding space. To address the challenges of data scarcity and imbalance, we introduce a knowledge distillation strategy. The teacher model is trained on rich English retrieval data, and by leveraging bi-text data, our distillation framework transfers its retrieval knowledge to the multilingual document encoder. Therefore, our approach does not require any multilingual retrieval training data. Extensive experiments on three MLIR datasets with a total of 15 languages demonstrate that KD-SPD significantly outperforms competitive baselines in all cases. We conduct extensive analyses to show that our method has less language bias and better zero-shot transfer ability towards new languages.
Publisher OA PDF DOI
Rank-LIME: Local Model-Agnostic Feature Attribution for Learning to Rank
2023-08-09 · 17 citations
articleSenior author
Understanding why a model makes certain predictions is crucial when adapting it for real world decision making. LIME is a popular model-agnostic feature attribution method for the tasks of classification and regression. However, the task of learning to rank in information retrieval is more complex in comparison with either classification or regression. In this work, we extend LIME to propose Rank-LIME, a model-agnostic, local, post-hoc linear feature attribution method for the task of learning to rank that generates explanations for ranked lists. We employ novel correlation-based perturbations, differentiable ranking loss functions and introduce new metrics to evaluate ranking based additive feature attribution models. We compare Rank-LIME with a variety of competing systems, with models trained on the MS MARCO datasets and observe that Rank-LIME outperforms existing explanation algorithms in terms of Model Fidelity and Explain-NDCG. With this we propose one of the first algorithms to generate additive feature attributions for explaining ranked lists.
Publisher DOI

Recent grants

EAGER: Dynamic Contextual Explanation of Search Results
NSF · $219k · 2020–2022
III: Small: Interactive Construction of Complex Query Models
NSF · $516k · 2016–2020
III: Small: Mirador: Explainable Computational Models for Recognizing and Understanding Controversial Topics Encountered Online
NSF · $500k · 2018–2022
III: Small: Topical Positioning System (TPS) for Informed Reading of Web Pages
NSF · $500k · 2012–2017
DC: Large: Collaborative Research: Mining a Million Scanned Books: Linguistic and Structure Analysis, Fast Expanded Search, and Improved OCR
NSF · $2.1M · 2009–2016

Frequent coauthors

Robert J. Kelly
University of Exeter
32 shared
Mammo Yewondwossen
Nova Scotia Health Authority
21 shared
Derek Wilke
Dalhousie University
21 shared
Sheikh Muhammad Sarwar
20 shared
Razieh Rahimi
19 shared
Krista Chytyk‐Praznik
Queen Elizabeth II Health Sciences Centre
19 shared
Victor Lavrenko
VIR Biotechnology (United States)
18 shared
Mark D. Smucker
University of Waterloo
18 shared

Education

PhD, Computer Science
Cornell University
1995

Awards & honors

Best Paper awards in SIGIR in 2001 and in 2006
Best Student Paper from CHIIR in 2017
SIGIR Test of Time Award for a 1998 paper on event detection…
elected to the CRA Board of Directors (2018)
elected to the CRA Executive Committee as Treasurer (2018)

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with James Allan

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you