Hamed Zamani
· Associate Professor and CIIR Associate DirectorVerifiedUniversity of Massachusetts Amherst · International Relations
Active 2014–2026
About
Hamed Zamani is an Associate Professor at the Manning College of Information and Computer Sciences at the University of Massachusetts Amherst, where he also serves as the Associate Director of the Center for Intelligent Information Retrieval (CIIR). His research focuses on designing and evaluating statistical and machine learning models with applications to interactive information access systems, including search engines, recommender systems, and question answering. His current research interests include Neural Information Retrieval, Conversational Search, and Retrieval-Enhanced Machine Learning. Prior to his position at UMass, Zamani was a Researcher at Microsoft, working on a wide range of problems related to search engines. He received his Ph.D. in 2019 from UMass under the supervision of W. Bruce Croft, and was awarded the UMass CICS Outstanding Dissertation Award for his thesis on weakly supervised neural information retrieval. He holds M.Sc. and B.Sc. degrees from the University of Tehran. Zamani is actively involved in organizing workshops and conferences in the field, has received multiple awards including the ACM SIGIR Early Career Excellence in Research and Excellence in Community Engagement Awards, and has been recognized for his contributions to the research community.
Research topics
- Computer Science
- Machine Learning
- Data Mining
- Artificial Intelligence
- Information Retrieval
- World Wide Web
- Data science
Selected publications
Evaluation of Agents under Simulated AI Marketplace Dynamics
arXiv (Cornell University) · 2026-04-15
preprintOpen accessModern information access ecosystems consist of mixtures of systems, such as retrieval systems and large language models, and increasingly rely on marketplaces to mediate access to models, tools, and data, making competition between systems inherent to deployment. In such settings, outcomes are shaped not only by benchmark quality but also by competitive pressure, including user switching, routing decisions, and operational constraints. Yet evaluation is still largely conducted on static benchmarks with accuracy-focused measures that assume systems operate in isolation. This mismatch makes it difficult to predict post-deployment success and obscures competitive effects such as early-adoption advantages and market dominance. We introduce Marketplace Evaluation, a simulation-based paradigm that evaluates information access systems as participants in a competitive marketplace. By simulating repeated interactions and evolving user and agent preferences, the framework enables longitudinal evaluation and marketplace-level metrics, such as retention and market share, that complement and can extend beyond traditional accuracy-based metrics. We formalize the framework and outline a research agenda, motivated by business and economics, around marketplace simulation, metrics, optimization, and adoption in evaluation campaigns like TREC.
Evaluation of Agents under Simulated AI Marketplace Dynamics
ArXiv.org · 2026-04-15
articleOpen accessModern information access ecosystems consist of mixtures of systems, such as retrieval systems and large language models, and increasingly rely on marketplaces to mediate access to models, tools, and data, making competition between systems inherent to deployment. In such settings, outcomes are shaped not only by benchmark quality but also by competitive pressure, including user switching, routing decisions, and operational constraints. Yet evaluation is still largely conducted on static benchmarks with accuracy-focused measures that assume systems operate in isolation. This mismatch makes it difficult to predict post-deployment success and obscures competitive effects such as early-adoption advantages and market dominance. We introduce Marketplace Evaluation, a simulation-based paradigm that evaluates information access systems as participants in a competitive marketplace. By simulating repeated interactions and evolving user and agent preferences, the framework enables longitudinal evaluation and marketplace-level metrics, such as retention and market share, that complement and can extend beyond traditional accuracy-based metrics. We formalize the framework and outline a research agenda, motivated by business and economics, around marketplace simulation, metrics, optimization, and adoption in evaluation campaigns like TREC.
Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering
2026-04-12
articleOpen accessPersonalization is well studied in search and recommendation, but personalized question answering remains underexplored due to challenges in inferring preferences from long, noisy, implicit contexts and generating responses that are both accurate and aligned with user expectations. To address this, we propose Pathways of Thoughts (PoT), an inference-stage method that applies to any large language model (LLM) without task-specific fine-tuning. PoT models the thinking as an iterative decision process, where the model dynamically selects among cognitive operations such as reasoning, revision, personalization, and clarification. This enables exploration of multiple reasoning trajectories, producing diverse candidate responses that capture different perspectives. PoT then aggregates and reweights these candidates according to inferred user preferences, yielding a final personalized response that benefits from the complementary strengths of diverse reasoning paths. Experiments on the LaMP-QA benchmark show that PoT consistently outperforms competitive baselines, achieving up to a 10.8% relative improvement. Human evaluation further validates these improvements, with annotators preferring PoT in 66% of cases compared to the best-performing baseline and reporting ties in 15% of cases.
Distillation and Refinement of Reasoning in Small Language Models for Document Re-ranking
2025-07-18
preprintOpen accessSenior authorWe present a novel approach for training small language models for reasoning-intensive document ranking that combines knowledge distillation with reinforcement learning optimization. While existing methods often rely on expensive human annotations or large black-box language models, our methodology leverages web data and a teacher LLM to automatically generate high-quality training examples with relevance explanations. By framing document ranking as a reinforcement learning problem and incentivizing explicit reasoning capabilities, we train a compact 3B parameter language model that achieves state-of-the-art performance on the BRIGHT benchmark. Our model ranks third on the leaderboard while using substantially fewer parameters than other approaches, outperforming models that are over 20 times larger. Through extensive experiments, we demonstrate that generating explanations during inference, rather than directly predicting relevance scores, enables more effective reasoning with smaller language models. The self-supervised nature of our method offers a scalable and interpretable solution for modern information retrieval systems.
Hypencoder: Hypernetworks for Information Retrieval
2025-07-13 · 3 citations
articleOpen accessSenior authorExisting information retrieval systems are largely constrained by their reliance on vector inner products to assess query-document relevance, which naturally limits the expressiveness of the relevance score they can produce.We propose a new paradigm; instead of representing a query as a vector, we use a small neural network that acts as a learned query-specific relevance function.This small neural network takes a document representation as input (in this work we use a single vector) and produces a scalar relevance score.To produce the small neural network we use a hypernetwork, a network that produces the weights of other networks, as our query encoder.We name this category of encoder models Hypencoders.Experiments on in-domain search tasks show that Hypencoders significantly outperform strong dense retrieval models and even surpass reranking models and retrieval models with an order of magnitude more parameters.To assess the extent of Hypencoders' capabilities, we evaluate on a set of hard retrieval tasks including tipof-the-tongue and instruction-following retrieval tasks.On harder tasks, we find that the performance gap widens substantially compared to standard retrieval tasks.Furthermore, to demonstrate the practicality of our method, we implement an approximate search algorithm and show that our model is able to retrieve from a corpus of 8.8M documents in under 60 milliseconds.
Pre-Trained Models for Search and Recommendation: Introduction to the Special Issue—Part 2
ACM Transactions on Information Systems · 2025-05-27
articleBeyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
ArXiv.org · 2025-10-31
preprintOpen accessEvaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%-12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.
Learning to Rank for Multiple Retrieval-Augmented Models through Iterative Utility Maximization
2025-07-18
articleOpen accessSenior authorThis paper investigates the design of a unified search engine to serve multiple retrieval-augmented generation (RAG) agents, each with a distinct task, backbone large language model (LLM), and RAG strategy. We introduce an iterative approach where the search engine generates retrieval results for the RAG agents and gathers feedback on the quality of the retrieved documents during an offline phase. This feedback is then used to iteratively optimize the search engine using an expectation-maximization algorithm, with the goal of maximizing each agent's utility function. Additionally, we adapt this to an online setting, allowing the search engine to refine its behavior based on real-time individual agents feedback to better serve the results for each of them. Experiments on datasets from the Knowledge-Intensive Language Tasks (KILT) benchmark demonstrates that our approach significantly on average outperforms baselines across 18 RAG models. We demonstrate that our method effectively ''personalizes'' the retrieval for each RAG agent based on the collected feedback. Finally, we provide a comprehensive ablation study to explore various aspects of our method.
Reliable Annotations with Less Effort: Evaluating LLM-Human Collaboration in Search Clarifications
2025-07-18
preprintOpen accessSenior authorDespite growing interest in using large language models (LLMs) to automate annotation, their effectiveness in complex, nuanced, and multi-dimensional labelling tasks remains relatively underexplored. This study focuses on annotation for the search clarification task, leveraging a high-quality, multi-dimensional dataset that includes five distinct fine-grained annotation subtasks. Although LLMs have shown impressive capabilities in general settings, our study reveals that even state-of-the-art models struggle to replicate human-level performance in subjective or fine-grained evaluation tasks. Through a systematic assessment, we demonstrate that LLM predictions are often inconsistent, poorly calibrated, and highly sensitive to prompt variations. To address these limitations, we propose a simple yet effective human-in-the-loop (HITL) workflow that uses confidence thresholds and inter-model disagreement to selectively involve human review. Our findings show that this lightweight intervention significantly improves annotation reliability while reducing human effort by up to 45%, offering a relatively scalable and cost-effective yet accurate path forward for deploying LLMs in real-world evaluation settings.
Open-Ended and Knowledge-Intensive Video Question Answering
ArXiv.org · 2025-02-17
preprintOpen accessSenior authorVideo question answering that requires external knowledge beyond the visual content remains a significant challenge in AI systems. While models can effectively answer questions based on direct visual observations, they often falter when faced with questions requiring broader contextual knowledge. To address this limitation, we investigate knowledge-intensive video question answering (KI-VideoQA) through the lens of multi-modal retrieval-augmented generation, with a particular focus on handling open-ended questions rather than just multiple-choice formats. Our comprehensive analysis examines various retrieval augmentation approaches using cutting-edge retrieval and vision language models, testing both zero-shot and fine-tuned configurations. We investigate several critical dimensions: the interplay between different information sources and modalities, strategies for integrating diverse multi-modal contexts, and the dynamics between query formulation and retrieval result utilization. Our findings reveal that while retrieval augmentation shows promise in improving model performance, its success is heavily dependent on the chosen modality and retrieval methodology. The study also highlights the critical role of query construction and retrieval depth optimization in effective knowledge integration. Through our proposed approach, we achieve a substantial 17.5% improvement in accuracy on multiple choice questions in the KnowIT VQA dataset, establishing new state-of-the-art performance levels.
Frequent coauthors
- 70 shared
Nick Craswell
Seattle University
- 45 shared
W. Bruce Croft
University of Massachusetts Amherst
- 44 shared
Bhaskar Mitra
- 31 shared
Sebastian Hofstätter
- 27 shared
Azadeh Shakery
University of Tehran
- 19 shared
Andrew McCallum
Queen Elizabeth University Hospital
- 19 shared
Gord Lueck
Microsoft (United States)
- 17 shared
Alireza Salemi
University of Massachusetts Amherst
Labs
Education
- 2019
Ph.D.
University of Massachusetts Amherst
M.S.
University of Tehran
B.S.
University of Tehran
Awards & honors
- ACM SIGIR Early Career Excellence in Research (2023)
- ACM SIGIR Excellence in Community Engagement Award (2023)
- Best Short Paper Award at ACM SIGIR 2024
- Best Student Paper Award at ACM SIGIR 2023
- Best Short Paper Award at ACM SIGIR 2022
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Hamed Zamani
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup