Andrew McCallum
· Distinguished ProfessorUniversity of Massachusetts Amherst · International Relations
Active 1980–2026
About
Andrew McCallum is a Professor in the Computer Science Department at the University of Massachusetts Amherst. His research focuses on machine learning, natural language processing, and information extraction. He has supervised numerous graduate students and postdoctoral researchers, many of whom have gone on to prominent positions in academia and industry. His collaborations span various institutions and include notable researchers from both academia and industry, such as David Blei, David Jensen, and Bruce Croft. McCallum's work involves developing models and algorithms for extracting structured information from unstructured data, with applications in question answering, entity modeling, and document provenance. He has contributed to the advancement of probabilistic graphical models, data mining, and information retrieval. His research has been supported by collaborations with colleagues at UMass, elsewhere in academia, and industry, including Google, Yahoo, Microsoft Research, and others. His extensive mentorship and collaborative efforts have significantly impacted the fields of machine learning and natural language processing.
Research topics
- Computer Science
- Artificial Intelligence
- Machine Learning
- Natural Language Processing
- Engineering
- Data science
- History
- Archaeology
Selected publications
Data and Code for "Tariffs and Goods-Market Search Frictions"
Mendeley Data · 2026-04-20
datasetOpen accessSenior authorWe study tariffs in a general equilibrium dynamic model with search frictions between heterogeneous exporting producers and importing retailers. We show the model has a unique equilibrium and analytically characterize home unilateral import tariffs that maximize welfare given a passive foreign country. Search frictions add two terms to the standard optimal tariff expression: One lowers tariffs when contact rates are low; another when private export costs exceed social opportunity costs. Search frictions also introduce new incentives to subsidize imports due to market thickness effects. We calibrate our baseline to U.S. and Chinese 2016 data. We compare this baseline to a counterfactual with international search costs reduced to domestic levels but with all other parameters fixed. We find that higher baseline search costs reduce optimal U.S. unilateral and Nash tariffs and attenuate welfare responses to tariff changes.
Data and Code for "Tariffs and Goods-Market Search Frictions"
Mendeley Data · 2026-04-20
datasetOpen accessSenior authorWe study tariffs in a general equilibrium dynamic model with search frictions between heterogeneous exporting producers and importing retailers. We show the model has a unique equilibrium and analytically characterize home unilateral import tariffs that maximize welfare given a passive foreign country. Search frictions add two terms to the standard optimal tariff expression: One lowers tariffs when contact rates are low; another when private export costs exceed social opportunity costs. Search frictions also introduce new incentives to subsidize imports due to market thickness effects. We calibrate our baseline to U.S. and Chinese 2016 data. We compare this baseline to a counterfactual with international search costs reduced to domestic levels but with all other parameters fixed. We find that higher baseline search costs reduce optimal U.S. unilateral and Nash tariffs and attenuate welfare responses to tariff changes.
E040 Pre-biologic cascade of care for people with positive IGRA results in Oxford
Lara D. Veeken · 2025-04-01 · 1 citations
articleOpen accessSenior authorAbstract Background/Aims In low-TB-incidence countries, the interferon gamma release assay (IGRA) plays a key role in screening for tuberculosis (TB) infection prior to immunosuppressants, and in new entrants and contacts of people with active TB. Biologic therapies are increasingly used in Rheumatology, and several agents require TB screening in advance to minimise the risk of TB reactivation. The risk of reactivation varies considerably by agent. The cascade of care from a positive IGRA result to TB preventive treatment (TPT) is vital. Methods We conducted a retrospective audit analysing IGRA requesting and testing results over 5 years. This project was approved as an audit locally. We collected clinical data from the electronic patient records and only routinely collected data as part of NHS care was utilised for this. Data were analysed using MS Excel and Sankey diagrams were created using Python to illustrate the pathway. Results We identified 3350 IGRA tests (excluding duplicates) of which 465 (13.9%) gave positive results. Adequate clinical records were not available for some patients, leaving us with 405 people. 140 (35%) of these were under investigation for active TB, 161 (40%) were tested as part of preventive screening (146 contacts of people with active TB disease, 11 new entrants to the UK and 4 People Living with HIV), and 100 (25%) were tested prior to receiving biologic (85) or non-biologic (15) immunosuppressants. 85/100 (85%) of patients with a positive IGRA prior to immunosuppression were referred to TB clinic, of whom 84/85 (99%) were seen. 77/84 (92%) were offered TPT and 73/77 (95%) completed. Overall, 73/100 (73%) completed TPT after a positive IGRA test. One patient developed active TB infection following a positive but unactioned IGRA and receiving adalimumab. None of the patients who completed TPT developed active TB despite treatment with biologics. Conclusion We demonstrate significant attrition in the number of patients receiving TPT prior to significant immunosuppression and one case of avoidable morbidity. Clear local pathways are needed for TB screening pre-biologics and significant immunosuppression. Strengthening each step in the cascade of care is indicated to increase the proportion of people completing TPT and reducing morbidity. Disclosure V. Tobert: None. A. O’Reilly: None. E. Bateman: None. J. Hoy: None. S. Dubey: Other; Advisory board - Abbvie, Boehringer Ingelheim. A. McCallum: None.
Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions
ArXiv.org · 2025-05-09
preprintOpen accessSenior authorAutoregressive models (ARMs), which predict subsequent tokens one-by-one ``from left to right,'' have achieved significant success across a wide range of sequence generation tasks. However, they struggle to accurately represent sequences that require satisfying sophisticated constraints or whose sequential dependencies are better addressed by out-of-order generation. Masked Diffusion Models (MDMs) address some of these limitations, but the process of unmasking multiple tokens simultaneously in MDMs can introduce incoherences, and MDMs cannot handle arbitrary infilling constraints when the number of tokens to be filled in is not known in advance. In this work, we introduce Insertion Language Models (ILMs), which learn to insert tokens at arbitrary positions in a sequence -- that is, they select jointly both the position and the vocabulary element to be inserted. By inserting tokens one at a time, ILMs can represent strong dependencies between tokens, and their ability to generate sequences in arbitrary order allows them to accurately model sequences where token dependencies do not follow a left-to-right sequential structure. To train ILMs, we propose a tailored network parameterization and use a simple denoising objective. Our empirical evaluation demonstrates that ILMs outperform both ARMs and MDMs on common planning tasks. Furthermore, we show that ILMs outperform MDMs and perform on par with ARMs in an unconditional text generation task while offering greater flexibility than MDMs in arbitrary-length text infilling. The code is available at: https://dhruveshp.com/projects/ilm .
ArXiv.org · 2025-02-15
preprintOpen accessSenior authorPersonalized item recommendation typically suffers from data sparsity, which is most often addressed by learning vector representations of users and items via low-rank matrix factorization. While this effectively densifies the matrix by assuming users and movies can be represented by linearly dependent latent features, it does not capture more complicated interactions. For example, vector representations struggle with set-theoretic relationships, such as negation and intersection, e.g. recommending a movie that is "comedy and action, but not romance". In this work, we formulate the problem of personalized item recommendation as matrix completion where rows are set-theoretically dependent. To capture this set-theoretic dependence we represent each user and attribute by a hyper-rectangle or box (i.e. a Cartesian product of intervals). Box embeddings can intuitively be understood as trainable Venn diagrams, and thus not only inherently represent similarity (via the Jaccard index), but also naturally and faithfully support arbitrary set-theoretic relationships. Queries involving set-theoretic constraints can be efficiently computed directly on the embedding space by performing geometric operations on the representations. We empirically demonstrate the superiority of box embeddings over vector-based neural methods on both simple and complex item recommendation queries by up to 30 \% overall.
P033 Optimising management of latent tuberculosis prior to biologics (OPTITAB)
Lara D. Veeken · 2025-04-01
articleOpen accessAbstract Background/Aims The British Society for Rheumatology (BSR) recommend that patients with inflammatory arthritis should be tested for tuberculosis (TB) (tuberculin skin test or interferon gamma release assay) as part of biologic pre-treatment screening. Local practice was inconsistent, so we used a multi-pronged approach to understand the issues and auditing practice. We intended to improve the management of patients with latent TB and therefore initiated OPTITAB - Optimisation of latent TB management in inflammatory arthritis starting biologics. Methods A quality improvement program was formulated and approved locally, including baseline audit. We collected retrospective data for control patients and patients who had positive T spots. The primary aim was to understand the management of these patients between decision to start a biologic to the actual first administration. Anonymised routine data was collected and analysis performed using MS Excel. Results We analysed data on 28 control patients and 20 patients with positive T spot Control group: Median age was 53.5 years (range 21-80 years), 20 patients were female. 15 patients had rheumatoid arthritis (RA), 6 psoriatic arthritis (PsA), 6 ankylosing spondylitis (AS), 1 enteropathic arthritis. Methotrexate was the DMARD most frequently used. Leflunomide, hydroxychloroquine, sulfasalazine were also used. 4 patients had oral steroids, 7 had IM depomedrone 120 mg injections (total of 10 injections). Only 4 out of 28 patients had full screening as per BSR recommendation, including IGRA testing. 1 patient did not start treatment, 21 patients had documented dates. Median time from clinical decision to start biologic and first administration was 41 days (mean 39). 1 patient had a positive unactioned IGRA, and they subsequently developed TB following biologic initiation. The patient had quadruple TB therapy and recovered fully. Positive T spot group: Median age was 57.5 years, 11 of 20 patients were female. 12 patients had RA, 4 PsA and 4 AS. There were 28 co-morbidities in 19 patients. MTX and HCQ were the most frequently used DMARDs in combination, MTX monotherapy was used in 12 patients. 3 patients did not start biologics. In 2 patients, the date for start of biologics was not documented. In the remaining 15 patients, median delay starting biologics was 157 days (mean 146 days). 9 patients had IM or oral corticosteroid rescue. 1 patient declined latent TB treatment and biologics. For 1 patient a clinical decision was made not to treat latent TB. 4 patients had INH monotherapy, and 14 a combination of rifampicin and INH. 1 patient died from unrelated infective causes; none developed TB. Conclusion Patients with positive T spot have inordinate delays in care and lack a structured approach. Development and implementation of efficient pathways are needed for this cohort of patients. Disclosure H. Bunting: None. S. Manderson: None. A. McCallum: None. S. Dubey: None.
MiGrATe: Mixed-Policy GRPO for Adaptation at Test-Time
ArXiv.org · 2025-08-12
preprintOpen accessSenior authorLarge language models (LLMs) are increasingly being applied to black-box optimization tasks, from program synthesis to molecule design. Prior work typically leverages in-context learning to iteratively guide the model towards better solutions. Such methods, however, often struggle to balance exploration of new solution spaces with exploitation of high-reward ones. Recently, test-time training (TTT) with synthetic data has shown promise in improving solution quality. However, the need for hand-crafted training data tailored to each task limits feasibility and scalability across domains. To address this problem, we introduce MiGrATe-a method for online TTT that uses GRPO as a search algorithm to adapt LLMs at inference without requiring external training data. MiGrATe operates via a mixed-policy group construction procedure that combines on-policy sampling with two off-policy data selection techniques: greedy sampling, which selects top-performing past completions, and neighborhood sampling (NS), which generates completions structurally similar to high-reward ones. Together, these components bias the policy gradient towards exploitation of promising regions in solution space, while preserving exploration through on-policy sampling. We evaluate MiGrATe on three challenging domains-word search, molecule optimization, and hypothesis+program induction on the Abstraction and Reasoning Corpus (ARC)-and find that it consistently outperforms both inference-only and TTT baselines, demonstrating the potential of online TTT as a solution for complex search tasks without external supervision.
AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise
arXiv (Cornell University) · 2025-06-30
preprintOpen accessThe promise of autonomous scientific discovery (ASD) hinges not only on answering questions, but also on knowing which questions to ask. Most recent works in ASD explore the use of large language models (LLMs) in goal-driven settings, relying on human-specified research questions to guide hypothesis generation. However, scientific discovery may be accelerated further by allowing the AI system to drive exploration by its own criteria. The few existing approaches in open-ended ASD select hypotheses based on diversity heuristics or subjective proxies for human interestingness, but the former struggles to meaningfully navigate the typically vast hypothesis space, and the latter suffers from imprecise definitions. This paper presents AutoDiscovery -- a method for open-ended ASD that instead drives scientific exploration using Bayesian surprise. Here, we quantify the epistemic shift from the LLM's prior beliefs about a hypothesis to its posterior beliefs after gathering experimental results. To efficiently explore the space of nested hypotheses, our method employs a Monte Carlo tree search (MCTS) strategy with progressive widening using surprisal as the reward function. We evaluate AutoDiscovery in the setting of data-driven discovery across 21 real-world datasets spanning domains such as biology, economics, finance, and behavioral science. Our results demonstrate that under a fixed budget, AutoDiscovery substantially outperforms competitors by producing 5-29% more discoveries deemed surprising by the LLM. Our human evaluation further reveals that two-thirds of discoveries made by our system are surprising to domain experts as well, suggesting this is an important step towards building open-ended ASD systems.
OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics
ArXiv.org · 2025-06-14
preprintOpen accessRobust unlearning is crucial for safely deploying large language models (LLMs) in environments where data privacy, model safety, and regulatory compliance must be ensured. Yet the task is inherently challenging, partly due to difficulties in reliably measuring whether unlearning has truly occurred. Moreover, fragmentation in current methodologies and inconsistent evaluation metrics hinder comparative analysis and reproducibility. To unify and accelerate research efforts, we introduce OpenUnlearning, a standardized and extensible framework designed explicitly for benchmarking both LLM unlearning methods and metrics. OpenUnlearning integrates 13 unlearning algorithms and 16 diverse evaluations across 3 leading benchmarks (TOFU, MUSE, and WMDP) and also enables analyses of forgetting behaviors across 450+ checkpoints we publicly release. Leveraging OpenUnlearning, we propose a novel meta-evaluation benchmark focused specifically on assessing the faithfulness and robustness of evaluation metrics themselves. We also benchmark diverse unlearning methods and provide a comparative analysis against an extensive evaluation suite. Overall, we establish a clear, community-driven pathway toward rigorous development in LLM unlearning research.
2024-01-01 · 6 citations
articleOpen accessSenior authorJiachen Zhao, Wenlong Zhao, Andrew Drozdov, Benjamin Rozonoyer, Md Arafat Sultan, Jay-Yoon Lee, Mohit Iyyer, Andrew McCallum. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.
Recent grants
NSF · $1.0M · 2015–2020
CI-ADDO-EN: Flexible Machine Learning for Natural Language in the MALLET Toolkit
NSF · $650k · 2010–2016
DMREF: Collaborative Research: The Synthesis Genome: Data Mining for Synthesis of New Materials
NSF · $364k · 2015–2019
RI: Medium: Extreme Clustering
NSF · $1.1M · 2018–2023
NSF · $3.9M · 2003–2011
Frequent coauthors
- 79 shared
Nicholas Monath
- 49 shared
Manzil Zaheer
- 42 shared
Emma Strubell
- 40 shared
Rajarshi Das
International Institute of Information Technology Bangalore
- 36 shared
Michael Boratko
- 35 shared
Luke Vilnis
- 34 shared
Patrick Verga
University of Massachusetts Amherst
- 29 shared
Ameya Godbole
Awards & honors
- General Chair of ICML 2012
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Andrew McCallum
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup