
Abdullah Almaatouq
· Douglas Drane Career Development Associate Professor in Information Technology and ManagementVerifiedMassachusetts Institute of Technology · Information Technology
Active 2013–2026
About
Abdullah Almaatouq is the Douglas Drane Career Development Associate Professor in Information Technology and Management at the MIT Sloan School of Management. He is a computational social scientist whose research focuses on improving cooperation, coordination, and collective intelligence in decision-making systems such as teams, committees, crowds, markets, and elections. Abdullah explores ways to advance social and behavioral research methodology through innovative research designs and theory-building strategies, with the goal of developing a deeper understanding of collective decision systems and how to design them effectively in various contexts. He is affiliated with the MIT Center for Computational Engineering, the MIT Center for Collective Intelligence, and the MIT Connection Science Research Initiative. Abdullah holds a PhD in computational science and engineering, along with dual master's degrees in media arts and sciences (MIT Media Lab) and computational science and engineering from MIT. Prior to joining MIT, he earned his undergraduate degree from Southampton University in the United Kingdom.
Research signals
Five dimensions sourced from public faculty / publication signals. Sign in to compare against your own profile and see your match score.
Research topics
- Computer Science
- Artificial Intelligence
- Machine Learning
- Psychology
- Social psychology
- Statistics
- Mathematics
- Engineering
- Medicine
- Theoretical computer science
- Cognitive psychology
- Communication
- Algorithm
- Geography
Selected publications
Integrative experiments identify how punishment affects welfare in public goods games
Science · 2026-04-09 · 1 citations
articleSenior authorCorrespondingDespite decades of research, the conditions under which punishment promotes cooperation remain unclear. Through an integrative experiment varying 14 design parameters of public goods games across 360 experimental conditions (147,618 decisions from 7100 participants), we reveal substantial heterogeneity in punishment effectiveness: Its impact on welfare ranges from 43% improvement to 44% reduction depending on the game parameters. To characterize these patterns, we developed models that outperformed human forecasters in predicting punishment effectiveness in new experiments. Communication emerges as the most important factor, followed by contribution framing (opt out versus opt in), contribution type (variable versus all-or-nothing), game length, and outcome visibility, though these factors often interact. The results reframe the debate from whether punishment works to when it does, demonstrating how integrative experiments enable discovery of generalizable patterns in social phenomena.
Post-training makes large language models less human-like
arXiv (Cornell University) · 2026-05-08
preprintOpen accessLarge language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training -- the stage that turns base models into useful assistants -- consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction -- a popular technique for eliciting human-like behavior by conditioning models on participant-specific information -- does not improve predictions at the level of individuals. Taken together, our results suggest that the very processes that are currently employed to turn LLMs into useful assistants also make them less accurate models of human behavior.
Evaluating Human-AI Safety: A Framework for Measuring Harmful Capability Uplift
arXiv (Cornell University) · 2026-03-06
preprintOpen accessCurrent frontier AI safety evaluations emphasize static benchmarks, third-party annotations, and red-teaming. In this position paper, we argue that AI safety research should focus on human-centered evaluations that measure harmful capability uplift: the marginal increase in a user's ability to cause harm with a frontier model beyond what conventional tools already enable. We frame harmful capability uplift as a core AI safety metric, ground it in prior social science research, and provide concrete methodological guidance for systematic measurement. We conclude with actionable steps for developers, researchers, funders, and regulators to make harmful capability uplift evaluation a standard practice.
Post-training makes large language models less human-like
ArXiv.org · 2026-05-08
articleOpen accessLarge language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training -- the stage that turns base models into useful assistants -- consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction -- a popular technique for eliciting human-like behavior by conditioning models on participant-specific information -- does not improve predictions at the level of individuals. Taken together, our results suggest that the very processes that are currently employed to turn LLMs into useful assistants also make them less accurate models of human behavior.
Evaluating Human-AI Safety: A Framework for Measuring Harmful Capability Uplift
arXiv (Cornell University) · 2026-03-06
articleOpen accessCurrent frontier AI safety evaluations emphasize static benchmarks, third-party annotations, and red-teaming. In this position paper, we argue that AI safety research should focus on human-centered evaluations that measure harmful capability uplift: the marginal increase in a user's ability to cause harm with a frontier model beyond what conventional tools already enable. We frame harmful capability uplift as a core AI safety metric, ground it in prior social science research, and provide concrete methodological guidance for systematic measurement. We conclude with actionable steps for developers, researchers, funders, and regulators to make harmful capability uplift evaluation a standard practice.
The Integration of Explanation and Prediction in Behavioral Science
Current Directions in Psychological Science · 2026-04-10
article1st authorCorrespondingBehavioral scientists aim to explain and predict behavior. In principle, these goals align; in practice, common approaches to pursuing them have become distinct traditions in tension with one another. The explanatory tradition often examines causal factors in isolation, establishing that they have some effect but not how much or how they combine. The predictive tradition learns how factors combine, but these patterns may not reflect a causal structure or hold when conditions change. Answering how much each factor matters, and how they combine across settings, requires both predictive accuracy and causal interpretation. This article examines three developments toward this integration: evaluation frameworks that emphasize generalization, systematic experimentation and flexible models, and interpretation tools. We present recent empirical examples that demonstrate how this integration enables the discovery of generalizable patterns and provides a path toward cumulative behavioral science.
The Task Space: An Integrative Framework for Team Research
2025-12-01
preprintOpen accessSenior authorResearch on teams spans many contexts, but integrating knowledge from heterogeneous sources is challenging because studies typically examine different tasks that cannot be directly compared. Most investigations involve teams working on just one or a handful of tasks, and researchers lack principled ways to quantify how similar or different these tasks are from one another. We address this challenge by introducing the “Task Space,” a multidimensional space in which tasks—and the distances between them—can be represented formally, and use it to create a “Task Map” of 102 crowd-annotated tasks from the published experimental literature. We then demonstrate the Task Space’s utility by performing an integrative experiment that addresses a fundamental question in team research: when do interacting groups outperform individuals? Our experiment samples 20 diverse tasks from the Task Map at three complexity levels and recruits 1,231 participants to work either individually or in groups of three or six (180 experimental conditions). We find striking heterogeneity in group advantage, with groups performing anywhere from three times worse to 60% better than the best individual working alone, depending on the task context. Critically, the Task Space makes this heterogeneity predictable: it significantly outperforms traditional typologies in predicting group advantage on unseen tasks. Our models also reveal theoretically meaningful interactions between task features; for example, group advantage on creative tasks depends on whether the answers are objectively verifiable. We conclude by arguing that the Task Space enables researchers to integrate findings across different experiments, thereby building cumulative knowledge about team performance.
Integrative Experiments Identify How Punishment Affects Welfare in Public Goods Games
Open MIND · 2025-01-01
otherSenior authorReproducibility package for "Integrative Experiments Identify How Punishment Affects Welfare in Public Goods Games" (https://www.science.org/doi/10.1126/science.aeb5280)
The Task Space: An Integrative Framework for Team Research
PsyArXiv (OSF Preprints) · 2025-10-14
otherOpen accessResearch on teams spans many contexts, but integrating knowledge from heterogeneous sources is challenging because studies typically examine different tasks that cannot be directly compared. Most investigations involve teams working on just one or a handful of tasks, and researchers lack principled ways to quantify how similar or different these tasks are from one another. We address this challenge by introducing the “Task Space,” a multidimensional space in which tasks—and the distances between them—can be represented formally, and use it to create a “Task Map” of 102 crowd-annotated tasks from the published experimental literature. We then demonstrate the Task Space’s utility by performing an integrative experiment that addresses a fundamental question in team research: when do interacting groups outperform individuals? Our experiment samples 20 diverse tasks from the Task Map at three complexity levels and recruits 1,231 participants to work either individually or in groups of three or six (180 experimental conditions). We find striking heterogeneity in group advantage, with groups performing anywhere from three times worse to 60% better than the best individual working alone, depending on the task context. Critically, the Task Space makes this heterogeneity predictable: it significantly outperforms traditional typologies in predicting group advantage on unseen tasks. Our models also reveal theoretically meaningful interactions between task features; for example, group advantage on creative tasks depends on whether the answers are objectively verifiable. We conclude by arguing that the Task Space enables researchers to integrate findings across different experiments, thereby building cumulative knowledge about team performance.
Studying collective intelligence in the lab
Edward Elgar Publishing Limited eBooks · 2025-12-11
book-chapter1st authorCorresponding
Frequent coauthors
- 43 shared
P. M. Krafft
University of the Arts London
- 31 shared
Mehdi Moussaïd
Max Planck Institute for Human Development
- 30 shared
Alex Pentland
Massachusetts Institute of Technology
- 30 shared
Abdulrahman Alotaibi
Moscow Institute of Thermal Technology
- 30 shared
Alejandro Noriega-Campero
Moscow Institute of Thermal Technology
- 27 shared
Alex Pentland
Human Media
- 15 shared
Duncan J. Watts
University of Pennsylvania
- 14 shared
David G. Rand
Massachusetts Institute of Technology
Labs
Education
Masters of Science, Media Lab
Massachusetts Institute of Technology
Masters of Science, Center for Computational Engineering
Massachusetts Institute of Technology
- 2019
Computational Science & Engineering, Computational Engineering
Massachusetts Institute of Technology
- 2012
Bachelor of Science, Electronics and Computer Science
University of Southampton
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Abdullah Almaatouq
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup