Sandy Pentland

· Toshiba Professor of Media Arts & Science

Massachusetts Institute of Technology · Information Technology

Active 1994–2025

h-index5

Citations265

Papers207 last 5y

Funding—

Faculty page Lab page

See your match with Sandy Pentland — sign in to PhdFit.Sign in

About

Sandy Pentland is the Toshiba Professor of Media Arts & Science and a Professor of Information Technology at MIT Sloan. He directs MIT’s Human Dynamics Laboratory and the MIT Media Lab Entrepreneurship Program. He co-leads the World Economic Forum Big Data and Personal Data initiatives and is a founding member of the Advisory Boards for Nissan, Motorola Mobility, Telefonica, and various start-up firms. Pentland has helped create and direct MIT’s Media Laboratory, the Media Lab Asia laboratories at the Indian Institutes of Technology, and Strong Hospital’s Center for Future Health. Recognized as one of the 'seven most powerful data scientists in the world' by Forbes in 2012, he is a pioneer in computational social science, organizational engineering, wearable computing, image understanding, and biometrics. His research has been featured in prominent scientific journals and media outlets, and his most recent book is 'Honest Signals.' He has advised over 50 PhD students, many of whom hold faculty positions, lead industry research groups, or have founded their own companies. His research group and entrepreneurship efforts have spun off more than 30 companies, including publicly listed firms and organizations serving millions in Africa and South Asia.

Research topics

Computer Science
Computer Security
Data science
Sociology
World Wide Web
Artificial Intelligence
Business
Engineering
Biology
Geography
Accounting
Philosophy
Ecology
Demography
Engineering ethics
Internet privacy
Psychology
Geology
Epistemology

Selected publications

Analyzing sequential activity and travel decisions with interpretable deep inverse reinforcement learning
Travel Behaviour and Society · 2025-11-20 · 1 citations
articleSenior author
Publisher DOI
Measuring risks inherent to our digital economies using Amazon purchase histories from US consumers
ArXiv.org · 2025-02-26
preprintOpen access
What do pickles and trampolines have in common? In this paper we show that while purchases for these products may seem innocuous, they risk revealing clues about customers' personal attributes - in this case, their race. As online retail and digital purchases become increasingly common, consumer data has become increasingly valuable, raising the risks of privacy violations and online discrimination. This work provides the first open analysis measuring these risks, using purchase histories crowdsourced from (N=4248) US Amazon.com customers and survey data on their personal attributes. With this limited sample and simple models, we demonstrate how easily consumers' personal attributes, such as health and lifestyle information, gender, age, and race, can be inferred from purchases. For example, our models achieve AUC values over 0.9 for predicting gender and over 0.8 for predicting diabetes status. To better understand the risks that highly resourced firms like Amazon, data brokers, and advertisers present to consumers, we measure how our models' predictive power scales with more data. Finally, we measure and highlight how different product categories contribute to inference risk in order to make our findings more interpretable and actionable for future researchers and privacy advocates.
Publisher OA PDF DOI
Data Authenticity, Consent, & Provenance for AI are all broken: what will it take to fix them?
arXiv (Cornell University) · 2024 · 3 citations
- Computer Science
- Internet privacy
- Psychology
New capabilities in foundation models are owed in large part to massive, widely-sourced, and under-documented training data collections. Existing practices in data collection have led to challenges in tracing authenticity, verifying consent, preserving privacy, addressing representation and bias, respecting copyright, and overall developing ethical and trustworthy foundation models. In response, regulation is emphasizing the need for training data transparency to understand foundation models' limitations. Based on a large-scale analysis of the foundation model training data landscape and existing solutions, we identify the missing infrastructure to facilitate responsible foundation model development practices. We examine the current shortcomings of common tools for tracing data authenticity, consent, and documentation, and outline how policymakers, developers, and data creators can facilitate responsible foundation model development by adopting universal data provenance standards.
Publisher OA PDF DOI
City mobility patterns during the COVID-19 pandemic: analysis of a global natural experiment
The Lancet Public Health · 2024-10-31 · 9 citations
articleOpen access
BACKGROUND: During the COVID-19 pandemic, changes were seen in city mobility patterns around the world, including in active transportation (walking, cycling, micromobility, and public transit use), creating a unique opportunity for global public health lessons and action. We aimed to analyse a global natural experiment exploring city mobility patterns during the pandemic and how they related to the implementation of COVID-19-related policies. METHODS: We obtained data from Apple's Mobility Trends Reports on city mobility indexes for 296 cities from Jan 13, 2020 to Feb 4, 2022. Mobility indexes represented the frequency of Apple Maps queries for driving, walking, and public transit journeys relative to a baseline value of 100 for the pre-pandemic period (defined as Jan 13, 2020). City mobility index trajectories were plotted with stratification by country income level, transportation-related city type, population density, and COVID-19 pandemic severity (SARS-CoV-2 infection rate). We also synthesised global pandemic policies and recovery actions that promoted or restricted city mobility and active transportation (walking, cycling and micromobility, and public transit) using the Shifting Streets dataset. Additionally, a natural experiment on a global scale evaluated the effects of new active transportation policies on walking and public transit use in cities around the world. We used multivariable regression with a difference-in-difference (DID) analysis to explore whether the implementation of walking or public transit promotion policies affected mobility indexes, comparing cities with and without implementation of these policies in the pre-intervention period (Jan 27 to April 12, 2020) and post-intervention period (April 13 to June 28, 2020). FINDINGS: Based on city mobility index trajectories, we observed an overall decline in mobility indexes for walking, driving, and public transit at the beginning of the pandemic, but these values began to increase in April, 2020. Cities with lower population densities generally had higher driving and walking indexes than cities with higher population density, while cities with higher population densities had higher public transit indexes. Cities with higher pandemic severity generally had higher driving and walking indexes than cities with lower pandemic severity, while cities with lower pandemic severity had higher public transit indexes than other cities. We identified 587 policies in the dataset that had known implementation dates and were relevant to active transportation, which included 305 policies on walking, 321 on cycling and micromobility, and 143 on public transit, across 230 cities within 33 countries (19 high-income, 11 middle-income, and three low-income countries). In the global natural experiment (including 39 cities), implementation of policy interventions promoting walking was significantly associated with a higher absolute value of the walking index (DID coefficient 20·675 [95% CI 8·778-32·572]), whereas no such effect was seen for public transit-promoting policies (0·600 [-13·293 to 14·494]). INTERPRETATION: Our results suggest that the policies implemented to mitigate the COVID-19 pandemic were effective in changing city mobility patterns, especially increasing active transportation. Given the known benefits of active transportation, such policies could be maintained, expanded, and evaluated post pandemic. The discrepancy in the interventions between countries of different incomes highlights that changes to the infrastructure to prioritise safe walking, cycling, and easy access to public transit use could help with the future-proofing of cities in low-income and middle-income countries. FUNDING: None.
Publisher OA PDF DOI
A Safe Harbor for AI Evaluation and Red Teaming
arXiv (Cornell University) · 2024-03-07 · 9 citations
preprintOpen access
Independent evaluation and red teaming are critical for identifying the risks posed by generative AI systems. However, the terms of service and enforcement strategies used by prominent AI companies to deter model misuse have disincentives on good faith safety evaluations. This causes some researchers to fear that conducting such research or releasing their findings will result in account suspensions or legal reprisal. Although some companies offer researcher access programs, they are an inadequate substitute for independent research access, as they have limited community representation, receive inadequate funding, and lack independence from corporate incentives. We propose that major AI developers commit to providing a legal and technical safe harbor, indemnifying public interest safety research and protecting it from the threat of account suspensions or legal reprisal. These proposals emerged from our collective experience conducting safety, privacy, and trustworthiness research on generative AI systems, where norms and incentives could be better aligned with public interests, without exacerbating model misuse. We believe these commitments are a necessary step towards more inclusive and unimpeded community efforts to tackle the risks of generative AI.
Publisher OA PDF DOI
A large-scale audit of dataset licensing and attribution in AI
Nature Machine Intelligence · 2024-08-30 · 57 citations
articleOpen access
Abstract The race to train language models on vast, diverse and inconsistently documented datasets raises pressing legal and ethical concerns. To improve data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace more than 1,800 text datasets. We develop tools and standards to trace the lineage of these datasets, including their source, creators, licences and subsequent use. Our landscape analysis highlights sharp divides in the composition and focus of data licenced for commercial use. Important categories including low-resource languages, creative tasks and new synthetic data all tend to be restrictively licenced. We observe frequent miscategorization of licences on popular dataset hosting sites, with licence omission rates of more than 70% and error rates of more than 50%. This highlights a crisis in misattribution and informed use of popular datasets driving many recent breakthroughs. Our analysis of data sources also explains the application of copyright law and fair use to finetuning data. As a contribution to continuing improvements in dataset transparency and responsible use, we release our audit, with an interactive user interface, the Data Provenance Explorer, to enable practitioners to trace and filter on data provenance for the most popular finetuning data collections: www.dataprovenance.org .
Publisher OA PDF DOI
Data Authenticity, Consent, and Provenance for AI Are All Broken: What Will It Take to Fix Them?
2024-03-27 · 15 citations
articleOpen accessSenior author
New AI capabilities are owed in large part to massive, widely sourced, and underdocumented training data collections. Dubious collection practices have spurred crises in data transparency, authenticity, consent, privacy, representation, bias, copyright infringement, and the overall development of ethical and trustworthy AI systems. In response, AI regulation is emphasizing the need for training data transparency to understand AI model limitations. Based on a large-scale analysis of the AI training data landscape and existing solutions, we identify the missing infrastructure to facilitate responsible AI development practices. We explain why existing tools for data authenticity, consent, and documentation alone are unable to solve the core problems facing the AI community, and outline how policymakers, developers, and data creators can facilitate responsible AI development, through universal data provenance standards.
Publisher OA PDF DOI
Consent in Crisis: The Rapid Decline of the AI Data Commons
2024-01-01 · 1 citations
article
Publisher DOI
Open e-commerce 1.0, five years of crowdsourced U.S. Amazon purchase histories with user demographics
Scientific Data · 2024 · 8 citations
Senior authorCorresponding
- Computer Science
- Sociology
- World Wide Web
This is a first-of-its-kind dataset containing detailed purchase histories from 5027 U.S. Amazon.com consumers, spanning 2018 through 2022, with more than 1.8 million purchases. Consumer spending data are customarily collected through government surveys to produce public datasets and statistics, which serve public agencies and researchers. Companies now collect similar data through consumers' use of digital platforms at rates superseding data collection by public agencies. We published this dataset in an effort towards democratizing access to rich data sources routinely used by companies. The data were crowdsourced through an online survey and shared with participants' informed consent. Data columns include order date, product code, title, price, quantity, and shipping address state. Each purchase history is linked to survey data with information about participants' demographics, lifestyle, and health. We validate the dataset by showing expenditure correlates with public Amazon sales data (Pearson r = 0.978, p < 0.001) and conduct analyses of specific product categories, demonstrating expected seasonal trends and strong relationships to other public datasets.
Publisher OA PDF DOI
Consent in Crisis: The Rapid Decline of the AI Data Commons
arXiv (Cornell University) · 2024-07-20 · 11 citations
preprintOpen accessSenior author
General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how codified data use preferences are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crises in data consent, for both developers and creators. The foreclosure of much of the open web will impact not only commercial AI, but also non-commercial AI and academic research.
Publisher OA PDF DOI

Frequent coauthors

Leandro Martin Totaro Garcia
4 shared
Shayne Longpre
4 shared
Robert Mahari
4 shared
Ruoyu Wang
Sun Yat-sen University
4 shared
Naana Obeng-Marnu
Massachusetts Institute of Technology
3 shared
Selin Akaraci
Queen's University Belfast
3 shared
Jad Kabbara
3 shared
Robert Mahari
3 shared

Labs

MIT Media LabPI

Awards & honors

Forbes named Sandy one of the 'seven most powerful data scie…
McKinsey Award from Harvard Business Review (2013)

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Sandy Pentland

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you