
David Donoho
VerifiedStanford University · Statistics
Active 1981–2026
About
I have studied the exploitation of sparse signals in signal recovery, including for denoising, superresolution, and solution of underdetermined equations. This research with collaborators showed that ell-1 penalization was an effective and even optimal way to exploit sparsity of the object to be recovered. Compressed sensing has impacted scientific and technical fields, including magnetic resonance imaging in medicine, where it has been implemented in FDA-approved medical imaging protocols already used for millions of patient MRIs. In recent years, my postdocs and students have been studying large-scale covariance matrix estimation, large-scale matrix denoising, detection of rare and weak signals among many pure noise non-signals, compressed sensing and related scientific imaging problems, and most recently, empirical deep learning.
Research topics
- Mathematics
- Computer science
- Algorithm
- Artificial intelligence
- Combinatorics
Selected publications
ArXiv.org · 2026-01-24
articleOpen access1st authorCorrespondingThis article presents the full, original record of the 2024 Joint Statistical Meetings (JSM) town hall, "Statistics in the Age of AI," which convened leading statisticians to discuss how the field is evolving in response to advances in artificial intelligence, foundation models, large-scale empirical modeling, and data-intensive infrastructures. The town hall was structured around open panel discussion and extensive audience Q&A, with the aim of eliciting candid, experience-driven perspectives rather than formal presentations or prepared statements. This document preserves the extended exchanges among panelists and audience members, with minimal editorial intervention, and organizes the conversation around five recurring questions concerning disciplinary culture and practices, data curation and "data work," engagement with modern empirical modeling, training for large-scale AI applications, and partnerships with key AI stakeholders. By providing an archival record of this discussion, the preprint aims to support transparency, community reflection, and ongoing dialogue about the evolving role of statistics in the data- and AI-centric future.
arXiv (Cornell University) · 2026-01-24
preprintOpen access1st authorCorrespondingThis article presents the full, original record of the 2024 Joint Statistical Meetings (JSM) town hall, "Statistics in the Age of AI," which convened leading statisticians to discuss how the field is evolving in response to advances in artificial intelligence, foundation models, large-scale empirical modeling, and data-intensive infrastructures. The town hall was structured around open panel discussion and extensive audience Q&A, with the aim of eliciting candid, experience-driven perspectives rather than formal presentations or prepared statements. This document preserves the extended exchanges among panelists and audience members, with minimal editorial intervention, and organizes the conversation around five recurring questions concerning disciplinary culture and practices, data curation and "data work," engagement with modern empirical modeling, training for large-scale AI applications, and partnerships with key AI stakeholders. By providing an archival record of this discussion, the preprint aims to support transparency, community reflection, and ongoing dialogue about the evolving role of statistics in the data- and AI-centric future.
Genome biology · 2025-09-27
articleOpen accessWe introduce hybrid BAG-seq: a high-throughput, multi-omic method that simultaneously captures DNA and RNA from single nuclei. We apply this protocol to 65,499 single nuclei from samples of five uterine cancer patients and validate the clustering using RNA-only and DNA-only protocols from the same tissues. Multiple tumor genome or expression clusters are often present within a patient, with different tumor clones projecting into distinct or shared expression states, demonstrating nearly all possible genome-transcriptome correlations. We also identify mutant stroma with significant X chromosome loss in various cell types and patient-specific stromal subtypes exhibiting aberrant expression patterns.
Statistics and AI: A Fireside Conversation
UNC Libraries · 2025-05-08
articleOpen accessA 3-hour webinar titled “Statistics and AI – A Fireside Conversation” was held on Sunday, March 17, 2024, attracting an online audience of approximately 1,000. The event featured three sessions aimed at engaging the statistical community on key topics in the AI era: addressing statistical challenges and opportunities (Panel I), evolving the publication process (Panel II), and advancing next-generation statistical pipelines and resources (Panel III). Panel I examined issues such as dwindling talent, shifting funding landscapes, and AI's rapid rise, highlighting the need for statistical rigor, interdisciplinary collaboration, and innovative approaches to shape the future of AI. Panel II emphasized the importance of streamlining the publication process, fostering impactful research, and prioritizing workflows and data quality. Panel III focused on modernizing statistical education by integrating AI and deep learning, promoting interdisciplinary collaboration, and maintaining foundational principles such as uncertainty and reproducibility. These discussions collectively outlined a strategic roadmap for ensuring the relevance and advancement of statistics in the age of AI. These discussions were organized by (in alphabetical order) Xihong Lin (Harvard University), Tracy Ke (Harvard University), Tian Zheng (Columbia University), Jing Zhou (University of California at Los Angeles), and Hongtu Zhu (University of North Carolina at Chapel Hill). In the dynamic landscape of statistical science, the fireside chat organized by the Stats Up AI Alliance (https://statsupai.org/) and the International Chinese Statistical Association (ICSA) emerged as a seminal event, bringing together leading experts to explore the evolving role of statistics in the era of artificial intelligence.
Statistics and AI: A Fireside Conversation
Harvard Data Science Review · 2025-04-30 · 2 citations
articleOpen accessA 3-hour webinar titled “Statistics and AI – A Fireside Conversation” was held on Sunday, March 17, 2024, attracting an online audience of approximately 1,000. The event featured three sessions aimed at engaging the statistical community on key topics in the AI era: addressing statistical challenges and opportunities (Panel I), evolving the publication process (Panel II), and advancing next-generation statistical pipelines and resources (Panel III). Panel I examined issues such as dwindling talent, shifting funding landscapes, and AI's rapid rise, highlighting the need for statistical rigor, interdisciplinary collaboration, and innovative approaches to shape the future of AI. Panel II emphasized the importance of streamlining the publication process, fostering impactful research, and prioritizing workflows and data quality. Panel III focused on modernizing statistical education by integrating AI and deep learning, promoting interdisciplinary collaboration, and maintaining foundational principles such as uncertainty and reproducibility. These discussions collectively outlined a strategic roadmap for ensuring the relevance and advancement of statistics in the age of AI.These discussions were organized by (in alphabetical order) Xihong Lin (Harvard University), Tracy Ke (Harvard University), Tian Zheng (Columbia University), Jing Zhou (University of California at Los Angeles), and Hongtu Zhu (University of North Carolina at Chapel Hill).In the dynamic landscape of statistical science, the fireside chat organized by the Stats Up AI Alliance (https://statsupai.org/ <https://statsupai.org/> ) and the International Chinese Statistical Association (ICSA) emerged as a seminal event, bringing together leading experts to explore the evolving role of statistics in the era of artificial intelligence.
Data Science at the Singularity
Harvard Data Science Review · 2024-01-29 · 42 citations
articleOpen access1st authorCorrespondingSomething fundamental to computation-based research has really changed in the last ten years. In certain fields, progress is simply dramatically more rapid than previously. Researchers in affected fields are living through a period of profound transformation, as the fields undergo a transition to frictionless reproducibility (FR). This transition markedly changes the rate of spread of ideas and practices, affects scientific mindsets and the goals of science, and erases memories of much that came before.The emergence of FR flows from 3 data science principles that matured together after decades of work by many technologists and numerous research communities. The mature principles involve data sharing, code sharing, and competitive challenges, however implemented in the particularly strong form of frictionless open services. Empirical Machine Learning is todayâs leading adherent field; its hidden superpower is adherence to frictionless reproducibility practices; these practices are responsible for the striking and surprising progress in AI that we see everywhere; they can be learned and adhered to by researchers in whatever research field, automatically increasing the rate of progress in each adherent field.
Optimal Covariance Estimation for Condition Number Loss in the Spiked model
Econometrics and Statistics · 2024-05-01 · 1 citations
article1st authorCorrespondingPrincipled and interpretable alignability testing and integration of single-cell data
Proceedings of the National Academy of Sciences · 2024-02-28 · 22 citations
articleOpen accessSingle-cell data integration can provide a comprehensive molecular view of cells, and many algorithms have been developed to remove unwanted technical or biological variations and integrate heterogeneous single-cell datasets. Despite their wide usage, existing methods suffer from several fundamental limitations. In particular, we lack a rigorous statistical test for whether two high-dimensional single-cell datasets are alignable (and therefore should even be aligned). Moreover, popular methods can substantially distort the data during alignment, making the aligned data and downstream analysis difficult to interpret. To overcome these limitations, we present a spectral manifold alignment and inference (SMAI) framework, which enables principled and interpretable alignability testing and structure-preserving integration of single-cell data with the same type of features. SMAI provides a statistical test to robustly assess the alignability between datasets to avoid misleading inference and is justified by high-dimensional statistical theory. On a diverse range of real and simulated benchmark datasets, it outperforms commonly used alignment methods. Moreover, we show that SMAI improves various downstream analyses such as identification of differentially expressed genes and imputation of single-cell spatial transcriptomics, providing further biological insights. SMAI's interpretability also enables quantification and a deeper understanding of the sources of technical confounders in single-cell data.
Rejoinder to Discussion of "Data Science at the Singularity"
Harvard Data Science Review · 2024-07-11 · 1 citations
articleOpen access1st authorCorrespondingRejoinder to Discussion of "Data Science at the Singularity" 2 I am impressed by the number, diversity, and seriousness of the discussions.I sense general agreement about the data science reality that has been forming over the last decades, some of the larger forces driving it, and the resulting permanent changes to research computing and scientific publishing that will ensue.I also sense concerns and important reservations, maybe not so much about what my article says, but what it does not begin to acknowledge and discuss.Each discussant makes unique and valuable points about issues exposed by these rapid changes across a broad range of fields represented and topics discussed.I can only admire and celebrate these contributions.In this rejoinder I will refer to the original
Universality of the $π^2/6$ Pathway in Avoiding Model Collapse
arXiv (Cornell University) · 2024-10-30
preprintOpen accessSenior authorResearchers in empirical machine learning recently spotlighted their fears of so-called Model Collapse. They imagined a discard workflow, where an initial generative model is trained with real data, after which the real data are discarded, and subsequently, the model generates synthetic data on which a new model is trained. They came to the conclusion that models degenerate as model-fitting generations proceed. However, other researchers considered an augment workflow, where the original real data continue to be used in each generation of training, augmented by synthetic data from models fit in all earlier generations. Empirical results on canonical datasets and learning procedures confirmed the occurrence of model collapse under the discard workflow and avoidance of model collapse under the augment workflow. Under the augment workflow, theoretical evidence also confirmed avoidance in particular instances; specifically, Gerstgrasser et al. (2024) found that for classical Linear Regression, test risk at any later generation is bounded by a moderate multiple, viz. pi-squared-over-6 of the test risk of training with the original real data alone. Some commentators questioned the generality of theoretical conclusions based on the generative model assumed in Gerstgrasser et al. (2024): could similar conclusions be reached for other task/model pairings? In this work, we demonstrate the universality of the pi-squared-over-6 augment risk bound across a large family of canonical statistical models, offering key insights into exactly why collapse happens under the discard workflow and is avoided under the augment workflow. In the process, we provide a framework that is able to accommodate a large variety of workflows (beyond discard and augment), thereby enabling an experimenter to judge the comparative merits of multiple different workflows by simulating a simple Gaussian process.
Recent grants
"Big-Data" Asymptotics: Theory and Large-Scale Experiments
NSF · $701k · 2014–2018
Frequent coauthors
- 62 shared
Jean‐Luc Starck
CEA Cadarache
- 42 shared
Iain M. Johnstone
- 27 shared
Xiaoming Huo
- 23 shared
Andrea Montanari
- 21 shared
Jared Tanner
- 21 shared
Emmanuel J. Candès
- 20 shared
Ery Arias-Castro
University of California, San Diego
- 18 shared
Matan Gavish
Hebrew University of Jerusalem
Education
- 1984
Ph.D., Statistics
Harvard University
- 1978
AB, Statistics
Princeton University
Awards & honors
- 2022 IEEE Jack S. Kilby Signal Processing Medal
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with David Donoho
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup