
About
My group’s research lies at the intersection of statistics and machine learning, with a focus on developing methods that are mathematically principled, scalable, and useful in practice. A central theme is that uncertainty quantification should remain trustworthy even when models are imperfect and inference is approximate — challenges that are especially acute in scientific applications involving heterogeneous data, latent structure, and substantial computational constraints. Our work spans four interconnected areas: (1) scalable generalized Bayesian learning, including theory and methods for robust, reproducible inference under model misspecification; (2) automation and validation of posterior approximation algorithms; (3) discovery of interpretable latent structure in complex scientific data; and (4) large-scale data assimilation and forecasting. Current applied work is focused on developing computational methods and software tools for large-scale ecological and Earth science forecast
Research topics
- Computer Science
- Biology
- Artificial Intelligence
- Virology
- Medicine
- Computational biology
- Data Mining
- Database
- Mathematics
- Real-time computing
- Pathology
Selected publications
bioRxiv (Cold Spring Harbor Laboratory) · 2026-02-26 · 1 citations
articleOpen accessAbstract The ability to accurately assess ecosystem C budgets across scales from individual sites to continents is essential for C accounting, management, and ultimately mitigating climate change. State data assimilation (SDA) provides a framework for harmonizing observations with models, while robustly accounting for and reducing multiple sources of uncertainty. In this study, we employed a hybrid SDA framework that combines process-based terrestrial biosphere modeling, hierarchical Bayesian inference, and machine learning to harmonize bottom-up and remotely-sensed data streams for 8,000 pre-selected 1km 2 locations across North America within a hybrid structure. Combining bottom-up soils data (SoilGrids) with spectral (MODIS and Landsat) and microwave (SMAP) remote sensing helps constrain the major C and water stocks through space and time. Machine learning is used both to identify and correct systematic errors in the process model (SIPNET) and to interpolate the pre-selected locations onto a 1km grid, making it computationally feasible to generate annual ensemble maps of the North American carbon budget. Furthermore, the uncertainties for each variable were reduced compared to those from observations or models alone. Spatiotemporal analysis showed a slight decrease in aboveground biomass (AGB) across the western US, a loss of leaf area across the boreal, and a slight greening of the Alaskan tundra. The uncertainty trends suggest a significant reduction in the uncertainty about soil organic carbon (SOC), the largest C reservoir. Validation results show that we accurately estimate C pools, compared to the assimilated data streams and held-out observations of AGB from GEDI, ICESat-2, and the US FIA, and SOC from the ISCN network. Our ML-debiasing algorithm further improved the accuracy of major C pools (AGB, SOC). In general, our continental SDA framework will facilitate global C MRV (monitoring, reporting, and verification) by providing accurate and precise C-cycle estimates, along with their corresponding spatiotemporal uncertainties.
SSRN Electronic Journal · 2026-01-01
preprintOpen accessCalibrated Model Criticism Using Split Predictive Checks
Journal of the American Statistical Association · 2026-04-15 · 2 citations
preprintOpen accessSenior authorAssessing how well a Bayesian model generalizes to unobserved data is essential, yet existing general-purpose model checks are either not properly calibrated (as in posterior predictive checks) or fail to be sufficiently general for practical use (e.g., due to requiring model-specific derivations). We propose <i>split predictive checks (SPCs)</i> as a simple, general-purpose class of predictive checks that maintain the usability of posterior predictive checks while directly targeting predictive generalization. SPCs work by splitting the data into training and test subsets, then fitting the model to the former and evaluating predictive discrepancies on the latter. We develop an asymptotic theory for two variants – single SPCs and divided SPCs – and show that, unlike posterior predictive checks, both yield asymptotically calibrated (hence interpretable) p-values. Our results show that single SPCs work well at identifying substantial misspecification, while divided SPCs are sensitive even to subtle departures from modeling assumptions. Through simulation studies and real-data applications, we show that SPCs provide reliable, flexible, and computationally efficient assessments of Bayesian model fit, often revealing issues with predictive generalization missed by other predictive checks.
Tuning-Free Coreset Markov Chain Monte Carlo via Hot DoG.
PubMed · 2025-07-01
articleOpen access(Hot DoG), for training coreset weights in Coreset MCMC without user tuning effort. We provide a theoretical analysis of the convergence of the coreset weights produced by Hot DoG. We also provide empirical results demonstrate that Hot DoG provides higher quality posterior approximations than other learning-rate-free stochastic gradient methods, and performs competitively to optimally-tuned ADAM.
Quantitative Error Bounds for Scaling Limits of Stochastic Iterative Algorithms
ArXiv.org · 2025-01-21
preprintOpen accessSenior authorStochastic iterative algorithms, including stochastic gradient descent (SGD) and stochastic gradient Langevin dynamics (SGLD), are widely utilized for optimization and sampling in large-scale and high-dimensional problems in machine learning, statistics, and engineering. Numerous works have bounded the parameter error in, and characterized the uncertainty of, these approximations. One common approach has been to use scaling limit analyses to relate the distribution of algorithm sample paths to a continuous-time stochastic process approximation, particularly in asymptotic setups. Focusing on the univariate setting, in this paper, we build on previous work to derive non-asymptotic functional approximation error bounds between the algorithm sample paths and the Ornstein-Uhlenbeck approximation using an infinite-dimensional version of Stein's method of exchangeable pairs. We show that this bound implies weak convergence under modest additional assumptions and leads to a bound on the error of the variance of the iterate averages of the algorithm. Furthermore, we use our main result to construct error bounds in terms of two common metrics: the Lévy-Prokhorov and bounded Wasserstein distances. Our results provide a foundation for developing similar error bounds for the multivariate setting and for more sophisticated stochastic approximation algorithms.
The PEcAn+SIPNET Terrestrial Carbon Cycle Reanalysis: Development and Validation
ARPHA Conference Abstracts · 2025-05-28
articleOpen accessImproving our ability to understand and predict the dynamics of the terrestrial carbon cycle remains a pressing challenge despite a rapidly growing volume and diversity of Earth Observation data. State data assimilation represents a path forward via an iterative cycle of making process-based forecasts and then statistically reconciling these forecasts against numerous ground-based and remotely-sensed data constraints into a “reanalysis” data product that provides full spatiotemporal carbon budgets with robust uncertainty accounting. Here we report on an &gt;100x expansion of the PEcAn+SIPNET reanalysis from 500 sites CONUS, 25 ensemble members, and 2 data constraints to 6400 sites across North America, 100 ensemble members, and 5 data constraints: GEDI and Landtrendr AGB, MODIS LAI, SoilGrids Soil C, and SMAP soil moisture. We also report on an ensemble-based machine learning (ML) downscaling to a 1km product that preserves spatial, temporal, and across-variable covariances and demonstrate the impacts of these covariances on uncertainty accounting (Fig. 1). Synergistically, we use the same ML models to assess what climate, vegetation, and soil variables explain the spatiotemporal variability in different C pools and fluxes. In addition, we review a wide range of ongoing validation activities, comparing the outputs of the reanalysis against withheld data from: Ameriflux and NEON NEE and LE; USFS Forest Inventory biomass, biomass increment, tree rings, soil C, and litter; and NEON soil C and soil respiration. Finally, we touch on ML analyses to diagnose and correct systematic biases and emulator-based recalibration efforts.
Industrial & Engineering Chemistry Research · 2024-06-11 · 8 citations
articleOpen accessFouling in heat exchangers leads to increased pressure drop, associated with higher energy consumption, utility costs, and CO2 emissions. However, other effects can also take place, threatening process operations and safety. This is the case of ethylene oxide operations, where unplanned outages and decomposition events pose significant safety risks. Therefore, the development of a framework for advanced monitoring and forecasting of heat exchanger fouling is both important and opportune to improve the reliability and safety of the operation. We propose a hybrid approach, where knowledge-based feature generation is integrated with data-driven methods, to forecast key performance indicator that acts as a fouling surrogate. The forecasting model can predict one month ahead with an accuracy of R2 = 0.7. We also show that long-term forecasting is possible with this model, which can be applied to optimize maintenance scheduling. The solution can be extended to other situations where fouling takes place.
Structurally Aware Robust Model Selection for Mixtures
arXiv (Cornell University) · 2024-03-01
preprintOpen accessSenior authorMixture models are often used to identify meaningful subpopulations (i.e., clusters) in observed data such that the subpopulations have a real-world interpretation (e.g., as cell types). However, when used for subpopulation discovery, mixture model inference is usually ill-defined a priori because the assumed observation model is only an approximation to the true data-generating process. Thus, as the number of observations increases, rather than obtaining better inferences, the opposite occurs: the data is explained by adding spurious subpopulations that compensate for the shortcomings of the observation model. However, there are two important sources of prior knowledge that we can exploit to obtain well-defined results no matter the dataset size: known causal structure (e.g., knowing that the latent subpopulations cause the observed signal but not vice-versa) and a rough sense of how wrong the observation model is (e.g., based on small amounts of expert-labeled data or some understanding of the data-generating process). We propose a new model selection criteria that, while model-based, uses this available knowledge to obtain mixture model inferences that are robust to misspecification of the observation model. We provide theoretical support for our approach by proving a first-of-its-kind consistency result under intuitive assumptions. Simulation studies and an application to flow cytometry data demonstrate our model selection criteria consistently finds the correct number of subpopulations.
Robust discovery of mutational signatures using power posteriors
bioRxiv (Cold Spring Harbor Laboratory) · 2024-10-28
preprintOpen accessSenior authorMutational processes, such as the molecular effects of carcinogenic agents or defective DNA repair mechanisms, are known to produce different mutation types with characteristic frequency profiles, referred to as mutational signatures. Non-negative matrix factorization (NMF) has successfully been used to discover many mutational signatures, yielding novel insights into cancer etiology and targeted therapies. However, the NMF model is only a rough approximation to reality, and even small departures from this assumed model can have large negative effects on the accuracy and reliability of the results. We propose a new approach to mutational signatures analysis that improves robustness to misspecification by using a power posterior for a fully Bayesian NMF model, while employing a sparsity-inducing prior to automatically infer the number of active signatures. In extensive simulation studies, we find that our proposed approach recovers more true signatures with greater accuracy than current leading methods. On whole-genome sequencing data for six cancer types from the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium, we find that our method is able to accurately recover more signatures than the current state-of-the-art.
Reproducible parameter inference using bagged posteriors
Electronic Journal of Statistics · 2024-01-01 · 1 citations
articleOpen access1st authorCorrespondingUnder model misspecification, it is known that Bayesian posteriors often do not properly quantify uncertainty about true or pseudo-true parameters. Even more fundamentally, misspecification leads to a lack of reproducibility in the sense that the same model will yield contradictory posteriors on independent data sets from the true distribution. To define a criterion for reproducible uncertainty quantification under misspecification, we consider the probability that two credible sets constructed from independent data sets have nonempty overlap, and we establish a lower bound on this overlap probability that holds whenever the credible sets are valid confidence sets. We prove that credible sets from the standard posterior can strongly violate this bound, indicating that it is not internally coherent under misspecification. To improve reproducibility in an easy-to-use and widely applicable way, we propose to apply bagging to the Bayesian posterior ("BayesBag"); that is, to use the average of posterior distributions conditioned on bootstrapped datasets. We motivate BayesBag from first principles based on Jeffrey conditionalization and show that the bagged posterior typically satisfies the overlap lower bound. Further, we prove a Bernstein-Von Mises theorem for the bagged posterior, establishing its asymptotic normal distribution. We demonstrate the benefits of BayesBag via simulation experiments and an application to crime rate prediction.
Frequent coauthors
- 30 shared
Tamara Broderick
Western Caspian University
- 19 shared
Trevor Campbell
- 9 shared
Jeffrey W. Miller
- 8 shared
Aaron Chevalier
- 8 shared
Joshua D. Campbell
Boston Medical Center
- 8 shared
Raj Agrawal
- 8 shared
Masanao Yajima
- 8 shared
Zainab Khurshid
Boston University
Labs
Education
B.A.
Columbia University
- 2018
Ph.D.
Massachusetts Institute of Technology
Awards & honors
- Data Science Faculty Fellow
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Jonathan Huggins
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup