David Ruppert

Verified

Cornell University · Operations Research and Information Engineering

Active 1971–2025

h-index73

Citations28.7k

Papers43535 last 5y

Funding$266k

Faculty page

See your match with David Ruppert — sign in to PhdFit.Sign in

About

David Ruppert is the Andrew Schulz Jr. Professor of Engineering at the School of Operations Research and Information Engineering and also a Professor of Statistical Science at Cornell University. He holds a BA in Mathematics from Cornell University, an MA in Mathematics from the University of Vermont, and a PhD in Statistics and Probability from Michigan State University. His academic career includes positions as Assistant and Associate Professor of Statistics at the University of North Carolina, Chapel Hill, from 1977 to 1987. Professor Ruppert is a Fellow of the American Statistical Association (ASA) and the Institute of Mathematical Statistics (IMS), and he received the Wilcoxon Prize in 1986. Recognized as a highly cited researcher, he has been ranked 21st in mathematics by journal citations and has mentored 29 PhD students, many of whom are now leading researchers.

Research topics

Mathematics
Statistics
Computer science
Econometrics
Applied mathematics

Selected publications

Bayesian Functional Data Analysis in Astronomy
2025-11-04
articleOpen accessSenior author
Cosmic demographics—the statistical study of populations of astrophysical objects—has long relied on tools from multivariate statistics for analyzing data comprising fixed-length vectors of properties of objects, as might be compiled in a tabular astronomical catalog (say, with sky coordinates, and brightness measurements in a fixed number of spectral passbands). But beginning with the emergence of automated digital sky surveys, ca. 2000, astronomers began producing large collections of data with more complex structures: light curves (brightness time series) and spectra (brightness vs. wavelength). These comprise what statisticians call functional data—measurements of populations of functions. Upcoming automated sky surveys will soon provide astronomers with a flood of functional data. New methods are needed to accurately and optimally analyze large ensembles of light curves and spectra, accumulating information both along individual measured functions and across a population of such functions. Functional data analysis (FDA) provides tools for statistical modeling of functional data. Astronomical data presents several challenges for FDA methodology, e.g., sparse, irregular, and asynchronous sampling, and heteroscedastic measurement error. Bayesian FDA uses hierarchical Bayesian models for function populations, and is well suited to addressing these challenges. We provide an overview of astronomical functional data and some key Bayesian FDA modeling approaches, including functional mixed effects models, and stochastic process models. We briefly describe a Bayesian FDA framework combining FDA and machine learning methods to build low-dimensional parametric models for galaxy spectra.
Publisher OA PDF DOI
Bayesian analysis of regression discontinuity designs with heterogeneous treatment effects
ArXiv.org · 2025-04-14
preprintOpen accessSenior author
Regression Discontinuity Design (RDD) is a popular framework for estimating a causal effect in settings where treatment is assigned if an observed covariate exceeds a fixed threshold. We consider estimation and inference in the common setting where the sample consists of multiple known sub-populations with potentially heterogeneous treatment effects. In the applied literature, it is common to account for heterogeneity by either fitting a parametric model or considering each sub-population separately. In contrast, we develop a Bayesian hierarchical model using Gaussian process regression which allows for non-parametric regression while borrowing information across sub-populations. We derive the posterior distribution, prove posterior consistency, and develop a Metropolis-Hastings within Gibbs sampling algorithm. In extensive simulations, we show that the proposed procedure outperforms existing methods in both estimation and inferential tasks. Finally, we apply our procedure to U.S. Senate election data and discover an incumbent party advantage which is heterogeneous over different time periods.
Publisher OA PDF DOI
Correction to: Dynamic Shrinkage Processes
Journal of the Royal Statistical Society Series B (Statistical Methodology) · 2024-11-11
articleOpen accessSenior author
Publisher OA PDF DOI
A novel approach to assessing the joint effects of mercury and fish consumption on neurodevelopment in the New Bedford Cohort
American Journal of Epidemiology · 2024-06-28 · 5 citations
articleOpen access
Understanding health risks from methylmercury (MeHg) exposure is complicated by its link to fish consumption, which may confound or modify toxicities. One solution is to include fish intake and a biomarker of MeHg exposure in the same analytical model, but resulting estimates do not reflect the independent impact of accumulated MeHg or fish exposure. In fish-eating populations, this can be addressed by separating MeHg exposure into fish intake and average mercury content of the consumed fish. We assessed the joint association of prenatal MeHg exposure (maternal hair mercury level) and fish intake (among fish-eating mothers) with neurodevelopment in 361 children aged 8 years from the New Bedford Cohort (New Bedford, Massachusetts; born in 1993-1998). Neurodevelopmental assessments used standardized tests of IQ, language, memory, and attention. Covariate-adjusted regression assessed the association of maternal fish consumption, stratified by tertile of estimated average fish mercury level, with neurodevelopment. Associations between maternal fish intake and child outcomes were generally beneficial for those in the lowest average fish mercury tertile but detrimental in the highest average fish mercury tertile, where, for example, each serving of fish was associated with 1.3 fewer correct responses (95% CI, -2.2 to -0.4) on the Boston Naming Test. Standard analyses showed no outcome associations with hair mercury level or fish intake. This article is part of a Special Collection on Environmental Epidemiology.
Publisher OA PDF DOI
Bayesian functional data analysis in astronomy
arXiv (Cornell University) · 2024-08-26 · 1 citations
preprintOpen accessSenior author
Cosmic demographics -- the statistical study of populations of astrophysical objects -- has long relied on *multivariate statistics*, providing methods for analyzing data comprising fixed-length vectors of properties of objects, as might be compiled in a tabular astronomical catalog (say, with sky coordinates, and brightness measurements in a fixed number of spectral passbands). But beginning with the emergence of automated digital sky surveys, ca. ~2000, astronomers began producing large collections of data with more complex structure: light curves (brightness time series) and spectra (brightness vs. wavelength). These comprise what statisticians call *functional data* -- measurements of populations of functions. Upcoming automated sky surveys will soon provide astronomers with a flood of functional data. New methods are needed to accurately and optimally analyze large ensembles of light curves and spectra, accumulating information both along and across measured functions. Functional data analysis (FDA) provides tools for statistical modeling of functional data. Astronomical data presents several challenges for FDA methodology, e.g., sparse, irregular, and asynchronous sampling, and heteroscedastic measurement error. Bayesian FDA uses hierarchical Bayesian models for function populations, and is well suited to addressing these challenges. We provide an overview of astronomical functional data, and of some key Bayesian FDA modeling approaches, including functional mixed effects models, and stochastic process models. We briefly describe a Bayesian FDA framework combining FDA and machine learning methods to build low-dimensional parametric models for galaxy spectra.
Publisher OA PDF DOI
Characterization of extrasolar giant planets with machine learning
Monthly Notices of the Royal Astronomical Society Letters · 2023-10-04
articleOpen accessSenior author
ABSTRACT More than 5000 extrasolar planets have already been detected. JWST and near-term ground-based telescopes like the Extremely Large Telescope (ELT), Giant Magellan Telescope (GMT), Thirty Meter Telescope (TMT), and upcoming telescopes such as the Nancy Grace Roman Space Telescope, Xuntian, and Ariel are designed to characterize the atmosphere of directly imaged Jovian planets. Here, we used five diverse machine learning algorithms to investigate how well broad-band filter photometric fluxes could initially characterize giant exoplanets. We use an established grid of 8813 reflected light model spectra of different metallicities, planet–star distances, and cloud properties to assess the performance of several machine learning algorithms on both noiseless and noisy data to provide classification and regression results as a function of signal to noise of the data. In all cases, the algorithms were tested on noisy validation data. The results show that the use of machine learning to characterize giant planets from reflected broad-band filter photometry provides a promising tool for initial characterization, with over 65 per cent accuracy in characterizing metallicity for signal-to-noise ratios (S/N) ≳ 30, over 80 per cent for cloud coverage for S/N ≳ 30. This approach will allow initial characterization for large surveys of giant exoplanets and prioritization for spectroscopy observations of a subset of these worlds.
Publisher OA PDF DOI
Splines 'n Lines: Rest-frame galaxy spectral energy distributions via Bayesian functional data analysis
arXiv (Cornell University) · 2023-10-30
preprintOpen accessSenior author
Survey-based measurements of the spectral energy distributions (SEDs) of galaxies have flux density estimates on badly misaligned grids in rest-frame wavelength. The shift to rest frame wavelength also causes estimated SEDs to have differing support. For many galaxies, there are sizeable wavelength regions with missing data. Finally, dim galaxies dominate typical samples and have noisy SED measurements, many near the limiting signal-to-noise level of the survey. These limitations of SED measurements shifted to the rest frame complicate downstream analysis tasks, particularly tasks requiring computation of functionals (e.g., weighted integrals) of the SEDs, such as synthetic photometry, quantifying SED similarity, and using SED measurements for photometric redshift estimation. We describe a hierarchical Bayesian framework, drawing on tools from functional data analysis, that models SEDs as a random superposition of smooth continuum basis functions (B-splines) and line features, comprising a finite-rank, nonstationary Gaussian process, measured with additive Gaussian noise. We apply this *Splines 'n Lines* (SnL) model to a collection of 678,239 galaxy SED measurements comprising the Main Galaxy Sample from the Sloan Digital Sky Survey, Data Release 17, demonstrating capability to provide continuous estimated SEDs that reliably denoise, interpolate, and extrapolate, with quantified uncertainty, including the ability to predict line features where there is missing data by leveraging correlations between line features and the entire continuum.
Publisher OA PDF DOI
Measurement errors in semi‐parametric generalised regression models
Australian & New Zealand Journal of Statistics · 2023-10-11
articleSenior author
Summary Regression models that ignore measurement error in predictors may produce highly biased estimates leading to erroneous inferences. It is well known that it is extremely difficult to take measurement error into account in Gaussian non‐parametric regression. This problem becomes even more difficult when considering other families such as binary, Poisson and negative binomial regression. We present a novel method aiming to correct for measurement error when estimating regression functions. Our approach is sufficiently flexible to cover virtually all distributions and link functions regularly considered in generalised linear models. This approach depends on approximating the first and the second moment of the response after integrating out the true unobserved predictors in any semi‐parametric generalised regression model. By the latter is meant a model with both linear and non‐parametric effects that are connected to the mean response by a link function and with a response distribution in an exponential family or quasi‐likelihood model. Unlike previous methods, the method we now propose is not restricted to truncated splines and can utilise various basis functions. Moreover, it can operate without making any distributional assumption about the unobserved predictor. Through extensive simulation studies, we study the performance of our method under many scenarios.
Publisher DOI
Bayesian Functional Principal Components Analysis via Variational Message Passing with Multilevel Extensions
Bayesian Analysis · 2023-08-08 · 2 citations
articleOpen accessSenior author
Standard approaches for functional principal components analysis rely on an eigendecomposition of a smoothed covariance surface in order to extract the orthonormal eigenfunctions representing the major modes of variation in a set of functional data. This approach can be a computationally intensive procedure, especially in the presence of large datasets with irregular observations. In this article, we develop a variational Bayesian approach, which aims to determine the Karhunen-Loève decomposition directly without smoothing and estimating a covariance surface. More specifically, we incorporate the notion of variational message passing over a factor graph because it removes the need for rederiving approximate posterior density functions if there is a change in the model. Instead, model changes are handled by changing specific computational units, known as fragments, within the factor graph – we demonstrate this with an extension to multilevel functional data. Indeed, this is the first article to address a functional data model via variational message passing. Our approach introduces three new fragments that are necessary for Bayesian functional principal components analysis. We present the computational details, a set of simulations for assessing the accuracy and speed of the variational message passing algorithm and an application to United States temperature data.
Publisher OA PDF DOI
Maximizing Portfolio Predictability with Machine Learning
SSRN Electronic Journal · 2023-01-01
articleOpen accessSenior author
Publisher DOI

Recent grants

Nonparametric Regression
NSF · $86k · 1998–2001
Asymptotic Theory of Penalized Splines and Calibration of Computationally Expensive Models
NSF · $180k · 2008–2012

Frequent coauthors

Raymond J. Carroll
University of Technology Sydney
136 shared
M. P. Wand
57 shared
David S. Matteson
Cornell University
46 shared
Hua Liang
24 shared
Ciprian M. Crainiceanu
Johns Hopkins University
23 shared
Heng Lian
City University of Hong Kong
16 shared
Christine A. Shoemaker
National University of Singapore
16 shared
Yixin Fang
AbbVie (United States)
16 shared

Education

PhD, Statistics and Probability
Michigan State University
1977
MA, Mathematics
University of Vermont
1973
BA, Mathematics
Cornell University
1970

Awards & honors

Wilcoxon Prize (1986)
Fellow of the ASA
Fellow of the IMS

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with David Ruppert

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you