
David S. Matteson
VerifiedCornell University · Computer Science
Active 1986–2026
About
David S. Matteson is a faculty member in the Department of Computer Science at Cornell University. The provided page text does not include specific information about his research focus, background, or key contributions. Therefore, no detailed biography can be extracted from the available content.
Research topics
- Mathematics
- Data Mining
- Artificial Intelligence
- Computer Science
- Machine Learning
- Econometrics
- Geometry
- Mathematical economics
- Economics
- Physics
- Economic growth
- Medicine
- Geography
- Data science
- Keynesian economics
Selected publications
BASTION: A Bayesian Framework for Trend and Seasonality Decomposition
arXiv (Cornell University) · 2026-01-26
preprintOpen accessSenior authorWe introduce BASTION (Bayesian Adaptive Seasonality and Trend DecompositION), a flexible Bayesian framework for decomposing time series into trend and multiple seasonality components. We cast the decomposition as a penalized nonparametric regression and establish formal conditions under which the trend and seasonal components are uniquely identifiable, an issue only treated informally in the existing literature. BASTION offers three key advantages over existing decomposition methods: (1) accurate estimation of trend and seasonality amidst abrupt changes, (2) enhanced robustness against outliers and time-varying volatility, and (3) robust uncertainty quantification. We evaluate BASTION against established methods, including TBATS, STR, and MSTL, using both simulated and real-world datasets. By effectively capturing complex dynamics while accounting for irregular components such as outliers and heteroskedasticity, BASTION delivers a more nuanced and interpretable decomposition. To support further research and practical applications, BASTION is available as an R package at https://github.com/Jasoncho0914/BASTION
Journal of Clinical and Translational Science · 2026-04-01
articleOpen accessObjectives/Goals: The objective of this study was to evaluate the performance of multimodal machine learning (ML) models trained to predict differentiated thyroid cancer (DTC) recurrence using clinical data combined with novel natural language processing (NLP) derived features extracted from patient cytopathology and surgical pathology reports. Methods/Study Population: This was a retrospective study of adult thyroid cancer patients treated at an academic medical center. Patients were classified as having cancer recurrence or no recurrence. NLP features were extracted from cytopathology and surgical pathology reports using Term Frequency–Inverse Document Frequency (TF-IDF), latent Dirichlet allocation (LDA), and a zero-shot large language model (LLM) classification. 5 multimodal ML models were trained to predict cancer recurrence utilizing a combination of NLP and LLM features and clinical variables. Model performance was evaluated using area under the receiver operating characteristic curve (ROC-AUC) and precision recall area under the curve (PR-AUC). The top performing model was optimized with a 5-fold cross-validation. Feature importance was calculated. Results/Anticipated Results: 480 patients with differentiated thyroid cancer diagnosed on surgical pathology were included in this study. The baseline model (clinical variables only) had a F1-score of 0.52 and an AUC of 0.53. The optimized gradient boosting model utilizing all features (EMR, LDA, TF-IDF, and LLM) had a F1-score of 0.87 and an AUC of 0.86. Topic words and themes from the patient cytopathology and surgical pathology reports were generated using LDA. Topic themes in cytopathology reports include malignancy, lymph node evaluation, and molecular testing. Topic themes in surgical pathology reports include histologic subtype, orientation of nodule, and intraoperative biopsy. The LDA themes of malignancy and histologic subtype ranked the highest in terms of feature importance. Discussion/Significance of Impact: Multimodal models utilizing novel NLP features derived from unstructured pathology reports may enable improved prediction of recurrence in patients with DTC. Our optimized model demonstrated that 4 of the top 6 highest features were LDA topics. Topic modeling may be a valuable tool to extract relevant information from unstructured clinical notes.
The Journal of Physical Chemistry B · 2026-01-15
articleOpen accessSingle-molecule Förster resonance energy transfer (smFRET) experiments have greatly contributed to the understanding of the conformational dynamics of proteins and other biomolecules. Generating high-fidelity simulated data for smFRET experiments is an important step toward developing and examining accurate and efficient smFRET data analysis techniques. Here, we use distributions of interdye distances generated using Langevin dynamics to simulate freely diffusing smFRET timestamp data for proteins and biomolecules that have conformational flexibility. We then compare analysis techniques for smFRET data to validate the new module. The Langevin dynamics is used here as an illustrative example to demonstrate how modeling conformational dynamics can be integrated with molecular diffusion and photon emission statistics, all of which are essential for realistic simulation of freely diffusing smFRET data. We also discuss different ways to generalize our approach to make the simulated data more realistic including the employment of molecular dynamics (MD) simulations that is illustrated with an example. The Langevin dynamics module provides a framework for generating timestamp data for systems with a known underlying conformational heterogeneity as a step toward the development of new analysis techniques for smFRET data dealing with flexible proteins or other biomolecular systems.
Spatial heterogeneity in machine learning-based poverty mapping: Where do models underperform?
Geography and sustainability · 2026-01-15
articleOpen access• Machine learning-based poverty mapping underperforms due to spatial heterogeneity. • In interpolation, geographically weighted models reveal variation across regions. • In extrapolation, models overestimate welfare in poor, rural, single-sector regions. • Spatial models yield limited gains in underperforming areas. • Unbiased poverty maps require improved training data and remote sensing proxies. Accurately locating poor populations is increasingly urgent as global poverty reduction has stalled under the combined pressures of conflicts, climate shocks, rising food prices, pandemics, and growing inequality. Recent studies harnessing geospatial big data and machine learning (ML) have significantly advanced poverty mapping, enabling granular and timely welfare estimates in traditionally data-scarce regions. While much of the existing research has focused on overall out-of-sample predictive performance, there is a lack of understanding regarding where such models underperform and whether key spatial relationships might vary across places. This study investigates spatial heterogeneity in ML-based poverty mapping in East Africa, testing whether spatial regression and ML techniques produce more unbiased predictions. We find that extrapolation into unsurveyed areas suffers from biases that spatial methods do not resolve; welfare is overestimated in impoverished regions, rural areas, and single sector-dominated economies, whereas it tends to be underestimated in wealthier, urbanized, and diversified economies. Even as spatial models improve overall predictive accuracy, enhancements in traditionally underperforming areas remain marginal. This underscores the need for more representative training datasets and better remotely sensed proxies, especially for poor and rural regions, in future research related to ML-based poverty mapping. For development agencies, the findings caution against treating ML-based outputs as neutral or universally reliable, highlighting instead the need to pair technical advances with investments in inclusive data collection, integration of spatial theory, and institutional strategies that address structural data inequalities.
Modeling Dynamic Correlation Matrices with Shrinkage Priors
ArXiv.org · 2026-05-07
articleOpen accessEstimating time-varying correlation matrices is challenging because existing methods may adapt slowly to structural changes, impose insufficient regularization, or produce diffuse posterior uncertainty. In moderate dimensions, an additional difficulty is summarizing the estimated evolving dependence structure for downstream decision-making tasks. We propose a Bayesian approach based on a low-rank factor representation, with latent states evolving under a dynamic shrinkage prior and observation errors following a multivariate factor stochastic volatility model. This specification allows locally adaptive regularization of the estimated correlation structure over time and informative uncertainty quantification. We establish, to our knowledge, a first-of-its-kind posterior contraction result for dynamically regularized Bayesian models, showing contraction around the true model parameters at an explicit rate under averaged Hellinger distance. To summarize the estimated correlation matrices, we build on the information-theoretic concept of total correlation to obtain a scalar measure of cross-sectional dependence. Simulation studies show improved accuracy and responsiveness relative to competing methods in a range of challenging scenarios. We then apply our method to monitoring the correlation evolution of equity portfolios during periods of financial market stress, providing an ex post framework for assessing the changing benefits of diversification in backtesting analyses.
BASTION: A Bayesian Framework for Trend and Seasonality Decomposition
ArXiv.org · 2026-01-26
articleOpen accessSenior authorWe introduce BASTION (Bayesian Adaptive Seasonality and Trend DecompositION), a flexible Bayesian framework for decomposing time series into trend and multiple seasonality components. We cast the decomposition as a penalized nonparametric regression and establish formal conditions under which the trend and seasonal components are uniquely identifiable, an issue only treated informally in the existing literature. BASTION offers three key advantages over existing decomposition methods: (1) accurate estimation of trend and seasonality amidst abrupt changes, (2) enhanced robustness against outliers and time-varying volatility, and (3) robust uncertainty quantification. We evaluate BASTION against established methods, including TBATS, STR, and MSTL, using both simulated and real-world datasets. By effectively capturing complex dynamics while accounting for irregular components such as outliers and heteroskedasticity, BASTION delivers a more nuanced and interpretable decomposition. To support further research and practical applications, BASTION is available as an R package at https://github.com/Jasoncho0914/BASTION
Modeling Dynamic Correlation Matrices with Shrinkage Priors
arXiv (Cornell University) · 2026-05-07
preprintOpen accessEstimating time-varying correlation matrices is challenging because existing methods may adapt slowly to structural changes, impose insufficient regularization, or produce diffuse posterior uncertainty. In moderate dimensions, an additional difficulty is summarizing the estimated evolving dependence structure for downstream decision-making tasks. We propose a Bayesian approach based on a low-rank factor representation, with latent states evolving under a dynamic shrinkage prior and observation errors following a multivariate factor stochastic volatility model. This specification allows locally adaptive regularization of the estimated correlation structure over time and informative uncertainty quantification. We establish, to our knowledge, a first-of-its-kind posterior contraction result for dynamically regularized Bayesian models, showing contraction around the true model parameters at an explicit rate under averaged Hellinger distance. To summarize the estimated correlation matrices, we build on the information-theoretic concept of total correlation to obtain a scalar measure of cross-sectional dependence. Simulation studies show improved accuracy and responsiveness relative to competing methods in a range of challenging scenarios. We then apply our method to monitoring the correlation evolution of equity portfolios during periods of financial market stress, providing an ex post framework for assessing the changing benefits of diversification in backtesting analyses.
Smoothing Variances Across Time: Adaptive Stochastic Volatility
Figshare · 2025-12-29
datasetOpen accessSenior authorWe introduce a novel Bayesian framework for estimating time-varying volatility by extending the Random Walk Stochastic Volatility (RWSV) model with Dynamic Shrinkage Processes (DSP) in log-variances. Unlike the classical Stochastic Volatility (SV) or GARCH-type models with restrictive parametric stationarity assumptions, our proposed Adaptive Stochastic Volatility (ASV) model provides smooth yet dynamically adaptive estimates of evolving volatility and its uncertainty. We further enhance the model by incorporating a nugget effect, allowing it to flexibly capture small-scale variability while preserving smoothness elsewhere. We derive the theoretical properties of the global-local shrinkage prior DSP. Simulation studies demonstrate that ASV is highly robust to misspecification, consistently recovering the latent volatility structure across a wide range of data-generating processes. Furthermore, ASV’s capacity to yield locally smooth and interpretable estimates facilitates a clearer understanding of the underlying patterns and trends in volatility. As an extension, we develop the Bayesian Trend Filter with ASV (BTF-ASV) which allows joint modeling of the mean and volatility with abrupt changes. Finally, our proposed models are applied to time series data from finance, econometrics, and environmental science, highlighting their flexibility and broad applicability.
dsp: Dynamic Shrinkage Process and Change Point Detection
2025-08-19
datasetOpen accessSenior authorProvides efficient Markov chain Monte Carlo (MCMC) algorithms for dynamic shrinkage processes, which extend global-local shrinkage priors to the time series setting by allowing shrinkage to depend on its own past. These priors yield locally adaptive estimates, useful for time series and regression functions with irregular features. The package includes full MCMC implementations for trend filtering using dynamic shrinkage on signal differences, producing locally constant or linear fits with adaptive credible bands. Also included are models with static shrinkage and normal-inverse-Gamma priors for comparison. Additional tools cover dynamic regression with time-varying coefficients and B-spline models with shrinkage on basis differences, allowing for flexible curve-fitting with unequally spaced data. Some support for heteroscedastic errors, outlier detection, and change point estimation. Methods in this package are described in Kowal et al. (2019) <<a href="https://doi.org/10.1111%2Frssb.12325" target="_top">doi:10.1111/rssb.12325</a>>, Wu et al. (2024) <<a href="https://doi.org/10.1080%2F07350015.2024.2362269" target="_top">doi:10.1080/07350015.2024.2362269</a>>, Schafer and Matteson (2024) <<a href="https://doi.org/10.1080%2F00401706.2024.2407316" target="_top">doi:10.1080/00401706.2024.2407316</a>>, and Cho and Matteson (2024) <<a href="https://doi.org/10.48550%2FarXiv.2408.11315" target="_top">doi:10.48550/arXiv.2408.11315</a>>.
Spatial Heterogeneity in Machine Learning-Based Poverty Mapping: Where Do Models Underperform?
SSRN Electronic Journal · 2025-01-01
preprintOpen access
Recent grants
NSF · $735k · 2019–2023
Collaborative Research: Atomic Level Structural Dynamics in Catalysts
NSF · $331k · 2019–2022
New Frontiers in Time Series Analysis
NSF · $300k · 2021–2025
CAREER: New Frontiers in Time Series Analysis
NSF · $400k · 2015–2020
HDR TRIPODS: Collaborative Research: Foundations of Greater Data Science
NSF · $686k · 2019–2023
Frequent coauthors
- 46 shared
David Ruppert
Cornell University
- 19 shared
Ines Wilms
- 19 shared
Jacob Bien
University of Southern California
- 16 shared
Yuchen Xu
Shandong First Medical University
- 16 shared
Peter A. Crozier
- 15 shared
Nicholas A. James
Mount Sinai Beth Israel
- 14 shared
Toryn L. J. Schafer
Cornell University
- 14 shared
Benjamin B. Risk
Emory University
Education
- 2008
PhD, Statistics
University of Chicago
Awards & honors
- CAREER Award from the National Science Foundation
- Chancellor’s Award for Scholarship and Creative Activities f…
- inaugural Ann S. Bowers Research Excellence Award
- Faculty Research Awards from the Xerox/PARC Foundation and L…
- Fellow of the American Statistical Association (ASA)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with David S. Matteson
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup