
Joshua T. Vogelstein
· Associate ProfessorJohns Hopkins University · Radiology and Radiological Science
Active 2009–2026
About
Joshua T. Vogelstein, PhD, is an Associate Professor at Johns Hopkins University in the departments of Biomedical Engineering, Biostatistics, Applied Mathematics & Statistics, Neuroscience, and Computer Science. His research interests include big data science, machine learning, statistics, and network science. His work focuses on searching for patterns in physical worlds such as bodies and brains, as well as mental worlds including perceptions, experiences, memories, thoughts, emotions, and psychiatric conditions. He aims to understand the links between these worlds to bring them into greater alignment, with the broader goal of benefiting humans and animals by developing deeper insights into brain and body functions. All of his research products are freely available to the public.
Research topics
- Computer Science
- Statistics
- Mathematics
- Neuroscience
- Biology
- Evolutionary biology
Selected publications
Vectorized Adaptive Histograms for Sparse Oblique Forests
arXiv (Cornell University) · 2026-02-27
preprintOpen accessClassification using sparse oblique random forests provides guarantees on uncertainty and confidence while controlling for specific error types. However, they use more data and more compute than other tree ensembles because they create deep trees and need to sort or histogram linear combinations of data at runtime. We provide a method for dynamically switching between histograms and sorting to find the best split. We further optimize histogram construction using vector intrinsics. Evaluating this on large datasets, our optimizations speedup training by 1.7-2.5x compared to existing oblique forests and 1.5-2x compared to standard random forests. We also provide a GPU and hybrid CPU-GPU implementation.
Community correlations and testing independence between binary graphs
Applied Network Science · 2026-04-05
articleOpen accessSenior authorGraph data has a unique structure that deviates from standard data assumptions, often necessitating modifications to existing methods or the development of new ones to ensure valid statistical analysis. In this paper, we explore the notion of correlation and dependence between two binary graphs. Given vertex communities, we propose community correlations to measure the edge association, which equals zero if and only if the two graphs are conditionally independent within a specific pair of communities. The set of community correlations naturally leads to the maximum community correlation, indicating conditional independence on all possible pairs of communities, and to the overall graph correlation, which equals zero if and only if the two binary graphs are unconditionally independent. We then compute the sample community correlations via graph encoder embedding, proving they converge to their respective population versions, and derive the asymptotic null distribution to enable a fast, valid, and consistent test for conditional or unconditional independence between two binary graphs. The theoretical results are validated through comprehensive simulations, and we provide two real-data examples: one using Enron email networks and another using mouse connectome graphs, to demonstrate the utility of the proposed correlation measures.
Compiling molecular ultrastructure into neural dynamics
ArXiv.org · 2026-03-26
articleOpen accessHigh-resolution brain imaging can now capture not just synapse locations but their molecular composition, with the cost of such mapping falling exponentially. Yet such ultrastructural data has so far told us little about local neuronal physiology - specifically, the parameters (e.g., synaptic efficacies, local conductances) that govern neural dynamics. We propose to translate molecularly annotated ultrastructure into physiology, introducing the concept of an ultrastructure-to-dynamics compiler: a learned mapping from molecularly annotated ultrastructure to simulator-ready, uncertainty-aware physiological parameters. The requirement is paired training data, with jointly acquired ultrastructure from imaging, and dynamical responses to perturbations from physiological experiments. With this data we can train models that predict local physiology directly from structure. Such a compiler would support biophysical simulations by turning anatomical maps into models of circuit dynamics, shifting structure-to-function from a descriptive program to a predictive one and opening routes to understanding neural computation and forecasting intervention effects.
Vectorized Adaptive Histograms for Sparse Oblique Forests
arXiv (Cornell University) · 2026-02-27
articleOpen accessClassification using sparse oblique random forests provides guarantees on uncertainty and confidence while controlling for specific error types. However, they use more data and more compute than other tree ensembles because they create deep trees and need to sort or histogram linear combinations of data at runtime. We provide a method for dynamically switching between histograms and sorting to find the best split. We further optimize histogram construction using vector intrinsics. Evaluating this on large datasets, our optimizations speedup training by 1.7-2.5x compared to existing oblique forests and 1.5-2x compared to standard random forests. We also provide a GPU and hybrid CPU-GPU implementation.
Compiling molecular ultrastructure into neural dynamics
arXiv (Cornell University) · 2026-03-26
preprintOpen accessHigh-resolution brain imaging can now capture not just synapse locations but their molecular composition, with the cost of such mapping falling exponentially. Yet such ultrastructural data has so far told us little about local neuronal physiology - specifically, the parameters (e.g., synaptic efficacies, local conductances) that govern neural dynamics. We propose to translate molecularly annotated ultrastructure into physiology, introducing the concept of an ultrastructure-to-dynamics compiler: a learned mapping from molecularly annotated ultrastructure to simulator-ready, uncertainty-aware physiological parameters. The requirement is paired training data, with jointly acquired ultrastructure from imaging, and dynamical responses to perturbations from physiological experiments. With this data we can train models that predict local physiology directly from structure. Such a compiler would support biophysical simulations by turning anatomical maps into models of circuit dynamics, shifting structure-to-function from a descriptive program to a predictive one and opening routes to understanding neural computation and forecasting intervention effects.
Optimal control of the future via prospective learning with control
arXiv (Cornell University) · 2025-11-11
preprintOpen accessSenior authorOptimal control of the future is the next frontier for AI. Current approaches to this problem are typically rooted in reinforcement learning (RL). RL is mathematically distinct from supervised learning, which has been the main workhorse for the recent achievements in AI. Moreover, RL typically operates in a stationary environment with episodic resets, limiting its utility. Here, we extend supervised learning to address learning to control in non-stationary, reset-free environments. Using this framework, called ''Prospective Learning with Control'' (PLuC), we prove that under certain fairly general assumptions, empirical risk minimization (ERM) asymptotically achieves the Bayes optimal policy. We then consider a specific instance of prospective learning with control: foraging, a canonical task relevant to both natural and artificial agents. We illustrate that modern RL algorithms, which assume stationarity, struggle in these non-stationary reset-free environments. Even with time-aware modifications, they converge orders of magnitude slower than our prospective foraging agents on a simple 1-D foraging benchmark. Code is available at: https://github.com/neurodata/procontrol.
PLoS ONE · 2025-08-12 · 1 citations
articleOpen accessCorrespondingAlzheimer's disease (AD) lacks effective cures and is typically detected after substantial pathological changes have occurred, making intervention challenging. Alzheimer's disease (AD) intervention requires early detection of risk factors and understanding their complex interactions before substantial pathological changes manifest. Current research often examines individual risk factors in isolation, limiting our understanding of their combined effects. We present a novel multivariate analytical framework to simultaneously assess multiple AD risk factors using mouse models expressing human ApoE alleles. Our methodological innovation lies in combining high-resolution magnetic resonance diffusion imaging with a comprehensive multifactorial analysis that integrates genotype, age, sex, diet, and immunity as interacting variables. This approach enables the simultaneous examination of regional brain volume and fractional anisotropy changes across multiple risk factors, providing a more holistic view than traditional univariate analyses. Our proposed method effectively identified how these factors converge on specific brain regions - with genotype influencing the caudate putamen, pons, cingulate cortex, and cerebellum; sex affecting the amygdala and piriform cortex; and immune status impacting association cortices and cerebellar nuclei. Importantly, our integrated approach revealed factor interactions that would remain undetected in single-variable studies, particularly in the amygdala, thalamus, and pons. While many findings align with previous research, our multidimensional framework offers a methodological advancement for studying AD risk factors by modeling their combined effects rather than isolated impacts. This approach creates a template for future studies to investigate mechanisms underlying coordinated changes in brain structure through network analyses of gene expression, metabolism, and structural pathways involved in neurodegeneration.
When no answer is better than a wrong answer: A causal perspective on batch effects
Imaging Neuroscience · 2025-01-01 · 3 citations
articleOpen accessSenior authorBatch effects, undesirable sources of variability across multiple experiments, present significant challenges for scientific and clinical discoveries. Batch effects can (i) produce spurious signals and/or (ii) obscure genuine signals, contributing to the ongoing reproducibility crisis. Because batch effects are typically modeled as classical statistical effects, they often cannot differentiate between sources of variability due to confounding biases, which may lead them to erroneously conclude batch effects are present (or not). We formalize batch effects as causal effects, and introduce algorithms leveraging causal machinery, to address these concerns. Simulations illustrate that when non-causal methods provide the wrong answer, our methods either produce more accurate answers or "no answer," meaning they assert the data are inadequate to confidently conclude on the presence of a batch effect. Applying our causal methods to 27 neuroimaging datasets yields qualitatively similar results: in situations where it is unclear whether batch effects are present, non-causal methods confidently identify (or fail to identify) batch effects, whereas our causal methods assert that it is unclear whether there are batch effects or not. In instances where batch effects should be discernable, our techniques produce different results from prior art, each of which produce results more qualitatively similar to not applying any batch effect correction to the data at all. This work, therefore, provides a causal framework for understanding the potential capabilities and limitations of analysis of multi-site data.
Hands-On Network Machine Learning with Python
Cambridge University Press eBooks · 2025-09-18
bookSenior authorBridging theory and practice in network data analysis, this guide offers an intuitive approach to understanding and analyzing complex networks. It covers foundational concepts, practical tools, and real-world applications using Python frameworks including NumPy, SciPy, scikit-learn, graspologic, and NetworkX. Readers will learn to apply network machine learning techniques to real-world problems, transform complex network structures into meaningful representations, leverage Python libraries for efficient network analysis, and interpret network data and results. The book explores methods for extracting valuable insights across various domains such as social networks, ecological systems, and brain connectivity. Hands-on tutorials and concrete examples develop intuition through visualization and mathematical reasoning. The book will equip data scientists, students, and researchers in applications using network data with the skills to confidently tackle network machine learning projects, providing a robust toolkit for data science applications involving network-structured data.
Simple Lifelong Learning Machines
IEEE Transactions on Pattern Analysis and Machine Intelligence · 2025-08-04 · 1 citations
article1st authorCorrespondingIn lifelong learning, data are used to improve performance not only on the present task, but also on past and future (unencountered) tasks. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain performance on old tasks given new tasks. But striving to avoid forgetting sets the goal unnecessarily low. The goal of lifelong learning should be to use data to improve performance on both future tasks (forward transfer) and past tasks (backward transfer). In this paper, we show that a simple approach-representation ensembling-demonstrates both forward and backward transfer in a variety of simulated and benchmark data scenarios, including tabular, vision (CIFAR-100, 5-dataset, Split Mini-Imagenet, Food1k, and CORe50), and speech (spoken digit), in contrast to various reference algorithms, which typically failed to transfer either forward or backward, or both. Moreover, our proposed approach can flexibly operate with or without a computational budget.
Frequent coauthors
- 30 shared
Carey E. Priebe
- 12 shared
Randal Burns
Johns Hopkins University
- 11 shared
Michael P. Milham
- 10 shared
Daniel S. Margulies
Wellcome Centre for Integrative Neuroimaging
- 8 shared
Eric Bridgeford
Stanford University
- 8 shared
Georg Langs
Medical University of Vienna
- 8 shared
R. Jacob Vogelstein
- 7 shared
Youngser Park
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Joshua T. Vogelstein
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup