Ali Shojaie
· ProfessorVerifiedUniversity of Washington · Statistics
Active 2005–2026
About
Ali Shojaie is a Professor in the Department of Biostatistics and the Department of Statistics at the University of Washington. He currently serves as the Interim Chair of the Department of Biostatistics and is the Founding Director of the Summer Institute for Statistics in Big Data (SISBID) at the University of Washington. Additionally, he leads the Data Management and Statistics (DMS) Core at the UW Alzheimer's Disease Research Center (ADRC). He holds affiliate memberships with the Fred Hutchinson Cancer Research Center, the UW Center for Statistics and Social Sciences (CSSS), and the UW eScience Institute. Professor Shojaie is actively involved in editorial service for several prestigious journals, including serving as Editor for the Section on Machine Learning and Data Mining for the New England Journal of Statistics in Data Science since 2020, Associate Editor for the Journal of the American Statistical Association (JASA) since 2013 and again from 2024, Action Editor for the Journal of Machine Learning Research (JMLR) since 2022, Associate Editor for Biometrika since 2023, and Statistical Editor for NEJM AI since 2024.
Research signals
Five dimensions sourced from public faculty / publication signals. Sign in to compare against your own profile and see your match score.
Research topics
- Biology
- Bioinformatics
- Genetics
- Computer Science
- Machine Learning
- Artificial Intelligence
- Endocrinology
- Evolutionary biology
- Medicine
- Computational biology
- Internal medicine
Selected publications
EBioMedicine · 2026-03-17
articleOpen accessBACKGROUND: The significant clinical and molecular heterogeneity of pulmonary arterial hypertension (PAH) poses challenges in identifying effective therapies. Advanced multidimensional profiling offers an opportunity to capture molecular responses and assess biomarker stability, yet its application in randomised trials remains limited. METHODS: We evaluated the multi-omic profiles of participants with PAH in a randomised, placebo-controlled trial of famotidine. Plasma metabolomic and proteomic profiling was performed at enrolment and 24 weeks. Baseline profiles were compared between treatment arms to assess randomisation balance. Intraclass correlation coefficients quantified within-subject stability over time. Linear regression models adjusting for age, sex, body mass index and PAH aetiology evaluated famotidine's molecular effects. False discovery rate was controlled for multiple comparisons. FINDINGS: For the 79 participants, baseline multi-omic profiles were similar between groups. At 24 weeks, 34 and 37 participants remained in the famotidine and placebo groups respectively. The placebo group showed high molecular stability, while greater variability was observed in the famotidine group. Famotidine treatment was associated with significant changes across 191 proteomic pathways (q-value <0.05), but no metabolomic changes remained significant after multiple-testing correction. INTERPRETATION: Integrating multi-omics into a prospective clinical trial is feasible and yields stable longitudinal profiles in the absence of intervention. While famotidine did not yield clinical benefit, associated proteomic changes illustrate how molecular profiling can reveal treatment-related biology and inform future trial design. These findings highlight the broader utility of multi-omics for evaluating drug responses and identifying molecular endotypes in PAH and beyond. FUNDING: US National Institutes of Health.
Statistical inference for high-dimensional generalized estimating equations
Biostatistics · 2026-01-01
preprintOpen accessSenior authorRegression analysis of correlated data, where multiple correlated responses are recorded on the same unit, is ubiquitous in many scientific areas. With the advent of new technologies, in particular high-throughput omics profiling assays, such correlated data increasingly consist of a large number of variables compared with the available sample size. Motivated by recent longitudinal proteomics studies of COVID-19, we propose a novel inference procedure for linear functionals of high-dimensional regression coefficients in generalized estimating equations, which are widely used to analyze correlated data. Our estimator for this more general inferential target, obtained via constructing projected estimating equations, is shown to be asymptotically normally distributed under mild regularity conditions. We also introduce a data-driven cross-validation procedure to select the tuning parameter for estimating the projection direction, which is not addressed in the existing procedures. We illustrate the utility of the proposed procedure in providing confidence intervals for associations of individual proteins and severe COVID risk scores obtained based on high-dimensional proteomics data, and demonstrate its robust finite-sample performance, especially in estimation bias and confidence interval coverage, via extensive simulations.
A general nonparametric framework for testing hypotheses about function-valued parameters
arXiv (Cornell University) · 2026-04-21
preprintOpen accessWe present a general nonparametric approach for testing whether a statistical parameter defined through conditional distributions is constant across the conditioning variables. Such hypotheses arise naturally in problems such as assessing treatment effect heterogeneity, conditional associational effects, and conditional mean dependence. Our framework studies function-valued parameters obtained by evaluating a smooth statistical functional on conditional probability distributions. We establish an explicit connection between our test and procedures based on studying the norm of the function-valued parameter. Unlike many existing norm-based tests, which exhibit poor asymptotic behavior under the null, the proposed test statistic admits a tractable limiting null distribution. We illustrate the applicability of the proposed test through several examples, assess its operating characteristics in simulation studies, and apply it to data from a breast cancer trial to identify predictive biomarkers for response to adjuvant chemotherapy.
Differential expression analysis for spatially correlated data using smiDE
Genome biology · 2026-01-10
articleOpen accessSenior authorCorrespondingDifferential expression is a key application of imaging spatial transcriptomics, moving analysis beyond cell type localization to examining cell state responses to microenvironments. However, spatial data poses new challenges to differential expression: segmentation errors cause bias in fold-change estimates, and correlation among neighboring cells leads standard models to inflate statistical significance. We find that ignoring these issues can result in considerable false discoveries that greatly outnumber true findings. We present a suite of solutions to these fundamental challenges, and implement them in the R package smiDE.
Data for "Accounting for Spatial Structure in Network Analysis of Spatial Transcriptomics Data"
Zenodo (CERN European Organization for Nuclear Research) · 2026-05-18
datasetOpen accessSenior authorIntermediate data files for Vasconcelos et al. (2025), "Accounting for Spatial Structure in Network Analysis of Spatial Transcriptomics Data." To be used with the analysis code at https://github.com/anagpv/spacedecorrcodes. The spacedecorr R package is available at https://github.com/anagpv/spacedecorr.
A general nonparametric framework for testing hypotheses about function-valued parameters
arXiv (Cornell University) · 2026-04-21
articleOpen accessWe present a general nonparametric approach for testing whether a statistical parameter defined through conditional distributions is constant across the conditioning variables. Such hypotheses arise naturally in problems such as assessing treatment effect heterogeneity, conditional associational effects, and conditional mean dependence. Our framework studies function-valued parameters obtained by evaluating a smooth statistical functional on conditional probability distributions. We establish an explicit connection between our test and procedures based on studying the norm of the function-valued parameter. Unlike many existing norm-based tests, which exhibit poor asymptotic behavior under the null, the proposed test statistic admits a tractable limiting null distribution. We illustrate the applicability of the proposed test through several examples, assess its operating characteristics in simulation studies, and apply it to data from a breast cancer trial to identify predictive biomarkers for response to adjuvant chemotherapy.
Inference on function-valued parameters using a restricted score test
Journal of the Royal Statistical Society Series B (Statistical Methodology) · 2026-01-20
articleOpen accessSenior authorIt is often of interest to make inference on an unknown function that is a local parameter of the data-generating mechanism, such as a density or regression function. Such estimands can typically only be estimated at a slower-than-parametric rate in nonparametric and semiparametric models, and performing calibrated inference can be challenging. In many cases, these estimands can be expressed as the minimizer of a population risk functional. Here, we propose a general framework that leverages such representation and provides a nonparametric extension of the score test for inference on an infinite-dimensional risk minimizer. We demonstrate that our framework is applicable in a wide variety of problems. As both analytic and computational examples, we describe how to use our general approach for inference on a mean regression function under (i) nonparametric and (ii) partially additive models, and evaluate the operating characteristics of the resulting procedures via simulations. Assessment of effect heterogeneity, inference on density functions, and conditional independence testing are discussed as additional examples.
medRxiv · 2026-03-18
articleOpen accessAtrial fibrillation and heart failure impose substantial health burdens worldwide, yet existing prediction models lack sufficient accuracy and generalizability. We developed CARDIAC-FM, a multimodal foundation model that learns joint representations of 12-lead electrocardiogram (ECG) and cardiac magnetic resonance imaging (MRI) through contrastive learning. We trained CARDIAC-FM on 57,609 paired ECG-cardiac MRI samples from UK Biobank and evaluated it in two external cohorts: the Cardiovascular Health Study (CHS) and the Multi-Ethnic Study of Atherosclerosis (MESA). CARDIAC-FM consistently outperformed unimodal models across all cohorts, and jointly incorporating ECG features with established clinical risk scores yielded additive gains in discrimination, indicating that ECG and traditional risk factors capture complementary dimensions of cardiovascular risk. The learned representations improved prediction across a range of cardiovascular outcomes with minimal task-specific fine-tuning, reflecting real-world settings where many diseases have limited positive samples and lack dedicated risk models. Although trained on paired ECG and MRI data, CARDIAC-FM generates predictions using ECG alone or ECG combined with established risk scores, enabling broad clinical deployment without MRI. These findings demonstrate the promise of multimodal pre-training for generalizable cardiovascular risk prediction.
bioRxiv (Cold Spring Harbor Laboratory) · 2026-05-15
articleOpen accessAbstract Human brain tissue preserved in biorepositories is foundational for the structural, cellular, and biomolecular research necessary for a mechanistic understanding of neurological diseases. Realizing the research potential of these valuable resources requires well-characterized research-relevant tissue that can be efficiently identified by investigators and incorporated into the conceptual and computational frameworks of interdisciplinary research. Several large-scale efforts to improve research reliability and reproducibility have sought to characterize and annotate the processes by which these samples are collected, yet limited progress has been made on standardizing spatial information for these samples. Biorepositories systematically collect brain tissue according to a brain sampling protocol (BSP) that differs between institutions, yet explicit spatial information regarding the samples may not be documented in standard operating procedures (SOPs). The amount of anatomical location details available to investigators are inconsistent across biorepositories and typically lack sufficient anatomical precision to ensure correspondence with samples from other biorepositories or research relevant brain regions specified by neuroimaging, functional, or disease-susceptibility criteria. Here, we introduce a pipeline for developing a Spatial Atlas for Mapping Protocol Locations of Ex vivo Samples (SAMPLES), which uses a neuroimaging framework to create a 3D representation of a BSP through a metrically precise digital instantiation of the procedures for brain extraction, segmentation, slicing, and sampling on a modern digital brain template. SAMPLES incorporates modern neuroinformatics conventions to create explicit 3D labels of BSP-defined samples that can be interactively visualized with freely available neuroimaging software. We illustrate the pipeline by developing an atlas for the protocol from the University of Washington BioRepository and Integrated Neuropathology laboratory (UW BRaIN SAMPLES). By providing an explicit, computable reference, SAMPLES atlases can support the efficient identification, referencing, and utilization of postmortem samples for interdisciplinary research. These capabilities enable biorepository workflows, data harmonization across biorepositories, and integration with antemortem neuroimaging.
Building an Interoperable Rare Disease Multi-omic Resource: The GREGoR Data Model and Dataset
bioRxiv (Cold Spring Harbor Laboratory) · 2026-05-19
articleOpen accessRare disease research and diagnosis rely on the integration of genomic and phenotypic data generated across diverse clinical sites; however, the absence of widely adopted standards for representing genomic data and associated metadata has limited data interoperability, reuse, and cross-study analysis. The Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium was established to investigate challenging rare disease cases and evaluate emerging multi-omic technologies for clinical translation. To support coordinated data integration across distributed research sites, we developed a common Consortium Data Model in partnership with domain experts to standardize the capture of participant-, family-, phenotype- and assay-level metadata, with a particular emphasis on using a modular architecture to support linking of multiple data versions from multiple omic technologies to a single individual and attribution of a genetic finding to the specific technology used for its initial discovery. Adoption of the GREGoR Data Model has enabled continued generation and public release of a harmonized, analysis-ready Consortium Dataset. The most recent release includes phenotypic, family and multi-omic data from 12,292 participants in 5,029 families. Other rare disease data sharing efforts are beginning to adopt this data model which will facilitate cross consortium analyses and empower rare disease research. This work demonstrates that a collaborative, flexible, and scalable data model can enable large-scale rare disease research, facilitate cross-center data harmonization, and enable data interoperability.
Recent grants
Systems Biology Analysis of Cardiac Electrical Activity and Arrhythmias.
NIH · $968k · 2019–2023
Statistical Methods for Differential Network Biology with Applications to Aging
NSF · $1.2M · 2016–2021
Novel Statistical Inference for Biomedical Big Data
NIH · $1.7M · 2020–2025
Statistical Methods for Network-Based Integrative Analysis of CVD Epigenetic Data
NIH · $550k · 2015–2021
NSF · $300k · 2017–2021
Frequent coauthors
- 126 shared
Shahrokh F. Shariat
Medical University of Vienna
- 106 shared
George Michailidis
University of Florida
- 104 shared
Vihas T. Vasu
Maharaja Sayajirao University of Baroda
- 98 shared
Arun Sreekumar
Baylor College of Medicine
- 94 shared
Vasanta Putluri
Baylor College of Medicine
- 94 shared
Nagireddy Putluri
Baylor College of Medicine
- 90 shared
Theodore R. Sana
Kerala Agricultural University
- 88 shared
Gagan Thangjam
Labs
The Ali Shojaie Lab focuses on statistical methods and their applications in biology and medicine.
Education
- 2010
PhD, Statistics
University of Michigan
Awards & honors
- Fellow of the Institute for Mathematical Statistics (IMS)
- Fellow of the American Statistical Association (ASA)
- 2022 Leo Breiman Award from the American Statistical Associa…
- Best paper award from Statistical Learning and Data Science…
- Best paper award from the Section on Statistics in Imaging
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Ali Shojaie
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup