Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Giovanni Parmigiani

Giovanni Parmigiani

· Professor of BiostatisticsVerified

Harvard University · Biostatistics

Active 1966–2026

h-index141
Citations159.5k
Papers934187 last 5y
Funding$114.1M2 active
See your match with Giovanni Parmigiani — sign in to PhdFit.Sign in

About

Giovanni Parmigiani is a Professor of Biostatistics at Harvard University. His research interests include Bayesian decision theory, multi-study statistical methods, machine learning for precision prevention and treatment in health care, and statistical techniques in cancer biology. He is affiliated with the Department of Statistics and is involved in various academic activities, including teaching and research, within the department.

Research topics

  • Computer Science
  • Mathematics
  • Artificial Intelligence
  • Machine Learning
  • Statistics
  • Immunology
  • Biology
  • Medicine
  • Demography
  • Internal medicine
  • Environmental health
  • Theoretical computer science
  • Genetics

Selected publications

  • A Gene Expression Tumor Signature Optimizing Partial Area‐Under‐the‐Curve (pAUC) to Improve Specificity for Indolent Prostate Cancer

    The Prostate · 2026-04-29

    articleOpen access

    PURPOSE: A key clinical challenge in prostate cancer is the identification and validation of biomarkers with high specificity for indolent long-term outcomes. We applied a novel statistical method to identify tumor transcriptomic biomarkers that optimally predicted patients with low metastatic potential. METHODS: Using tumor whole-transcriptome data from the Health Professionals Follow-up Study (HPFS, discovery set) and Physicians' Health Study (PHS, validation set), we compared patients who died of prostate cancer or developed metastases ("lethal," n = 113) and patients with > 8 years of metastasis-free survival ("indolent," n = 291). Whole transcriptome tumor gene expression data were generated using an Affymetrix array. We applied a novel method for optimizing a partial area under the curve (pAUC) that up-weighted indolent cases with a predefined 80%-100% specificity. This method leverages weighted logistic lasso regression, with weights chosen via cross-validation to reduce overfitting. RESULTS: Median age at cancer diagnosis was 66 years; median follow-up for outcomes was 14 years. We identified a 40-gene transcriptome signature of indolent prostate cancer, which, compared to Gleason grade groups, improved the pAUC over the predefined 80%-100% specificity range by 1.72-fold (p < 0.001) and improved overall AUC from 0.85 to 0.93 (p < 0.001). The signature improved positive predictive value for indolent tumors > 2-fold with minimal decrease in negative predictive value. Importantly, the 40-gene signature showed high discrimination among intermediate Gleason 7 tumors (Grade groups 2 and 3, AUC 0.88, 95% CI: 0.79-0.95). CONCLUSION: Incorporating pAUC into prognostic signature development improved identification of prostate tumors with low risk of metastatic potential. Its clinical application may help reduce overtreatment and overdiagnosis of indolent prostate cancers, and the pAUC may be relevant beyond prostate cancer.

  • Debiased Machine Learning for Conformal Prediction of Counterfactual Outcomes Under Runtime Confounding

    arXiv (Cornell University) · 2026-04-04

    preprintOpen accessSenior author

    Data-driven decision making frequently relies on predicting counterfactual outcomes. In practice, researchers commonly train counterfactual prediction models on a source dataset to inform decisions on a possibly separate target population. Conformal prediction has arisen as a popular method for producing assumption-lean prediction intervals for counterfactual outcomes that would arise under different treatment decisions in the target population of interest. However, existing methods require that every confounding factor of the treatment-outcome relationship used for training on the source data is additionally measured in the target population, risking miscoverage if important confounders are unmeasured in the target population. In this paper, we introduce a computationally efficient debiased machine learning framework that allows for valid prediction intervals when only a subset of confounders is measured in the target population, a common challenge referred to as runtime confounding. Grounded in semiparametric efficiency theory, we show the resulting prediction intervals achieve desired coverage rates with faster convergence compared to standard methods. Through numerous synthetic and semi-synthetic experiments, we demonstrate the utility of our proposed method.

  • bayesNMF: Fast Bayesian Poisson NMF with Automatically Learned Rank Applied to Mutational Signatures

    Journal of Computational and Graphical Statistics · 2026-04-13

    articleOpen accessSenior author

    Bayesian Poisson Non-Negative Matrix Factorization (NMF) is widely used to model count data, including in cancer mutational signature analysis. However, standard Gibbs samplers rely on computationally expensive Poisson augmentation, and current software implementations learn the latent rank either through slow and potentially subjective heuristic rank selection or with automatic approaches that do not report posterior uncertainty. In this paper, we introduce bayesNMF, an MH-within-Gibbs sampler to address both of these limitations. First, we define high-overlap proposals for Metropolis-Hastings sampling to remove the need for Poisson augmentation. Second, we define a BIC-based sparsity prior to learn rank automatically within the Bayesian formulation while allowing for posterior uncertainty quantification. We provide an open-source R software package with all of the models and plotting capabilities demonstrated in this paper on GitHub at jennalandy/bayesNMF. Although our applications focus on cancer mutational signatures, our software and results can be extended to any use of Bayesian Poisson NMF.

  • Debiased Machine Learning for Conformal Prediction of Counterfactual Outcomes Under Runtime Confounding

    arXiv (Cornell University) · 2026-04-04

    articleOpen accessSenior author

    Data-driven decision making frequently relies on predicting counterfactual outcomes. In practice, researchers commonly train counterfactual prediction models on a source dataset to inform decisions on a possibly separate target population. Conformal prediction has arisen as a popular method for producing assumption-lean prediction intervals for counterfactual outcomes that would arise under different treatment decisions in the target population of interest. However, existing methods require that every confounding factor of the treatment-outcome relationship used for training on the source data is additionally measured in the target population, risking miscoverage if important confounders are unmeasured in the target population. In this paper, we introduce a computationally efficient debiased machine learning framework that allows for valid prediction intervals when only a subset of confounders is measured in the target population, a common challenge referred to as runtime confounding. Grounded in semiparametric efficiency theory, we show the resulting prediction intervals achieve desired coverage rates with faster convergence compared to standard methods. Through numerous synthetic and semi-synthetic experiments, we demonstrate the utility of our proposed method.

  • BreakLoops: A New Feature for the Multi-Gene, Multi-Cancer Family History-Based Model, Fam3Pro

    ArXiv.org · 2025-05-02

    preprintOpen access

    Previously, we presented PanelPRO, now known as Fam3PRO, an open-source R package for multi-gene, multi-cancer risk modeling with pedigree data. The initial release could not handle pedigrees that contained cyclic structures called loops, which occur when relatives mate. Here, we present a graph-based function called breakloops that can detect and break loops in any pedigree. The core algorithm identifies the optimal set of loop breakers when individuals in a loop have exactly one parental mating, and extends to handle cases where individuals have multiple parental matings. The algorithm transforms complex pedigrees by strategically creating clones of key individuals to disrupt cycles while minimizing computational complexity. Our extensive testing demonstrates that this new feature can handle a wide variety of pedigree structures. The breakloops function is available in Fam3Pro version 2.0.0. This advancement enables Fam3Pro to assess cancer risk in a wider range of family structures, enhancing its applicability in clinical settings

  • Causal Inference for Latent Outcomes Learned with Factor Models

    ArXiv.org · 2025-06-25

    preprintOpen accessSenior author

    In many fields$\unicode{x2013}$including genomics, epidemiology, natural language processing, social and behavioral sciences, and economics$\unicode{x2013}$it is increasingly important to address causal questions in the context of factor models or representation learning. In this work, we investigate causal effects on $\textit{latent outcomes}$ derived from high-dimensional observed data using nonnegative matrix factorization. To the best of our knowledge, this is the first study to formally address causal inference in this setting. A central challenge is that estimating a latent factor model can cause an individual's learned latent outcome to depend on other individuals' treatments, thereby violating the standard causal inference assumption of no interference. We formalize this issue as $\textit{learning-induced interference}$ and distinguish it from interference present in a data-generating process. To address this, we propose a novel, intuitive, and theoretically grounded algorithm to estimate causal effects on latent outcomes while mitigating learning-induced interference and improving estimation efficiency. We establish theoretical guarantees for the consistency of our estimator and demonstrate its practical utility through simulation studies and an application to cancer mutational signature analysis. All baseline and proposed methods are available in our open-source R package, ${\tt causalLFO}$.

  • Independent and Complementary Value of RNA Expression Signatures in High-Risk Multiple Myeloma

    Clinical Lymphoma Myeloma & Leukemia · 2025-09-01

    article
  • Bayesian Probit Multi-Study Non-negative Matrix Factorization for Mutational Signatures

    ArXiv.org · 2025-02-03

    preprintOpen access

    Mutational signatures are patterns of somatic mutations in tumor genomes that provide insights into underlying mutagenic processes and cancer origin. Developing reliable methods for their estimation is of growing importance in cancer biology. Somatic mutation data are often collected for different cancer types, highlighting the need for multi-study approaches that enable joint analysis in a principled and integrative manner. Despite significant advancements, statistical models tailored for analyzing the genomes of multiple cancer types remain underexplored. In this work, we introduce a Bayesian Multi-Study Non-negative Matrix Factorization (NMF) approach that uses mixture modeling to incorporate sparsity in the exposure weights of each subject to mutational signatures, allowing for individual tumor profiles to be represented by a subset rather than all signatures, and making this subset depend on covariates. This allows for a) more precise ability to identify meaningful contributions of mutational signatures at the individual level; b) estimation of the prevalence of activity of signatures within a cancer type, defined by the proportion of tumor profiles where a certain signature is present; and c) de-novo identification of interpretable patient subtypes based on the mutational signatures present within their mutational profile. We apply our approach to the mutational profiles of tumors from seven different cancer types, demonstrating its ability to accurately estimate mutational signatures while uncovering both individual and tissue-specific differences. An R package implementing our method is available at https://github.com/blhansen/BAPmultiNMF.

  • Multivariate Causal Effects: a Bayesian Causal Regression Factor Model

    arXiv (Cornell University) · 2025-04-04

    preprintOpen access

    The impact of wildfire smoke on air quality is a growing concern, contributing to air pollution through a complex mixture of chemical species with important implications for public health. While previous studies have primarily focused on its association with total particulate matter (PM2.5), the causal relationship between wildfire smoke and the chemical composition of PM2.5 remains largely unexplored. Exposure to these chemical mixtures plays a critical role in shaping public health, yet capturing their relationships requires advanced statistical methods capable of modeling the complex dependencies among chemical species. To fill this gap, we propose a Bayesian causal regression factor model that estimates the multivariate causal effects of wildfire smoke on the concentration of 27 chemical species in PM2.5 across the United States. Our approach introduces two key innovations: (i) a causal inference framework for multivariate potential outcomes, and (ii) a novel Bayesian factor model that employs a probit stick-breaking process as prior for treatment-specific factor scores. By focusing on factor scores, our method addresses the missing data challenge common in causal inference and enables a flexible, data-driven characterization of the latent factor structure, which is crucial to capture the complex correlation among multivariate outcomes. Through Monte Carlo simulations, we show the model's accuracy in estimating the causal effects in multivariate outcomes and characterizing the treatment-specific latent structure. Finally, we apply our method to US air quality data, estimating the causal effect of wildfire smoke on 27 chemical species in PM2.5, providing a deeper understanding of their interdependencies.

  • Bayesian multi-study non-negative matrix factorization for mutational signatures

    Genome biology · 2025-04-16 · 4 citations

    articleOpen accessSenior author

    Mutational signatures are typically identified from tumor genome sequencing data using non-negative matrix factorization (NMF). However, existing NMF techniques only decompose a single dataset, limiting rigorous comparisons of signatures across conditions. We propose a Bayesian NMF method that jointly decomposes multiple datasets to identify signatures and their sharing pattern across conditions. We propose a fully unsupervised "discovery-only" model and a semi-supervised "recovery-discovery" model that simultaneously estimates known and novel signatures, and extend both to estimate covariate effects. We demonstrate our approach on extensive simulations, and apply our method to answer questions related to colorectal cancer and early-onset breast cancer.

Recent grants

Frequent coauthors

  • Bert Vogelstein

    Howard Hughes Medical Institute

    422 shared
  • Victor E. Velculescu

    University of Baltimore

    413 shared
  • Kenneth W. Kinzler

    Johns Hopkins University

    411 shared
  • Levi Waldron

    City University of New York

    329 shared
  • D. Williams Parsons

    Altarum Institute

    320 shared
  • Curtis Huttenhower

    Harvard University

    278 shared
  • Siân Jones

    272 shared
  • Michael J. Birrer

    Winthrop Rockefeller Foundation

    236 shared

Labs

Education

  • PhD, Statistics

    Carnegie Mellon University

    1990
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Giovanni Parmigiani

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup