Mark Gerstein

· Albert L. Williams Professor of Molecular Biophysics and Biochemistry; Professor of Statistics and Data ScienceVerified

Yale University · Department of Statistics and Data Science

Active 1991–2026

h-index217

Citations303.3k

Papers1.2k340 last 5y

Funding$218.6M4 active

Faculty page Lab page

See your match with Mark Gerstein — sign in to PhdFit.Sign in

About

Mark Gerstein is the Albert L Williams Professor and Principal Investigator of the Gerstein Bioinformatics Group at Yale University. His research focuses on bioinformatics, with particular emphasis on the development of computational methods to analyze biological data. His work involves understanding complex biological systems through data-driven approaches, contributing significantly to the fields of genomics, systems biology, and computational biology. As a leading figure in bioinformatics, Gerstein's contributions include advancing the understanding of molecular biology through computational techniques, and mentoring a diverse group of researchers, including postdoctoral associates, graduate students, and undergraduates. His role at Yale involves both research and leadership in the bioinformatics community, fostering innovations that bridge biology and computer science.

Research topics

Biology
Genetics
Computational biology
Computer Science
Evolutionary biology
Neuroscience
Machine Learning
Medicine
Biological system
Endocrinology
Biochemistry
Biophysics
Immunology
Mathematics
Pathology
Psychology
Materials science
Statistics
Chemistry
Cell biology
Psychiatry
Virology
Cancer research

Selected publications

Transcriptomic and phenotypic convergence of neurodevelopmental disorder risk genes in vitro and in vivo
Nature Neuroscience · 2026-04-24
articleOpen access
Diverse risk genes have been identified for neurodevelopmental disorders (NDDs), but how these genes converge on similar biological pathways in neurons, and thus give rise to similar phenotypes, is unclear. Here we apply a pooled CRISPR approach to successfully target 23 NDD loss-of-function genes with roles in chromatin biology and examine convergent effects on gene expression across human induced pluripotent stem cell-derived neural progenitor cells, glutamatergic neurons and GABAergic neurons. Points of convergence vary between these cell types, with the greatest number of convergent genes and strongest convergent networks in mature glutamatergic neurons, where they broadly represent synaptic, epigenetic and, unexpectedly, mitochondrial pathways. The most convergent networks were observed between NDD genes with shared biological annotations, clinical associations and co-expression patterns in human post-mortem brain. Drugs that were predicted to reverse convergent transcriptomic signatures and/or arousal and sensory processing behaviors ameliorated behavioral phenotypes in zebrafish NDD gene mutants. These results suggest that convergent effects of NDD risk genes could provide clinically useful insights.
Publisher DOI
Interpretability and implicit model semantics in biomedicine and deep learning
Nature Machine Intelligence · 2026-03-23
articleSenior authorCorresponding
Publisher DOI
Dynamic convergence of neurodevelopment disorder risk genes: Seahorse Mito Stress and mitochondrial morphology datasets
Zenodo (CERN European Organization for Nuclear Research) · 2025-07-16
datasetOpen access
These data tables contain results and statistical analyses from the Seahorse Mito Stress assay and TOMM20 immunostaining for mitochondrial morphology. Detailed methods and final figures are available at DOI: https://doi.org/10.1101/2024.08.23.609190
Publisher DOI
The chronODE framework for modelling multi-omic time series with ordinary differential equations and machine learning
Nature Communications · 2025-08-19 · 2 citations
articleOpen accessSenior authorCorresponding
Many genome-wide studies capture isolated moments in cell differentiation or organismal development. Conversely, longitudinal studies provide a more direct way to study these kinetic processes. Here, we present an approach for modeling gene-expression and chromatin kinetics from such studies: chronODE, an interpretable framework based on ordinary differential equations. chronODE incorporates two parameters that capture biophysical constraints governing the initial cooperativity and later saturation in gene expression. These parameters group genes into three major kinetic patterns: accelerators, switchers, and decelerators. Applying chronODE to bulk and single-cell time-series data from mouse brain development reveals that most genes (~87%) follow simple logistic kinetics. Among them, genes with rapid acceleration and high saturation values are rare, highlighting biochemical limitations that prevent cells from attaining both simultaneously. Early- and late-emerging cell types display distinct kinetic patterns, with essential genes ramping up faster. Extending chronODE to chromatin, we find that genes regulated by both enhancer and silencer cis-regulatory elements are enriched in brain-specific functions. Finally, we develop a bidirectional recurrent neural network to predict changes in gene expression from corresponding chromatin changes, successfully capturing the cumulative effect of multiple regulatory elements. Overall, our framework allows investigation of the kinetics of gene regulation in diverse biological systems.
Publisher OA PDF DOI
FAVOR 2.0: A reengineered functional annotation of variants online resource for interpreting genomic variation
Nucleic Acids Research · 2025-10-21 · 1 citations
articleOpen access
The Functional Annotation of Variants Online Resource (FAVOR), http://favor.genohub.org, is a whole genome variant annotation database and portal that provides comprehensive variant functional annotations of all possible variants across the genome. It can facilitate the analysis of whole-genome sequencing studies, support the interpretation of variant functional impacts, and help prioritize causal variants of diseases or traits. To support the growing popularity and expand the scope of FAVOR, we present here a substantial platform update. The new release features dramatically expanded annotations, a completely redesigned infrastructure powered by a newly implemented application programming interface (FAVOR-API), and a revamped web interface with advanced data-visualization capabilities and enhanced query performance. Key expansions include much more comprehensive variant annotations, including global, tissue- and cell-type-specific variant annotations; gene and protein annotations; support for both hg38 and hg19 reference genomes; and an interactive genome-browser for visualization of multi-faceted variant annotations. The updated platform also includes FAVOR-GPT, a large language model-powered interface for navigating the FAVOR database and interpreting results. FAVOR continues to evolve to keep pace with advances in research on interpreting the functional and phenotypic impact of genomic variation.
Publisher DOI
The IGVF catalog—from genetic variation to function
Nucleic Acids Research · 2025-12-08 · 3 citations
articleOpen access
Genomic variation between individuals is essential for understanding how differences in the genome sequence affect molecular and cellular processes. The Impact of Genomic Variation on Function (IGVF) Consortium aims to uncover the relationships among genomic variation, genome function, and phenotypes by combining experimental techniques, such as single-cell mapping and genomic perturbation assays, with computational approaches such as machine learning-based predictive modeling. The IGVF Data and Administrative Coordinating Centers collect, analyze, and disseminate data and results from across the consortium through an open-source platform called the IGVF Catalog. This resource includes, but is not limited to, data on the effects of coding variants on protein abundance and function, noncoding variants on enhancer activity (measured by MPRA or predicted computationally), and associations between variants and quantitative traits. All data are organized within a graph database comprising over 50 types of data collections with nearly 3 billion nodes and over 7.5 billion edges. The Catalog offers public API endpoints (https://api.catalogkg.igvf.org/) and a user-friendly interface for exploring, querying, and visualizing the data at https://catalog.igvf.org. We expect that this open-access platform will support the broader scientific community to advance our understanding of how genomic variation influences biology and disease.
Publisher DOI
Epigenetic characterization of pseudogenes across human tissues
bioRxiv (Cold Spring Harbor Laboratory) · 2025-10-06
preprintOpen accessSenior authorCorresponding
Pseudogenes have historically been regarded as nonfunctional remnants of genome evolution. However, relative to other noncoding genomic elements, their promoter architecture and epigenetic regulation remain incompletely understood. Here, we systematically characterize pseudogene promoters and compare them with those of protein-coding genes and long noncoding RNAs. To do this, we integrate matched transcriptomic and epigenomic data across 26 human tissues from the EN-TEx (ENCODE-GTEx) project. We uniformly annotate promoters with chromatin features (histone modifications, chromatin accessibility, and DNA methylation), sequence motifs, and evolutionary conservation, generating an online catalog. Leveraging this catalog, we show that, across multiple tissues, transcribed, unprocessed pseudogenes exhibit chromatin patterns similar to those of active protein-coding genes. In contrast, transcribed, processed pseudogenes show a different pattern: most lack the canonical hallmarks of transcription (e.g., active histone marks) at their promoters. Instead, their promoters show increased overlap with LINE elements, enrichment for YY1-like binding motifs, and higher Hi-C contact frequency, particularly with distal enhancer-like regulatory regions. Together with their greater conservation (relative to unprocessed pseudogenes), these features suggest that the transcription of processed pseudogenes may require regulatory mechanisms distinct from canonical promoter-associated epigenetic activation.
Publisher OA PDF DOI
Aerosol-based exposure to opportunistic pathogens originating from hospital sink drains
American Journal of Infection Control · 2025-11-04 · 3 citations
article
Publisher DOI
DNA shape and epigenomics distinguish the mechanistic origin of human genomic structural variations
Nucleic Acids Research · 2025-11-07 · 1 citations
articleOpen access
The recent advent of long-read whole genome sequencing has enabled us to create an accurate telomere-to-telomere reference genome, construct pangenome graphs, and compile precise catalogs of genomic structural variations (SVs). These comprehensive SV repositories provide an excellent opportunity to explore the role of SVs in genotype-phenotype associations and examine the mechanisms by which SVs are introduced through double-strand break (DSB) repair. Here, we employed comprehensive SV catalogs identified through various short- and long-read whole genome sequencing efforts to infer the underlying mechanisms of SV introduction based on their genomic and epigenomic profiles. Our findings indicate that high local DNA methylation and DNA shape-related features, such as low variations in propeller twist, support the origins of homology-driven SVs. Subsequently, we utilized an active-learning-based unsupervised clustering approach, revealing that homology-dependent SVs show greater evidence of retaining ancestral recombination patterns compared to their homology-independent counterparts. Finally, our comparison of inherited and de novo SVs from healthy populations and rare disease cohorts showed distinct upstream H3K27me3 levels in de novo SVs from individuals with ultra-rare disorders. These findings highlight genome-wide characteristics that may influence the choice of repair mechanisms linked to heritable SV origins.
Publisher DOI
Complex genetic variation in nearly complete human genomes
Nature · 2025-07-23 · 67 citations
articleOpen access
Abstract Diverse sets of complete human genomes are required to construct a pangenome reference and to understand the extent of complex structural variation. Here we sequence 65 diverse human genomes and build 130 haplotype-resolved assemblies (median continuity of 130 Mb), closing 92% of all previous assembly gaps 1,2 and reaching telomere-to-telomere status for 39% of the chromosomes. We highlight complete sequence continuity of complex loci, including the major histocompatibility complex (MHC), SMN1 / SMN2 , NBPF8 and AMY1/AMY2 , and fully resolve 1,852 complex structural variants. In addition, we completely assemble and validate 1,246 human centromeres. We find up to 30-fold variation in α-satellite higher-order repeat array length and characterize the pattern of mobile element insertions into α-satellite higher-order repeat arrays. Although most centromeres predict a single site of kinetochore attachment, epigenetic analysis suggests the presence of two hypomethylated regions for 7% of centromeres. Combining our data with the draft pangenome reference 1 significantly enhances genotyping accuracy from short-read data, enabling whole-genome inference 3 to a median quality value of 45. Using this approach, 26,115 structural variants per individual are detected, substantially increasing the number of structural variants now amenable to downstream disease association studies.
Publisher DOI

Recent grants

Enhancing open data sharing for functional genomics experiments: Measures to quantify genomic information leakage and file formats for privacy preservation
NIH · $526k · 2020–2025
Genomic mosaicism in developing human brain
NIH · $3.5M · 2014–2019
Collaborative Proposal: ABI Innovation:A Graph Based Approach for the Genome Wide Prediction of Conditionaly Essential Genes
NSF · $1.2M · 2017–2022
Biomedical Informatics and Data Science Training at Yale
NIH · $2.2M · 1987–2027
Methods and Software to Enhance Genomic Privacy and Sharing of RNA-Seq Data
NIH · $487k · 2016–2020

Frequent coauthors

Joel Rozowsky
Lieber Institute for Brain Development
301 shared
Mark A. Rubin
University of Bern
296 shared
M Snyder
286 shared
Andrea Sboner
Weill Cornell Medicine
215 shared
Rory Johnson
University Hospital of Bern
210 shared
Jan O. Korbel
German Cancer Research Center
207 shared
Lars Feuerbach
German Cancer Research Center
168 shared
Rajiv Dhir
Shadyside Hospital
158 shared

Labs

Gerstein Bioinformatics GroupPI
Not provided

Education

PhD, Chemistry
University of Cambridge
1993

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Mark Gerstein

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you