
Mark Gerstein
· Albert L. Williams Professor of Molecular Biophysics and Biochemistry; Professor of Statistics and Data ScienceVerifiedYale University · Department of Statistics and Data Science
Active 1991–2026
About
Mark Gerstein is the Albert L Williams Professor and Principal Investigator of the Gerstein Bioinformatics Group at Yale University. His research focuses on bioinformatics, with particular emphasis on the development of computational methods to analyze biological data. His work involves understanding complex biological systems through data-driven approaches, contributing significantly to the fields of genomics, systems biology, and computational biology. As a leading figure in bioinformatics, Gerstein's contributions include advancing the understanding of molecular biology through computational techniques, and mentoring a diverse group of researchers, including postdoctoral associates, graduate students, and undergraduates. His role at Yale involves both research and leadership in the bioinformatics community, fostering innovations that bridge biology and computer science.
Research topics
- Biology
- Genetics
- Computational biology
- Computer Science
- Evolutionary biology
- Neuroscience
- Machine Learning
- Medicine
- Biological system
- Endocrinology
- Biochemistry
- Biophysics
- Immunology
- Mathematics
- Pathology
- Psychology
- Materials science
- Statistics
- Chemistry
- Cell biology
- Psychiatry
- Virology
- Cancer research
Selected publications
Nature Neuroscience · 2026-04-24
articleOpen accessDiverse risk genes have been identified for neurodevelopmental disorders (NDDs), but how these genes converge on similar biological pathways in neurons, and thus give rise to similar phenotypes, is unclear. Here we apply a pooled CRISPR approach to successfully target 23 NDD loss-of-function genes with roles in chromatin biology and examine convergent effects on gene expression across human induced pluripotent stem cell-derived neural progenitor cells, glutamatergic neurons and GABAergic neurons. Points of convergence vary between these cell types, with the greatest number of convergent genes and strongest convergent networks in mature glutamatergic neurons, where they broadly represent synaptic, epigenetic and, unexpectedly, mitochondrial pathways. The most convergent networks were observed between NDD genes with shared biological annotations, clinical associations and co-expression patterns in human post-mortem brain. Drugs that were predicted to reverse convergent transcriptomic signatures and/or arousal and sensory processing behaviors ameliorated behavioral phenotypes in zebrafish NDD gene mutants. These results suggest that convergent effects of NDD risk genes could provide clinically useful insights.
Interpretability and implicit model semantics in biomedicine and deep learning
Nature Machine Intelligence · 2026-03-23
articleSenior authorCorrespondingZenodo (CERN European Organization for Nuclear Research) · 2025-07-16
datasetOpen accessThese data tables contain results and statistical analyses from the Seahorse Mito Stress assay and TOMM20 immunostaining for mitochondrial morphology. Detailed methods and final figures are available at DOI: https://doi.org/10.1101/2024.08.23.609190
Nature Communications · 2025-08-19 · 2 citations
articleOpen accessSenior authorCorrespondingMany genome-wide studies capture isolated moments in cell differentiation or organismal development. Conversely, longitudinal studies provide a more direct way to study these kinetic processes. Here, we present an approach for modeling gene-expression and chromatin kinetics from such studies: chronODE, an interpretable framework based on ordinary differential equations. chronODE incorporates two parameters that capture biophysical constraints governing the initial cooperativity and later saturation in gene expression. These parameters group genes into three major kinetic patterns: accelerators, switchers, and decelerators. Applying chronODE to bulk and single-cell time-series data from mouse brain development reveals that most genes (~87%) follow simple logistic kinetics. Among them, genes with rapid acceleration and high saturation values are rare, highlighting biochemical limitations that prevent cells from attaining both simultaneously. Early- and late-emerging cell types display distinct kinetic patterns, with essential genes ramping up faster. Extending chronODE to chromatin, we find that genes regulated by both enhancer and silencer cis-regulatory elements are enriched in brain-specific functions. Finally, we develop a bidirectional recurrent neural network to predict changes in gene expression from corresponding chromatin changes, successfully capturing the cumulative effect of multiple regulatory elements. Overall, our framework allows investigation of the kinetics of gene regulation in diverse biological systems.
Nucleic Acids Research · 2025-10-21 · 1 citations
articleOpen accessThe Functional Annotation of Variants Online Resource (FAVOR), http://favor.genohub.org, is a whole genome variant annotation database and portal that provides comprehensive variant functional annotations of all possible variants across the genome. It can facilitate the analysis of whole-genome sequencing studies, support the interpretation of variant functional impacts, and help prioritize causal variants of diseases or traits. To support the growing popularity and expand the scope of FAVOR, we present here a substantial platform update. The new release features dramatically expanded annotations, a completely redesigned infrastructure powered by a newly implemented application programming interface (FAVOR-API), and a revamped web interface with advanced data-visualization capabilities and enhanced query performance. Key expansions include much more comprehensive variant annotations, including global, tissue- and cell-type-specific variant annotations; gene and protein annotations; support for both hg38 and hg19 reference genomes; and an interactive genome-browser for visualization of multi-faceted variant annotations. The updated platform also includes FAVOR-GPT, a large language model-powered interface for navigating the FAVOR database and interpreting results. FAVOR continues to evolve to keep pace with advances in research on interpreting the functional and phenotypic impact of genomic variation.
The IGVF catalog—from genetic variation to function
Nucleic Acids Research · 2025-12-08 · 3 citations
articleOpen accessGenomic variation between individuals is essential for understanding how differences in the genome sequence affect molecular and cellular processes. The Impact of Genomic Variation on Function (IGVF) Consortium aims to uncover the relationships among genomic variation, genome function, and phenotypes by combining experimental techniques, such as single-cell mapping and genomic perturbation assays, with computational approaches such as machine learning-based predictive modeling. The IGVF Data and Administrative Coordinating Centers collect, analyze, and disseminate data and results from across the consortium through an open-source platform called the IGVF Catalog. This resource includes, but is not limited to, data on the effects of coding variants on protein abundance and function, noncoding variants on enhancer activity (measured by MPRA or predicted computationally), and associations between variants and quantitative traits. All data are organized within a graph database comprising over 50 types of data collections with nearly 3 billion nodes and over 7.5 billion edges. The Catalog offers public API endpoints (https://api.catalogkg.igvf.org/) and a user-friendly interface for exploring, querying, and visualizing the data at https://catalog.igvf.org. We expect that this open-access platform will support the broader scientific community to advance our understanding of how genomic variation influences biology and disease.
Epigenetic characterization of pseudogenes across human tissues
bioRxiv (Cold Spring Harbor Laboratory) · 2025-10-06
preprintOpen accessSenior authorCorrespondingPseudogenes have historically been regarded as nonfunctional remnants of genome evolution. However, relative to other noncoding genomic elements, their promoter architecture and epigenetic regulation remain incompletely understood. Here, we systematically characterize pseudogene promoters and compare them with those of protein-coding genes and long noncoding RNAs. To do this, we integrate matched transcriptomic and epigenomic data across 26 human tissues from the EN-TEx (ENCODE-GTEx) project. We uniformly annotate promoters with chromatin features (histone modifications, chromatin accessibility, and DNA methylation), sequence motifs, and evolutionary conservation, generating an online catalog. Leveraging this catalog, we show that, across multiple tissues, transcribed, unprocessed pseudogenes exhibit chromatin patterns similar to those of active protein-coding genes. In contrast, transcribed, processed pseudogenes show a different pattern: most lack the canonical hallmarks of transcription (e.g., active histone marks) at their promoters. Instead, their promoters show increased overlap with LINE elements, enrichment for YY1-like binding motifs, and higher Hi-C contact frequency, particularly with distal enhancer-like regulatory regions. Together with their greater conservation (relative to unprocessed pseudogenes), these features suggest that the transcription of processed pseudogenes may require regulatory mechanisms distinct from canonical promoter-associated epigenetic activation.
Aerosol-based exposure to opportunistic pathogens originating from hospital sink drains
American Journal of Infection Control · 2025-11-04 · 3 citations
articleDNA shape and epigenomics distinguish the mechanistic origin of human genomic structural variations
Nucleic Acids Research · 2025-11-07 · 1 citations
articleOpen accessThe recent advent of long-read whole genome sequencing has enabled us to create an accurate telomere-to-telomere reference genome, construct pangenome graphs, and compile precise catalogs of genomic structural variations (SVs). These comprehensive SV repositories provide an excellent opportunity to explore the role of SVs in genotype-phenotype associations and examine the mechanisms by which SVs are introduced through double-strand break (DSB) repair. Here, we employed comprehensive SV catalogs identified through various short- and long-read whole genome sequencing efforts to infer the underlying mechanisms of SV introduction based on their genomic and epigenomic profiles. Our findings indicate that high local DNA methylation and DNA shape-related features, such as low variations in propeller twist, support the origins of homology-driven SVs. Subsequently, we utilized an active-learning-based unsupervised clustering approach, revealing that homology-dependent SVs show greater evidence of retaining ancestral recombination patterns compared to their homology-independent counterparts. Finally, our comparison of inherited and de novo SVs from healthy populations and rare disease cohorts showed distinct upstream H3K27me3 levels in de novo SVs from individuals with ultra-rare disorders. These findings highlight genome-wide characteristics that may influence the choice of repair mechanisms linked to heritable SV origins.
Complex genetic variation in nearly complete human genomes
Nature · 2025-07-23 · 67 citations
articleOpen accessAbstract Diverse sets of complete human genomes are required to construct a pangenome reference and to understand the extent of complex structural variation. Here we sequence 65 diverse human genomes and build 130 haplotype-resolved assemblies (median continuity of 130 Mb), closing 92% of all previous assembly gaps 1,2 and reaching telomere-to-telomere status for 39% of the chromosomes. We highlight complete sequence continuity of complex loci, including the major histocompatibility complex (MHC), SMN1 / SMN2 , NBPF8 and AMY1/AMY2 , and fully resolve 1,852 complex structural variants. In addition, we completely assemble and validate 1,246 human centromeres. We find up to 30-fold variation in α-satellite higher-order repeat array length and characterize the pattern of mobile element insertions into α-satellite higher-order repeat arrays. Although most centromeres predict a single site of kinetochore attachment, epigenetic analysis suggests the presence of two hypomethylated regions for 7% of centromeres. Combining our data with the draft pangenome reference 1 significantly enhances genotyping accuracy from short-read data, enabling whole-genome inference 3 to a median quality value of 45. Using this approach, 26,115 structural variants per individual are detected, substantially increasing the number of structural variants now amenable to downstream disease association studies.
Recent grants
NIH · $526k · 2020–2025
Genomic mosaicism in developing human brain
NIH · $3.5M · 2014–2019
NSF · $1.2M · 2017–2022
Biomedical Informatics and Data Science Training at Yale
NIH · $2.2M · 1987–2027
Methods and Software to Enhance Genomic Privacy and Sharing of RNA-Seq Data
NIH · $487k · 2016–2020
Frequent coauthors
- 301 shared
Joel Rozowsky
Lieber Institute for Brain Development
- 296 shared
Mark A. Rubin
University of Bern
- 286 shared
M Snyder
- 215 shared
Andrea Sboner
Weill Cornell Medicine
- 210 shared
Rory Johnson
University Hospital of Bern
- 207 shared
Jan O. Korbel
German Cancer Research Center
- 168 shared
Lars Feuerbach
German Cancer Research Center
- 158 shared
Rajiv Dhir
Shadyside Hospital
Labs
Gerstein Bioinformatics GroupPI
Not provided
Education
- 1993
PhD, Chemistry
University of Cambridge
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Mark Gerstein
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup