About
Joseph Thornton, PhD, is a Professor of Ecology and Evolution and a Professor of Human Genetics at the University of Chicago. His research focuses on understanding the mechanisms by which protein functions evolve, employing phylogenetic reconstruction of ancient proteins and experimental characterization of their biological functions and physical properties. His work involves a diverse team of evolutionary biologists, biochemists, biophysicists, computational biologists, geneticists, and molecular biologists, working collaboratively to address fundamental questions about the evolution of complex molecular systems, the role of mutations and epistasis, and the structural and genetic mechanisms underlying protein function and architecture. Thornton's research aims to elucidate how evolution shapes biological systems and their components, including proteins and molecular machines, and how these insights can inform biochemistry, biophysics, and molecular biology. His studies have contributed to understanding the evolution of protein architectures, the impact of evolutionary history on protein function, and the processes driving molecular self-assembly and structural diversity. His academic background includes a PhD in Biological Sciences from Columbia University, with postdoctoral work at Columbia University and the American Museum of Natural History. He has received numerous awards and honors, including the Guggenheim Foundation Fellowship, the Hans Falk Award, and the U.S. Presidential Early Career Award for Scientists and Engineers.
Research topics
- Biology
- Evolutionary biology
- Chemistry
- Computer Science
- Biochemistry
- Genetics
- Biophysics
- Statistics
- Mathematics
- Computational biology
Selected publications
Molecular Biology and Evolution · 2025-04-01 · 5 citations
articleOpen accessSenior authorAncestral sequence reconstruction is typically performed using homogeneous evolutionary models, which assume that the same substitution propensities affect all sites and lineages. These assumptions are routinely violated: heterogeneous structural and functional constraints favor different amino acids at different sites, and these constraints often change among lineages as epistatic substitutions accrue at other sites. To evaluate how violations of the homogeneity assumption affect ancestral sequence reconstruction under realistic conditions, we developed site-specific substitution models and parameterized them using data from deep mutational scanning experiments on three protein families; we then used these models to perform ancestral sequence reconstruction on the empirical alignments and on alignments simulated under heterogeneous conditions derived from the experiments. Extensive among-site and -lineage heterogeneity is present in these datasets, but the sequences reconstructed from empirical alignments are almost identical when heterogeneous or homogeneous models are used for ancestral sequence reconstruction. Using models fit to deep mutational scanning data from distantly related proteins in which mutational effects are very different also has a minimal impact on ancestral sequence reconstruction. The rare differences occur primarily where phylogenetic signal is weak-at fast-evolving sites and nodes connected by long branches. When ancestral sequence reconstruction is performed on simulated data, errors in the reconstructed sequences become more likely as branch lengths increase, but incorporating heterogeneity into the model does not improve accuracy. These data establish that ancestral sequence reconstruction is robust to unincorporated realistic forms of evolutionary heterogeneity, because the primary determinant of ancestral sequence reconstruction is phylogenetic signal, not the substitution model. The best way to improve accuracy is therefore not to develop more elaborate models but to apply ancestral sequence reconstruction to densely sampled alignments that maximize phylogenetic signal at the nodes of interest.
Nature Ecology & Evolution · 2025-07-25 · 4 citations
articleSenior authorProceedings of the National Academy of Sciences · 2025-01-23 · 6 citations
articleOpen accessSenior authorMany proteins form paralogous multimers—molecular complexes in which evolutionarily related proteins are arranged into specific quaternary structures. Little is known about the mechanisms by which they acquired their stoichiometry (the number of total subunits in the complex) and heterospecificity (the preference of subunits for their paralogs rather than other copies of the same protein). Here, we use ancestral protein reconstruction and biochemical experiments to study historical increases in stoichiometry and specificity during the evolution of vertebrate hemoglobin (Hb), an α 2 β 2 heterotetramer that evolved from a homodimeric ancestor after a gene duplication. We show that the mechanisms for this evolutionary transition were simple. One hydrophobic substitution in subunit β after the gene duplication was sufficient to cause the ancestral dimer to homotetramerize with high affinity across a new interface. During this same interval, a single-residue deletion in subunit α at the older interface conferred specificity for the heterotetrameric form and the trans -orientation of subunits within it. These sudden transitions in stoichiometry and specificity were possible because the interfaces in Hb are isologous, binding via the same surface patch on interacting subunits, but rotated 180° relative to each other. This architecture amplifies the impacts of individual mutations on stoichiometry and specificity, especially in higher-order complexes, and allows single substitutions to differentially affect heteromeric and homomeric interactions. Our findings suggest that elaborate and specific symmetrical molecular complexes may often evolve via simple genetic and physical mechanisms.
bioRxiv (Cold Spring Harbor Laboratory) · 2025-01-29
preprintOpen accessSenior authorCorrespondingSome phenotypes are more likely to be produced by mutation than others, but the causal role of these propensities in the evolution of extant phenotypic diversity remains unclear. There are two major challenges: it is difficult to separate the effect of the genotype-phenotype (GP) map from that of natural selection in causing natural patterns of diversity, and most extant phenotypes evolved long ago in species whose GP maps cannot be recovered. Using reconstructed ancestral transcription factors as a model to address this problem, we created libraries containing all possible amino acid combinations at historically variable sites in the proteins' DNA binding interface (the genotypes) and measured their capacity to bind specifically to response elements containing all possible combinations of nucleotides at historically variable sites in the DNA (the phenotypes). The ancestral proteins we used existed during an ancient phylogenetic interval when a new phenotype-specificity for a new response element-evolved. We found that the two ancestral GP maps were strongly anisotropic (the distribution of phenotypes encoded by genotypes is highly nonuniform) and heterogeneous (the phenotypes accessible around each genotype vary dramatically among genotypes), but the extent and direction of these properties differed between the maps. In both cases, these properties steered evolution toward the lineage-specific phenotypes that evolved during history. Our findings establish that ancient properties of the GP relationship were causal factors in the evolutionary process that produced the present-day patterns of functional conservation and diversity in this protein family.
Epistatic drift in protein evolution
Current Opinion in Genetics & Development · 2025-11-13 · 2 citations
reviewOpen accessSenior authorCorrespondingNew methods are revealing the character of epistatic interactions within proteins and their impacts on evolution. Variation in biochemical phenotypes across protein sequences is determined primarily by the context-independent effects of amino acids and global nonlinearities imposed by biophysical mechanisms. Specific epistasis - primarily pairwise interactions - plays a subsidiary role, but collectively has a major impact on evolution. Every substitution in an evolving protein changes the effects of many potential mutations at epistatically coupled sites. As homologs diverge from common ancestors, the constraints that determine the accessibility of subsequent mutations gradually drift apart. Opportunities for adaptation and functional innovation also change over time, because each substitution epistatically modifies the effects of mutations on existing and new protein phenotypes. Over moderate evolutionary timescales, the outcomes of protein evolution - both their sequences and biochemical properties - thus become strongly contingent on the substitutions that happen to occur in each lineage. This interplay between random chance and each proteins' epistatic architecture helps explain widely observed lineage-specific patterns of conservation and variation that are not expected under the dominant schools of thought in molecular evolution.
bioRxiv (Cold Spring Harbor Laboratory) · 2024-09-20
preprintOpen accessSenior authorCorrespondingWe recently reanalyzed 20 combinatorial mutagenesis datasets using a novel reference-free analysis (RFA) method and showed that high-order epistasis contributes negligibly to protein sequence-function relationships in every case. Dupic, Phillips, and Desai (DPD) commented on a preprint of our work. In our published paper, we addressed all the major issues they raised, but we respond directly to them here. 1) DPD's claim that RFA is equivalent to estimating reference-based analysis (RBA) models by regression neglects fundamental differences in how the two formalisms dissect the causal relationship between sequence and function. It also misinterprets the observation that using regression to estimate any truncated model of genetic architecture will always yield the same predicted phenotypes and variance partition; the resulting estimates correspond to those of the RFA formalism but are inaccurate representations of the true RBA model. 2) DPD's claim that high-order epistasis is widespread and significant while somehow explaining little phenotypic variance is an artifact of two strong biases in the use of regression to estimate RBA models: this procedure underestimates the phenotypic variance explained by RBA epistatic terms while at the same time inflating the magnitude of individual terms. 3) DPD erroneously claim that RFA is "exactly equivalent" to Fourier analysis (FA) and background-averaged analysis (BA). This error arises because DPD used an incorrect mathematical definition of RFA and were misled by a simple numerical relationship among the models that only holds only for the simplest kinds of datasets. 4) DPD argue that using a nonlinear transformation to account for global nonlinearities in sequence-function relationships is often unnecessary and may artifactually absorb specific epistatic interactions. We show that nonspecific epistasis caused by a limited dynamic range affects datasets of all types, even when the phenotype is represented on a free-energy scale. Moreover, using a nonlinear transformation in a joint fitting procedure does not underestimate specific epistasis under realistic conditions, even if the data are not affected by nonspecific epistasis. The conclusions of our work therefore hold: the genetic architecture of all 20 protein datasets we analyzed can be efficiently and accurately described in an RFA framework by first-order amino acid effects and pairwise interactions with a simple model of global nonlinearity. We are grateful for DPD's commentary, which helped us improve our paper.
Author response: Epistasis facilitates functional evolution in an ancient transcription factor
2024-05-20
peer-reviewOpen accessSenior authorA complete mutational scan of a protein-DNA interface shows that pairwise epistatic interactions among amino acids determine a transcription factor's specificity for DNA and facilitate the evolution of new functions.
The simplicity of protein sequence-function relationships
Nature Communications · 2024-09-11 · 53 citations
articleOpen accessSenior authorHow complex are the rules by which a protein’s sequence determines its function? High-order epistatic interactions among residues are thought to be pervasive, suggesting an idiosyncratic and unpredictable sequence-function relationship. But many prior studies may have overestimated epistasis, because they analyzed sequence-function relationships relative to a single reference sequence—which causes measurement noise and local idiosyncrasies to snowball into high-order epistasis—or they did not fully account for global nonlinearities. Here we present a reference-free method that jointly infers specific epistatic interactions and global nonlinearity using a bird’s-eye view of sequence space. This technique yields the simplest explanation of sequence-function relationships and is more robust than existing methods to measurement noise, missing data, and model misspecification. We reanalyze 20 experimental datasets and find that context-independent amino acid effects and pairwise interactions, along with a simple nonlinearity to account for limited dynamic range, explain a median of 96% of phenotypic variance and over 92% in every case. Only a tiny fraction of genotypes are strongly affected by higher-order epistasis. Sequence-function relationships are also sparse: a miniscule fraction of amino acids and interactions account for 90% of phenotypic variance. Sequence-function causality across these datasets is therefore simple, opening the way for tractable approaches to characterize proteins’ genetic architecture. Understanding protein sequence-function relationships is complicated by high order epistatic interactions among residues, although the extent of these interactions remains uncertain. Here, the authors present a reference-free method which suggests that sequence-function relationships are relatively simple, with little influence from high order epistatic interactions.
Epistasis facilitates functional evolution in an ancient transcription factor
eLife · 2024-05-20 · 25 citations
articleOpen accessSenior authorA protein’s genetic architecture – the set of causal rules by which its sequence produces its functions – also determines its possible evolutionary trajectories. Prior research has proposed that the genetic architecture of proteins is very complex, with pervasive epistatic interactions that constrain evolution and make function difficult to predict from sequence. Most of this work has analyzed only the direct paths between two proteins of interest – excluding the vast majority of possible genotypes and evolutionary trajectories – and has considered only a single protein function, leaving unaddressed the genetic architecture of functional specificity and its impact on the evolution of new functions. Here, we develop a new method based on ordinal logistic regression to directly characterize the global genetic determinants of multiple protein functions from 20-state combinatorial deep mutational scanning (DMS) experiments. We use it to dissect the genetic architecture and evolution of a transcription factor’s specificity for DNA, using data from a combinatorial DMS of an ancient steroid hormone receptor’s capacity to activate transcription from two biologically relevant DNA elements. We show that the genetic architecture of DNA recognition consists of a dense set of main and pairwise effects that involve virtually every possible amino acid state in the protein-DNA interface, but higher-order epistasis plays only a tiny role. Pairwise interactions enlarge the set of functional sequences and are the primary determinants of specificity for different DNA elements. They also massively expand the number of opportunities for single-residue mutations to switch specificity from one DNA target to another. By bringing variants with different functions close together in sequence space, pairwise epistasis therefore facilitates rather than constrains the evolution of new functions.
Author Response: Epistasis facilitates functional evolution in an ancient transcription factor
2024-03-22
peer-reviewOpen accessSenior authorA protein’s genetic architecture – the set of causal rules by which its sequence produces its functions – also determines its possible evolutionary trajectories. Prior research has proposed that genetic architecture of proteins is very complex, with pervasive epistatic interactions that constrain evolution and make function difficult to predict from sequence. Most of this work has analyzed only the direct paths between two proteins of interest – excluding the vast majority of possible genotypes and evolutionary trajectories – and has considered only a single protein function, leaving unaddressed the genetic architecture of functional specificity and its impact on the evolution of new functions. Here we develop a new method based on ordinal logistic regression to directly characterize the global genetic determinants of multiple protein functions from 20-state combinatorial deep mutational scanning (DMS) experiments. We use it to dissect the genetic architecture and evolution of a transcription factor’s specificity for DNA, using data from a combinatorial DMS of an ancient steroid hormone receptor’s capacity to activate transcription from two biologically relevant DNA elements. We show that the genetic architecture of DNA recognition consists of a dense set of main and pairwise effects that involve virtually every possible amino acid state in the protein-DNA interface, but higher-order epistasis plays only a tiny role. Pairwise interactions enlarge the set of functional sequences and are the primary determinants of specificity for different DNA elements. They also massively expand the number of opportunities for single-residue mutations to switch specificity from one DNA target to another. By bringing variants with different functions close together in sequence space, pairwise epistasis therefore facilitates rather than constrains the evolution of new functions.
Recent grants
NIH · $1.6M · 2017–2021
NIH · $253k · 2007
NSF · $19k · 2015–2018
CAREER: Molecular Evolution of Steroid Hormone Receptor Function And Interactions
NSF · $912k · 2006–2012
NIH · $904k · 2019–2023
Frequent coauthors
- 137 shared
Michael Berkwits
- 137 shared
Teresa Omiotek
Advisory Board Company (United States)
- 137 shared
Amy Evers
University of Pennsylvania Health System
- 137 shared
Judith Literskis
Johns Hopkins University
- 137 shared
Patricia Panek
Perspectives Charter School
- 137 shared
Annette Flanagin
- 137 shared
Maria Duda
- 137 shared
Fred Furtner
International Rescue Committee
Awards & honors
- Steenbock Distinguished Lecturer in Biochemistry University…
- Distinguished Alumni Lectureship University of Queensland Sc…
- Guggenheim Foundation Fellowship (2014)
- Hans Falk Award National Institute for Environmental Health…
- Richard Jones Investigator Award Oregon Medical Research Fou…
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Joseph Thornton
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup