Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Jeffrey Thorne

Jeffrey Thorne

· Professor of StatisticsVerified

North Carolina State University · Statistics

Active 1992–2026

h-index31
Citations6.8k
Papers6710 last 5y
Funding$19.7M
See your match with Jeffrey Thorne — sign in to PhdFit.Sign in

Research topics

  • Biology
  • Computer Science
  • Artificial Intelligence
  • Genetics
  • Evolutionary biology
  • Algorithm
  • Virology
  • Ecology
  • Computational biology

Selected publications

  • A deep-learning-based score to evaluate multiple sequence alignments

    bioRxiv (Cold Spring Harbor Laboratory) · 2026-02-05

    articleOpen access

    Multiple sequence alignment (MSA) inference is a central task in molecular evolution and comparative genomics, and the reliability of downstream analyses, including phylogenetic inference, depends critically on alignment quality. Despite this importance, most widely used MSA methods optimize the sum-of-pairs (SP) score, and relatively little attention has been paid to whether this objective function accurately reflects alignment accuracy. Here, we evaluate the performance of the SP score using simulated and empirical benchmark alignments. For each dataset, we compare alternative MSAs derived from the same unaligned sequences and quantify the relationship between their SP scores and their distances from a reference alignment. We show that the alignment with the optimal SP score often does not correspond to the most accurate alignment. To address this limitation, we develop deep-learning-based scoring functions that integrate a collection of MSA features. We first introduce Model 1, a regression model that predicts the distance of a given MSA from the reference alignment. Across simulated and empirical datasets, this learned score correlates more strongly with true alignment accuracy than the SP score. However, Model 1 is less effective at identifying the best alignment among alternatives. We therefore develop Model 2, which takes as input a set of alternative MSAs generated from the same sequences and predicts their relative ranking. Model 2 more accurately identifies the top-ranking MSA than the SP score, Model 1, and several widely used alignment programs. Using simulations, we show that selecting MSAs based on our approach leads to more accurate phylogenetic reconstructions.

  • Using drift coefficients as a basis for inferring times, effective population sizes, and genetic adaptations

    Molecular Biology and Evolution · 2026-05-01

    articleOpen access

    Genetic drift and gene flow can give rise to a complex population genetic structure. The inverse problem of estimating the genetic drift and gene flow in the past, based on the present-day genomic population structure, can be solved using an admixture graph. This describes differentiated local populations in terms of population splits and migrations between populations. The history and associated levels of genetic drift and admixture can be estimated based on the genome-wide single nucleotide polymorphism allele frequency data. Here, we present a set of statistical methods based on the admixture graph. Applying a prior on the stochastic variation of the effective population size decomposes the genetic drift values that are associated with the non-migration edges into the timings of the population splits and the effective population sizes at those times. This decomposition facilitates downstream analyses such as reconstruction of ancestral allele frequencies via a Brownian motion model with admixture. To trace changes in allele frequencies on a world map, we estimated the geographic locations of the ancestral populations using Brownian motion, the rate of which depends on the genetic drift values. Mapping the history of putative adaptations onto a world map can illuminate factors responsible for regional population heterogeneity. We investigated the effectiveness of detecting adaptations with a numerical simulation that mimics human population history, and by analyzing the expression quantitative trait loci of the melanocortin 1 receptor gene, which is involved in regulation of skin and hair pigmentation.

  • Likelihood-based evaluation of character recoding schemes for phylogenetic analysis

    bioRxiv (Cold Spring Harbor Laboratory) · 2025-10-16

    preprintOpen accessSenior author

    Character recoding is a common practice in evolutionary studies. For example, phylogenies can be inferred from protein-coding DNA sequences via 61-state codon substitution models. However, often the inference is instead done by adopting 20-state amino acid replacement models that do not explicitly consider synonymous substitution. When there is substantial heterogeneity of amino acid frequencies among sites and/or among lineages, another sort of character recoding is sometimes performed. In these cases, one option is to reduce the state space of the model by placing each of the 20 amino acids into one of a relatively small number (e.g., 6) of groups of amino acids and then modeling only how group membership changes at a site over evolutionary time. Unfortunately, these kinds of character recoding schemes are prone to reducing the amount of available evolutionary information. Here, we provide a likelihood framework to statistically assess recoding schemes. Although we concentrate on the recoding of 61-state codon substitution models into 20-state amino acid replacement models, the general approach is also relevant to other recoding schemes such as those that recode 20-state models into 6-state models.

  • Scalable Bayesian Divergence Time Estimation With Ratio Transformations

    Systematic Biology · 2023 · 12 citations

    • Computer Science
    • Artificial Intelligence
    • Biology

    Divergence time estimation is crucial to provide temporal signals for dating biologically important events from species divergence to viral transmissions in space and time. With the advent of high-throughput sequencing, recent Bayesian phylogenetic studies have analyzed hundreds to thousands of sequences. Such large-scale analyses challenge divergence time reconstruction by requiring inference on highly correlated internal node heights that often become computationally infeasible. To overcome this limitation, we explore a ratio transformation that maps the original $N-1$ internal node heights into a space of one height parameter and $N-2$ ratio parameters. To make the analyses scalable, we develop a collection of linear-time algorithms to compute the gradient and Jacobian-associated terms of the log-likelihood with respect to these ratios. We then apply Hamiltonian Monte Carlo sampling with the ratio transform in a Bayesian framework to learn the divergence times in 4 pathogenic viruses (West Nile virus, rabies virus, Lassa virus, and Ebola virus) and the coralline red algae. Our method both resolves a mixing issue in the West Nile virus example and improves inference efficiency by at least 5-fold for the Lassa and rabies virus examples as well as for the algae example. Our method now also makes it computationally feasible to incorporate mixed-effects molecular clock models for the Ebola virus example, confirms the findings from the original study, and reveals clearer multimodal distributions of the divergence times of some clades of interest.

  • Interlocus Gene Conversion, Natural Selection, and Paralog Homogenization

    Molecular Biology and Evolution · 2023-09-01 · 2 citations

    articleOpen access

    Following a duplication, the resulting paralogs tend to diverge. While mutation and natural selection can accelerate this process, they can also slow it. Here, we quantify the paralog homogenization that is caused by point mutations and interlocus gene conversion (IGC). Among 164 duplicated teleost genes, the median percentage of postduplication codon substitutions that arise from IGC rather than point mutation is estimated to be between 7% and 8%. By differentiating between the nonsynonymous codon substitutions that homogenize the protein sequences of paralogs and the nonhomogenizing nonsynonymous substitutions, we estimate the homogenizing nonsynonymous rates to be higher for 163 of the 164 teleost data sets as well as for all 14 data sets of duplicated yeast ribosomal protein-coding genes that we consider. For all 14 yeast data sets, the estimated homogenizing nonsynonymous rates exceed the synonymous rates.

  • Correlations between alignment gaps and nucleotide substitution or amino acid replacement

    Proceedings of the National Academy of Sciences · 2022-08-16 · 1 citations

    articleOpen accessSenior author

    To assess the conventional treatment in evolutionary inference of alignment gaps as missing data, we propose a simple nonparametric test of the null hypothesis that the locations of alignment gaps are independent of the nucleotide substitution or amino acid replacement process. When we apply the test to 1,390 protein alignments that are informed by protein tertiary structure and use a 5% significance level, the null hypothesis of independence between amino acid replacement and gap location is rejected for ∼65% of datasets. Via simulations that include substitution and insertion-deletion, we show that the test performs well with true alignments. When we simulate according to the null hypothesis and then apply the test to optimal alignments that are inferred by each of four widely used software packages, the null hypothesis is rejected too frequently. Via further simulations and analyses, we show that the overly frequent rejections of the null hypothesis are not solely due to weaknesses of widely used software for finding optimal alignments. Instead, our evidence suggests that optimal alignments are unrepresentative of true alignments and that biased evolutionary inferences may result from relying upon individual optimal alignments.

  • Convergent evolution of polyploid genomes from across the eukaryotic tree of life

    G3 Genes Genomes Genetics · 2022 · 21 citations

    • Biology
    • Evolutionary biology
    • Computational biology

    By modeling the homoeologous gene losses that occurred in 50 genomes deriving from ten distinct polyploidy events, we show that the evolutionary forces acting on polyploids are remarkably similar, regardless of whether they occur in flowering plants, ciliates, fishes, or yeasts. We show that many of the events show a relative rate of duplicate gene loss before the first postpolyploidy speciation that is significantly higher than in later phases of their evolution. The relatively weak selective constraint experienced by the single-copy genes these losses produced leads us to suggest that most of the purely selectively neutral duplicate gene losses occur in the immediate postpolyploid period. Nearly all of the events show strong evidence of biases in the duplicate losses, consistent with them being allopolyploidies, with 2 distinct progenitors contributing to the modern species. We also find ongoing and extensive reciprocal gene losses (alternative losses of duplicated ancestral genes) between these genomes. With the exception of a handful of closely related taxa, all of these polyploid organisms are separated from each other by tens to thousands of reciprocal gene losses. As a result, it is very unlikely that viable diploid hybrid species could form between these taxa, since matings between such hybrids would tend to produce offspring lacking essential genes. It is, therefore, possible that the relatively high frequency of recurrent polyploidies in some lineages may be due to the ability of new polyploidies to bypass reciprocal gene loss barriers.

  • Exome sequencing of hepatocellular carcinoma in lemurs identifies potential cancer drivers

    Evolution Medicine and Public Health · 2022-01-01 · 3 citations

    articleOpen access

    Background and objectives: Hepatocellular carcinoma occurs frequently in prosimians, but the cause of these liver cancers in this group is unknown. Characterizing the genetic changes associated with hepatocellular carcinoma in prosimians may point to possible causes, treatments and methods of prevention, aiding conservation efforts that are particularly crucial to the survival of endangered lemurs. Although genomic studies of cancer in non-human primates have been hampered by a lack of tools, recent studies have demonstrated the efficacy of using human exome capture reagents across primates. Methodology: In this proof-of-principle study, we applied human exome capture reagents to tumor-normal pairs from five lemurs with hepatocellular carcinoma to characterize the mutational landscape of this disease in lemurs. Results: degradation and regulation. In addition to these similarities with human hepatocellular carcinoma, we also noted unique features, including six genes that contain mutations in all five lemurs. Interestingly, these genes are infrequently mutated in human hepatocellular carcinoma, suggesting potential differences in the etiology and/or progression of this cancer in lemurs and humans. Conclusions and implications: Collectively, this pilot study suggests that human exome capture reagents are a promising tool for genomic studies of cancer in lemurs and other non-human primates. Lay Summary: Hepatocellular carcinoma occurs frequently in prosimians, but the cause of these liver cancers is unknown. In this proof-of-principle study, we applied human DNA sequencing tools to tumor-normal pairs from five lemurs with hepatocellular carcinoma and compared the lemur mutation profiles to those of human hepatocellular carcinomas.

  • Pedigree-based and phylogenetic methods support surprising patterns of mutation rate and spectrum in the gray mouse lemur

    Heredity · 2021 · 54 citations

    • Biology
    • Genetics
    • Evolutionary biology
  • Measuring Phylogenetic Information of Incomplete Sequence Data

    Systematic Biology · 2021-09-01 · 3 citations

    articleSenior author

    Widely used approaches for extracting phylogenetic information from aligned sets of molecular sequences rely upon probabilistic models of nucleotide substitution or amino-acid replacement. The phylogenetic information that can be extracted depends on the number of columns in the sequence alignment and will be decreased when the alignment contains gaps due to insertion or deletion events. Motivated by the measurement of information loss, we suggest assessment of the effective sequence length (ESL) of an aligned data set. The ESL can differ from the actual number of columns in a sequence alignment because of the presence of alignment gaps. Furthermore, the estimation of phylogenetic information is affected by model misspecification. Inevitably, the actual process of molecular evolution differs from the probabilistic models employed to describe this process. This disparity means the amount of phylogenetic information in an actual sequence alignment will differ from the amount in a simulated data set of equal size, which motivated us to develop a new test for model adequacy. Via theory and empirical data analysis, we show how to disentangle the effects of gaps and model misspecification. By comparing the Fisher information of actual and simulated sequences, we identify which alignment sites and tree branches are most affected by gaps and model misspecification. [Fisher information; gaps; insertion; deletion; indel; model adequacy; goodness-of-fit test; sequence alignment.].

Recent grants

Frequent coauthors

  • Hirohisa Kishino

    Chuo University

    25 shared
  • Nick Goldman

    European Bioinformatics Institute

    12 shared
  • Tae‐Kun Seo

    Korea Polar Research Institute

    11 shared
  • David Jones

    Heidelberg University

    9 shared
  • Sang Chul Choi

    Sungshin Women's University

    7 shared
  • Joseph Felsenstein

    University of Washington

    6 shared
  • Jason A. Somarelli

    Duke University

    5 shared
  • Masami Hasegawa

    Toho University

    5 shared
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Jeffrey Thorne

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup