
Spencer V. Muse
· Professor of Statistics Director of Bioinformatics Graduate Program Director of Statistics Undergraduate ProgramVerifiedNorth Carolina State University · Statistics
Active 1992–2025
Research topics
- Computer Science
- Artificial Intelligence
- Biology
- Statistics
- Mathematics
- Machine Learning
- Econometrics
- Evolutionary biology
- Genetics
- Economics
- Paleontology
- Algorithm
- Computational biology
- Mathematical analysis
Selected publications
Minus the Error: Testing for Positive Selection in the Presence of Residual Alignment Errors
eLife · 2025-06-26
preprintOpen accessAbstract Positive selection is an evolutionary process which increases the frequency of advantageous mutations because they confer a fitness benefit. Inferring the past action of positive selection on protein-coding sequences is fundamental for deciphering phenotypic diversity and the emergence of novel traits. With the advent of genome-wide comparative genomic datasets, researchers can analyze selection not only at the level of individual genes but also globally, delivering systems-level insights into evolutionary dynamics. However, genome-scale datasets are generated with automated pipelines and imperfect curation that does not eliminate all sequencing, annotation, and alignment errors. Positive selection inference methods are highly sensitive to such errors. We present BUSTED-E: a method designed to detect positive selection for amino acid diversification while concurrently identifying some alignment errors. This method builds on the flexible branch-site random effects model (BUSTED) for fitting distributions of dN/dS, with a critical modification: it incorporates an “error-sink” component to represent an abiological evolutionary regime. Using several genome-scale biological datasets that were extensively filtered using state-of-the art automated alignment tools, we show that BUSTED-E identifies pervasive residual alignment errors, produces more realistic estimates of positive selection, reduces bias, and improves biological interpretation. The BUSTED-E model promises to be a more stringent filter to identify positive selection in genome-wide contexts, thus enabling further characterization and validation of the most biologically relevant cases.
Minus the Error: Testing for Positive Selection in the Presence of Residual Alignment Errors
eLife · 2025-06-26 · 1 citations
preprintOpen accessAbstract Positive selection is an evolutionary process which increases the frequency of advantageous mutations because they confer a fitness benefit. Inferring the past action of positive selection on protein-coding sequences is fundamental for deciphering phenotypic diversity and the emergence of novel traits. With the advent of genome-wide comparative genomic datasets, researchers can analyze selection not only at the level of individual genes but also globally, delivering systems-level insights into evolutionary dynamics. However, genome-scale datasets are generated with automated pipelines and imperfect curation that does not eliminate all sequencing, annotation, and alignment errors. Positive selection inference methods are highly sensitive to such errors. We present BUSTED-E: a method designed to detect positive selection for amino acid diversification while concurrently identifying some alignment errors. This method builds on the flexible branch-site random effects model (BUSTED) for fitting distributions of dN/dS, with a critical modification: it incorporates an “error-sink” component to represent an abiological evolutionary regime. Using several genome-scale biological datasets that were extensively filtered using state-of-the art automated alignment tools, we show that BUSTED-E identifies pervasive residual alignment errors, produces more realistic estimates of positive selection, reduces bias, and improves biological interpretation. The BUSTED-E model promises to be a more stringent filter to identify positive selection in genome-wide contexts, thus enabling further characterization and validation of the most biologically relevant cases.
2025-06-26
peer-reviewOpen accessPositive selection is an evolutionary process which increases the frequency of advantageous mutations because they confer a fitness benefit. Inferring the past action of positive selection on protein-coding sequences is fundamental for deciphering phenotypic diversity and the emergence of novel traits. With the advent of genome-wide comparative genomic datasets, researchers can analyze selection not only at the level of individual genes but also globally, delivering systems-level insights into evolutionary dynamics. However, genome-scale datasets are generated with automated pipelines and imperfect curation that does not eliminate all sequencing, annotation, and alignment errors. Positive selection inference methods are highly sensitive to such errors. We present BUSTED-E: a method designed to detect positive selection for amino acid diversification while concurrently identifying some alignment errors. This method builds on the flexible branch-site random effects model (BUSTED) for fitting distributions of dN/dS, with a critical modification: it incorporates an “error-sink” component to represent an abiological evolutionary regime. Using several genome-scale biological datasets that were extensively filtered using state-of-the art automated alignment tools, we show that BUSTED-E identifies pervasive residual alignment errors, produces more realistic estimates of positive selection, reduces bias, and improves biological interpretation. The BUSTED-E model promises to be a more stringent filter to identify positive selection in genome-wide contexts, thus enabling further characterization and validation of the most biologically relevant cases.
Minus the Error: Testing for Positive Selection in the Presence of Residual Alignment Errors
bioRxiv (Cold Spring Harbor Laboratory) · 2024 · 8 citations
- Computer Science
- Artificial Intelligence
- Statistics
Positive selection is an evolutionary process which increases the frequency of advantageous mutations because they confer a fitness benefit. Inferring the past action of positive selection on protein-coding sequences is fundamental for deciphering phenotypic diversity and the emergence of novel traits. With the advent of genome-wide comparative genomic datasets, researchers can analyze selection not only at the level of individual genes but also globally, delivering systems-level insights into evolutionary dynamics. However, genome-scale datasets are generated with automated pipelines and imperfect curation that does not eliminate all sequencing, annotation, and alignment errors. Positive selection inference methods are highly sensitive to such errors. We present BUSTED-E: a method designed to detect positive selection for amino acid diversification while concurrently identifying some alignment errors. This method builds on the flexible branch-site random effects model (BUSTED) for fitting distributions of dN/dS, with a critical modification: it incorporates an "error-sink" component to represent an abiological evolutionary regime. Using several genome-scale biological datasets that were extensively filtered using state-of-the art automated alignment tools, we show that BUSTED-E identifies pervasive residual alignment errors, produces more realistic estimates of positive selection, reduces bias, and improves biological interpretation. The BUSTED-E model promises to be a more stringent filter to identify positive selection in genome-wide contexts, thus enabling further characterization and validation of the most biologically relevant cases.
PLoS ONE · 2020 · 3 citations
Senior authorCorresponding- Computer Science
- Statistics
- Mathematics
It is standard practice to model site-to-site variability of substitution rates by discretizing a continuous distribution into a small number, K, of equiprobable rate categories. We demonstrate that the variance of this discretized distribution has an upper bound determined solely by the choice of K and the mean of the distribution. This bound can introduce biases into statistical inference, especially when estimating parameters governing site-to-site variability of substitution rates. Applications to two large collections of sequence alignments demonstrate that this upper bound is often reached in analyses of real data. When parameter estimation is of primary interest, additional rate categories or more flexible modeling methods should be considered.
Molecular Biology and Evolution · 2020 · 103 citations
Senior authorCorresponding- Machine Learning
- Computer Science
- Biology
Most molecular evolutionary studies of natural selection maintain the decades-old assumption that synonymous substitution rate variation (SRV) across sites within genes occurs at levels that are either nonexistent or negligible. However, numerous studies challenge this assumption from a biological perspective and show that SRV is comparable in magnitude to that of nonsynonymous substitution rate variation. We evaluated the impact of this assumption on methods for inferring selection at the molecular level by incorporating SRV into an existing method (BUSTED) for detecting signatures of episodic diversifying selection in genes. Using simulated data we found that failing to account for even moderate levels of SRV in selection testing is likely to produce intolerably high false positive rates. To evaluate the effect of the SRV assumption on actual inferences we compared results of tests with and without the assumption in an empirical analysis of over 13,000 Euteleostomi (bony vertebrate) gene alignments from the Selectome database. This exercise reveals that close to 50% of positive results (i.e., evidence for selection) in empirical analyses disappear when SRV is modeled as part of the statistical analysis and are thus candidates for being false positives. The results from this work add to a growing literature establishing that tests of selection are much more sensitive to certain model assumptions than previously believed.
HyPhy 2.5—A Customizable Platform for Evolutionary Hypothesis Testing Using Phylogenies
Molecular Biology and Evolution · 2019-08-25 · 761 citations
articleOpen accessSenior authorHYpothesis testing using PHYlogenies (HyPhy) is a scriptable, open-source package for fitting a broad range of evolutionary models to multiple sequence alignments, and for conducting subsequent parameter estimation and hypothesis testing, primarily in the maximum likelihood statistical framework. It has become a popular choice for characterizing various aspects of the evolutionary process: natural selection, evolutionary rates, recombination, and coevolution. The 2.5 release (available from www.hyphy.org) includes a completely re-engineered computational core and analysis library that introduces new classes of evolutionary models and statistical tests, delivers substantial performance and stability enhancements, improves usability, streamlines end-to-end analysis workflows, makes it easier to develop custom analyses, and is mostly backward compatible with previous HyPhy releases.
Molecular Biology and Evolution · 2017-12-30 · 1041 citations
articleOpen accessInference of how evolutionary forces have shaped extant genetic diversity is a cornerstone of modern comparative sequence analysis. Advances in sequence generation and increased statistical sophistication of relevant methods now allow researchers to extract ever more evolutionary signal from the data, albeit at an increased computational cost. Here, we announce the release of Datamonkey 2.0, a completely re-engineered version of the Datamonkey web-server for analyzing evolutionary signatures in sequence data. For this endeavor, we leveraged recent developments in open-source libraries that facilitate interactive, robust, and scalable web application development. Datamonkey 2.0 provides a carefully curated collection of methods for interrogating coding-sequence alignments for imprints of natural selection, packaged as a responsive (i.e. can be viewed on tablet and mobile devices), fully interactive, and API-enabled web application. To complement Datamonkey 2.0, we additionally release HyPhy Vision, an accompanying JavaScript application for visualizing analysis results. HyPhy Vision can also be used separately from Datamonkey 2.0 to visualize locally executed HyPhy analyses. Together, Datamonkey 2.0 and HyPhy Vision showcase how scientific software development can benefit from general-purpose open-source frameworks. Datamonkey 2.0 is freely and publicly available at http://www.datamonkey.org, and the underlying codebase is available from https://github.com/veg/datamonkey-js.
The Computational Phyloinformatics Summer Course Wikis
Zenodo (CERN European Organization for Nuclear Research) · 2015-06-18
articleOpen access<em>These are snapshots of the 2008-2012 course wikis for the Computational Phyloinformatics Summer Courses. (There was no course wiki for the 2007 course.)</em> <strong>Computational Phyloinformatics Summer Course</strong> Computational Phyloinformatics is an 10 to 14-day intensive summer workshop established at NESCent, but often co-sponsored and hosted at other institutions. The workshop aims to give biologists practical knowledge and hands-on programming skills in phyloinformatics. The curriculum changes form year to year, but has included PERL (BioPerl, BioPhyo), SQL (BioSQL, TreeBASE), JAVA (JEBL, PAL, Mesquite), R (Ape), HyPhy, and BioRuby. <strong>Synopsis</strong> Biologists are faced with ever-larger datasets, more complex evolutionary models, and increasingly elaborate analytical methods. Seldom is it sufficient to run a dataset with an off-the-shelf program on a desktop PC; increasingly, biologists need to write scripts to interface with internet services and databases, build analytical pipelines, customize analyses, and distribute computation over multiple processors. This course is designed for graduate students, postdocs, and researchers in phylogenetics interested in receiving practical, hands-on training in the use of scripting languages for solving phyloinformatics problems. Students will learn how to write basic phylogenetic or comparative analysis scripts, parse various data files, traverse and compute over trees, and make practical use of phylogenetic software libraries. These skills will be learned in a biological context, touching on a diverse array of topics such as analysis of large datasets, automation of supertree assembly, scripting multiple sequence alignment processing, gene duplication inference, querying for topological patterns in large collections of trees, etc. Participants leave the course with their laptops filled with working software and programming libraries to apply them to their own research projects. Current and Prior Workshops NESCent, Durham, NC, USA, 2007 NESCent, Durham, NC, USA, 2008 Instituto Gulbenkian, Lisbon, Portugal, 2009 BGI-Shenzhen, China, 2010 Kyoto, Japan, 2011 Moscow, Russia, 2012
TRAINING THE NEXT GENERATION OF QUANTITATIVE BIOLOGISTS IN THE ERA OF BIG DATA
2014-11-01 · 5 citations
articleThe following sections are included: Workshop Focus, Workshop Contributions and References.
Recent grants
NIH · $45k
HyPhy: comprehensive, fast, and user-friendly software for evolutionary analysis
NIH · $2.5M · 2010–2020
Frequent coauthors
- 19 shared
Sergei L. Kosakovsky Pond
Temple University
- 8 shared
Laura A. Katz
Smith College
- 6 shared
Edward S. Buckler
Cornell University
- 6 shared
Sadie R Wisotsky
Temple University
- 6 shared
Brandon S. Gaut
University of California, Irvine
- 5 shared
Andrew G. Clark
Cornell University
- 5 shared
Michael T. Clegg
University of California, Irvine
- 4 shared
Simon D. W. Frost
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Spencer V. Muse
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup