
Yun William Yu
· Associate ProfessorVerifiedCarnegie Mellon University · Ray and Stephanie Lane Computational Biology Department
Active 1998–2025
About
Yun William Yu is an Associate Professor in the Ray and Stephanie Lane Computational Biology Department at Carnegie Mellon University. His academic affiliation is within the School of Computer Science, and he is based at the Gates Hillman Center. The department offers a range of educational programs including Ph.D., M.S., and undergraduate degrees in Computational Biology, as well as an integrated Masters Degree in Computational Biology. The department emphasizes research and education in computational biology and automated science, providing various resources and programs to support students and faculty. Specific details about Professor Yu's research focus, background, or key contributions are not provided on the page.
Research topics
- Computer Science
- Artificial Intelligence
- Data Mining
- Medicine
- Internet privacy
- Business
- Biology
- Psychology
- Theoretical computer science
- World Wide Web
- Database
- Virology
Selected publications
Long-read reconstruction of many diverse haplotypes with devider
Genome Research · 2025-09-23 · 4 citations
articleOpen accessReconstructing exact haplotypes is important when sequencing a mixture of similar sequences. Long-read sequencing can connect distant alleles to disentangle similar haplotypes, but handling sequencing errors requires specialized techniques. Here, we present devider , an algorithm for haplotyping small sequences, such as viruses or genes, from long-read sequencing. devider uses a positional de Bruijn graph with sequence-to-graph alignment on an alphabet of informative alleles to provide a fast assembly-inspired approach compatible with various long-read sequencing technologies. On a synthetic Oxford Nanopore Technologies (ONT) long-read data set containing seven HIV strains, devider recovers 97% of the haplotype content and has the most accurate abundance estimates while taking <4 min and 1 GB of memory for >8000× coverage. Benchmarking on synthetic mixtures of antimicrobial-resistance (AMR) genes shows that devider recovers 83% of haplotypes, 23 percentage points higher than the next best method. On real Pacific Biosciences (PacBio) and ONT data sets, devider recapitulates previously known results in seconds, disentangling a bacterial community with more than 10 strains and an HIV-1 coinfection data set. We use devider to investigate the within-host diversity of a long-read bovine gut metagenome enriched for AMR genes, discovering 13 distinct haplotypes for a tet(Q) tetracycline-resistance gene with >18,000× coverage and six haplotypes for a CfxA2 beta-lactamase gene. We find clear recombination blocks for these AMR gene haplotypes, showcasing devider's ability to unveil evolutionary signals for heterogeneous mixtures.
KuPID: Kmer-based Upstream Preprocessing of Long Reads for Isoform Discovery
bioRxiv (Cold Spring Harbor Laboratory) · 2025-12-09
articleOpen accessSenior authorEukaryotic genes can encode multiple protein isoforms based on alternative splicing of their transcribed regions. Most modern novel isoform discovery methods function by identifying and assembling exon splice junctions from an RNAseq sample. However, splice junctions can only be accurately annotated with time-intensive dynamic programming alignment. This manuscript introduces KuPID, a method for preprocessing long RNAseq reads with the goal of better identifying novel isoform transcripts. KuPID utilizes kmer sketching as a pre-filter to quickly pseudo-align reads to known reference isoforms. Full alignment need only then be applied to reads that are most relevant to isoform discovery. Not only does KuPID speed up the discovery pipeline, it also increases downstream accuracy by filtering out extraneous reads. KuPID preprocessing simultaneously increases the f1 accuracy of isoform discovery pipelines by up to 16.7 points while decreasing the runtime by a factor of 2-3x. An optional mode permits a KuPID sample to be paired with both isoform discovery and transcript quantification. Code availability: https://github.com/mboro2000/KuPID.git.
Incorporating indel channels into average-case analysis of seed-chain-extend
ArXiv.org · 2025-12-04
preprintOpen accessSenior authorGiven a sequence $s_1$ of $n$ letters drawn i.i.d. from an alphabet of size $σ$ and a mutated substring $s_2$ of length $m < n$, we often want to recover the mutation history that generated $s_2$ from $s_1$. Modern sequence aligners are widely used for this task, and many employ the seed-chain-extend heuristic with $k$-mer seeds. Previously, Shaw and Yu showed that optimal linear-gap cost chaining can produce a chain with $1 - O\left(\frac{1}{\sqrt{m}}\right)$ recoverability, the proportion of the mutation history that is recovered, in $O\left(mn^{2.43θ} \log n\right)$ expected time, where $θ< 0.206$ is the mutation rate under a substitution-only channel and $s_1$ is assumed to be uniformly random. However, a gap remains between theory and practice, since real genomic data includes insertions and deletions (indels), and yet seed-chain-extend remains effective. In this paper, we generalize those prior results by introducing mathematical machinery to deal with the two new obstacles introduced by indel channels: the dependence of neighboring anchors and the presence of anchors that are only partially correct. We are thus able to prove that the expected recoverability of an optimal chain is $\ge 1 - O\Bigl(\frac{1}{\sqrt{m}}\Bigr)$ and the expected runtime is $O(mn^{3.15 \cdot θ_T}\log n)$, when the total mutation rate given by the sum of the substitution, insertion, and deletion mutation rates ($θ_T = θ_i + θ_d + θ_s$) is less than $0.159$.
Devider: Long-Read Reconstruction of Many Diverse Haplotypes
Lecture notes in computer science · 2025-01-01
book-chapterskandiver: a divergence-based analysis tool for identifying intercellular mobile genetic elements
arXiv (Cornell University) · 2024-06-17
preprintOpen accessSenior authorMobile genetic elements (MGEs) are as ubiquitous in nature as they are varied in type, ranging from viral insertions to transposons to incorporated plasmids. Horizontal transfer of MGEs across bacterial species may also pose a significant threat to global health due to their capability to harbour antibiotic resistance genes. However, despite cheap and rapid whole genome sequencing, the varied nature of MGEs makes it difficult to fully characterize them, and existing methods for detecting MGEs often don't agree on what should count. In this manuscript, we first define and argue in favor of a divergence-based characterization of mobile-genetic elements. Using that paradigm, we present skandiver, a tool designed to efficiently detect MGEs from whole genome assemblies without the need for gene annotation or markers. skandiver determines mobile elements via genome fragmentation, average nucleotide identity (ANI), and divergence time. By building on the scalable skani software for ANI computation, skandiver can query hundreds of complete assemblies against $>$65,000 representative genomes in a few minutes and 19 GB memory, providing scalable and efficient method for elucidating mobile element profiles in incomplete, uncharacterized genomic sequences. For isolated and integrated large plasmids (>10kbp), skandiver's recall was 48\% and 47\%, MobileElementFinder was 59\% and 17\%, and geNomad was 86\% and 32\%, respectively. For isolated large plasmids, skandiver's recall (48\%) is lower than state-of-the-art reference-based methods geNomad (86\%) and MobileElementFinder (59\%). However, skandiver achieves higher recall on integrated plasmids and, unlike other methods, without comparing against a curated database, making skandiver suitable for discovery of novel MGEs. Availability: https://github.com/YoukaiFromAccounting/skandiver
Floria: Fast and accurate strain haplotyping in metagenomes
bioRxiv (Cold Spring Harbor Laboratory) · 2024-01-31
preprintOpen accessSenior authorCorrespondingAbstract Shotgun metagenomics allows for direct analysis of microbial community genetics, but scalable computational methods for the recovery of bacterial strain genomes from microbiomes remains a key challenge. We introduce Floria, a novel method designed for rapid and accurate recovery of strain haplotypes from short and long-read metagenome sequencing data, based on minimum error correction (MEC) read clustering and a strain-preserving network flow model. Floria can function as a standalone haplotyping method, outputting alleles and reads that co-occur on the same strain, as well as an end-to-end read-to-assembly pipeline (Floria-PL) for strain-level assembly. Benchmarking evaluations on synthetic metagenomes showed that Floria is > 3 × faster and recovers 21% more strain content than base-level assembly methods (Strainberry), while being over an order of magnitude faster when only phasing is required. Applying Floria to a set of 109 deeply sequenced nanopore metagenomes took < 20 minutes on average per sample, and identified several species that have consistent strain heterogeneity. Applying Floria’s short-read haplotyping to a longitudinal gut metagenomics dataset revealed a dynamic multi-strain Anaerostipes hadrus community with frequent strain loss and emergence events over 636 days. With Floria, accurate haplotyping of metagenomic datasets takes mere minutes on standard workstations, paving the way for extensive strain-level metagenomic analyses. Availability Floria is available at https://github.com/bluenote-1577/floria , and the Floria-PL pipeline is available at https://github.com/jsgounot/Floria_analysis_workflow .
Fairy: fast approximate coverage for multi-sample metagenomic binning
Microbiome · 2024-08-14 · 14 citations
articleOpen accessSenior authorAbstract Background Metagenomic binning, the clustering of assembled contigs that belong to the same genome, is a crucial step for recovering metagenome-assembled genomes (MAGs). Contigs are linked by exploiting consistent signatures along a genome, such as read coverage patterns. Using coverage from multiple samples leads to higher-quality MAGs; however, standard pipelines require all-to-all read alignments for multiple samples to compute coverage, becoming a key computational bottleneck. Results We present fairy ( https://github.com/bluenote-1577/fairy ), an approximate coverage calculation method for metagenomic binning. Fairy is a fast k-mer-based alignment-free method. For multi-sample binning, fairy can be $$> 250 \times$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mo>></mml:mo> <mml:mn>250</mml:mn> <mml:mo>×</mml:mo> </mml:mrow> </mml:math> faster than read alignment and accurate enough for binning. Fairy is compatible with several existing binners on host and non-host-associated datasets. Using MetaBAT2, fairy recovers $$98.5\%$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mn>98.5</mml:mn> <mml:mo>%</mml:mo> </mml:mrow> </mml:math> of MAGs with $$> 50\%$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mo>></mml:mo> <mml:mn>50</mml:mn> <mml:mo>%</mml:mo> </mml:mrow> </mml:math> completeness and $$< 5\%$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mo><</mml:mo> <mml:mn>5</mml:mn> <mml:mo>%</mml:mo> </mml:mrow> </mml:math> contamination relative to alignment with BWA. Notably, multi-sample binning with fairy is always better than single-sample binning using BWA ( $$> 1.5\times$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mo>></mml:mo> <mml:mn>1.5</mml:mn> <mml:mo>×</mml:mo> </mml:mrow> </mml:math> more $$>50\%$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mo>></mml:mo> <mml:mn>50</mml:mn> <mml:mo>%</mml:mo> </mml:mrow> </mml:math> complete MAGs on average) while still being faster. For a public sediment metagenome project, we demonstrate that multi-sample binning recovers higher quality Asgard archaea MAGs than single-sample binning and that fairy’s results are indistinguishable from read alignment. Conclusions Fairy is a new tool for approximately and quickly calculating multi-sample coverage for binning, resolving a computational bottleneck for metagenomics.
devider: long-read reconstruction of many diverse haplotypes
bioRxiv (Cold Spring Harbor Laboratory) · 2024-11-08 · 1 citations
preprintOpen accessAbstract Reconstructing haplotypes is important when sequencing a mixture of similar sequences. Long-read sequencing can connect distant alleles to disentangle similar haplotypes, but handling se-quencing errors requires specialized techniques. We present devider , an algorithm for haplotyping small sequences—such as viruses or genes—from long-read sequencing. devider uses a positional de Bruijn graph with sequence-to-graph alignment on an alphabet of informative alleles to provide a fast assembly-inspired approach compatible with various long-read sequencing technologies. On a synthetic Nanopore dataset containing seven HIV strains, devider recovered 97% of the haplotype content compared to 86% for the next best method while taking < 4 minutes and 1 GB of memory for > 8000× coverage. Benchmarking on synthetic mixtures of antimicrobial resistance (AMR) genes showed that devider recovered 83% of haplotypes, 23 percentage points higher than the next best method. On real PacBio and Nanopore datasets, devider recapitulates previously known results in seconds, disentan-gling a bacterial community with > 10 strains and an HIV-1 co-infection dataset. We used devider to investigate the within-host diversity of a long-read bovine gut metagenome enriched for AMR genes, discovering 13 distinct haplotypes for a tet(Q) tetracycline resistance gene with > 18, 000× coverage and 6 haplotypes for a CfxA2 beta-lactamase gene. We found clear recombination blocks for these AMR gene haplotypes, showcasing devider ’s ability to unveil ecological signals for heterogeneous mixtures.
skandiver: a divergence-based analysis tool for identifying intercellular mobile genetic elements
Bioinformatics · 2024-09-01 · 3 citations
articleOpen accessSenior authorMotivation: Mobile genetic elements (MGEs) are as ubiquitous in nature as they are varied in type, ranging from viral insertions to transposons to incorporated plasmids. Horizontal transfer of MGEs across bacterial species may also pose a significant threat to global health due to their capability to harbor antibiotic resistance genes. However, despite cheap and rapid whole-genome sequencing, the varied nature of MGEs makes it difficult to fully characterize them, and existing methods for detecting MGEs often do not agree on what should count. In this manuscript, we first define and argue in favor of a divergence-based characterization of mobile-genetic elements. Results: Using that paradigm, we present skandiver, a tool designed to efficiently detect MGEs from whole-genome assemblies without the need for gene annotation or markers. skandiver determines mobile elements via genome fragmentation, average nucleotide identity (ANI), and divergence time. By building on the scalable skani software for ANI computation, skandiver can query hundreds of complete assemblies against >65 000 representative genomes in a few minutes and 19 GB memory, providing scalable and efficient method for elucidating mobile element profiles in incomplete, uncharacterized genomic sequences. For isolated and integrated large plasmids (>10 kb), skandiver's recall was 48% and 47%, MobileElementFinder was 59% and 17%, and geNomad was 86% and 32%, respectively. For isolated large plasmids, skandiver's recall (48%) is lower than state-of-the-art reference-based methods geNomad (86%) and MobileElementFinder (59%). However, skandiver achieves higher recall on integrated plasmids and, unlike other methods, without comparing against a curated database, making skandiver suitable for discovery of novel MGEs. AVAILABILITY AND IMPLEMENTATION: https://github.com/YoukaiFromAccounting/skandiver.
Rapid species-level metagenome profiling and containment estimation with sylph
Nature Biotechnology · 2024-10-08 · 82 citations
articleOpen accessSenior authorProfiling metagenomes against databases allows for the detection and quantification of microorganisms, even at low abundances where assembly is not possible. We introduce sylph, a species-level metagenome profiler that estimates genome-to-metagenome containment average nucleotide identity (ANI) through zero-inflated Poisson k-mer statistics, enabling ANI-based taxa detection. On the Critical Assessment of Metagenome Interpretation II (CAMI2) Marine dataset, sylph was the most accurate profiling method of seven tested. For multisample profiling, sylph took >10-fold less central processing unit time compared to Kraken2 and used 30-fold less memory. Sylph's ANI estimates provided an orthogonal signal to abundance, allowing for an ANI-based metagenome-wide association study for Parkinson disease (PD) against 289,232 genomes while confirming known butyrate-PD associations at the strain level. Sylph took <1 min and 16 GB of random-access memory to profile metagenomes against 85,205 prokaryotic and 2,917,516 viral genomes, detecting 30-fold more viral sequences in the human gut compared to RefSeq. Sylph offers precise, efficient profiling with accurate containment ANI estimation even for low-coverage genomes.
Frequent coauthors
- 41 shared
Jim Shaw
University of Toronto
- 17 shared
Griffin M. Weber
- 14 shared
Bonnie Berger
Massachusetts Institute of Technology
- 12 shared
Niranjan Nagarajan
Agency for Science, Technology and Research
- 10 shared
Andrew Zheng
Emory University
- 8 shared
Daphne Ippolito
- 8 shared
Jean-Sébastien Gounot
Agency for Science, Technology and Research
- 8 shared
Hanrong Chen
Agency for Science, Technology and Research
Education
- 2017
PhD, Mathematics / Computer Science and AI Lab
Massachusetts Institute of Technology
- 2014
MPhil, Mathematics
Imperial College London
- 2010
MRes, Chemistry
Imperial College London
- 2009
BS/BA, Math, chemistry, Germanic studies
Indiana University Bloomington
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Yun William Yu
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup