Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Yun William Yu

Yun William Yu

· Associate ProfessorVerified

Carnegie Mellon University · Ray and Stephanie Lane Computational Biology Department

Active 1998–2025

h-index22
Citations1.8k
Papers9872 last 5y
Funding
See your match with Yun William Yu — sign in to PhdFit.Sign in

About

Yun William Yu is an Associate Professor in the Ray and Stephanie Lane Computational Biology Department at Carnegie Mellon University. His academic affiliation is within the School of Computer Science, and he is based at the Gates Hillman Center. The department offers a range of educational programs including Ph.D., M.S., and undergraduate degrees in Computational Biology, as well as an integrated Masters Degree in Computational Biology. The department emphasizes research and education in computational biology and automated science, providing various resources and programs to support students and faculty. Specific details about Professor Yu's research focus, background, or key contributions are not provided on the page.

Research topics

  • Computer Science
  • Artificial Intelligence
  • Data Mining
  • Medicine
  • Internet privacy
  • Business
  • Biology
  • Psychology
  • Theoretical computer science
  • World Wide Web
  • Database
  • Virology

Selected publications

  • Long-read reconstruction of many diverse haplotypes with devider

    Genome Research · 2025-09-23 · 4 citations

    articleOpen access

    Reconstructing exact haplotypes is important when sequencing a mixture of similar sequences. Long-read sequencing can connect distant alleles to disentangle similar haplotypes, but handling sequencing errors requires specialized techniques. Here, we present devider , an algorithm for haplotyping small sequences, such as viruses or genes, from long-read sequencing. devider uses a positional de Bruijn graph with sequence-to-graph alignment on an alphabet of informative alleles to provide a fast assembly-inspired approach compatible with various long-read sequencing technologies. On a synthetic Oxford Nanopore Technologies (ONT) long-read data set containing seven HIV strains, devider recovers 97% of the haplotype content and has the most accurate abundance estimates while taking <4 min and 1 GB of memory for >8000× coverage. Benchmarking on synthetic mixtures of antimicrobial-resistance (AMR) genes shows that devider recovers 83% of haplotypes, 23 percentage points higher than the next best method. On real Pacific Biosciences (PacBio) and ONT data sets, devider recapitulates previously known results in seconds, disentangling a bacterial community with more than 10 strains and an HIV-1 coinfection data set. We use devider to investigate the within-host diversity of a long-read bovine gut metagenome enriched for AMR genes, discovering 13 distinct haplotypes for a tet(Q) tetracycline-resistance gene with >18,000× coverage and six haplotypes for a CfxA2 beta-lactamase gene. We find clear recombination blocks for these AMR gene haplotypes, showcasing devider's ability to unveil evolutionary signals for heterogeneous mixtures.

  • KuPID: Kmer-based Upstream Preprocessing of Long Reads for Isoform Discovery

    bioRxiv (Cold Spring Harbor Laboratory) · 2025-12-09

    articleOpen accessSenior author

    Eukaryotic genes can encode multiple protein isoforms based on alternative splicing of their transcribed regions. Most modern novel isoform discovery methods function by identifying and assembling exon splice junctions from an RNAseq sample. However, splice junctions can only be accurately annotated with time-intensive dynamic programming alignment. This manuscript introduces KuPID, a method for preprocessing long RNAseq reads with the goal of better identifying novel isoform transcripts. KuPID utilizes kmer sketching as a pre-filter to quickly pseudo-align reads to known reference isoforms. Full alignment need only then be applied to reads that are most relevant to isoform discovery. Not only does KuPID speed up the discovery pipeline, it also increases downstream accuracy by filtering out extraneous reads. KuPID preprocessing simultaneously increases the f1 accuracy of isoform discovery pipelines by up to 16.7 points while decreasing the runtime by a factor of 2-3x. An optional mode permits a KuPID sample to be paired with both isoform discovery and transcript quantification. Code availability: https://github.com/mboro2000/KuPID.git.

  • Incorporating indel channels into average-case analysis of seed-chain-extend

    ArXiv.org · 2025-12-04

    preprintOpen accessSenior author

    Given a sequence $s_1$ of $n$ letters drawn i.i.d. from an alphabet of size $σ$ and a mutated substring $s_2$ of length $m < n$, we often want to recover the mutation history that generated $s_2$ from $s_1$. Modern sequence aligners are widely used for this task, and many employ the seed-chain-extend heuristic with $k$-mer seeds. Previously, Shaw and Yu showed that optimal linear-gap cost chaining can produce a chain with $1 - O\left(\frac{1}{\sqrt{m}}\right)$ recoverability, the proportion of the mutation history that is recovered, in $O\left(mn^{2.43θ} \log n\right)$ expected time, where $θ< 0.206$ is the mutation rate under a substitution-only channel and $s_1$ is assumed to be uniformly random. However, a gap remains between theory and practice, since real genomic data includes insertions and deletions (indels), and yet seed-chain-extend remains effective. In this paper, we generalize those prior results by introducing mathematical machinery to deal with the two new obstacles introduced by indel channels: the dependence of neighboring anchors and the presence of anchors that are only partially correct. We are thus able to prove that the expected recoverability of an optimal chain is $\ge 1 - O\Bigl(\frac{1}{\sqrt{m}}\Bigr)$ and the expected runtime is $O(mn^{3.15 \cdot θ_T}\log n)$, when the total mutation rate given by the sum of the substitution, insertion, and deletion mutation rates ($θ_T = θ_i + θ_d + θ_s$) is less than $0.159$.

  • Devider: Long-Read Reconstruction of Many Diverse Haplotypes

    Lecture notes in computer science · 2025-01-01

    book-chapter
  • skandiver: a divergence-based analysis tool for identifying intercellular mobile genetic elements

    arXiv (Cornell University) · 2024-06-17

    preprintOpen accessSenior author

    Mobile genetic elements (MGEs) are as ubiquitous in nature as they are varied in type, ranging from viral insertions to transposons to incorporated plasmids. Horizontal transfer of MGEs across bacterial species may also pose a significant threat to global health due to their capability to harbour antibiotic resistance genes. However, despite cheap and rapid whole genome sequencing, the varied nature of MGEs makes it difficult to fully characterize them, and existing methods for detecting MGEs often don't agree on what should count. In this manuscript, we first define and argue in favor of a divergence-based characterization of mobile-genetic elements. Using that paradigm, we present skandiver, a tool designed to efficiently detect MGEs from whole genome assemblies without the need for gene annotation or markers. skandiver determines mobile elements via genome fragmentation, average nucleotide identity (ANI), and divergence time. By building on the scalable skani software for ANI computation, skandiver can query hundreds of complete assemblies against $>$65,000 representative genomes in a few minutes and 19 GB memory, providing scalable and efficient method for elucidating mobile element profiles in incomplete, uncharacterized genomic sequences. For isolated and integrated large plasmids (>10kbp), skandiver's recall was 48\% and 47\%, MobileElementFinder was 59\% and 17\%, and geNomad was 86\% and 32\%, respectively. For isolated large plasmids, skandiver's recall (48\%) is lower than state-of-the-art reference-based methods geNomad (86\%) and MobileElementFinder (59\%). However, skandiver achieves higher recall on integrated plasmids and, unlike other methods, without comparing against a curated database, making skandiver suitable for discovery of novel MGEs. Availability: https://github.com/YoukaiFromAccounting/skandiver

  • Floria: Fast and accurate strain haplotyping in metagenomes

    bioRxiv (Cold Spring Harbor Laboratory) · 2024-01-31

    preprintOpen accessSenior authorCorresponding

    Abstract Shotgun metagenomics allows for direct analysis of microbial community genetics, but scalable computational methods for the recovery of bacterial strain genomes from microbiomes remains a key challenge. We introduce Floria, a novel method designed for rapid and accurate recovery of strain haplotypes from short and long-read metagenome sequencing data, based on minimum error correction (MEC) read clustering and a strain-preserving network flow model. Floria can function as a standalone haplotyping method, outputting alleles and reads that co-occur on the same strain, as well as an end-to-end read-to-assembly pipeline (Floria-PL) for strain-level assembly. Benchmarking evaluations on synthetic metagenomes showed that Floria is > 3 × faster and recovers 21% more strain content than base-level assembly methods (Strainberry), while being over an order of magnitude faster when only phasing is required. Applying Floria to a set of 109 deeply sequenced nanopore metagenomes took < 20 minutes on average per sample, and identified several species that have consistent strain heterogeneity. Applying Floria’s short-read haplotyping to a longitudinal gut metagenomics dataset revealed a dynamic multi-strain Anaerostipes hadrus community with frequent strain loss and emergence events over 636 days. With Floria, accurate haplotyping of metagenomic datasets takes mere minutes on standard workstations, paving the way for extensive strain-level metagenomic analyses. Availability Floria is available at https://github.com/bluenote-1577/floria , and the Floria-PL pipeline is available at https://github.com/jsgounot/Floria_analysis_workflow .

  • Fairy: fast approximate coverage for multi-sample metagenomic binning

    Microbiome · 2024-08-14 · 14 citations

    articleOpen accessSenior author

    Abstract Background Metagenomic binning, the clustering of assembled contigs that belong to the same genome, is a crucial step for recovering metagenome-assembled genomes (MAGs). Contigs are linked by exploiting consistent signatures along a genome, such as read coverage patterns. Using coverage from multiple samples leads to higher-quality MAGs; however, standard pipelines require all-to-all read alignments for multiple samples to compute coverage, becoming a key computational bottleneck. Results We present fairy ( https://github.com/bluenote-1577/fairy ), an approximate coverage calculation method for metagenomic binning. Fairy is a fast k-mer-based alignment-free method. For multi-sample binning, fairy can be $$&gt; 250 \times$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mo>&gt;</mml:mo> <mml:mn>250</mml:mn> <mml:mo>×</mml:mo> </mml:mrow> </mml:math> faster than read alignment and accurate enough for binning. Fairy is compatible with several existing binners on host and non-host-associated datasets. Using MetaBAT2, fairy recovers $$98.5\%$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mn>98.5</mml:mn> <mml:mo>%</mml:mo> </mml:mrow> </mml:math> of MAGs with $$&gt; 50\%$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mo>&gt;</mml:mo> <mml:mn>50</mml:mn> <mml:mo>%</mml:mo> </mml:mrow> </mml:math> completeness and $$&lt; 5\%$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mo>&lt;</mml:mo> <mml:mn>5</mml:mn> <mml:mo>%</mml:mo> </mml:mrow> </mml:math> contamination relative to alignment with BWA. Notably, multi-sample binning with fairy is always better than single-sample binning using BWA ( $$&gt; 1.5\times$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mo>&gt;</mml:mo> <mml:mn>1.5</mml:mn> <mml:mo>×</mml:mo> </mml:mrow> </mml:math> more $$&gt;50\%$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mo>&gt;</mml:mo> <mml:mn>50</mml:mn> <mml:mo>%</mml:mo> </mml:mrow> </mml:math> complete MAGs on average) while still being faster. For a public sediment metagenome project, we demonstrate that multi-sample binning recovers higher quality Asgard archaea MAGs than single-sample binning and that fairy’s results are indistinguishable from read alignment. Conclusions Fairy is a new tool for approximately and quickly calculating multi-sample coverage for binning, resolving a computational bottleneck for metagenomics.

  • devider: long-read reconstruction of many diverse haplotypes

    bioRxiv (Cold Spring Harbor Laboratory) · 2024-11-08 · 1 citations

    preprintOpen access

    Abstract Reconstructing haplotypes is important when sequencing a mixture of similar sequences. Long-read sequencing can connect distant alleles to disentangle similar haplotypes, but handling se-quencing errors requires specialized techniques. We present devider , an algorithm for haplotyping small sequences—such as viruses or genes—from long-read sequencing. devider uses a positional de Bruijn graph with sequence-to-graph alignment on an alphabet of informative alleles to provide a fast assembly-inspired approach compatible with various long-read sequencing technologies. On a synthetic Nanopore dataset containing seven HIV strains, devider recovered 97% of the haplotype content compared to 86% for the next best method while taking &lt; 4 minutes and 1 GB of memory for &gt; 8000× coverage. Benchmarking on synthetic mixtures of antimicrobial resistance (AMR) genes showed that devider recovered 83% of haplotypes, 23 percentage points higher than the next best method. On real PacBio and Nanopore datasets, devider recapitulates previously known results in seconds, disentan-gling a bacterial community with &gt; 10 strains and an HIV-1 co-infection dataset. We used devider to investigate the within-host diversity of a long-read bovine gut metagenome enriched for AMR genes, discovering 13 distinct haplotypes for a tet(Q) tetracycline resistance gene with &gt; 18, 000× coverage and 6 haplotypes for a CfxA2 beta-lactamase gene. We found clear recombination blocks for these AMR gene haplotypes, showcasing devider ’s ability to unveil ecological signals for heterogeneous mixtures.

  • skandiver: a divergence-based analysis tool for identifying intercellular mobile genetic elements

    Bioinformatics · 2024-09-01 · 3 citations

    articleOpen accessSenior author

    Motivation: Mobile genetic elements (MGEs) are as ubiquitous in nature as they are varied in type, ranging from viral insertions to transposons to incorporated plasmids. Horizontal transfer of MGEs across bacterial species may also pose a significant threat to global health due to their capability to harbor antibiotic resistance genes. However, despite cheap and rapid whole-genome sequencing, the varied nature of MGEs makes it difficult to fully characterize them, and existing methods for detecting MGEs often do not agree on what should count. In this manuscript, we first define and argue in favor of a divergence-based characterization of mobile-genetic elements. Results: Using that paradigm, we present skandiver, a tool designed to efficiently detect MGEs from whole-genome assemblies without the need for gene annotation or markers. skandiver determines mobile elements via genome fragmentation, average nucleotide identity (ANI), and divergence time. By building on the scalable skani software for ANI computation, skandiver can query hundreds of complete assemblies against >65 000 representative genomes in a few minutes and 19 GB memory, providing scalable and efficient method for elucidating mobile element profiles in incomplete, uncharacterized genomic sequences. For isolated and integrated large plasmids (>10 kb), skandiver's recall was 48% and 47%, MobileElementFinder was 59% and 17%, and geNomad was 86% and 32%, respectively. For isolated large plasmids, skandiver's recall (48%) is lower than state-of-the-art reference-based methods geNomad (86%) and MobileElementFinder (59%). However, skandiver achieves higher recall on integrated plasmids and, unlike other methods, without comparing against a curated database, making skandiver suitable for discovery of novel MGEs. AVAILABILITY AND IMPLEMENTATION: https://github.com/YoukaiFromAccounting/skandiver.

  • Rapid species-level metagenome profiling and containment estimation with sylph

    Nature Biotechnology · 2024-10-08 · 82 citations

    articleOpen accessSenior author

    Profiling metagenomes against databases allows for the detection and quantification of microorganisms, even at low abundances where assembly is not possible. We introduce sylph, a species-level metagenome profiler that estimates genome-to-metagenome containment average nucleotide identity (ANI) through zero-inflated Poisson k-mer statistics, enabling ANI-based taxa detection. On the Critical Assessment of Metagenome Interpretation II (CAMI2) Marine dataset, sylph was the most accurate profiling method of seven tested. For multisample profiling, sylph took >10-fold less central processing unit time compared to Kraken2 and used 30-fold less memory. Sylph's ANI estimates provided an orthogonal signal to abundance, allowing for an ANI-based metagenome-wide association study for Parkinson disease (PD) against 289,232 genomes while confirming known butyrate-PD associations at the strain level. Sylph took <1 min and 16 GB of random-access memory to profile metagenomes against 85,205 prokaryotic and 2,917,516 viral genomes, detecting 30-fold more viral sequences in the human gut compared to RefSeq. Sylph offers precise, efficient profiling with accurate containment ANI estimation even for low-coverage genomes.

Frequent coauthors

  • Jim Shaw

    University of Toronto

    41 shared
  • Griffin M. Weber

    17 shared
  • Bonnie Berger

    Massachusetts Institute of Technology

    14 shared
  • Niranjan Nagarajan

    Agency for Science, Technology and Research

    12 shared
  • Andrew Zheng

    Emory University

    10 shared
  • Daphne Ippolito

    8 shared
  • Jean-Sébastien Gounot

    Agency for Science, Technology and Research

    8 shared
  • Hanrong Chen

    Agency for Science, Technology and Research

    8 shared

Education

  • PhD, Mathematics / Computer Science and AI Lab

    Massachusetts Institute of Technology

    2017
  • MPhil, Mathematics

    Imperial College London

    2014
  • MRes, Chemistry

    Imperial College London

    2010
  • BS/BA, Math, chemistry, Germanic studies

    Indiana University Bloomington

    2009
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Yun William Yu

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup