Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Carl Kingsford

Carl Kingsford

· Herbert A. Simon Professor and Co-Director of the Joint Carnegie Mellon-University of Pittsburgh Ph.D Program in Computational Biology

Carnegie Mellon University · Ray and Stephanie Lane Computational Biology Department

Active 2000–2026

h-index40
Citations18.9k
Papers19857 last 5y
Funding$7.2M2 active
See your match with Carl Kingsford — sign in to PhdFit.Sign in

About

Carl Kingsford is the Herbert A. Simon Professor of Computer Science in the Ray and Stephanie Lane Computational Biology Department at Carnegie Mellon University. He is recognized as a trailblazer in computational molecular biology, showcasing sustained innovation in scalable algorithmic approaches. His research focuses on the development of computational methods and algorithms for biological data analysis, including genome graph construction, sequence analysis, and the study of genomic variation. Kingsford's contributions have significantly advanced the understanding of genome structure and function through algorithmic innovations, and he has been honored as an ISCB Fellow for his impactful work in the field.

Research topics

  • Computer Science
  • Biology
  • Artificial Intelligence
  • Computational biology
  • Data Mining
  • Anatomy
  • Cell biology
  • Algorithm

Selected publications

  • Data-driven AI system for learning how to run transcript assemblers

    Genome biology · 2026-05-12

    articleOpen accessSenior authorCorresponding

    We introduce AutoTuneX, a data-driven, AI system designed to automatically predict optimal parameters for transcript assemblers - tools for reconstructing transcripts from the reads in a given RNA-seq sample. AutoTuneX is built by learning parameter knowledge from existing RNA-seq samples and transferring this knowledge to unseen samples. On 1588 human RNA-seq samples tested with two transcript assemblers, AutoTuneX predicts parameters that resulted in 98% of samples achieving more accurate transcript assembly compared to using default parameters, with some samples experiencing up to a 600% improvement in AUC. AutoTuneX offers a new strategy for automatically optimizing use of sequence analysis tools.

  • CodonMoE: DNA Language Models for Codon-Dependent mRNA Prediction

    Bioinformatics · 2026-05-05

    articleOpen accessSenior author

    MOTIVATION: Genomic language models (gLMs) face a fundamental efficiency challenge: one must either maintain separate specialized models for each biological modality (DNA and RNA) or develop large multi-modal architectures. Both approaches impose significant computational burdens-modality-specific models require redundant infrastructure despite inherent biological connections, while multi-modal architectures demand increased parameter counts and extensive cross-modality pretraining. RESULTS: To address this limitation, we introduce CodonMoE (Adaptive Mixture of Codon Reformative Experts), a lightweight adapter that transforms DNA language models into effective RNA analyzers without RNA-specific pretraining. Our theoretical analysis establishes CodonMoE as a universal approximator at the codon level, capable of mapping arbitrary functions from codon sequences to codon-dependent RNA properties given sufficient expert capacity. Across four RNA prediction tasks spanning stability, expression, and regulation, DNA models augmented with CodonMoE significantly outperform their unmodified counterparts, with the HyenaDNA+CodonMoE series achieving state-of-the-art results using 80% fewer parameters than specialized RNA models. By maintaining sub-quadratic complexity while achieving superior performance, our approach provides a principled path toward unifying genomic language modeling, leveraging more abundant DNA data and reducing computational overhead while preserving modality-specific performance advantages. AVAILABILITY: Source code for the method and to reproduce the results is available at https://github.com/Kingsford-Group/CodonMoE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

  • Turnpike with Uncertain Measurements: Triangle-Equality ILP with a Deterministic Recovery Guarantee

    ArXiv.org · 2026-03-18

    articleOpen accessSenior author

    We study Turnpike with uncertain measurements: reconstructing a one-dimensional point set from an unlabeled multiset of pairwise distances under bounded noise and rounding. We give a combinatorial characterization of realizability via a multi-matching that labels interval indices by distinct distance values while satisfying all triangle equalities. This yields an ILP based on the triangle equality whose constraint structure depends only on the two-partition set $\mathcal{P}_y=\{(r,s,t): y_r+y_s=y_t\}$ and a natural LP relaxation with $\{0,1\}$-coefficient constraints. Integral solutions certify realizability and output an explicit assignment matrix, enabling an assignment-first, regression-second pipeline for downstream coordinate estimation. Under bounded noise followed by rounding, we prove a deterministic separation condition under which $\mathcal{P}_y$ is recovered exactly, so the ILP/LP receives the same combinatorial input as in the noiseless case. Experiments illustrate integrality behavior and degradation outside the provable regime.

  • The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

    ArXiv.org · 2026-05-08

    articleOpen access

    Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model--game settings, a pattern we term the memory curse. We isolate the underlying mechanism through three analyses. First, lexical analysis of 378,000 reasoning traces associates this breakdown with eroding forward-looking intent rather than rising paranoia. We validate this using targeted fine-tuning as a cognitive probe: a LoRA adapter trained exclusively on forward-looking traces mitigates the decay and transfers zero-shot to distinct games. Second, memory sanitization holds prompt length fixed while replacing visible history with synthetic cooperative records, which restores cooperation substantially, proving the trigger is memory content, not length alone. Finally, ablating explicit Chain-of-Thought reasoning often reduces the collapse, showing that deliberation paradoxically amplifies the memory curse. Together, these results recast memory as an active determinant of multi-agent behavior: longer recall can either destabilize or support cooperation depending on the reasoning patterns it elicits.

  • Why Search When You Can Transfer? Amortized Agentic Workflow Design from Structural Priors

    ArXiv.org · 2026-04-27

    articleOpen accessSenior author

    Automated agentic workflow design currently relies on per-task iterative search, which is computationally prohibitive and fails to reuse structural knowledge across tasks. We observe that optimized workflows converge to a small family of domain-specific topologies, suggesting that this combinatorial search is largely redundant. Building on this insight, we propose SWIFT (Synthesizing Workflows via Few-shot Transfer), a framework that amortizes workflow design into reusable structural priors. SWIFT first distills compositional heuristics and output-interface contracts from contrastive analysis of prior search trajectories across source tasks. At inference time, it conditions a single LLM generation pass on these priors together with cross-task workflow demonstrations to synthesize a complete, executable workflow for an unseen target task, bypassing iterative search entirely. On five benchmarks, SWIFT outperforms the state-of-the-art search-based method while reducing marginal per-task optimization cost by three orders of magnitude. It further generalizes to four additional unseen benchmarks and transfers successfully from GPT-4o-mini to three additional foundation models (Grok, Qwen, Gemma). Controlled ablations reveal that workflow demonstrations primarily transfer topological structure rather than surface semantics: replacing all operator names with random strings still retains over 93% of the full system's average performance.

  • Turnpike with Uncertain Measurements: Triangle-Equality ILP with a Deterministic Recovery Guarantee

    arXiv (Cornell University) · 2026-03-18

    preprintOpen accessSenior author

    We study Turnpike with uncertain measurements: reconstructing a one-dimensional point set from an unlabeled multiset of pairwise distances under bounded noise and rounding. We give a combinatorial characterization of realizability via a multi-matching that labels interval indices by distinct distance values while satisfying all triangle equalities. This yields an ILP based on the triangle equality whose constraint structure depends only on the two-partition set $\mathcal{P}_y=\{(r,s,t): y_r+y_s=y_t\}$ and a natural LP relaxation with $\{0,1\}$-coefficient constraints. Integral solutions certify realizability and output an explicit assignment matrix, enabling an assignment-first, regression-second pipeline for downstream coordinate estimation. Under bounded noise followed by rounding, we prove a deterministic separation condition under which $\mathcal{P}_y$ is recovered exactly, so the ILP/LP receives the same combinatorial input as in the noiseless case. Experiments illustrate integrality behavior and degradation outside the provable regime.

  • The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

    arXiv (Cornell University) · 2026-05-08

    preprintOpen access

    Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model--game settings, a pattern we term the memory curse. We isolate the underlying mechanism through three analyses. First, lexical analysis of 378,000 reasoning traces associates this breakdown with eroding forward-looking intent rather than rising paranoia. We validate this using targeted fine-tuning as a cognitive probe: a LoRA adapter trained exclusively on forward-looking traces mitigates the decay and transfers zero-shot to distinct games. Second, memory sanitization holds prompt length fixed while replacing visible history with synthetic cooperative records, which restores cooperation substantially, proving the trigger is memory content, not length alone. Finally, ablating explicit Chain-of-Thought reasoning often reduces the collapse, showing that deliberation paradoxically amplifies the memory curse. Together, these results recast memory as an active determinant of multi-agent behavior: longer recall can either destabilize or support cooperation depending on the reasoning patterns it elicits.

  • Why Search When You Can Transfer? Amortized Agentic Workflow Design from Structural Priors

    arXiv (Cornell University) · 2026-04-27

    preprintOpen accessSenior author

    Automated agentic workflow design currently relies on per-task iterative search, which is computationally prohibitive and fails to reuse structural knowledge across tasks. We observe that optimized workflows converge to a small family of domain-specific topologies, suggesting that this combinatorial search is largely redundant. Building on this insight, we propose SWIFT (Synthesizing Workflows via Few-shot Transfer), a framework that amortizes workflow design into reusable structural priors. SWIFT first distills compositional heuristics and output-interface contracts from contrastive analysis of prior search trajectories across source tasks. At inference time, it conditions a single LLM generation pass on these priors together with cross-task workflow demonstrations to synthesize a complete, executable workflow for an unseen target task, bypassing iterative search entirely. On five benchmarks, SWIFT outperforms the state-of-the-art search-based method while reducing marginal per-task optimization cost by three orders of magnitude. It further generalizes to four additional unseen benchmarks and transfers successfully from GPT-4o-mini to three additional foundation models (Grok, Qwen, Gemma). Controlled ablations reveal that workflow demonstrations primarily transfer topological structure rather than surface semantics: replacing all operator names with random strings still retains over 93% of the full system's average performance.

  • Models Know Their Shortcuts: Deployment-Time Shortcut Mitigation

    ArXiv.org · 2026-04-14

    articleOpen accessSenior author

    Pretrained language models often rely on superficial features that appear predictive during training yet fail to generalize at test time, a phenomenon known as shortcut learning. Existing mitigation methods generally operate at training time and require heavy supervision such as access to the original training data or prior knowledge of shortcut type. We propose Shortcut Guardrail, a deployment-time framework that mitigates token-level shortcuts without access to the original training data or shortcut annotations. Our key insight is that gradient-based attribution on a biased model highlights shortcut tokens. Building on this finding, we train a lightweight LoRA-based debiasing module with a Masked Contrastive Learning (MaskCL) objective that encourages consistent representations with or without individual tokens. Across sentiment classification, toxicity detection, and natural language inference under both naturally occurring and controlled shortcuts, Shortcut Guardrail improves overall accuracy and worst-group accuracy over the unmitigated model under distribution shifts while preserving in-distribution performance.

  • Models Know Their Shortcuts: Deployment-Time Shortcut Mitigation

    arXiv (Cornell University) · 2026-04-14

    preprintOpen accessSenior author

    Pretrained language models often rely on superficial features that appear predictive during training yet fail to generalize at test time, a phenomenon known as shortcut learning. Existing mitigation methods generally operate at training time and require heavy supervision such as access to the original training data or prior knowledge of shortcut type. We propose Shortcut Guardrail, a deployment-time framework that mitigates token-level shortcuts without access to the original training data or shortcut annotations. Our key insight is that gradient-based attribution on a biased model highlights shortcut tokens. Building on this finding, we train a lightweight LoRA-based debiasing module with a Masked Contrastive Learning (MaskCL) objective that encourages consistent representations with or without individual tokens. Across sentiment classification, toxicity detection, and natural language inference under both naturally occurring and controlled shortcuts, Shortcut Guardrail improves overall accuracy and worst-group accuracy over the unmitigated model under distribution shifts while preserving in-distribution performance.

Recent grants

Frequent coauthors

  • Rob Patro

    University of Maryland, College Park

    42 shared
  • Guillaume Marçais

    Carnegie Mellon University

    38 shared
  • Mingfu Shao

    Pennsylvania State University

    25 shared
  • Dan DeBlasio

    Carnegie Mellon University

    21 shared
  • Cong Ma

    Northwestern Polytechnical University

    19 shared
  • Geet Duggal

    DNAnexus (United States)

    19 shared
  • Charlotte Soneson

    SIB Swiss Institute of Bioinformatics

    17 shared
  • Darya Filippova

    16 shared

Labs

  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Carl Kingsford

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup