Carl Kingsford

· Herbert A. Simon Professor and Co-Director of the Joint Carnegie Mellon-University of Pittsburgh Ph.D Program in Computational Biology

Carnegie Mellon University · Ray and Stephanie Lane Computational Biology Department

Active 2000–2026

h-index40

Citations18.9k

Papers19857 last 5y

Funding$7.2M2 active

Faculty page Lab page

See your match with Carl Kingsford — sign in to PhdFit.Sign in

About

Carl Kingsford is the Herbert A. Simon Professor of Computer Science in the Ray and Stephanie Lane Computational Biology Department at Carnegie Mellon University. He is recognized as a trailblazer in computational molecular biology, showcasing sustained innovation in scalable algorithmic approaches. His research focuses on the development of computational methods and algorithms for biological data analysis, including genome graph construction, sequence analysis, and the study of genomic variation. Kingsford's contributions have significantly advanced the understanding of genome structure and function through algorithmic innovations, and he has been honored as an ISCB Fellow for his impactful work in the field.

Research topics

Computer Science
Biology
Artificial Intelligence
Computational biology
Data Mining
Anatomy
Cell biology
Algorithm

Selected publications

Data-driven AI system for learning how to run transcript assemblers
Genome biology · 2026-05-12
articleOpen accessSenior authorCorresponding
We introduce AutoTuneX, a data-driven, AI system designed to automatically predict optimal parameters for transcript assemblers - tools for reconstructing transcripts from the reads in a given RNA-seq sample. AutoTuneX is built by learning parameter knowledge from existing RNA-seq samples and transferring this knowledge to unseen samples. On 1588 human RNA-seq samples tested with two transcript assemblers, AutoTuneX predicts parameters that resulted in 98% of samples achieving more accurate transcript assembly compared to using default parameters, with some samples experiencing up to a 600% improvement in AUC. AutoTuneX offers a new strategy for automatically optimizing use of sequence analysis tools.
Publisher DOI
CodonMoE: DNA Language Models for Codon-Dependent mRNA Prediction
Bioinformatics · 2026-05-05
articleOpen accessSenior author
MOTIVATION: Genomic language models (gLMs) face a fundamental efficiency challenge: one must either maintain separate specialized models for each biological modality (DNA and RNA) or develop large multi-modal architectures. Both approaches impose significant computational burdens-modality-specific models require redundant infrastructure despite inherent biological connections, while multi-modal architectures demand increased parameter counts and extensive cross-modality pretraining. RESULTS: To address this limitation, we introduce CodonMoE (Adaptive Mixture of Codon Reformative Experts), a lightweight adapter that transforms DNA language models into effective RNA analyzers without RNA-specific pretraining. Our theoretical analysis establishes CodonMoE as a universal approximator at the codon level, capable of mapping arbitrary functions from codon sequences to codon-dependent RNA properties given sufficient expert capacity. Across four RNA prediction tasks spanning stability, expression, and regulation, DNA models augmented with CodonMoE significantly outperform their unmodified counterparts, with the HyenaDNA+CodonMoE series achieving state-of-the-art results using 80% fewer parameters than specialized RNA models. By maintaining sub-quadratic complexity while achieving superior performance, our approach provides a principled path toward unifying genomic language modeling, leveraging more abundant DNA data and reducing computational overhead while preserving modality-specific performance advantages. AVAILABILITY: Source code for the method and to reproduce the results is available at https://github.com/Kingsford-Group/CodonMoE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Publisher DOI
Turnpike with Uncertain Measurements: Triangle-Equality ILP with a Deterministic Recovery Guarantee
ArXiv.org · 2026-03-18
articleOpen accessSenior author
We study Turnpike with uncertain measurements: reconstructing a one-dimensional point set from an unlabeled multiset of pairwise distances under bounded noise and rounding. We give a combinatorial characterization of realizability via a multi-matching that labels interval indices by distinct distance values while satisfying all triangle equalities. This yields an ILP based on the triangle equality whose constraint structure depends only on the two-partition set $\mathcal{P}_y=\{(r,s,t): y_r+y_s=y_t\}$ and a natural LP relaxation with $\{0,1\}$-coefficient constraints. Integral solutions certify realizability and output an explicit assignment matrix, enabling an assignment-first, regression-second pipeline for downstream coordinate estimation. Under bounded noise followed by rounding, we prove a deterministic separation condition under which $\mathcal{P}_y$ is recovered exactly, so the ILP/LP receives the same combinatorial input as in the noiseless case. Experiments illustrate integrality behavior and degradation outside the provable regime.
Publisher OA PDF
The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents
ArXiv.org · 2026-05-08
articleOpen access
Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model--game settings, a pattern we term the memory curse. We isolate the underlying mechanism through three analyses. First, lexical analysis of 378,000 reasoning traces associates this breakdown with eroding forward-looking intent rather than rising paranoia. We validate this using targeted fine-tuning as a cognitive probe: a LoRA adapter trained exclusively on forward-looking traces mitigates the decay and transfers zero-shot to distinct games. Second, memory sanitization holds prompt length fixed while replacing visible history with synthetic cooperative records, which restores cooperation substantially, proving the trigger is memory content, not length alone. Finally, ablating explicit Chain-of-Thought reasoning often reduces the collapse, showing that deliberation paradoxically amplifies the memory curse. Together, these results recast memory as an active determinant of multi-agent behavior: longer recall can either destabilize or support cooperation depending on the reasoning patterns it elicits.
Publisher OA PDF
Why Search When You Can Transfer? Amortized Agentic Workflow Design from Structural Priors
ArXiv.org · 2026-04-27
articleOpen accessSenior author
Automated agentic workflow design currently relies on per-task iterative search, which is computationally prohibitive and fails to reuse structural knowledge across tasks. We observe that optimized workflows converge to a small family of domain-specific topologies, suggesting that this combinatorial search is largely redundant. Building on this insight, we propose SWIFT (Synthesizing Workflows via Few-shot Transfer), a framework that amortizes workflow design into reusable structural priors. SWIFT first distills compositional heuristics and output-interface contracts from contrastive analysis of prior search trajectories across source tasks. At inference time, it conditions a single LLM generation pass on these priors together with cross-task workflow demonstrations to synthesize a complete, executable workflow for an unseen target task, bypassing iterative search entirely. On five benchmarks, SWIFT outperforms the state-of-the-art search-based method while reducing marginal per-task optimization cost by three orders of magnitude. It further generalizes to four additional unseen benchmarks and transfers successfully from GPT-4o-mini to three additional foundation models (Grok, Qwen, Gemma). Controlled ablations reveal that workflow demonstrations primarily transfer topological structure rather than surface semantics: replacing all operator names with random strings still retains over 93% of the full system's average performance.
Publisher OA PDF
Turnpike with Uncertain Measurements: Triangle-Equality ILP with a Deterministic Recovery Guarantee
arXiv (Cornell University) · 2026-03-18
preprintOpen accessSenior author
We study Turnpike with uncertain measurements: reconstructing a one-dimensional point set from an unlabeled multiset of pairwise distances under bounded noise and rounding. We give a combinatorial characterization of realizability via a multi-matching that labels interval indices by distinct distance values while satisfying all triangle equalities. This yields an ILP based on the triangle equality whose constraint structure depends only on the two-partition set $\mathcal{P}_y=\{(r,s,t): y_r+y_s=y_t\}$ and a natural LP relaxation with $\{0,1\}$-coefficient constraints. Integral solutions certify realizability and output an explicit assignment matrix, enabling an assignment-first, regression-second pipeline for downstream coordinate estimation. Under bounded noise followed by rounding, we prove a deterministic separation condition under which $\mathcal{P}_y$ is recovered exactly, so the ILP/LP receives the same combinatorial input as in the noiseless case. Experiments illustrate integrality behavior and degradation outside the provable regime.
Publisher DOI
The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents
arXiv (Cornell University) · 2026-05-08
preprintOpen access
Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model--game settings, a pattern we term the memory curse. We isolate the underlying mechanism through three analyses. First, lexical analysis of 378,000 reasoning traces associates this breakdown with eroding forward-looking intent rather than rising paranoia. We validate this using targeted fine-tuning as a cognitive probe: a LoRA adapter trained exclusively on forward-looking traces mitigates the decay and transfers zero-shot to distinct games. Second, memory sanitization holds prompt length fixed while replacing visible history with synthetic cooperative records, which restores cooperation substantially, proving the trigger is memory content, not length alone. Finally, ablating explicit Chain-of-Thought reasoning often reduces the collapse, showing that deliberation paradoxically amplifies the memory curse. Together, these results recast memory as an active determinant of multi-agent behavior: longer recall can either destabilize or support cooperation depending on the reasoning patterns it elicits.
Publisher DOI
Why Search When You Can Transfer? Amortized Agentic Workflow Design from Structural Priors
arXiv (Cornell University) · 2026-04-27
preprintOpen accessSenior author
Automated agentic workflow design currently relies on per-task iterative search, which is computationally prohibitive and fails to reuse structural knowledge across tasks. We observe that optimized workflows converge to a small family of domain-specific topologies, suggesting that this combinatorial search is largely redundant. Building on this insight, we propose SWIFT (Synthesizing Workflows via Few-shot Transfer), a framework that amortizes workflow design into reusable structural priors. SWIFT first distills compositional heuristics and output-interface contracts from contrastive analysis of prior search trajectories across source tasks. At inference time, it conditions a single LLM generation pass on these priors together with cross-task workflow demonstrations to synthesize a complete, executable workflow for an unseen target task, bypassing iterative search entirely. On five benchmarks, SWIFT outperforms the state-of-the-art search-based method while reducing marginal per-task optimization cost by three orders of magnitude. It further generalizes to four additional unseen benchmarks and transfers successfully from GPT-4o-mini to three additional foundation models (Grok, Qwen, Gemma). Controlled ablations reveal that workflow demonstrations primarily transfer topological structure rather than surface semantics: replacing all operator names with random strings still retains over 93% of the full system's average performance.
Publisher DOI
Models Know Their Shortcuts: Deployment-Time Shortcut Mitigation
ArXiv.org · 2026-04-14
articleOpen accessSenior author
Pretrained language models often rely on superficial features that appear predictive during training yet fail to generalize at test time, a phenomenon known as shortcut learning. Existing mitigation methods generally operate at training time and require heavy supervision such as access to the original training data or prior knowledge of shortcut type. We propose Shortcut Guardrail, a deployment-time framework that mitigates token-level shortcuts without access to the original training data or shortcut annotations. Our key insight is that gradient-based attribution on a biased model highlights shortcut tokens. Building on this finding, we train a lightweight LoRA-based debiasing module with a Masked Contrastive Learning (MaskCL) objective that encourages consistent representations with or without individual tokens. Across sentiment classification, toxicity detection, and natural language inference under both naturally occurring and controlled shortcuts, Shortcut Guardrail improves overall accuracy and worst-group accuracy over the unmitigated model under distribution shifts while preserving in-distribution performance.
Publisher OA PDF
Models Know Their Shortcuts: Deployment-Time Shortcut Mitigation
arXiv (Cornell University) · 2026-04-14
preprintOpen accessSenior author
Pretrained language models often rely on superficial features that appear predictive during training yet fail to generalize at test time, a phenomenon known as shortcut learning. Existing mitigation methods generally operate at training time and require heavy supervision such as access to the original training data or prior knowledge of shortcut type. We propose Shortcut Guardrail, a deployment-time framework that mitigates token-level shortcuts without access to the original training data or shortcut annotations. Our key insight is that gradient-based attribution on a biased model highlights shortcut tokens. Building on this finding, we train a lightweight LoRA-based debiasing module with a Masked Contrastive Learning (MaskCL) objective that encourages consistent representations with or without individual tokens. Across sentiment classification, toxicity detection, and natural language inference under both naturally occurring and controlled shortcuts, Shortcut Guardrail improves overall accuracy and worst-group accuracy over the unmitigated model under distribution shifts while preserving in-distribution performance.
Publisher DOI

Recent grants

Algorithms for Managing Uncertainty in Chromosome Conformation Capture Data
NIH · $1.3M · 2013–2017
CAREER: Model-based Reconstruction of Ancient Biological Networks
NSF · $177k · 2011–2012
Data Discovery: Computational Methods for Searching Short-Read Sequencing Experiments - Administrative Supplement
NIH · $1.2M · 2017–2022
CAREER: Model-based Reconstruction of Ancient Biological Networks
NSF · $359k · 2012–2017
Improved genomic sketching for MUMmer and metagenomics
NIH · $1.7M · 2022–2027

Frequent coauthors

Rob Patro
University of Maryland, College Park
42 shared
Guillaume Marçais
Carnegie Mellon University
38 shared
Mingfu Shao
Pennsylvania State University
25 shared
Dan DeBlasio
Carnegie Mellon University
21 shared
Cong Ma
Northwestern Polytechnical University
19 shared
Geet Duggal
DNAnexus (United States)
19 shared
Charlotte Soneson
SIB Swiss Institute of Bioinformatics
17 shared
Darya Filippova
16 shared

Labs

Kingsford GroupPI

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Carl Kingsford

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you