Cathy Wu

· Class of 1954 Career Development ProfessorVerified

Massachusetts Institute of Technology · Civil and Environmental Engineering

Active 1968–2026

h-index106

Citations92.5k

Papers592131 last 5y

Funding$163.5M4 active

Faculty page

See your match with Cathy Wu — sign in to PhdFit.Sign in

About

Cathy Wu is a Class of 1954 Career Development Professor at the Massachusetts Institute of Technology, affiliated with the Department of Civil and Environmental Engineering and the Institute for Data, Systems, and Society. Her research intersects machine learning, optimization, and large-scale societal systems, with a recent focus on mixed autonomy systems in mobility. This involves studying the complex integration of automation, such as self-driving cars, into urban transportation systems. She aims to develop principled computational tools to enable reliable and complex decision-making for critical societal systems. Cathy Wu has collaborated broadly across fields including transportation, computer science, electrical engineering, mechanical engineering, urban planning, and public policy. Her industry collaborations include Microsoft Research, OpenAI, Google X Self-Driving Car Team, AT&T, Caltrans, Facebook, and Dropbox. She is also the founder and Chair of the Interdisciplinary Research Initiative within the ACM Future of Computing Academy, actively working to build international programs that promote interdisciplinary research in computing.

Research topics

Computer Science
Information Retrieval
Bioinformatics
World Wide Web
Biology
Computational biology
Data science
Genetics
Software engineering
Internet privacy
Engineering

Selected publications

Measuring the State of Open Science in Transportation Using Large Language Models
ArXiv.org · 2026-01-20
articleOpen accessSenior author
Open science initiatives have strengthened scientific integrity and accelerated research progress across many fields, but the state of their practice within transportation research remains under-investigated. Key features of open science, defined here as data and code availability, are difficult to extract due to the inherent complexity of the field. Previous work has either been limited to small-scale studies due to the labor-intensive nature of manual analysis or has relied on large-scale bibliometric approaches that sacrifice contextual richness. This paper introduces an automatic and scalable feature-extraction pipeline to measure data and code availability in transportation research. We employ Large Language Models (LLMs) for this task and validate their performance against a manually curated dataset and through an inter-rater agreement analysis. We applied this pipeline to examine 10,724 research articles published in the Transportation Research Part series of journals between 2019 and 2024. Our analysis found that only 5% of quantitative papers shared a code repository, 4% of quantitative papers shared a data repository, and about 3% of papers shared both, with trends differing across journals, topics, and geographic regions. We found no significant difference in citation counts or review duration between papers that provided data and code and those that did not, suggesting a misalignment between open science efforts and traditional academic metrics. Consequently, encouraging these practices will likely require structural interventions from journals and funding agencies to supplement the lack of direct author incentives. The pipeline developed in this study can be readily scaled to other journals, representing a critical step toward the automated measurement and monitoring of open science practices in transportation research.
Publisher OA PDF
Measuring the State of Open Science in Transportation Using Large Language Models
arXiv (Cornell University) · 2026-01-20
preprintOpen accessSenior author
Open science initiatives have strengthened scientific integrity and accelerated research progress across many fields, but the state of their practice within transportation research remains under-investigated. Key features of open science, defined here as data and code availability, are difficult to extract due to the inherent complexity of the field. Previous work has either been limited to small-scale studies due to the labor-intensive nature of manual analysis or has relied on large-scale bibliometric approaches that sacrifice contextual richness. This paper introduces an automatic and scalable feature-extraction pipeline to measure data and code availability in transportation research. We employ Large Language Models (LLMs) for this task and validate their performance against a manually curated dataset and through an inter-rater agreement analysis. We applied this pipeline to examine 10,724 research articles published in the Transportation Research Part series of journals between 2019 and 2024. Our analysis found that only 5% of quantitative papers shared a code repository, 4% of quantitative papers shared a data repository, and about 3% of papers shared both, with trends differing across journals, topics, and geographic regions. We found no significant difference in citation counts or review duration between papers that provided data and code and those that did not, suggesting a misalignment between open science efforts and traditional academic metrics. Consequently, encouraging these practices will likely require structural interventions from journals and funding agencies to supplement the lack of direct author incentives. The pipeline developed in this study can be readily scaled to other journals, representing a critical step toward the automated measurement and monitoring of open science practices in transportation research.
Publisher DOI
Pretraining effective T5 generative models for clinical and biomedical applications
PLoS ONE · 2026-04-17
articleOpen access
This paper presents a study of the impact of corpus selection and vocabulary design on the performance of T5-based language models in clinical and biomedical domains. We introduce five different T5-EHR models, each pretrained from scratch using different combinations of clinical and biomedical corpora alongside domain-specific vocabularies. We evaluated these models across a variety of clinical and biomedical tasks to quantify the impact of pretraining data and vocabulary tokenization choices on downstream performance. Our findings reveal the importance of aligning both pretraining corpus and vocabulary with the target domain. Models pretrained exclusively on clinical data achieve superior performance on clinical tasks, while adding biomedical data contributes only marginal gains in most cases, with a few exceptions. Similarly, the choice of vocabulary significantly influences model performance, with clinical-specific vocabularies outperforming general biomedical vocabularies in tasks requiring a deeper understanding of clinical language. Also, the T5 generative models perform competitively with state-of-the-art discriminative models on several biomedical benchmarks, demonstrating strong generalization to biomedical domain. Overall, these results emphasize that task-specific selection of corpus and vocabulary is essential for optimizing model performance in clinical and biomedical natural language processing (NLP).
Publisher DOI
AlphaOPT: Formulating Optimization Programs with Self-Improving LLM Experience Library
arXiv (Cornell University) · 2025-10-21
preprintOpen access
Optimization modeling underlies critical decision-making across industries, yet remains difficult to automate: natural-language problem descriptions must be translated into precise mathematical formulations and executable solver code. Existing LLM-based approaches typically rely on brittle prompting or costly retraining, both of which offer limited generalization. Recent work suggests that large models can improve via experience reuse, but how to systematically acquire, refine, and reuse such experience in structurally constrained settings remains unclear. We present \textbf{AlphaOPT}, a self-improving experience library that enables LLMs to learn optimization modeling knowledge from limited supervision, including answer-only feedback without gold-standard programs, annotated reasoning traces, or parameter updates. AlphaOPT operates in a continual two-phase cycle: a \emph{Library Learning} phase that extracts solver-verified, structured insights from failed attempts, and a \emph{Library Evolution} phase that refines the applicability of stored insights based on aggregate evidence across tasks. This design allows the model to accumulate reusable modeling principles, improve transfer across problem instances, and maintain bounded library growth over time. Evaluated on multiple optimization benchmarks, AlphaOPT steadily improves as more training data become available (65\% $\rightarrow$ 72\% from 100 to 300 training items) and outperforms the strongest baseline by 9.1\% and 8.2\% on two out-of-distribution datasets. These results demonstrate that structured experience learning, grounded in solver feedback, provides a practical alternative to retraining for complex reasoning tasks requiring precise formulation and execution. All code and data are available at: https://github.com/Minw913/AlphaOPT.
Publisher OA PDF DOI
DIFFERENTIAL CELLULAR RESPONSES TO ω-3 AND ω-6 FATTY ACIDS MEDIATED BY FFAR4 RECEPTOR IN LARGE YELLOW CROAKER (LARIMICHTHYS CROCEA)
Acta Hydrobiologica Sinica · 2025-11-25
articleOpen access1st authorCorresponding
Free fatty acid receptor 4 (FFAR4), a G protein-coupled receptor (GPCR), plays a key role in sensing long-chain polyunsaturated fatty acids (LC-PUFAs) in mammals. However, its ligand selectivity and downstream signaling mechanisms in fish remain poorly understood. In this study, we investigated the ligand recognition and signaling characteristics of FFAR4 from the large yellow croaker (<italic>Larimichthys crocea</italic>), with particular emphasis on ω-3 and ω-6 fatty acids. A stable HEK293T cell line expressing LcFFAR4 was established to systematically assess ligand-induced receptor internalization, intracellular calcium flux, and cyclic adenosine monophosphate (cAMP) signaling pathway. Subcellular localization analysis revealed predominant plasma membrane distribution of LcFFAR4. Ligand stimulation assays demonstrated that both eicosapentaenoic acid (EPA, ω-3) and linoleic acid (LA, ω-6) induced receptor internalization, with EPA showing significantly stronger effect. Calcium imaging revealed low concentrations EPA markedly increased intracellular calcium levels, indicating activation of the calcium signaling pathway, whereas LA elicited no significant change. Dual-luciferase reporter assays further confirmed both EPA and LA activated the LcFFAR4-mediated cAMP signaling pathway, with EPA displaying a more potent effect. Functional analyses revealed that EPA more effectively mitigated lipopolysaccharide (LPS)-induced inflammatory responses via LcFFAR4. These results collectively demonstrate that LcFFAR4 exhibits distinct ligand selectivity and signaling bias. EPA, an ω-3 fatty acid, functions as a more potent agonist than LA, inducing stronger receptor internalization and more robust activation of both calcium and cAMP signaling pathways. This study provides the first evidence of teleost FFAR4 differential responses to ω-3 and ω-6 fatty acids from a GPCR signaling perspective, offering new insights into the fatty acid sensing mechanisms and their potential roles in nutrient sensing and metabolic regulation in fish.
Publisher OA PDF DOI
Leveraging Social Determinants of Health (SDoH) Knowledge Graph to Identify Latent Patterns in Veteran Suicide Risk
2025-10-26
article
While Social Determinants of Health (SDoH) are widely acknowledged as critical factors influencing health outcomes, particularly in vulnerable populations, their complex relationships and systemic impacts remain insufficiently examined. This study presents the development and systematic analysis of a comprehensive knowledge graph (KG) framework designed to elucidate the complex relationships between SDoH and mental health outcomes in a high-risk population: veterans with documented histories of suicide attempts or suicidal ideation. Leveraging a comprehensive electronic health records dataset from the U.S. Veterans Health Administration, we generated synthetic data that accurately preserves the statistical properties of the original dataset. We also constructed a specialized SDoH knowledge graph to enable multidimensional analysis. Using topological link prediction and node classification algorithms, we systematically analyzed structural patterns across critical SDoH domains to uncover latent relationships within the KG. Our KG-based approach enables privacy-preserving health disparities research by combining synthetic data generation with graph-based analytics. Our results demonstrate the viability of this approach for deriving clinically meaningful insights while maintaining strict confidentiality protections, establishing a scalable paradigm for future population health studies.
Publisher DOI
When Context Is Not Enough: Modeling Unexplained Variability in Car-Following Behavior
ArXiv.org · 2025-07-09
preprintOpen access
Modeling car-following behavior is fundamental to microscopic traffic simulation, yet traditional deterministic models often fail to capture the full extent of variability and unpredictability in human driving. While many modern approaches incorporate context-aware inputs (e.g., spacing, speed, relative speed), they frequently overlook structured stochasticity that arises from latent driver intentions, perception errors, and memory effects -- factors that are not directly observable from context alone. To fill the gap, this study introduces an interpretable stochastic modeling framework that captures not only context-dependent dynamics but also residual variability beyond what context can explain. Leveraging deep neural networks integrated with nonstationary Gaussian processes (GPs), our model employs a scenario-adaptive Gibbs kernel to learn dynamic temporal correlations in acceleration decisions, where the strength and duration of correlations between acceleration decisions evolve with the driving context. This formulation enables a principled, data-driven quantification of uncertainty in acceleration, speed, and spacing, grounded in both observable context and latent behavioral variability. Comprehensive experiments on the naturalistic vehicle trajectory dataset collected from the German highway, i.e., the HighD dataset, demonstrate that the proposed stochastic simulation method within this framework surpasses conventional methods in both predictive performance and interpretable uncertainty quantification. The integration of interpretability and accuracy makes this framework a promising tool for traffic analysis and safety-critical applications.
Publisher OA PDF DOI
Route Recommendations for Traffic Management Under Learned Partial Driver Compliance
ArXiv.org · 2025-04-03
preprintOpen access
In this paper, we aim to mitigate congestion in traffic management systems by guiding travelers along system-optimal (SO) routes. However, we recognize that most theoretical approaches assume perfect driver compliance, which often does not reflect reality, as drivers tend to deviate from recommendations to fulfill their personal objectives. Therefore, we propose a route recommendation framework that explicitly learns partial driver compliance and optimizes traffic flow under realistic adherence. We first compute an SO edge flow through flow optimization techniques. Next, we train a compliance model based on historical driver decisions to capture individual responses to our recommendations. Finally, we formulate a stochastic optimization problem that minimizes the gap between the target SO flow and the realized flow under conditions of imperfect adherence. Our simulations conducted on a grid network reveal that our approach significantly reduces travel time compared to baseline strategies, demonstrating the practical advantage of incorporating learned compliance into traffic management.
Publisher OA PDF DOI
KSMoFinder - Knowledge graph embedding of proteins and motifs for predicting kinases of human phosphosites
bioRxiv (Cold Spring Harbor Laboratory) · 2025-10-23
preprintOpen accessSenior author
Motivation: Protein kinases regulate cellular signaling pathways through a cascade of phosphorylation activity, selectively targeting specific residues on substrate proteins (phosphosites). Determining the characteristics of kinases that phosphorylate specific substrates have been extensively studied. Most tools utilize amino acid sequence motifs around phosphosites but do not consider the biological characteristics of substrate proteins. Results: We present KSMoFinder, a kinase-substrate-motif prediction model that learns factors beyond motif similarities by integrating the biological contexts of proteins. We learn the semantics in a knowledge graph containing contextual relationships of proteins, kinase-specific motifs and motif composition, and represent the proteins and motifs as embedded vectors. Using the representations as features, we train a supervised deep learning classifier to identify kinase-phosphosite relationships. We use ground truth kinase-substrate-motif dataset from iPTMnet and PhosphositePlus and evaluate the prediction performance of KSMoFinder. Pairwise comparative assessments with prior kinase-substrate prediction tools demonstrate the superior performance of KSMoFinder. KSMoFinder trained using our knowledge graph embeddings surpasses the prediction performances using embeddings of popular protein language models such as ProtT5, ESM2 and ESM3 with a ROC-AUC of 0.851 and PR-AUC of 0.839 on a testing dataset with equal number of positives and negatives. Unlike most existing tools, KSMoFinder can be utilized to predict at the motif and at the substrate protein level. Availability and implementation: All code to reproduce the results are available at https://github.com/manju-anandakrishnan/KSMoFinder. All data and KSMoFinder predictions are deposited at https://doi.org/10.5281/zenodo.15730847.
Publisher DOI
Desiderata for a biomedical knowledge network: opportunities, challenges and future Directions.
PubMed · 2025-09-26
articleOpen accessSenior author
Knowledge graphs, collectively as a knowledge network, have become critical tools for knowledge discovery in computable and explainable knowledge systems. Due to the semantic and structural complexities of biomedical data, these knowledge graphs need to enable dynamic reasoning over large evolving graphs and support fit-for-purpose abstraction, while establishing standards, preserving provenance and enforcing policy constraints for actionable discovery. A recent meeting of leading scientists discussed the opportunities, challenges and future directions of a biomedical knowledge network. Here we present six desiderata inspired by the meeting: (1) inference and reasoning in biomedical knowledge graphs need domain-centric approaches; (2) harmonized and accessible standards are required for knowledge graph representation and metadata; (3) robust validation of biomedical knowledge graphs needs multi-layered, context-aware approaches that are both rigorous and scalable; (4) the evolving and synergistic relationship between knowledge graphs and large language models is essential in empowering AI-driven biomedical discovery; (5) integrated development environments, public repositories, and governance frameworks are essential for secure and reproducible knowledge graph sharing; and (6) robust validation, provenance, and ethical governance are critical for trustworthy biomedical knowledge graphs. Addressing these key issues will be essential to realize the promises of a biomedical knowledge network in advancing biomedicine.
Publisher OA PDF

Recent grants

UniProt: A Protein Sequence and Function Resource for Biomedical Science
NIH · $3.8M · 2014–2026
NIH Grant R29LM005524
NIH · $461k · 1999
Delaware INBRE CSR Core
NIH · $100.8M · 2001–2030
Semantic Literature Annotation and Integrative Panomics Analysis for PTM-Disease Knowledge Network Discovery
NIH · $370k · 2016–2019
PRO: A Protein Ontology in OBO Foundry for Scalable Integration of Biomedical Knowledge
NIH · $8.9M · 2007–2020

Frequent coauthors

Darren A. Natale
Georgetown University Medical Center
404 shared
Hongzhan Huang
National Taiwan University Hospital
376 shared
Cecilia Arighi
University of Delaware
313 shared
Alex Bateman
289 shared
ROBERT FINN
European Bioinformatics Institute
281 shared
Rolf Apweiler
European Bioinformatics Institute
239 shared
Christian Sigrist
223 shared
Alex Mitchell
Pennsylvania State University
217 shared

Education

Ph.D.
Purdue University
1984
M.S.
Purdue University
1982

Awards & honors

Ole Madsen Mentoring Award (2025)
NSF Faculty Early Career Development (CAREER) Award (2023)
CUTC Milton Pikarsky Memorial Dissertation Award (2018)
Outstanding Graduate Student Instructor Award, UC Berkeley (…
ACM Future of Computing Academy, Interdisciplinary Research…

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Cathy Wu

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you