
Danqi Chen
· Associate Director of Princeton Language and IntelligenceVerifiedPrinceton University · Philosophy
Active 2008–2026
About
Danqi Chen is an associate professor of Computer Science at Princeton University, where she co-leads the Princeton NLP Group and serves as an associate director of Princeton Language and Intelligence (PLI). Her research explores the full life cycle of language models, focusing on how they are built, aligned, and understood, with an emphasis on developing methods that democratize their creation and deployment. She has previously worked as a visiting scientist at Facebook AI Research in Seattle and is currently on sabbatical leave from Princeton, working as a member of technical staff at Thinking Machines Lab. Chen received her Ph.D. in Computer Science from Stanford University in 2018, where she was advised by Christopher Manning and was part of the Stanford NLP Group. Her academic background includes an undergraduate degree in Computer Science from Tsinghua University, where she was part of the Special Pilot CS Class supervised by Andrew Yao. Her research interests include natural language processing and machine learning, with a focus on language model development and understanding.
Research topics
- Computer Science
- Artificial Intelligence
- Natural Language Processing
- Machine Learning
- Computer Security
- Cell biology
- Biology
- Data science
- Theoretical computer science
- Psychology
- Molecular biology
- Genetics
Selected publications
ChemRxiv · 2026-05-10
articleOpen accessHematopoietic progenitor kinase 1 (HPK1) is an intracellular negative regulator of immune responses, particularly in T cell receptor (TCR) signaling. Compelling genetic evidence indicates that HPK1 impairs multiple stages of antitumor immunity largely via its kinase activity, thereby positioning it as a promising target for cancer immunotherapy. Starting from two 3-aminopyrazole hits identified through in-house screening, we initiated a structure-guided medicinal chemistry campaign. Through multi-stage optimization to enhance potency by engaging the Asp101 residue and to improve kinase selectivity via molecular hybridization, we developed a novel series of 1 H -pyrazolo[3,4- c ]pyridin-3-amine derivatives, from which D5 emerged as a key representative. Compound D5 exhibited potent HPK1 inhibitory activity (IC 50 = 26.3 nM), and inhibited SLP76 phosphorylation and promoted IL-2 secretion in cell-based assays. Importantly, D5 showed favorable selectivity across diversity and immune-focused kinase panels, representing a marked improvement over its earlier analogue C4 . Furthermore, D5 significantly suppressed tumor growth in the CT26 syngeneic mouse model without causing observable body weight loss. Overall, this work provides both a promising lead and a novel chemical scaffold, offering valuable insights and a concrete starting point for HPK1-targeted drug discovery.
Animal models of developmental toxicity induced by early life electronic-cigarettes exposure
Critical Reviews in Toxicology · 2025-04-21 · 1 citations
reviewThe rising prevalence of electronic-cigarettes (E-cigs) use during pregnancy and lactation can be attributed, in part, to advertising campaigns promoting their safety. Nevertheless, the integrity of E-cigs as a secure substitute for conventional cigarettes necessitates further exploration. Some studies emphasize the toxic role of nicotine in E-cigs, while others underscore the significance of other distinct components whose toxicity cannot be disregarded. Increasingly, researchers are employing rodent models to elucidate the potential toxicological implications of e-cig use. Various paradigms of E-cigs exposure in early life frequently yield divergent health outcomes for offspring. This review first presents different animal model approaches to E-cig-exposure during pregnancy and lactation, referring to E-cig liquid, E-cig devices, puff topography, and inhalation methods, which would be related to the health outcomes. Moreover, the mechanisms underlying the hazardous impacts of maternal E-cig-exposure on offspring are also elucidated. Maternal exposure to E-cigs has been found to induce adverse effects on lung function, neurobehavior, glycolipid metabolism and energy homeostasis in offspring, which are probably mediated through inflammation, oxidative stress, and epigenetic modifications.
Formaldehyde Exposure Induces Systemic Epigenetic Alterations in Histone Methylation and Acetylation
bioRxiv (Cold Spring Harbor Laboratory) · 2025-03-01 · 3 citations
preprintOpen accessFormaldehyde (FA) is a pervasive environmental organic pollutant and a Group 1 human carcinogen. While FA has been implicated in various cancers, its genotoxic effects, including DNA damage and DNA-protein crosslinking, have proven insufficient to fully explain its role in carcinogenesis, suggesting the involvement of epigenetic mechanisms. Histone post-translational modifications (PTMs) on H3 and H4, critical for regulating gene expression, may contribute to FA-induced pathogenesis as lysine and arginine residues serve as targets for FA-protein adduct formation. This study aimed to elucidate the effects of FA on histone methylation and acetylation patterns. Human bronchial epithelial cells (BEAS-2B) were exposed to low-dose (100 μM) and high-dose (500 μM) FA for 1 hour, and their histone extracts were analyzed using high-resolution liquid chromatography-tandem mass spectrometry-based proteomics, followed by PTM-combined peptide analysis and single PTM site/type comparisons. We identified 40 peptides on histone H3 and 16 on histone H4 bearing epigenetic marks. Our findings revealed that FA exposure induced systemic alterations in H3 and H4 methylation and acetylation, including hypomethylation of H3K4 and H3K79 and changes in H3K9, H3K14, H3K18, H3K23, H3K27, H3K36, H3K37, and H3R40, as well as modifications in H4K5, H4K8, H4K12, and H4K16. These FA-induced histone modifications exhibited strong parallels with epigenetic alterations observed in cancers, leukemia, and Alzheimer's disease. This study provides novel evidence of FA epigenetic toxicity, offering new insights into the potential mechanisms underlying FA-driven pathogenesis.
ArXiv.org · 2025-03-21
preprintOpen accessThe latest Audio Language Models (Audio LMs) process speech directly instead of relying on a separate transcription step. This shift preserves detailed information, such as intonation or the presence of multiple speakers, that would otherwise be lost in transcription. However, it also introduces new safety risks, including the potential misuse of speaker identity cues and other sensitive vocal attributes, which could have legal implications. In this paper, we urge a closer examination of how these models are built and deployed. Our experiments show that end-to-end modeling, compared with cascaded pipelines, creates socio-technical safety risks such as identity inference, biased decision-making, and emotion detection. This raises concerns about whether Audio LMs store voiceprints and function in ways that create uncertainty under existing legal regimes. We then argue that the Principle of Least Privilege should be considered to guide the development and deployment of these models. Specifically, evaluations should assess (1) the privacy and safety risks associated with end-to-end modeling; and (2) the appropriate scope of information access. Finally, we highlight related gaps in current audio LM benchmarks and identify key open research questions, both technical and policy-related, that must be addressed to enable the responsible deployment of end-to-end Audio LMs.
Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking
ArXiv.org · 2025-06-11
preprintOpen accessRecent work has identified retrieval heads, a subset of attention heads responsible for retrieving salient information in long-context language models (LMs), as measured by their copy-paste behavior in Needlein-a-Haystack tasks. In this paper, we introduce QRHead (Query-Focused Retrieval Head), an improved set of attention heads that enhance retrieval from long context. We identify QRHead by aggregating attention scores with respect to the input query, using a handful of examples from real-world tasks (e.g., long-context QA). We further introduce QRRetriever, an efficient and effective retriever that uses the accumulated attention mass of QRHead as retrieval scores. We use QRRetriever for long-context reasoning by selecting the most relevant parts with the highest retrieval scores. On multi-hop reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains over full context and outperforms strong dense retrievers. We also evaluate QRRetriever as a re-ranker on the BEIR benchmark and find that it achieves strong zero-shot performance, outperforming other LLM-based re-rankers such as RankGPT. Further analysis shows that both the query-context attention scoring and task selection are crucial for identifying QRHead with strong downstream utility. Overall, our work contributes a general-purpose retriever and offers interpretability insights into the long-context capabilities of LMs.
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
arXiv (Cornell University) · 2025-02-14
preprintOpen accessModern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pre-training data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance. We demonstrate that our domain mixing also improves existing methods that select data based on quality. Furthermore, we study and compare how quality-based methods will implicitly change the domain mixture. Overall, our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods, opening new avenues for effective and insightful pre-training data curation.
Frontiers in Plant Science · 2025-02-04
articleOpen accessOptimized triketone dioxygenase (TDO) variants with enhanced temperature stability parameters were engineered to enable robust triketone tolerance in transgenic cotton and soybean crops. This herbicide tolerance trait, which can metabolize triketone herbicides such as mesotrione and tembotrione, could be useful for weed management systems and provide additional tools for farmers to control weeds. TDO has a low melting point (~39°C–40°C). We designed an optimization scheme using a hypothesis-based rational design to improve the temperature stability of TDO. Temperature stabilization resulted in enzymes with K cat values less than half of wild-type TDO. The best variant TDO had a K cat of 1.2 min −1 compared to wild-type TDO, which had a K cat of 2.7 min −1 . However K m values did not change much due to temperature stabilization. Recovery of the K cat without losing heat stability was the focus of additional optimization. Multiple variants were found that had better heat stability in vitro and efficacies against mesotrione equaling the wild-type (WT) TDO in greenhouse and field tests.
Representing Rule-based Chatbots with Transformers
2025-01-01
articleOpen accessSenior authorDan Friedman, Abhishek Panigrahi, Danqi Chen. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025.
Precise Information Control in Long-Form Text Generation
ArXiv.org · 2025-06-06
preprintOpen accessA central challenge in language models (LMs) is faithfulness hallucination: the generation of information unsubstantiated by input context. To study this problem, we propose Precise Information Control (PIC), a new task formulation that requires models to generate long-form outputs grounded in a provided set of short self-contained statements, without adding any unsupported ones. PIC includes a full setting that tests a model's ability to include exactly all input claims, and a partial setting that requires the model to selectively incorporate only relevant claims. We present PIC-Bench, a benchmark of eight long-form generation tasks (e.g., summarization, biography generation) adapted to the PIC setting, where LMs are supplied with well-formed, verifiable input claims. Our evaluation of a range of open and proprietary LMs on PIC-Bench reveals that, surprisingly, state-of-the-art LMs still hallucinate against user-provided input in over 70% of generations. To alleviate this lack of faithfulness, we introduce a post-training framework that uses a weakly supervised preference data construction method to train an 8B PIC-LM with stronger PIC ability--improving from 69.1% to 91.0% F1 in the full PIC setting. When integrated into end-to-end factual generation pipelines, PIC-LM improves exact match recall by 17.1% on ambiguous QA with retrieval, and factual precision by 30.5% on a birthplace fact-checking task, underscoring the potential of precisely grounded generation.
Environmental Technology & Innovation · 2025-07-07
articleOpen accessSolid-state fermentation (SSF) of green tea waste (GTW) and black tea waste (BTW) was evaluated to assess microbial effects on extractable caffeine content. A significant caffeine increase was observed, with GTW fermented at 45°C and 55% water content for 6 days yielding the higher caffeine content (~13.7134 mg/g dry mass via high-temperature extraction, ~5.098 mg/g dry mass higher than control, unfermented GTW on day 0). SSF also enhanced caffeine extraction at room temperature with minimal solvent, offering an energy-efficient advantage. Volatile organic compounds (VOCs) analysis by the proton transfer reaction-time of a flight-mass spectrometer (PTR-ToF-MS) and microbial profiling by DNA sequencing analysis revealed key VOCs ( m/z 59, m/z 61, m/z 89) linked to microbial activity. Dominant bacteria ( Bacillus, Paenibacillus ) and fungi ( Aspergillus ) in fermented GTW likely contributed to improved caffeine extractability. These findings highlight microbial transformations in tea waste SSF and its potential for sustainable caffeine extraction as a value-added product. • Green tea waste fermented at 45°C, 55% moisture yielded highest extracted caffeine • Solid state fermentation enhanced caffeine extracted content • Extraction improvement is especially evident at low energy, minimal solvent • Key volatile compounds at m/z 59, 61, and 89 are linked to microbial activity. • Bacillus, Paenibacillus and Aspergillus correlate with improved caffeine content.
Frequent coauthors
- 136 shared
Bing Xiong
Chinese Academy of Sciences
- 61 shared
Jingkang Shen
- 49 shared
Xin Wang
Shandong University
- 47 shared
Meiyu Geng
Chinese Center For Disease Control and Prevention
- 37 shared
Danyan Cao
Shanghai Institute of Materia Medica
- 33 shared
Yanlian Li
Xuzhou Medical College
- 30 shared
Jian Xu
Huashan Hospital
- 29 shared
Chunyuan Jin
NYU Langone Health
Education
- 1999
工学硕士, 计算机系
哈尔滨工业大学
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Danqi Chen
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup