Jiafeng Chen

Verified

Stanford University · Economics

Active 1997–2026

h-index47

Citations11.2k

Papers389216 last 5y

Funding$1.0M

Faculty page

See your match with Jiafeng Chen — sign in to PhdFit.Sign in

About

Jiafeng Chen is an Assistant Professor of Economics at Stanford University within the Department of Economics. His research interests focus on applied microeconomics and microeconomic theory. The Stanford Economics Department's mission includes training students at both undergraduate and graduate levels in modern economics methods and ideas, as well as conducting basic and applied research to advance the field. Chen's work contributes to this mission by engaging in research that pushes forward the frontier of knowledge in economics.

Research topics

Computer Science
Medicine
Artificial Intelligence
Machine Learning
Internal medicine
Natural Language Processing
Medical education
Computer Security
Computational biology
Political Science
Engineering
Biological system
Psychology
Biochemistry
Biology
Pediatrics
Medical physics
Virology
Psychiatry
World Wide Web
Family medicine
Gerontology
Pharmacology
Environmental health

Selected publications

Micro-randomization trial design under operational constraints
Contemporary Clinical Trials · 2026-05-01
article
Publisher DOI
Microfluidization—Applications in Food Industry
2026-01-01
book-chapter1st author
Publisher DOI
P-417. Machine Learning Prediction of Pediatric Bacteremia: Development of EHR-Based Models for Diagnostic and Clinical Decision Support
Open Forum Infectious Diseases · 2026-01-01
articleOpen accessSenior author
Abstract Background Pediatric blood cultures are frequently ordered but have low positivity rates (&lt; 4%) in emergency departments (EDs), highlighting the need for better-targeted testing. Accurate prediction can reduce unnecessary cultures, conserve resources, and support stewardship—particularly during the global blood culture bottle shortage. Models developed for adults perform poorly in children due to physiological and clinical differences; in prior work, applying an adult model to pediatric data yielded an AUC of 0.61. We excluded infants &lt; 90 days, who have distinct risk factors (e.g., perinatal history), and developed machine learning models to predict bacteremia in children aged &gt; 90 days to ≤ 18 years using electronic health record (EHR) data.Table 1:PedsBactoScore Point-Based Scoring System Derived from Logistic Regression CoefficientsEach feature contributes a fixed number of points based on clinically meaningful thresholds. The total score is used to stratify risk of bacteremia at the point-of-care.Table 2:Performance Metrics of Pediatric Bacteremia Prediction ModelsComparison of PedsBactoRisk and PedsBactoScore models on the pediatric test set. Metrics include AUC with 95% confidence intervals, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) at pre-specified thresholds. Methods We analyzed 26,829 blood culture orders from 9,362 pediatric emergency department (ED) encounters at Stanford Medicine | Children's Health. To preserve temporal validity, data were split chronologically, with the most recent encounters used as the test set. We developed two models: PedsBactoRisk, a logistic regression model, and PedsBactoScore, a simplified point-based tool derived from the most influential PedsBactoRisk predictors. The PedsBactoScore rubric is shown in Table 1. Results Table 2 summarizes model performance. We evaluated sensitivity, specificity, PPV, and NPV, focusing on thresholds achieving 90% sensitivity. PedsBactoRisk achieved an AUC-ROC of 0.75; PedsBactoScore, 0.64. While PedsBactoRisk showed superior performance, PedsBactoScore allows easier implementation via its interpretable scoring system. PedsBactoScore performance is shown across thresholds to illustrate sensitivity–specificity trade-offs. Conclusion PedsBactoRisk demonstrated the highest overall performance (AUC: 0.75), but PedsBactoScore offers a pragmatic, interpretable bedside tool with strong sensitivity. Both models support more judicious blood culture use by identifying low-risk patients with high sensitivity. Future work will focus on integrating provider notes using large language models to enhance predictive accuracy and extending this approach to infants &lt; 90 days by incorporating maternal and delivery data. Disclosures Jonathan H. Chen, MD, PhD, Reaction Explorer: Ownership Interest
Publisher DOI
P-2045. Riding the Storm: Leveraging a Best Practice Advisory to Mitigate IV Shortages Post-Hurricane
Open Forum Infectious Diseases · 2026-01-01
articleOpen access
Abstract Background In September 2024, Hurricane Helene severely damaged Baxter International’s North Carolina manufacturing facility, leading to severe intravenous (IV) fluid shortages. Stanford Medicine implemented warnings in the electronic medical record (EMR) to shift providers away from IV medications. We sought to understand whether there was increased acceptance of the pre-existing Best Practice Advisory (BPA) with additional notifications in the EMR.Figure 1:Monthly BPA Acceptance RateTable 1:Acceptance Rates by Medication Pre-Shortage and Shortage Methods We analyzed BPA acceptance rates from September 2023 through February 2025 for Stanford Hospital and Stanford Tri-Valley Hospital. The EMR Epic displayed additional notifications about the IV fluid shortage as: (a) Message of the Day at sign-in from Oct. 11, 2024 through Jan. 15, 2025 and (b) Homepage Banner in Summary Report from Nov. 21, 2024 through Feb. 25, 2025. A Chi-square test or Fisher’s Test was used to determine statistical differences in the pre-shortage and shortage periods. Results Overall BPA acceptance increased from an average of 25% pre-shortage to 32% during the IV fluid shortage (p&lt; 0.001). The increase in BPA acceptance was driven mostly by antibiotics and not antifungals, with the the biggest changes observed for azithromycin (from 34% acceptance rate to 52%, p&lt; 0.001), metronidazole (from 29% to 36%, p = 0.019), linezolid (from 28% to 44%, p = 0.046), and isavuconazole (from 15% to 38%, p = 0.012), which are often used in oral formulation in more stable patients. The BPA acceptance decreased to baseline in February 2025 at 26% acceptance rate when the Message of the Day was taken down, suggesting it is more effective than the Homepage Banner. Conclusion Additional EMR notifications were associated with increased acceptance of the BPA shifting IV to oral medications. Future studies must evaluate how to maximize this benefit while minimizing alert fatigue. Disclosures Jonathan H. Chen, MD, PhD, Reaction Explorer: Ownership Interest
Publisher DOI
Valorization of plant-based protein-rich byproducts through physical processing techniques: A review
International Journal of Biological Macromolecules · 2026-01-15 · 1 citations
article1st author
Publisher DOI
P-416. From Broad to Best: A Structured, Automated, and Scalable EHR Approach to Evaluate Empiric Antibiotic Appropriateness
Open Forum Infectious Diseases · 2026-01-01
articleOpen accessSenior author
Abstract Background Current antimicrobial stewardship metrics emphasize reducing overall antibiotic use but rarely assess patient-level appropriateness. Tools like the SAAR and EHR alerts benchmark use or trigger rules but do not evaluate whether an agent was clinically appropriate. We developed an automated, scalable metric using SQL, widely adopted and optimized for querying data, to implement the DOOR MAT (Desirability of Outcome Ranking for the Management of Antimicrobial Therapy) framework, which ranks empiric antibiotics by spectrum, favoring narrower agents when susceptible.Figure 1:Cohort Construction Flowchart for Adult Emergency Department (ED) Urinary Tract Infection CasesThis flow diagram illustrates the cohort generation process for adult ED patients with presumed urinary tract infection (UTI), based on urine culture orders and empiric antibiotic prescriptions. Starting from all urine culture orders, sequential filters were applied to isolate cases from the ED with associated empiric antibiotic treatment, while excluding those with recent antibiotic exposure (within 30 days) or non-relevant encounters. This figure provides a visual example of inclusion and exclusion criteria, culminating in the final analytic sample used in the appropriateness analysis. The same logic was applied to generate the pediatric ED and adult outpatient cohorts (not shown). As a proof-of-concept study using a large real-world dataset, limitations include potential misclassification, missing data, and variability in documentation or data completeness across care settings.Figure 2:Antibiotic Spectrum Tiering Used for Appropriateness Classification in the DOOR MAT FrameworkThis figure presents selected examples of how empiric antibiotics were categorized into six hierarchical spectrum tiers, adapted from the World Health Organization’s AWaRE classification. AWaRE groups antibiotics into three categories to guide stewardship prioritization: Access (first-line agents with low resistance potential), Watch (higher resistance potential), and Reserve (last-resort agents for multidrug-resistant infections). These categories were expanded to improve granularity for spectrum-based assessment. Narrow-spectrum agents appear in lower tiers, while broader-spectrum and last-resort agents occupy higher tiers. This new tiering system, developed and validated by an infectious diseases physician, served as the basis for evaluating whether a prescribed agent was optimal, broader than necessary (over-treatment), or lacked adequate activity (under-treatment) relative to culture and antimicrobial susceptibility testing (AST) data. The tiers were used in conjunction with SQL-based cohort construction and join logic to apply the DOOR MAT (Desirability of Outcome Ranking for the Management of Antimicrobial Therapy) framework. This is not a comprehensive list; antibiotics shown here are representative examples from the full tiering system. Methods We used the ARMD EHR dataset to identify presumptive UTI cases based on urine culture orders and empiric antibiotics, excluding patients with antibiotic exposure in the prior 30 days. Cohorts were stratified by care setting (Figure 1). Antibiotics were categorized into six spectrum tiers adapted from WHO AWaRE and validated by an infectious diseases physician (Figure 2). Using the DOOR MAT framework, we applied SQL logic to compare empiric therapy with culture and antimicrobial susceptibility testing (AST) results, classifying each case as optimal, over-treatment, under-treatment, unnecessary, or not assessable. All unique antibiotics and organisms were retained for full assessment.Figure 3:Appropriateness of Empiric Antibiotic Prescribing for Urinary Tract Infections Across Care SettingsThis figure displays a spectrum-based histogram of empiric antibiotic appropriateness for culture-positive urinary tract infection (UTI) cases across adult ED, pediatric ED, and adult outpatient settings. Prescriptions are classified as optimal (green), indicating the narrowest agent with full in vitro activity based on antimicrobial susceptibility testing (AST) or implied susceptibility; over-treatment (yellow to red gradient), where the agent was active but broader than necessary, with color intensity reflecting the number of spectrum tiers above the optimal choice; under-treatment (dark red), where the agent lacked activity against the cultured organism(s); and not assessable (gray), where AST was unavailable and no intrinsic resistance or predictable susceptibility could be inferred. The histogram includes only culture-positive cases to maintain interpretability. The percentage and count of unnecessary prescriptions (defined as empiric antibiotics given for negative cultures) are shown separately to the right, as inclusion in the main histogram would distort the visual scale. This color-coded format enables intuitive assessment of antibiotic use across departments, where green reflects appropriate prescribing and red indicates increasingly inappropriate use.Figure 4:Spectrum Deviation of Commonly Over-Treated Antibiotics in the Adult Outpatient CohortThis bar chart displays the five most frequently prescribed empiric antibiotics in the adult outpatient cohort that were classified as over-treatment based on final urine culture and antimicrobial susceptibility testing (AST) results. Each antibiotic is labeled on the x-axis with the number of spectrum tier deviations above the optimal agent, based on the adapted WHO AWaRE classification system. For example, nitrofurantoin was the most commonly prescribed agent but typically deviated by only one tier from the optimal choice, whereas ciprofloxacin was two tiers broader than necessary in most cases, where an agent such as amoxicillin (lowest tier) would have provided adequate coverage. This figure illustrates how over-treatment varies not only by drug selection but also by degree of unnecessary spectrum, providing insight into stewardship opportunities beyond binary classification alone. Results Of 73,881 adult ED prescriptions, 58.0% were unnecessary. Among culture-positive cases, 3.3% were optimal, 53.3% over-treated, 17.4% under-treated, and 26.0% lacked AST. In 7,213 pediatric ED cases, 64.4% were unnecessary; among positives, 7.4% were optimal, 29.5% over-treated, and 47.2% lacked AST. Among 47,109 adult outpatient prescriptions, 68.1% were unnecessary; among positives, 10.1% were optimal, 56.3% over-treated, and 23.8% lacked AST (Figure 3). Nitrofurantoin and ciprofloxacin were the most overused agents in the outpatient setting (Figure 4). Conclusion This structured, SQL-based framework enables standardized assessment of empiric antibiotic appropriateness using only EHR data. By determining appropriateness along a spectrum relative to culture and susceptibility results, it offers a scalable alternative to manual audit or rule-based alerts. These reproducible measures can support real-time and longitudinal stewardship, particularly in high-volume or outpatient settings. Disclosures Hayden T. Schwenk, MD, MPH, Bristol Myers Squibb: Stocks/Bonds (Public Company)|Karius, Inc.: Consultant, Medical Affairs Jonathan H. Chen, MD, PhD, Reaction Explorer: Ownership Interest
Publisher DOI
Retrieval-Augmented Guardrails for AI-Drafted Patient-Portal Messages: Error Taxonomy Construction and Large-Scale Evaluation
2025-12-01 · 1 citations
articleOpen access
Asynchronous patient-clinician messaging via EHR portals is a growing source of clinician workload, prompting interest in large language models (LLMs) to assist with draft responses. However, LLM outputs may contain clinical inaccuracies, omissions, or tone mismatches, making robust evaluation essential. Our contributions are threefold: (1) we introduce a clinically grounded error ontology comprising 5 domains and 59 granular error codes, developed through inductive coding and expert adjudication; (2) we develop a Retrieval-Augmented Error Checking (RAEC) pipeline that leverages semantically similar historical message-response pairs to improve judgment quality; and (3) we provide a two-stage prompting architecture using DSPy to enable scalable, interpretable, and hierarchical error detection. Our approach assesses the quality of drafts both in isolation and with reference to similar past message-response pairs retrieved from institutional archives. Using a two-stage DSPy pipeline, we compared baseline and reference-enhanced evaluations on over 1,500 patient messages. Retrieval context improved error identification in domains such as clinical completeness and workflow appropriateness. Human validation on 100 messages demonstrated superior agreement (concordance = 50% vs. 33%) and performance (F1 = 0.500 vs. 0.256) of context-enhanced labels vs. baseline, supporting the use of our RAEC pipeline as AI guardrails for patient messaging.
Publisher DOI
Monitoring Deployed AI Systems in Health Care
ArXiv.org · 2025-12-09
preprintOpen access
Post-deployment monitoring of artificial intelligence (AI) systems in health care is essential to ensure their safety, quality, and sustained benefit-and to support governance decisions about which systems to update, modify, or decommission. Motivated by these needs, we developed a framework for monitoring deployed AI systems grounded in the mandate to take specific actions when they fail to behave as intended. This framework, which is now actively used at Stanford Health Care, is organized around three complementary principles: system integrity, performance, and impact. System integrity monitoring focuses on maximizing system uptime, detecting runtime errors, and identifying when changes to the surrounding IT ecosystem have unintended effects. Performance monitoring focuses on maintaining accurate system behavior in the face of changing health care practices (and thus input data) over time. Impact monitoring assesses whether a deployed system continues to have value in the form of benefit to clinicians and patients. Drawing on examples of deployed AI systems at our academic medical center, we provide practical guidance for creating monitoring plans based on these principles that specify which metrics to measure, when those metrics should be reviewed, who is responsible for acting when metrics change, and what concrete follow-up actions should be taken-for both traditional and generative AI. We also discuss challenges to implementing this framework, including the effort and cost of monitoring for health systems with limited resources and the difficulty of incorporating data-driven monitoring practices into complex organizations where conflicting priorities and definitions of success often coexist. This framework offers a practical template and starting point for health systems seeking to ensure that AI deployments remain safe and effective over time.
Publisher OA PDF DOI
Automated Evaluation of Large Language Model Response Concordance with Human Specialist Responses on Physician-to-Physician eConsult Cases
2025-12-01
articleOpen accessSenior author
Specialist consults in primary care and inpatient settings typically address complex clinical questions beyond standard guidelines. eConsults have been developed as a way for specialist physicians to review cases asynchronously and provide clinical answers without a formal patient encounter. Meanwhile, large language models (LLMs) have approached human-level performance on structured clinical tasks, but their real-world effectiveness requires evaluation, which is bottlenecked by time-intensive manual physician review. To address this, we evaluate two automated methods: LLM-as-judge and a decompose-thenverify framework that breaks down AI answers into verifiable claims against human eConsult responses. Using 40 real-world physician-to-physician eConsults, we compared AI-generated responses to human answers using both physician raters and automated tools. LLM-as-judge outperformed decompose-then-verify, achieving human-level concordance assessment with F1-score of 0.89 (95% CI: 0.750, 0.960) and Cohen's kappa of 0.75 (95% CI 0.47,0.90) , comparable to physician inter-rater agreement κ = 0.69-0.90 (95% CI 0.43-1.0).
Publisher DOI
MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries
2025-12-01 · 1 citations
articleOpen accessSenior author
Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key facts and an "LLM Jury"-a multi-LLM majority vote-assesses their inclusion in generated summaries. Second, we present MedAgentBrief, a model-agnostic, multi-step workflow designed to generate high-quality, factual discharge summaries. To validate our evaluation framework, we established a gold-standard reference using a seven-physician majority vote on clinician-defined key facts from inpatient cases. The MedFactEval LLM Jury achieved almost perfect agreement with this panel (Cohen's κ = 81%), a performance statistically non-inferior to that of a single human expert (κ = 67%, P < 0.001). Our work provides both a robust evaluation framework (MedFactEval) and a high-performing generation workflow (MedAgentBrief), offering a comprehensive approach to advance the responsible deployment of generative AI in clinical workflows.
Publisher DOI

Recent grants

NIH Grant R03CA136071
NIH · $153k · 2012
Data-Mining Clinical Decision Support from Electronic Health Records
NIH · $893k · 2015–2020

Frequent coauthors

Jason Hom
Palo Alto University
81 shared
David Ouyang
Cedars-Sinai Smidt Heart Institute
79 shared
Jeffrey Chi
Stanford Medicine
62 shared
Steven M. Asch
56 shared
Mary K. Goldstein
Stanford University
27 shared
Russ B. Altman
Stanford University
26 shared
Michael Baiocchi
Stanford University
26 shared
Robert J. Gallo
VA Palo Alto Health Care System
26 shared

Education

Fellow , Medical Informatics
Veteran Affairs Palo Alto, Stanford
2016
Resident , Internal Medicine
Stanford University Hospital
2014
Intern, Internal Medicine
Stanford University Hospital
2012
M.D., Medical Scientist Training Program
University of California Irvine
2011
Ph.D, School of Information & Computer Science
University of California Irvine
2009
Bachelor Of Science , Cybernetics with Specialization in Computer Studies
University of California Los Angeles
2000

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Jiafeng Chen

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you