
Lu Tian
Stanford University · Statistics
Active 1993–2024
Research topics
- Medicine
- Artificial Intelligence
- Internal medicine
- Political Science
- Computer Science
- World Wide Web
- Physical therapy
- Applied psychology
- Medical education
- Physical medicine and rehabilitation
- Psychology
- Immunology
Selected publications
Red Teaming Large Language Models in Medicine: Real-World Insights on Model Behavior
medRxiv (Cold Spring Harbor Laboratory) · 2024 · 9 citations
- Computer Science
- Artificial Intelligence
- Political Science
0. Abstract Background The integration of large language models (LLMs) in healthcare offers immense opportunity to streamline healthcare tasks, but also carries risks such as response accuracy and bias perpetration. To address this, we conducted a red-teaming exercise to assess LLMs in healthcare and developed a dataset of clinically relevant scenarios for future teams to use. Methods We convened 80 multi-disciplinary experts to evaluate the performance of popular LLMs across multiple medical scenarios. Teams composed of clinicians, medical and engineering students, and technical professionals stress-tested LLMs with real world clinical use cases. Teams were given a framework comprising four categories to analyze for inappropriate responses: Safety, Privacy, Hallucinations, and Bias. Prompts were tested on GPT-3.5, GPT-4.0, and GPT-4.0 with the Internet. Six medically trained reviewers subsequently reanalyzed the prompt-response pairs, with dual reviewers for each prompt and a third to resolve discrepancies. This process allowed for the accurate identification and categorization of inappropriate or inaccurate content within the responses. Results There were a total of 382 unique prompts, with 1146 total responses across three iterations of ChatGPT (GPT-3.5, GPT-4.0, GPT-4.0 with Internet). 19.8% of the responses were labeled as inappropriate, with GPT-3.5 accounting for the highest percentage at 25.7% while GPT-4.0 and GPT-4.0 with internet performing comparably at 16.2% and 17.5% respectively. Interestingly, 11.8% of responses were deemed appropriate with GPT-3.5 but inappropriate in updated models, highlighting the ongoing need to evaluate evolving LLMs. Conclusion The red-teaming exercise underscored the benefits of interdisciplinary efforts, as this collaborative model fosters a deeper understanding of the potential limitations of LLMs in healthcare and sets a precedent for future red teaming events in the field. Additionally, we present all prompts and outputs as a benchmark for future LLM model evaluations. 1-2 Sentence Description As a proof-of-concept, we convened an interactive “red teaming” workshop in which medical and technical professionals stress-tested popular large language models (LLMs) through publicly available user interfaces on clinically relevant scenarios. Results demonstrate a significant proportion of inappropriate responses across GPT-3.5, GPT-4.0, and GPT-4.0 with Internet (25.7%, 16.2%, and 17.5%, respectively) and illustrate the valuable role that non-technical clinicians can play in evaluating models.
JAMA · 2021 · 180 citations
- Medicine
- Physical therapy
- Physical medicine and rehabilitation
Importance: Supervised high-intensity walking exercise that induces ischemic leg symptoms is the first-line therapy for people with lower-extremity peripheral artery disease (PAD), but adherence is poor. Objective: To determine whether low-intensity home-based walking exercise at a comfortable pace significantly improves walking ability in people with PAD vs high-intensity home-based walking exercise that induces ischemic leg symptoms and vs a nonexercise control. Design, Setting, and Participants: Multicenter randomized clinical trial conducted at 4 US centers and including 305 participants. Enrollment occurred between September 25, 2015, and December 11, 2019; final follow-up was October 7, 2020. Interventions: Participants with PAD were randomized to low-intensity walking exercise (n = 116), high-intensity walking exercise (n = 124), or nonexercise control (n = 65) for 12 months. Both exercise groups were asked to walk for exercise in an unsupervised setting 5 times per week for up to 50 minutes per session wearing an accelerometer to document exercise intensity and time. The low-intensity group walked at a pace without ischemic leg symptoms. The high-intensity group walked at a pace eliciting moderate to severe ischemic leg symptoms. Accelerometer data were viewable to a coach who telephoned participants weekly for 12 months and helped them adhere to their prescribed exercise. The nonexercise control group received weekly educational telephone calls for 12 months. Main Outcomes and Measures: The primary outcome was mean change in 6-minute walk distance at 12 months (minimum clinically important difference, 8-20 m). Results: Among 305 randomized patients (mean age, 69.3 [SD, 9.5] years, 146 [47.9%] women, 181 [59.3%] Black patients), 250 (82%) completed 12-month follow-up. The 6-minute walk distance changed from 332.1 m at baseline to 327.5 m at 12-month follow-up in the low-intensity exercise group (within-group mean change, -6.4 m [95% CI, -21.5 to 8.8 m]; P = .34) and from 338.1 m to 371.2 m in the high-intensity exercise group (within-group mean change, 34.5 m [95% CI, 20.1 to 48.9 m]; P < .001) and the mean change for the between-group comparison was -40.9 m (97.5% CI, -61.7 to -20.0 m; P < .001). The 6-minute walk distance changed from 328.1 m at baseline to 317.5 m at 12-month follow-up in the nonexercise control group (within-group mean change, -15.1 m [95% CI, -35.8 to 5.7 m]; P = .10), which was not significantly different from the change in the low-intensity exercise group (between-group mean change, 8.7 m [97.5% CI, -17.0 to 34.4 m]; P = .44). Of 184 serious adverse events, the event rate per participant was 0.64 in the low-intensity group, 0.65 in the high-intensity group, and 0.46 in the nonexercise control group. One serious adverse event in each exercise group was related to study participation. Conclusions and Relevance: Among patients with PAD, low-intensity home-based exercise was significantly less effective than high-intensity home-based exercise and was not significantly different from the nonexercise control for improving 6-minute walk distance. These results do not support the use of low-intensity home-based walking exercise for improving objectively measured walking performance in patients with PAD. Trial Registration: ClinicalTrials.gov Identifier: NCT02538900.
Annals of the Rheumatic Diseases · 2021 · 101 citations
- Medicine
- Immunology
- Internal medicine
Recent grants
NIH · $42.6M · 2020–2030
Core E: Outreach, Recruitment and Education Core
NIH · $15.2M · 2020
Statistical Learning for Precision Medicine Based on Multi-Source Data
NIH · $322k · 2008–2018
Statistical Learning for Precision Medicine Based on Multi-Source Data
NIH · $4.2M · 2008–2028
Statistical Learning for Precision Medicine Based on Multi-Source Data
NIH · $257k · 2008–2018
Frequent coauthors
- 269 shared
Mary Mcdermott
Northwestern University
- 232 shared
Luigi Ferrucci
National Institutes of Health
- 227 shared
Jack M. Guralnik
University of Maryland, Baltimore
- 226 shared
Michael H. Criqui
University of California, San Diego
- 196 shared
Kiang Liu
Northwestern University
- 144 shared
Guanglin Li
Chinese Academy of Sciences
- 111 shared
Yihua Liao
- 104 shared
Xiangxin Li
University of Chinese Academy of Sciences
Similar researchers at Stanford University
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Lu Tian
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup