
Nigam Shah
· Associate Professor of Medicine and of Biomedical Data ScienceStanford University · Biomedical Data Science
Active 2012–2024
About
Nigam Shah is a faculty member involved in AI for Health at Stanford University. His research interests include precision healthcare, healthcare delivery, real-world data, causality, and value-based care. He is associated with Stanford Engineering and is part of the AI for Health team, contributing to advancements in applying artificial intelligence to improve health outcomes and healthcare systems.
Research topics
- Computer Science
- Natural Language Processing
- Data science
- Database
- Multimedia
- Marketing
- Philosophy
- Reliability engineering
- Linguistics
- Chemistry
- Medicine
- Engineering
- Business
Selected publications
Red Teaming Large Language Models in Medicine: Real-World Insights on Model Behavior
medRxiv (Cold Spring Harbor Laboratory) · 2024 · 9 citations
- Computer Science
- Artificial Intelligence
- Political Science
0. Abstract Background The integration of large language models (LLMs) in healthcare offers immense opportunity to streamline healthcare tasks, but also carries risks such as response accuracy and bias perpetration. To address this, we conducted a red-teaming exercise to assess LLMs in healthcare and developed a dataset of clinically relevant scenarios for future teams to use. Methods We convened 80 multi-disciplinary experts to evaluate the performance of popular LLMs across multiple medical scenarios. Teams composed of clinicians, medical and engineering students, and technical professionals stress-tested LLMs with real world clinical use cases. Teams were given a framework comprising four categories to analyze for inappropriate responses: Safety, Privacy, Hallucinations, and Bias. Prompts were tested on GPT-3.5, GPT-4.0, and GPT-4.0 with the Internet. Six medically trained reviewers subsequently reanalyzed the prompt-response pairs, with dual reviewers for each prompt and a third to resolve discrepancies. This process allowed for the accurate identification and categorization of inappropriate or inaccurate content within the responses. Results There were a total of 382 unique prompts, with 1146 total responses across three iterations of ChatGPT (GPT-3.5, GPT-4.0, GPT-4.0 with Internet). 19.8% of the responses were labeled as inappropriate, with GPT-3.5 accounting for the highest percentage at 25.7% while GPT-4.0 and GPT-4.0 with internet performing comparably at 16.2% and 17.5% respectively. Interestingly, 11.8% of responses were deemed appropriate with GPT-3.5 but inappropriate in updated models, highlighting the ongoing need to evaluate evolving LLMs. Conclusion The red-teaming exercise underscored the benefits of interdisciplinary efforts, as this collaborative model fosters a deeper understanding of the potential limitations of LLMs in healthcare and sets a precedent for future red teaming events in the field. Additionally, we present all prompts and outputs as a benchmark for future LLM model evaluations. 1-2 Sentence Description As a proof-of-concept, we convened an interactive “red teaming” workshop in which medical and technical professionals stress-tested popular large language models (LLMs) through publicly available user interfaces on clinically relevant scenarios. Results demonstrate a significant proportion of inappropriate responses across GPT-3.5, GPT-4.0, and GPT-4.0 with Internet (25.7%, 16.2%, and 17.5%, respectively) and illustrate the valuable role that non-technical clinicians can play in evaluating models.
To do no harm — and the most good — with AI in health care
Nature Medicine · 2024 · 69 citations
- Political Science
- Medicine
- Nursing
FactEHR: A Dataset for Evaluating Factuality in Clinical Notes Using LLMs
arXiv (Cornell University) · 2024 · 1 citations
Senior authorCorresponding- Computer Science
- Natural Language Processing
- Computer Science
Verifying and attributing factual claims is essential for the safe and effective use of large language models (LLMs) in healthcare. A core component of factuality evaluation is fact decomposition, the process of breaking down complex clinical statements into fine-grained atomic facts for verification. Recent work has proposed fact decomposition, which uses LLMs to rewrite source text into concise sentences conveying a single piece of information, to facilitate fine-grained fact verification. However, clinical documentation poses unique challenges for fact decomposition due to dense terminology and diverse note types and remains understudied. To address this gap and explore these challenges, we present FactEHR, an NLI dataset consisting of document fact decompositions for 2,168 clinical notes spanning four types from three hospital systems, resulting in 987,266 entailment pairs. We assess the generated facts on different axes, from entailment evaluation of LLMs to a qualitative analysis. Our evaluation, including review by the clinicians, reveals substantial variability in LLM performance for fact decomposition. For example, Gemini-1.5-Flash consistently generates relevant and accurate facts, while Llama-3 8B produces fewer and less consistent outputs. The results underscore the need for better LLM capabilities to support factual verification in clinical text.
JCO Oncology Practice · 2022 · 31 citations
- Medicine
- Family medicine
- Nursing
PURPOSE: Patients with metastatic cancer benefit from advance care planning (ACP) conversations. We aimed to improve ACP using a computer model to select high-risk patients, with shorter predicted survival, for conversations with providers and lay care coaches. Outcomes included ACP documentation frequency and end-of-life quality measures. METHODS: In this study of a quality improvement initiative, providers in four medical oncology clinics received Serious Illness Care Program training. Two clinics (thoracic/genitourinary) participated in an intervention, and two (cutaneous/sarcoma) served as controls. ACP conversations were documented in a centralized form in the electronic medical record. In the intervention, providers and care coaches received weekly e-mails highlighting upcoming clinic patients with < 2 year computer-predicted survival and no prior prognosis documentation. Care coaches contacted these patients for an ACP conversation (excluding prognosis). Providers were asked to discuss and document prognosis. RESULTS: = .04). CONCLUSION: Combining a computer prognosis model with care coaches increased ACP documentation.
An open repository of real-time COVID-19 indicators
Proceedings of the National Academy of Sciences · 2021 · 71 citations
- Computer Science
- Internet privacy
- Data science
The COVID-19 pandemic presented enormous data challenges in the United States. Policy makers, epidemiological modelers, and health researchers all require up-to-date data on the pandemic and relevant public behavior, ideally at fine spatial and temporal resolution. The COVIDcast API is our attempt to fill this need: Operational since April 2020, it provides open access to both traditional public health surveillance signals (cases, deaths, and hospitalizations) and many auxiliary indicators of COVID-19 activity, such as signals extracted from deidentified medical claims data, massive online surveys, cell phone mobility data, and internet search trends. These are available at a fine geographic resolution (mostly at the county level) and are updated daily. The COVIDcast API also tracks all revisions to historical data, allowing modelers to account for the frequent revisions and backfill that are common for many public health data sources. All of the data are available in a common format through the API and accompanying R and Python software packages. This paper describes the data sources and signals, and provides examples demonstrating that the auxiliary signals in the COVIDcast API present information relevant to tracking COVID activity, augmenting traditional public health reporting and empowering research and decision-making.
Summarizing patients like mine via an on-demand consultation service
Proceedings of the VLDB Endowment · 2021
1st authorCorresponding- Computer Science
- Computer Science
- Data science
Using evidence derived from previously collected medical records to guide patient care has been a long-standing vision of clinicians and informaticians, and one with the potential to transform medical practice. We offered an on-demand consultation service to derive evidence from millions of other patients' data to answer clinician questions and support their bedside decision making. We describe the design and implementation of the service as well as a summary of our experience in responding to the first 100 requests. We will also review a new paradigm for a scalable time-aware clinical data search, and to describe the design, implementation, and use of a search engine realizing this paradigm.
Treatment and Monitoring Variability in US Metastatic Breast Cancer Care
JCO Clinical Cancer Informatics · 2021 · 19 citations
- Medicine
- Oncology
- Internal medicine
PURPOSE: Treatment and monitoring options for patients with metastatic breast cancer (MBC) are increasing, but little is known about variability in care. We sought to improve understanding of MBC care and its correlates by analyzing real-world claims data using a search engine with a novel query language to enable temporal electronic phenotyping. METHODS: Using the Advanced Cohort Engine, we identified 6,180 women who met criteria for having estrogen receptor-positive, human epidermal growth factor receptor 2-negative MBC from IBM MarketScan US insurance claims (2007-2014). We characterized treatment, monitoring, and hospice usage, along with clinical and nonclinical factors affecting care. RESULTS: < .0001). CONCLUSION: Variability in US MBC care is explained by patient and disease factors and by nonclinical factors such as geographic region, suggesting that treatment decisions are influenced by local practice patterns and/or resources. A search engine designed to express complex electronic phenotypes from longitudinal patient records enables the identification of variability in patient care, helping to define disparities and areas for improvement.
Rates of Co-infection Between SARS-CoV-2 and Other Respiratory Pathogens
JAMA · 2020 · 819 citations
- Medicine
- Virology
- Intensive care medicine
This study describes the prevalence of SARS-CoV-2 co-infection with noncoronavirus respiratory pathogens in a sample of symptomatic patients undergoing PCR testing in March 2020.
Journal of Clinical Virology · 2020 · 62 citations
- Medicine
- Emergency medicine
- Internal medicine
Data Quality Assessment of Laboratory Data
AMIA · 2020
Senior authorCorresponding- Computer Science
- Computer Science
- Reliability engineering
Frequent coauthors
- 4 shared
Suzanne Tamang
New York State Office of Mental Health
- 3 shared
Sepideh Modrek
San Francisco State University
- 3 shared
Mark R. Cullen
- 2 shared
Julie Kuznetsov
Stanford University
- 2 shared
K. Bretonnel Cohen
University of Colorado Anschutz Medical Campus
- 2 shared
Douglas W. Blayney
Stanford University
- 2 shared
Manali I. Patel
Stanford University
- 1 shared
Casey S. Greene
Labs
Shah LabPI
Similar researchers at Stanford University
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Nigam Shah
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup