Anoop Mayampurath

· Assistant ProfessorVerified

University of Wisconsin-Madison · Biostatistics and Medical Informatics

Active 2004–2026

h-index27

Citations3.3k

Papers14484 last 5y

Funding$738k

Faculty page Lab page

See your match with Anoop Mayampurath — sign in to PhdFit.Sign in

About

Anoop Mayampurath is an Assistant Professor in the Division of Pulmonary Medicine within the Department of Medicine at the University of Wisconsin-Madison School of Medicine and Public Health. His research focuses on developing models to predict patient outcomes using electronic health record data and machine learning techniques. He co-leads the ICU Data Science Lab with Drs. Churpek and Afshar, where the team utilizes structured, unstructured, and image data to address various clinical problems. His particular interest lies in creating explainable models for predicting outcomes in hospitalized children.

Research topics

Medicine
Computer Science
Nursing
Internal medicine
Machine Learning
Emergency medicine
Political Science
Physical therapy
Artificial Intelligence
Medical emergency
Internet privacy
Pediatrics
Psychiatry
Family medicine
Data science
Intensive care medicine
Medical education
Environmental health
Business

Selected publications

Explainable multimodal deep learning models for variable-length sequences in critically ill patients
Journal of Biomedical Informatics · 2026-02-24
articleOpen access
Publisher OA PDF DOI
<i>Letter:</i> Provider Intuition Predicts Future Retention in Care Among People with HIV
AIDS Patient Care and STDs · 2026-03-14
articleSenior author
Publisher DOI
Comparing prognostic performance and reasoning between large language models and physicians
medRxiv · 2026-04-25
articleOpen access
ABSTRACT Importance Physicians routinely prognosticate to guide care delivery and shared decision making, particularly when caring for patients with critical illnesses. Yet, these physician estimates are prone to inaccuracy and uncertainty. Artificial intelligence, including large language models (LLMs), show promise in supporting or improving this prognostication. However, the performance of contemporary LLMs in prognosticating for the heterogeneous population of critically ill patients remains poorly understood. Objective To characterize and compare the performance of LLMs and physicians when predicting 6-month mortality for hospitalized adults who survived critical illness. Design Embedded mixed methods study with elicitation and comparison of prognostic estimates and reasoning from LLMs and practicing physicians. Setting The publicly available, deidentified Medical Information Mart for Intensive Care (MIMIC)-IV v2.2 dataset. Participants We randomly selected 100 hospitalizations of adult survivors of critical illness. Four contemporary LLMs (Open AI GPT-4o, o3- and o4-mini, and DeepSeek-R1) and 7 physicians provided independent prognostic estimates for each case (1,100 total estimates; 400 LLM and 700 physician). Main outcomes and measures For each case, LLMs and physicians used the hospital discharge summary and demographics to predict 6-month mortality (yes/no) and provide their reasoning (free text). We assessed prognostic performance using accuracy, sensitivity, and specificity, and used inductive, qualitative content analysis to characterize reasonings. Results Mean physician accuracy for predicting mortality was 70.1% (95% CI 63.7-76.4%), with sensitivity of 59.7% (95% CI 50.6-68.8%) and specificity of 80.6% (95% CI 71.7-88.2%). The top-performing LLM (OpenAI o4-mini) accuracy was 78.0% (95% CI 70.0-86.0%), with sensitivity of 80.0% (95% CI 67.4-90.2%) and specificity of 76.0% (95% CI 63.3-88.0%). The difference between mean physician and top-performing LLM accuracy was not statistically significant (p = 0.5). Qualitative analysis revealed similar patterns in LLM and physician expressed reasoning, except that physicians regularly and explicitly reported uncertainty while LLMs did not. Conclusion and Relevance In this study, LLMs and physicians achieved comparable, moderate performance in predicting 6-month mortality after critical illness, with similar patterns in expressed reasoning. Our findings suggest LLMs could be used to support prognostication in clinical practice but also raise safety concerns due to the lack of LLM uncertainty expression. KEY POINTS Question How does large language model (LLM) prognostic accuracy and reasoning compare to physicians when predicting 6-month mortality for adult survivors of critical illness? Findings In this embedded mixed methods study, physicians and large language models had comparable, moderate prognostic accuracy with similar expressed reasoning patterns except that LLMs did not explicitly express uncertainty. Meaning Large language models may be able to support physician prognostication, although the inability of LLMs to express uncertainty poses an important safety consideration.
Publisher OA PDF DOI
Detecting stigmatizing language in clinical notes with large language models for addiction care
npj Health Systems · 2026-02-02 · 1 citations
articleOpen access
Intensive care units (ICU) produce numerous progress notes that may contain stigmatizing language that perpetuate negative biases and punitive approaches against patients. Patients with substance use disorders are particularly vulnerable to stigma. This study examined the performance of Large Language Models (LLMs) in the identification of stigmatizing language. We annotated a dataset with over 77,000 stigmatizing and non-stigmatizing notes from the MIMIC-III database. We utilized Meta's Llama-3 8B Instruct LLM to run the following experiments for stigma detection: zero-shot; in-context learning; in-context learning with a selective retrieval; supervised fine-tuning (SFT); and keyword search. All approaches were evaluated on a held-out test set and external validation (University of Wisconsin Health System). SFT had the best performance with 97.2% accuracy, followed by in-context learning. The LLMs with in-context learning and SFT provided appropriate reasoning for false positives during human review. Both approaches identified clinical notes with stigmatizing language that were missed during annotation. SFT achieved 97.9% accuracy on external validation dataset. LLMs, particularly SFT and in-context learning, effectively identify stigmatizing language in ICU notes with high accuracy while explaining their reasoning in an asynchronous fashion and demonstrated the ability to identify novel stigmatizing language, not explicitly in training data nor existing guidelines.
Publisher OA PDF DOI
Detecting Stigmatizing Language in Clinical Notes with Large Language Models for Addiction Care
medRxiv · 2025-08-12
preprintOpen access
RATIONALE: Recent studies have found that stigmatizing terms can incline physicians to pursue punitive approaches to patient care. The intensive care unit (ICU) contains large volumes of progress notes that may contain stigmatizing language, which could perpetuate negative biases against patients and affect healthcare delivery. Patients with substance use disorders (alcohol, opioid, and non-opioid drugs) are particularly vulnerable to stigma. This study aimed to examine the performance of Large Language Models (LLMs) in the identification of stigmatizing language from ICU progress notes of patients with substance use disorders (SUD). METHODS: Clinical notes were sampled from the Medical Information Mart for Intensive Care (MIMIC)-III, which contains 2,083,180 ICU notes. These 2,083,180 notes were passed into a rule-based labeling approach followed by manual verification for more ambiguous cases. The labeling approach followed the NIH guidelines on stigma in SUD. The labeling process resulted in identifying 38,552 stigmatizing encounters. To design our cohort, we randomly sampled an equivalent amount of non-stigmatizing encounters to create a dataset with 77,104 notes. This cohort was organized into train/development/test datasets (70/15/15). We utilized Meta's Llama-3 8B Instruct LLM to run the following experiments for stigma detection: (1) prompts with instructions that adhere to the NIH terms (Zero-Shot); (2) prompts with instructions and examples (in-context learning); (3) in-context learning with a selective retrieval system for the NIH terms (Retrieval Augmented Generation-RAG); and (4) supervised fine-tuning (SFT). We also created a baseline model using keyword search. Evaluation was performed on the held-out test set for accuracy, macro F1 score, and error analysis. The LLM-based approaches were prompted to provide their reasoning for label prediction. Additionally, all approaches were evaluated on an external validation dataset from the University of Wisconsin (UW) Health System with 288,130 ICU notes. RESULTS: SFT had the best performance with 97.2% accuracy, followed by in-context learning. The LLMs with in-context learning and SFT provided appropriate reasoning for false positives during human review. Both approaches identified clinical notes with stigmatizing language that were missed during annotation (10/93 false positives for SFT and 22/186 false positives for the in-context learning approach were considered valid after human review). SFT maintained its accuracy at 97.9% on a similarly balanced external validation dataset. CONCLUSION: Our findings demonstrate that LLMs, particularly using SFT and in-context learning, effectively identify stigmatizing language in ICU notes with high accuracy while explaining their reasoning in an asynchronous fashion without needing rigorous and time-intensive manual verification involved in labeling. These models also demonstrated the ability to identify novel stigmatizing language not explicitly in training data nor existing guidelines. This study highlights the potential of LLMs in reducing stigma in clinical documentation, especially for patients with SUD. These LLMs enable identification of stigmatizing language in clinical notes that can perpetuate negative stigma towards patients and encourage rewriting of notes.
Publisher OA PDF DOI
Comparison of Multimodal Deep Learning Approaches for Predicting Clinical Deterioration in Ward Patients: An Observational Cohort Study (Preprint)
2025-04-01
preprintOpen access
<sec> <title>BACKGROUND</title> Implementing machine learning models to identify clinical deterioration on the wards is associated with improved outcomes. However, these models have high false positive rates and only use structured data. </sec> <sec> <title>OBJECTIVE</title> We aim to compare models with and without information from clinical notes for predicting deterioration. </sec> <sec> <title>METHODS</title> Adults admitted to the wards at the University of Chicago (development cohort) and University of Wisconsin-Madison (external validation cohort) were included. Predictors consisted of structured and unstructured variables extracted from notes as Concept Unique Identifiers (CUIs). We parameterized CUIs in five ways: Standard Tokenization (ST), ICD Rollup using Tokenization (ICDR-T), ICD Rollup using Binary Variables (ICDR-BV), CUIs as SapBERT Embeddings (SE), and CUI Clustering using SapBERT embeddings (CC). Each parameterization method combined with structured data and structured data-only were compared for predicting intensive care unit transfer or death in the next 24 hours using deep recurrent neural networks. </sec> <sec> <title>RESULTS</title> The study included 506,076 ward patients, 4.9% of whom experienced the outcome. The SE model achieved the highest AUPRC (0.208), followed by CC (0.199) and the structured-only model (0.199), ICDR-BV (0.194), ICDR-T (0.166), and ST (0.158). The CC and structured-only models achieved the highest AUROC (0.870), followed by ICDR-T (0.867), ICDR-BV (0.866), ST (0.860), and SE (0.859). </sec> <sec> <title>CONCLUSIONS</title> A multimodal model combining structured data with embeddings using SapBERT had the highest AUPRC, but performance was similar between models with and without CUIs. The addition of CUIs from notes to structured data did not meaningfully improve model performance for predicting clinical deterioration. </sec>
Publisher DOI
Age and Saving Lives in Crisis Standards of Care: A Multicenter Cohort Study of Triage Score Prognostic Accuracy
Critical Care Explorations · 2025-05-01 · 1 citations
articleOpen access
IMPORTANCE: Current protocols to triage life support use scores that are biased and inaccurate. OBJECTIVES: To determine if adding age to triage protocols used in disaster scenarios improves the identification of critically ill patients likely to survive. DESIGN, SETTING, AND PARTICIPANTS: Observational cohort study from March 1, 2020, to March 1, 2022, at 22 hospitals in three networks, divided into derivation (12 hospitals) and validation cohorts (ten hospitals). Participants were critically ill adults (90% COVID-19 positive) who would have needed life support during an overwhelming case surge. Life support was defined as vasoactive medications for shock, invasive or noninvasive mechanical ventilation, or oxygen therapy with Pao2/Fio2 less than 200. MAIN OUTCOMES AND MEASURES: The primary outcome was death in the intensive care unit. We fit logistic regression models using a modified Sequential Organ Failure Assessment (SOFA) score with and without age in the derivation cohort and assessed predictive performance in the validation cohort using area under the receiver operating characteristic curves (AUCs) and compared observed and predicted mortality. RESULTS: The final analysis contained 7,660 patients with 16,711 life-support episodes. In the validation cohort, the AUC for age plus SOFA was significantly higher than the AUC for SOFA alone (0.66 vs. 0.54; p < 0.001). SOFA score substantially overpredicted mortality (13% predicted vs. 5% observed) for younger patients (< 40 yr) and underestimated mortality (14% predicted vs. 31% observed) for older patients (> 80 yr). In contrast, age plus SOFA had good calibration overall and across age groups. The addition of age improved but did not eliminate differences between observed and predicted mortality across racial-ethnic groups. CONCLUSIONS AND RELEVANCE: Age-inclusive triage better identifies ICU survivors than SOFA alone and is more equitable. Incorporating age into prioritization algorithms could save more lives in a crisis scenario.
Publisher DOI
Automating Evaluation of AI Text Generation in Healthcare with a Large Language Model (LLM)-as-a-Judge
medRxiv · 2025-04-22 · 16 citations
preprintOpen access
Electronic Health Records (EHRs) store vast amounts of clinical information that are difficult for healthcare providers to summarize and synthesize relevant details to their practice. To reduce cognitive load on providers, generative AI with Large Language Models have emerged to automatically summarize patient records into clear, actionable insights and offload the cognitive burden for providers. However, LLM summaries need to be precise and free from errors, making evaluations on the quality of the summaries necessary. While human experts are the gold standard for evaluations, their involvement is time-consuming and costly. Therefore, we introduce and validate an automated method for evaluating real-world EHR multi-document summaries using an LLM as the evaluator, referred to as LLM-as-a-Judge. Benchmarking against the validated Provider Documentation Summarization Quality Instrument (PDSQI)-9 for human evaluation, our LLM-as-a-Judge framework demonstrated strong inter-rater reliability with human evaluators. GPT-o3-mini achieved the highest intraclass correlation coefficient of 0.818 (95% CI 0.772, 0.854), with a median score difference of 0 from human evaluators, and completes evaluations in just 22 seconds. Overall, the reasoning models excelled in inter-rater reliability, particularly in evaluations that require advanced reasoning and domain expertise, outperforming non-reasoning models, those trained on the task, and multi-agent workflows. Cross-task validation on the Problem Summarization task similarly confirmed high reliability. By automating high-quality evaluations, medical LLM-as-a-Judge offers a scalable, efficient solution to rapidly identify accurate and safe AI-generated summaries in healthcare settings.
Publisher DOI
MoMA: a mixture-of-multimodal-agents architecture for enhancing clinical prediction modelling
npj Digital Medicine · 2025-12-09 · 1 citations
articleOpen access
Multimodal electronic health record (EHR) data provide richer, complementary insights into patient health compared to single-modality data. However, effectively integrating diverse data modalities for clinical prediction modeling remains challenging due to the substantial data requirements. We introduce a novel architecture, Mixture-of-Multimodal-Agents (MoMA), designed to leverage multiple large language model (LLM) agents for clinical prediction tasks using multimodal EHR data. MoMA employs specialized LLM agents ("specialist agents") to convert non-textual modalities, such as medical images and laboratory results, into structured textual summaries. These summaries, together with clinical notes, are combined by another LLM ("aggregator agent") to generate a unified multimodal summary, which is then used by a third LLM ("predictor agent") to produce clinical predictions. Evaluating MoMA with different modality combinations and prediction settings, MoMA outperforms existing methods on three prediction tasks using private datasets, highlighting its enhanced accuracy and flexibility across various tasks.
Publisher OA PDF DOI
An Information-Theoretic Perspective on Multi-LLM Uncertainty Estimation
medRxiv · 2025-07-10
preprintOpen access
Abstract Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings. Prior work on calibration and uncertainty quantification often focuses on individual models, overlooking the potential of model diversity. We hypothesize that LLMs make complementary predictions due to differences in training and the Zipfian nature of language, and that aggregating their outputs leads to more reliable uncertainty estimates. To leverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs. Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and naïve ensemble baselines.
Publisher OA PDF DOI

Recent grants

Using Machine Learning to Predict Clinical Deterioration in Hospitalized Children
NIH · $738k · 2019–2024

Frequent coauthors

Samuel L. Volchenboum
39 shared
Matthew M. Churpek
University of Wisconsin–Madison
37 shared
Richard Smith
Pacific Northwest National Laboratory
26 shared
Jagoda Jasielec
Janssen (United States)
22 shared
Majid Afshar
22 shared
Mattina M. Alonge
22 shared
Shaun Rosebeck
22 shared
Andrzej Jakubowiak
22 shared

Education

Ph.D., Biostatistics
University of Wisconsin-Madison
2006
M.S., Biostatistics
University of Wisconsin-Madison
2003
B.S., Mathematics
University of Kerala
2001

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Anoop Mayampurath

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you