
Peter Szolovits
VerifiedMassachusetts Institute of Technology · Electrical Engineering & Computer Science
Active 1899–2025
About
Peter Szolovits is a Professor of Computer Science and Engineering at MIT, specializing in Artificial Intelligence and Decision-making. His research areas include AI for Healthcare and Life Sciences, Artificial Intelligence and Machine Learning, and Biological and Medical Devices and Systems. He combines expertise from computer science and electrical engineering to develop systems that interact with the external world through perception, communication, and action, focusing on learning, decision-making, and adaptation in changing environments. Szolovits's work leverages computational, theoretical, and experimental tools to address shared human challenges, particularly in healthcare applications.
Research topics
- Computer science
- Artificial intelligence
- Medicine
- Machine learning
- Natural language processing
Selected publications
UNC Libraries · 2025-03-18
articleOpen accessDespite recent methodology advancements in clinical natural language processing (NLP), the adoption of clinical NLP models within the translational research community remains hindered by process heterogeneity and human factor variations. Concurrently, these factors also dramatically increase the difficulty in developing NLP models in multi-site settings, which is necessary for algorithm robustness and generalizability. Here, we reported on our experience developing an NLP solution for Coronavirus Disease 2019 (COVID-19) signs and symptom extraction in an open NLP framework from a subset of sites participating in the National COVID Cohort (N3C). We then empirically highlight the benefits of multi-site data for both symbolic and statistical methods, as well as highlight the need for federated annotation and evaluation to resolve several pitfalls encountered in the course of these efforts.
Toward responsible AI governance: Balancing multi-stakeholder perspectives on AI in healthcare
International Journal of Medical Informatics · 2025-06-19 · 14 citations
articleArXiv.org · 2025-10-17
preprintOpen accessThe 6th Annual Conference on Health, Inference, and Learning (CHIL 2025), hosted by the Association for Health Learning and Inference (AHLI), was held in person on June 25-27, 2025, at the University of California, Berkeley, in Berkeley, California, USA. As part of this year's program, we hosted Research Roundtables to catalyze collaborative, small-group dialogue around critical, timely topics at the intersection of machine learning and healthcare. Each roundtable was moderated by a team of senior and junior chairs who fostered open exchange, intellectual curiosity, and inclusive engagement. The sessions emphasized rigorous discussion of key challenges, exploration of emerging opportunities, and collective ideation toward actionable directions in the field. In total, eight roundtables were held by 19 roundtable chairs on topics of "Explainability, Interpretability, and Transparency," "Uncertainty, Bias, and Fairness," "Causality," "Domain Adaptation," "Foundation Models," "Learning from Small Medical Data," "Multimodal Methods," and "Scalable, Translational Healthcare Solutions."
Large Language Models Seem Miraculous, but Science Abhors Miracles
NEJM AI · 2024-05-09 · 9 citations
article1st authorCorrespondingGenerative artificial intelligence models exhibit amazing abilities but make serious errors. We have a very limited understanding of why they work well at all or of the circumstances under which they give incorrect responses. This suggests the need for additional research and great caution in deploying such models for critical applications. Since the availability of ChatGPT in late 2022, based on OpenAI's GPT 3.5 large language model, those of us who have explored its capabilities have been amazed by its facility with language and its abilities to generate coherent — and even insightful — synopses; answer questions about everything from general knowledge to domain-specific topics; offer advice on how to accomplish tasks, including for medical diagnosis, therapy, and prognosis; deduce consequences of assumptions; and even write effective computer programs. Nevertheless, I would urge great caution in adopting such methods in health care, mainly because of our lack of understanding of how they accomplish the miraculous-seeming things they are able to do.
Journal of the American Medical Informatics Association · 2023-08-08 · 24 citations
articleOpen accessDespite recent methodology advancements in clinical natural language processing (NLP), the adoption of clinical NLP models within the translational research community remains hindered by process heterogeneity and human factor variations. Concurrently, these factors also dramatically increase the difficulty in developing NLP models in multi-site settings, which is necessary for algorithm robustness and generalizability. Here, we reported on our experience developing an NLP solution for Coronavirus Disease 2019 (COVID-19) signs and symptom extraction in an open NLP framework from a subset of sites participating in the National COVID Cohort (N3C). We then empirically highlight the benefits of multi-site data for both symbolic and statistical methods, as well as highlight the need for federated annotation and evaluation to resolve several pitfalls encountered in the course of these efforts.
Using Machine Learning to Develop Smart Reflex Testing Protocols
arXiv (Cornell University) · 2023-02-01 · 2 citations
preprintOpen accessObjective: Reflex testing protocols allow clinical laboratories to perform second line diagnostic tests on existing specimens based on the results of initially ordered tests. Reflex testing can support optimal clinical laboratory test ordering and diagnosis. In current clinical practice, reflex testing typically relies on simple "if-then" rules; however, this limits their scope since most test ordering decisions involve more complexity than a simple rule will allow. Here, using the analyte ferritin as an example, we propose an alternative machine learning-based approach to "smart" reflex testing with a wider scope and greater impact than traditional rule-based approaches. Methods: Using patient data, we developed a machine learning model to predict whether a patient getting CBC testing will also have ferritin testing ordered, consider applications of this model to "smart" reflex testing, and evaluate the model by comparing its performance to possible rule-based approaches. Results: Our underlying machine learning models performed moderately well in predicting ferritin test ordering and demonstrated greater suitability to reflex testing than rule-based approaches. Using chart review, we demonstrate that our model may improve ferritin test ordering. Finally, as a secondary goal, we demonstrate that ferritin test results are missing not at random (MNAR), a finding with implications for unbiased imputation of missing test results. Conclusions: Machine learning may provide a foundation for new types of reflex testing with enhanced benefits for clinical diagnosis and laboratory utilization management.
Right, No Matter Why: AI Fact-checking and AI Authority in Health-related Inquiry Settings
arXiv (Cornell University) · 2023-10-22 · 1 citations
preprintOpen accessSenior authorPrevious research on expert advice-taking shows that humans exhibit two contradictory behaviors: on the one hand, people tend to overvalue their own opinions undervaluing the expert opinion, and on the other, people often defer to other people's advice even if the advice itself is rather obviously wrong. In our study, we conduct an exploratory evaluation of users' AI-advice accepting behavior when evaluating the truthfulness of a health-related statement in different "advice quality" settings. We find that even feedback that is confined to just stating that "the AI thinks that the statement is false/true" results in more than half of people moving their statement veracity assessment towards the AI suggestion. The different types of advice given influence the acceptance rates, but the sheer effect of getting a suggestion is often bigger than the suggestion-type effect.
Coding Inequity: Assessing GPT-4’s Potential for Perpetuating Racial and Gender Biases in Healthcare
medRxiv · 2023-07-16 · 30 citations
preprintOpen accessAbstract Background Large language models (LLMs) such as GPT-4 hold great promise as transformative tools in healthcare, ranging from automating administrative tasks to augmenting clinical decision- making. However, these models also pose a serious danger of perpetuating biases and delivering incorrect medical diagnoses, which can have a direct, harmful impact on medical care. Methods Using the Azure OpenAI API, we tested whether GPT-4 encodes racial and gender biases and examined the impact of such biases on four potential applications of LLMs in the clinical domain—namely, medical education, diagnostic reasoning, plan generation, and patient assessment. We conducted experiments with prompts designed to resemble typical use of GPT-4 within clinical and medical education applications. We used clinical vignettes from NEJM Healer and from published research on implicit bias in healthcare. GPT-4 estimates of the demographic distribution of medical conditions were compared to true U.S. prevalence estimates. Differential diagnosis and treatment planning were evaluated across demographic groups using standard statistical tests for significance between groups. Findings We find that GPT-4 does not appropriately model the demographic diversity of medical conditions, consistently producing clinical vignettes that stereotype demographic presentations. The differential diagnoses created by GPT-4 for standardized clinical vignettes were more likely to include diagnoses that stereotype certain races, ethnicities, and gender identities. Assessment and plans created by the model showed significant association between demographic attributes and recommendations for more expensive procedures as well as differences in patient perception. Interpretation Our findings highlight the urgent need for comprehensive and transparent bias assessments of LLM tools like GPT-4 for every intended use case before they are integrated into clinical care. We discuss the potential sources of these biases and potential mitigation strategies prior to clinical implementation.
Structure-inducing pre-training
Nature Machine Intelligence · 2023-06-01 · 20 citations
articleOpen accessAbstract Language model pre-training and the derived general-purpose methods have reshaped machine learning research. However, there remains considerable uncertainty regarding why pre-training improves the performance of downstream tasks. This challenge is pronounced when using language model pre-training in domains outside of natural language. Here we investigate this problem by analysing how pre-training methods impose relational structure in induced per-sample latent spaces—that is, what constraints do pre-training methods impose on the distance or geometry between the pre-trained embeddings of samples. A comprehensive review of pre-training methods reveals that this question remains open, despite theoretical analyses showing the importance of understanding this form of induced structure. Based on this review, we introduce a pre-training framework that enables a granular and comprehensive understanding of how relational structure can be induced. We present a theoretical analysis of the framework from the first principles and establish a connection between the relational inductive bias of pre-training and fine-tuning performance. Empirical studies spanning three data modalities and ten fine-tuning tasks confirm theoretical analyses, inform the design of novel pre-training methods and establish consistent improvements over a compelling suite of methods.
Do We Still Need Clinical Language Models?
arXiv (Cornell University) · 2023-02-16 · 50 citations
preprintOpen accessAlthough recent advances in scaling large language models (LLMs) have resulted in improvements on many NLP tasks, it remains unclear whether these models trained primarily with general web text are the right tool in highly specialized, safety critical domains such as clinical text. Recent results have suggested that LLMs encode a surprising amount of medical knowledge. This raises an important question regarding the utility of smaller domain-specific language models. With the success of general-domain LLMs, is there still a need for specialized clinical models? To investigate this question, we conduct an extensive empirical analysis of 12 language models, ranging from 220M to 175B parameters, measuring their performance on 3 different clinical tasks that test their ability to parse and reason over electronic health records. As part of our experiments, we train T5-Base and T5-Large models from scratch on clinical notes from MIMIC III and IV to directly investigate the efficiency of clinical tokens. We show that relatively small specialized clinical models substantially outperform all in-context learning approaches, even when finetuned on limited annotated data. Further, we find that pretraining on clinical tokens allows for smaller, more parameter-efficient models that either match or outperform much larger language models trained on general text. We release the code and the models used under the PhysioNet Credentialed Health Data license and data use agreement.
Recent grants
NIH · $146k · 1987
NIH · $6.2M · 1994
NIH · $697k · 1997
NIH · $1.4M · 2012
NIH · $3.5M · 1998
Frequent coauthors
- 128 shared
Isaac S. Kohane
Harvard University
- 83 shared
Shawn N. Murphy
- 76 shared
Katherine P. Liao
Harvard University
- 75 shared
Tianxi Cai
Harvard University
- 49 shared
Stanley Y. Shaw
- 48 shared
Susanne Churchill
Harvard University
- 48 shared
Ashwin N. Ananthakrishnan
Harvard University
- 48 shared
Vivian S. Gainer
Labs
MIT EECS Artificial Intelligence + Decision-making LabPI
Education
- 1974
PhD, Information Science
California Institute of Technology
- 1970
BS, Physics
California Institute of Technology
Awards & honors
- 2021 EECS Awards
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Peter Szolovits
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup