Kyle Alexander Bolo
· Assistant Professor of OphthalmologyVerifiedUniversity of Southern California · Ophthalmology
Active 2018–2026
About
Kyle Alexander Bolo, MD, is an assistant professor of clinical ophthalmology at the Keck School of Medicine of USC and a clinician scientist specializing in glaucoma at the USC Roski Eye Institute, part of Keck Medicine of USC. He diagnoses and treats patients with a variety of eye diseases, with extensive expertise in glaucoma and cataract surgery. His research focuses on the interface of population eye health, bioinformatics, and machine learning, aiming to investigate the impact of glaucoma on different patient populations and health systems, understand the effects of glaucoma screening, and modernize detection and care through data-driven and machine learning tools. Dr. Bolo's work promotes reproducible data science research and advances the use of innovative technologies in ophthalmology. He is also dedicated to educating the next generation of physicians in his role at the Keck School of Medicine.
Research topics
- Ophthalmology
- Artificial Intelligence
- Medicine
- Computer Science
- Anatomy
- Statistics
- Optometry
- Mathematics
Selected publications
Ophthalmology Science · 2026-03-07
articleOpen accessObjective: To compare the performance of state-of-the-art Gemini and GPT models on ophthalmology board-style questions and examine variation by subspecialty, cognitive complexity, and question type. Design: A cross-sectional evaluation of 12 distinct large language model (LLM) configurations using a standardized ophthalmology question set. Subjects: Five hundred multiple-choice questions (250 from the American Academy of Ophthalmology's Basic and Clinical Science Course [BCSC]; 250 StatPearls). Methods: Twelve configurations of the following LLMs: Gemini 3 Pro, Gemini 2.5 Pro, GPT-5.1 Pro, GPT-5 Pro, GPT-5.2, GPT-5.1, and GPT-5, interpreted the questions using standardized prompting procedures. Questions were categorized by subspecialty, multimodal content (image vs. text-only), and cognitive complexity (first, second, or third order). Accuracy, paired discordance (McNemar tests), and one-way analysis of variance with Tukey correction were used to compare performance. Human benchmarking used BCSC percent-correct data. Main Outcome Measures: Overall accuracy, subspecialty accuracy, image vs. nonimage accuracy, cognitive-complexity accuracy, and paired model-level discordance. Results: < 0.001), but most Tukey-corrected pairwise differences were nonsignificant. McNemar tests demonstrated significantly more correct paired responses for Gemini 3 Pro High Reasoning than for GPT-5.2 and all GPT-5/5.1 variants. Models performed markedly better on BCSC (mean 94.4%) than StatPearls (81.9%); human BCSC mean accuracy was 64.5%. Image-based items produced a 10- to 22-point accuracy decrement across all systems. Accuracy declined with increasing cognitive complexity, with the clearest separation on third-order management questions. Conclusions: Gemini 3 Pro had the best general-purpose LLM performance on ophthalmology board-style questions, providing near-perfect accuracy, while outperforming all GPT-5 family variants across domains and complexity levels. Significant deficits on image-based and third-order questions highlight persistent multimodal limitations and the need for ongoing benchmarking using challenging, clinically grounded datasets. Financial Disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Scientific Reports · 2025-07-02 · 15 citations
articleOpen accessThe ability of large language models (LLMs) to accurately answer medical board-style questions reflects their potential to benefit medical education and real-time clinical decision-making. With the recent advance to reasoning models, the latest LLMs excel at addressing complex problems in benchmark math and science tests. This study assessed the performance of first-generation reasoning models-DeepSeek's R1 and R1-Lite, OpenAI's o1 Pro, and Grok 3-on 493 ophthalmology questions sourced from the StatPearls and EyeQuiz question banks. o1 Pro achieved the highest overall accuracy (83.4%), significantly outperforming DeepSeek R1 (72.5%), DeepSeek-R1-Lite (76.5%), and Grok 3 (69.2%) (p < 0.001 for all pairwise comparisons). o1 Pro also demonstrated superior performance in questions from eight of nine ophthalmologic subfields, questions of second and third order cognitive complexity, and on image-based questions. DeepSeek-R1-Lite performed the second best, despite relatively small memory requirements, while Grok 3 performed inferiorly overall. These findings demonstrate that the strong performance of the first-generation reasoning models extends beyond benchmark tests to high-complexity ophthalmology questions. While these findings suggest a potential role for reasoning models in medical education and clinical practice, further research is needed to understand their performance with real-world data, their integration into educational and clinical settings, and human-AI interactions.
Ophthalmology Science · 2025-11-15
articleOpen access1st authorCorrespondingPurpose: To compare the performance of a vision transformer-based foundation model (RETFound) and a supervised convolutional neural network (VGG-19) for detecting referable glaucoma from fundus photographs. Design: An evaluation of diagnostic technology. Participants: Six thousand one hundred sixteen participants from the Los Angeles County Department of Health Services Teleretinal Screening Program. Methods: Fundus photographs were labeled for referable glaucoma (cup-to-disc ratio ≥0.6) by certified optometrists. Four deep learning models were trained on cropped and uncropped images (training N = 8996; validation N = 3002) using 2 architectures: RETFound, a vision transformer with self-supervised pretraining on fundus photographs, and VGG-19. Models were evaluated on a held-out test set (N = 1000) labeled by glaucoma specialists and an external test set (N = 300) from University of Southern California clinics. Performance was assessed while varying training set size and stratifying by demographic factors. xRAI was used for saliency mapping. Main Outcome Measures: Area under the receiver operating characteristic curve (AUC-ROC) and threshold-specific metrics. Results: < 0.04). Performance did not vary by age or gender. Saliency maps for both architectures consistently included the optic nerve. Conclusions: Although both RETFound and VGG-19 models performed well for classification of referable glaucoma, foundation models may be preferable when training data are limited and when domain shift is expected. Training models using images cropped to the region of the optic nerve improves performance regardless of architecture but may reduce model generalizability. Financial Disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Ophthalmology Science · 2025-02-25 · 8 citations
articleOpen access<h3>Purpose</h3> Develop and test a deep learning (DL) algorithm for detecting referable glaucoma. <h3>Design</h3> Retrospective cohort study. <h3>Participants</h3> A total of 6116 patients from the Los Angeles County (LAC) Department of Health Services (DHS) were included. <h3>Methods</h3> Fundus photographs and patient-level labels of referable glaucoma (cup-to-disc ratio ≥0.6) provided by 21 certified optometrists. A DL algorithm based on the Visual Geometry Group-19 architecture was trained using patient-level labels generalized to images from both eyes. Area under the receiver operating curve (AUROC), sensitivity, and specificity were calculated to assess algorithm performance using an independent test set that was also graded by 13 clinicians with 0 to 10 years of experience. Algorithm performance was tested using reference labels provided by either LAC DHS optometrists or an expert panel of 3 glaucoma specialists. <h3>Main Outcome Measures</h3> Area under the receiver operating curve, sensitivity, and specificity. <h3>Results</h3> The DL algorithm was trained using 12 998 images from 5616 patients (2086 referable glaucoma, 3530 nonglaucoma). In this data set, the mean age was 56.8 ± 10.5 years with 54.8% women, 68.2% Latinos, 8.9% Blacks, 6.0% Asians, and 2.7% Whites. One thousand images from 500 patients (250 referable glaucoma, 250 nonglaucoma) with similar demographics (<i>P</i> ≥ 0.57) were used to test the algorithm. Algorithm performance matched or exceeded that of all independent clinician graders in detecting patient-level referable glaucoma based on LAC DHS optometrist (AUROC = 0.92) or expert panel (AUROC = 0.93) reference labels. Clinician grader sensitivity (range, 0.33–0.99) and specificity (range, 0.68–0.98) ranged widely and did not correlate with years of experience (<i>P</i>≥ 0.49). Algorithm performance (AUROC = 0.93) also matched or exceeded the sensitivity (range, 0.78–1.00) and specificity (range, 0.32–0.87) of 6 certified LAC DHS optometrists in the subsets of the test data set they graded. <h3>Conclusions</h3> A DL algorithm for detecting referable glaucoma trained using patient-level data provided by certified LAC DHS optometrists approximates or exceeds performance by ophthalmologists and optometrists, who exhibit variable sensitivity and specificity unrelated to experience level. Implementation of this algorithm in screening workflows could help reallocate resources and provide more reproducible and timely glaucoma care. <h3>Financial Disclosure(s)</h3> Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Taiwan Journal of Ophthalmology · 2025-07-01 · 2 citations
reviewOpen access1st authorGlaucoma is an optic neuropathy and the leading cause of irreversible blindness worldwide. Imaging of the ganglion cell complex and retinal nerve fiber layer with optical coherence tomography (OCT) is a noninvasive, high-resolution means of diagnosing and quantitatively monitoring glaucoma. In the anterior segment, OCT can also be used to assess the anterior chamber angle and identify angle closure, a risk factor for glaucoma. The interpretation of OCT images for accurate diagnosis requires expert-level knowledge of both the technology and glaucoma. Deep learning (DL) is a subfield of artificial intelligence (AI), which is gaining prominence in health care for its ability to interpret images and approximate clinician judgment. This review summarizes recent research that demonstrates how DL can contribute to the analysis of OCT images in glaucoma. Deep neural networks can assist clinicians in checking the quality of OCT scans, quantifying the thickness of optic nerve tissues, evaluating the anterior chamber angle, diagnosing glaucoma, and detecting the progression of existing glaucoma. As further work expands on the generalizability, equity, and explainability of these DL techniques, AI-driven clinical support tools may become available for glaucoma diagnostics.
Ophthalmology Science · 2025-06-05 · 9 citations
articleOpen accessPurpose: To evaluate and compare the performance of human test takers and three AI models-OpenAI o1, ChatGPT-4o, and Gemini 1.5 Flash-on ophthalmology board-style questions, focusing on overall accuracy and performance stratified by ophthalmic subspecialty and cognitive complexity level.Design: Cross-sectional study.Subjects: 500 questions sourced from the Basic and Clinical Science Course (BCSC) and EyeQuiz question banks.Methods: Three large language models (LLMs) interpreted the questions using standardized prompting procedures.Subanalysis was performed stratifying the questions by subspecialty and complexity defined by the Buckwalter Taxonomic Schema.Statistical analysis, including the analysis of variance (ANOVA) and McNemar's test, was conducted to assess performance differences. Main Outcome Measures:Accuracy of responses for each model and human test takers, stratified by subspecialty and cognitive complexity.Results: OpenAI o1 achieved the highest overall accuracy (423/500, 84.6%), significantly outperforming GPT-4o (331/500, 66.2%; P < 0.001), Gemini (301/500, 60.2%; P < 0.001).o1 demonstrated superior performance on both BCSC (228/250, 91.2%) and EyeQuiz (195/250, 78.0%) questions compared to GPT-4o (BCSC: 183/250, 73.2%; EyeQuiz: 148/250, 59.2%) and Gemini (BCSC: 163/250, 65.2%; EyeQuiz: 137/250, 54.8%).On BCSC questions, human performance was lower (64.5%)than Gemini 1.5 Flash (65.2%),GPT-4o (73.2%), and OpenAI o1 (91.2%) (P < 0.001).OpenAI o1 outperformed other models in each of the nine ophthalmic subfields and three cognitive complexity levels.Conclusions: OpenAI o1 outperformed GPT-4o, Gemini, and human test takers in answering ophthalmology board-style questions from two question banks and across three complexity levels.These findings highlight advances in AI technology and OpenAI o1's growing potential as an adjunct in ophthalmic education and care.
medRxiv · 2025-06-28
preprintOpen accessABSTRACT Importance Pharmacologic dilation is vital for eye disease screening but is often avoided due to concerns about triggering acute angle closure (AAC), a sight-threatening ophthalmic emergency. Objective To assess AAC incidence after dilation and validate the use of International Classification of Diseases (ICD) codes for identifying AAC cases. Design Retrospective cohort study Setting Primary care-based teleretinal diabetic retinopathy screening (TDRS) program Participants Eligible participants were Los Angeles County (LAC) Department of Health Services (DHS) patients who underwent teleretinal screening by dilated fundus photography between August 23, 2013, and March 1, 2024. Potential AAC cases were identified using ICD codes for angle closure, including acute angle closure glaucoma (AACG), primary angle closure glaucoma (PACG), and anatomical narrow angle (ANA), within three months of dilation. All urgent care, emergency department, and eye clinic encounters within the next calendar day after TDRS and encounters with Current Procedural Terminology (CPT) codes for iridectomy/iridotomy or lens extraction within 14 calendar days of TDRS were also identified. Manual chart review was conducted to verify AAC cases and extract clinical information. Exposures Dilation with 1.0% or 0.5% tropicamide. Main Outcomes and Measures Cumulative incidence of AAC after dilation. Results 84,008 patients received 168,796 dilations with a mean of 2.01 ± 1.50 (mean ± standard deviation) dilations per patient. 55.1% were female. Mean age was 55.4 ± 10.7 (mean ± standard deviation) years. The cohort was 67.7% Hispanic, 8.2% Black, 6.3% Asian, 4.1% White, and 2.4% Other. Manual chart review confirmed four AAC cases after dilation: 3 coded as AACG and 1 as ANA. The AAC risk was 2.4 (95% CI 0.05-4.69) per 100,000 dilations (0.0024%) or 4.8 (95% CI 0.10-9.43) per 100,000 patients (0.0048%). All four cases were female, had narrow angles in the non-presenting eye on gonioscopy, and presented within one day with AAC symptoms, including eye pain and blurry vision. Conclusions and Relevance AAC risk was less than 1 in 40,000 per dilation in a high-volume TDRS program serving a diverse, safety net population, supporting the overall safety of dilation in this setting. Further discussion about AAC risk as a contraindication to dilation is warranted.
Acute Angle Closure Incidence in a Large Countywide Safety Net Teleretinal Screening Program
JAMA Ophthalmology · 2025-09-18 · 1 citations
articleOpen accessImportance: Pharmacologic pupillary dilation is vital for eye disease screening but is often avoided due to concerns about triggering acute angle closure (AAC), a sight-threatening ophthalmic emergency. Objective: To assess AAC incidence after dilation and validate the use of International Classification of Diseases (ICD) codes for identifying AAC cases. Design, Setting, and Participants: This retrospective cohort study used data from a primary care-based teleretinal diabetic retinopathy screening (TDRS) program. Eligible participants were Los Angeles County Department of Health Services patients who underwent teleretinal screening by dilated fundus photography between August 23, 2013, and March 1, 2024. Potential AAC cases were identified using ICD codes for angle closure, including AAC glaucoma, primary angle-closure glaucoma, and anatomical narrow angle, within 3 months of dilation. All urgent care, emergency department, and eye clinic encounters within the next calendar day after TDRS and encounters with Current Procedural Terminology codes for iridectomy/iridotomy or lens extraction within 14 calendar days of TDRS were also identified. Manual medical record review was conducted to verify AAC cases and extract clinical information. Data were analyzed from July 2024 to June 2025. Exposures: Dilation with tropicamide, 1.0%, or tropicamide, 0.5%. Main Outcomes and Measures: Cumulative incidence of AAC after dilation. Results: Of 84 008 included patients, 46 255 (55.1%) were female, and the mean (SD) age was 55.4 (10.7) years. There were a total of 168 796 dilations, with a mean (SD) of 2.01 (1.50) dilations per patient. Manual medical record review confirmed 4 AAC cases after dilation: 3 coded as AAC glaucoma and 1 as anatomical narrow angle. The AAC risk was 2.4 (95% CI, 0.05-4.69) per 100 000 dilations (0.002%) or 4.8 (95% CI, 0.10-9.43) per 100 000 patients (0.005%). All 4 AACs occurred in female patients, had narrow angles in the nonpresenting eye on gonioscopy, and presented within 1 day with AAC symptoms, including eye pain and blurry vision. Conclusions and Relevance: AAC risk was less than 1 in 40 000 per dilation in a high-volume TDRS program serving a diverse safety net population, supporting the overall safety of dilation in this setting. Further discussion about AAC risk as a contraindication to dilation is warranted.
medRxiv · 2025-08-24
preprintOpen access1st authorABSTRACT Purpose To compare the performance of a foundation model and a supervised learning-based model for detecting referable glaucoma from fundus photographs. Design Evaluation of diagnostic technology. Participants 6,116 participants from the Los Angeles County Department of Health Services Teleretinal Screening Program. Methods Fundus photographs were labeled for referable glaucoma (cup-to-disc ratio ≥ 0.6) by certified optometrists. Four deep learning models were trained on cropped and uncropped images (Training N = 8,996; Validation N = 3,002) using two architectures: a vision transformer with self-supervised pretraining on fundus photographs (RETFound) and a convolutional neural network (VGG-19). Models were evaluated on a held-out test set (N = 1,000) labeled by glaucoma specialists and an external test set (N = 300) from University of Southern California clinics. Performance was assessed while varying training set size and stratifying by demographic factors. xRAI was used for saliency mapping. Main Outcome Measures Area under the receiver operating characteristic curve (AUC-ROC) and threshold-specific metrics. Results The cropped image VGG-19 model achieved the highest AUC-ROC (0.924 [0.907-0.940]), which was comparable ( p = 0.07) to the cropped image RETFound model (0.911 [0.892-0.930]), which achieved the highest Youden-optimal performance (sensitivity 82.6%, specificity 88.2%) and F1 score (0.801). Cropped image models outperformed their uncropped counterparts within each architecture ( p < 0.001 for AUC-ROC comparisons). RETFound models had a performance advantage when trained on smaller datasets (N < 2000 images), and the uncropped image RETFound model performed best on external data ( p < 0.001 for AUC-ROC comparisons). The cropped image RETFound model performed consistently across ethnic groups ( p = 0.20), while the others did not ( p < 0.04); performance did not vary by age or gender. Saliency maps for both architectures consistently included the optic nerve. Conclusion While both RETFound and VGG-19 models performed well for classification of referable glaucoma, foundation models may be preferable when training data is limited and when domain shift is expected. Training models using images cropped to the region of the optic nerve improves performance regardless of architecture but may reduce model generalizability.
From Data to Decision: Data-Centric Infrastructure for Reproducible ML in Collaborative eScience
2025-09-15 · 1 citations
articleReproducibility remains a central challenge in machine learning (ML), especially in collaborative eScience projects where teams iterate over data, features, and models. Current ML workflows are often dynamic yet fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools. This fragmentation impedes transparency, reproducibility, and the adaptability of experiments over time. This paper introduces a data-centric framework for lifecycle-aware reproducibility, centered around six structured artifacts: Dataset, Feature, Workflow, Execution, Asset, and Controlled Vocabulary. These artifacts formalize the relationships between data, code, and decisions, enabling ML experiments to be versioned, interpretable, and traceable over time. The approach is demonstrated through a clinical ML use case of glaucoma detection, illustrating how the system supports iterative exploration, improves reproducibility, and preserves the provenance of collaborative decisions across the ML lifecycle.
Frequent coauthors
- 9 shared
Bruce Burkemper
University of Southern California
- 8 shared
Galo Apolo
University of Southern California
- 8 shared
Michael Chiang
University of Southern California
- 8 shared
Anmol A. Pardeshi
University of Southern California
- 8 shared
Benjamin Y. Xu
University of Southern California
- 6 shared
Alex S. Huang
University of California, San Diego
- 4 shared
Martin Simonovsky
- 4 shared
Xiaobin Xie
Nanchang University
Awards & honors
- NIH/SC-CTSI: Mentored Career Development Award in Clinical a…
- American Glaucoma Society: Mentoring for Advancement of Phys…
- Columbia University: Albert B. Knapp Scholarship (2019)
- Columbia University: Alpha Omega Alpha (2018)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Kyle Alexander Bolo
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup