Suma Bhat
· ADJ ASST PROFVerifiedUniversity of Illinois Urbana-Champaign · Computer Science
Active 2007–2026
About
Suma Bhat is an Assistant Professor at the Siebel School of Computing and Data Science at the University of Illinois Urbana-Champaign. Her research interests include Natural Language Processing, Human-AI Interaction, and Computational Social Science. She has been recognized for her teaching excellence, receiving the ECE Ronald W. Pratt Faculty Outstanding Teaching Award in 2021. Her work focuses on advancing understanding and development in artificial intelligence, particularly in language processing and human-AI collaboration.
Research topics
- Computer science
- Artificial intelligence
- Natural language processing
- Linguistics
- Psychology
Selected publications
Examining Students' Code Comprehension with LLMs in Block- and Text-Based Programming
2026-02-13
articleOpen accessUnderstanding how students reason about code is essential for providing tailored scaffolding in computer science (CS) education. Prior work has used think-aloud protocols with the Structure of the Observed Learning Outcomes (SOLO) taxonomy to examine students' code comprehension and programming levels. However, analyzing such data is labor-intensive and requires expert judgment. Recent advances in large language models (LLMs) offer a promising avenue for scaling this analysis, though their reliability for fine-grained coding remains uncertain. To address this gap, our study investigates the extent to which GPT-5 and 4o can classify SOLO levels and identify code-comprehension strategies from think-aloud transcripts of 27 high-school students working on block-based and text-based tasks. Results show modest alignment with human ratings for SOLO, with one-shot prompting improving agreement over zero-shot, though distinctions between adjacent lower levels (e.g., Prestructural 1 vs. 2) remained difficult. Strategy detection demonstrated stronger performance, achieving accuracies of 75–77% (block) and 62–67% (text), particularly for surface-visible strategies such as 'walkthroughs', 'control-structure identification', and 'pattern recognition', but weaker for less frequent, abstract, meta-cognitive strategies such as 'strategizing' (planning an approach) or 'thoroughness' (systematically checking work). These findings highlight both the potential and the limitations of using GPT-5 and 4o to analyze think-aloud data. While this work represents an initial step, with plans to examine more models, our preliminary results indicate that a human-in-the-loop approach is essential to ensure reliability and interpretive depth. Future work will extend this evaluation to other LLMs to better understand their role in supporting instructional decision-making.
2026-02-13
articleOpen accessUnderstanding how students comprehend code is essential for designing effective instructional support in computer science (CS). While prior studies have often relied on written responses, few have examined students' reasoning processes through think-aloud data. In this study, we analyzed the verbal reasoning of 27 high school students as they completed block-based and text-based code comprehension tasks targeting loops and conditional statements. Using an adapted SOLO taxonomy framework, we found that most students were classified at lower levels, with performance declining as they transitioned from block-based to text-based code. Students' strategy use, informed by prior work on code comprehension, showed that walkthroughs and identifying program structures were the most common approaches. Text-based tasks more often led students to use pattern-recognition strategies, such as interpreting operators or identifying numerical patterns, whereas block-based tasks occasionally prompted them to articulate broader problem-solving approaches. Overall, these findings demonstrate the value of applying the SOLO taxonomy to evaluate students' programming levels and highlight how programming modality impacts both the depth of understanding and the strategies students employ during code comprehension.
Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models
2026-04-30
preprintOpen accessWhile large models pre-trained on high-quality data exhibit excellent performance on mathematical reasoning (e.g., GSM8k, MultiArith), it remains challenging to specialize smaller models for these tasks. Common approaches to address this challenge include knowledge distillation from large teacher models and data augmentation (e.g., rephrasing questions and generating synthetic solutions). Despite these efforts, smaller models struggle with arithmetic computations, leading to errors in mathematical reasoning. In this work, we leverage a synthetic arithmetic dataset generated programmatically to enhance the reasoning capabilities of smaller models. We investigate two key approaches to incorporate this dataset: (1) intermediate fine-tuning, in which a model is fine-tuned on the arithmetic dataset before training it on a reasoning dataset, and (2) integrating the arithmetic dataset into an instruction-tuning mixture, allowing the model to learn arithmetic skills alongside general instruction-following abilities. Our experiments on multiple reasoning benchmarks demonstrate that incorporating an arithmetic dataset, whether through targeted fine-tuning or within an instruction-tuning mixture, enhances models' arithmetic capabilities, thereby improving their mathematical reasoning performance.
A hybrid decision tree with rule-based and deep learning nodes for automated medical coding
2025-01-01
dissertationSenior authorWith the growing digitization of healthcare data, automating the medical coding process has become increasingly important. Recent advances in machine learning and natural language processing (NLP) have led to promising approaches for automated medical coding using clinical notes and discharge summaries. Among these, deep learning excels at extracting complex patterns from unstructured text. However, it often requires large annotated datasets, significant computational resources, and lacks interpretability, which are key concerns in clinical settings. In this thesis, we adopt a hierarchical classification structure that mirrors the tree-like organization of the ICD coding system. To offer a scalable and efficient solution, we propose a hybrid decision tree (HDT) framework for automated ICD coding, which combines the efficiency of rule-based methods with the predictive power of deep learning models. Rather than relying on a single paradigm, the HDT approach determines, at each decision node, whether a lightweight rule-based classifier is sufficient or whether a more complex deep learning model is needed. For simpler nodes, where distinguishing features such as specific symptoms or keywords are easily identifiable, we classify medical codes using rule-based methods that apply statistical feature scoring based on term frequency and class-specific relevance. For more complex cases, where textual overlap between conditions makes rule-based classification unreliable, we employ deep learning models, particularly Long Short-Term Memory (LSTM) networks, to capture subtle semantic patterns in clinical text. We evaluate our approach using clinical notes and discharge summaries from the MIMIC-IV dataset. The results demonstrate that HDT offers a favorable trade-off by maintaining high prediction accuracy while significantly reducing inference time and resource consumption. Furthermore, its modular design facilitates system scalability and adaptation to updates in the ICD coding system, making it well-suited for real-world deployment.
2025-02-18 · 1 citations
articleSenior authorNovice programmers often face challenges in designing computational artifacts and fixing code errors, which can lead to task abandonment and over-reliance on external support. While research has explored effective meta-cognitive strategies to scaffold novice programmers' learning, it is essential to first understand and assess students' conceptual, procedural, and strategic/conditional programming knowledge at scale. To address this issue, we propose a three-model framework that leverages Large Language Models (LLMs) to simulate, classify, and correct student responses to programming questions based on the SOLO Taxonomy. The SOLO Taxonomy provides a structured approach for categorizing student understanding into four levels: Pre-structural, Uni-structural, Multi-structural, and Relational. Our results showed that GPT-4o achieved high accuracy in generating and classifying responses for the Relational category, with moderate accuracy in the Uni-structural and Pre-structural categories, but struggled with the Multi-structural category. The model successfully corrected responses to the Relational level. Although further refinement is needed, these findings suggest that LLMs hold significant potential for supporting computer science education by assessing programming knowledge and guiding students toward deeper cognitive engagement.
Medical Students' Perception of Automated Note Feedback After Simulated Encounters
The Clinical Teacher · 2025-11-17
articleOpen accessSenior authorBACKGROUND: Grading medical student patient notes (PNs) is resource-intensive. Natural language processing (NLP) offers a promising solution to automatically grade PNs. We deployed an automated grading system that uses NLP and explored the perceived value of PN feedback. APPROACH: The automated system graded written notes after two standardized patient encounters by third-year medical students. The system generated an individualized report on 'items found' and 'items not found' in the history, physical examination, and diagnosis sections, which was shared with students for feedback via a web-based interface. By rotation, block students received either the automated case feedback first or the faculty-written model note feedback first (the pre-intervention baseline). EVALUATION: After reviewing feedback, students completed surveys for both automated feedback and model note feedback and participated in follow-up focus groups. In total, 44 students received feedback, 37 completed surveys, and 28 participated in focus groups. Qualitative themes that emerged suggested the automated feedback was visually appealing and allowed for easy comparison of items found vs. missing, which would help improve students' documentation skills. Model note appeared trustworthy. IMPLICATIONS: We found automated systems can be a potential tool for formative feedback on note writing activity although in terms of quality it does not surpass the pre-existing feedback methods, such as model note feedback used in our study. Order effects may have influenced these perceptions and the small sample size limits generalizability. Tested software had occasional errors in recognizing a phrase or showing a false positive.
Study Partners Matter: Impacts on Inclusion and Outcomes
2021 ASEE Virtual Annual Conference Content Access Proceedings · 2024-02-20 · 1 citations
articleOpen accessSenior authorHer research contributes to the understanding how young students learn mathematics, and the classroom contexts for learning.Her detailed work on teaching practices, teacher learning, and discourse practices in elementary mathematics classrooms has yielded important insights on teaching practices that are linked to student understanding
Long-Form Analogy Evaluation Challenge
2024-01-01 · 2 citations
articleOpen accessGiven the practical applications of analogies, recent work has studied analogy generation to explain concepts.However, not all generated analogies are of high quality and it is unclear how to measure the quality of this new kind of generated text.To address this challenge, we propose a shared task on automatically evaluating the quality of generated analogies based on seven comprehensive criteria.For this, we will set up a leaderboard based on our dataset annotated with manual ratings along the seven criteria, and provide a baseline solution leveraging GPT-4.We hope that this task would advance the progress in development of new evaluation metrics and methods for analogy generation in natural language, particularly for education.
The Relation Among Gender, Language, and Posting Type in Online Chemistry Course Discussion Forums
2024-03-05 · 1 citations
articleSenior authorThis study explored gendered language used in an online chemistry course’s discussion forums, to understand how using gendered language might help or hinder learning outcomes, while considering the goal of various posting structures required in the course. Findings revealed that although gendered-language use did not differ between men and women, gendered forms of language were widely used throughout the forums. The use of gendered language appeared strategic, however, and reliably varied by the goal of the discussion post (i.e., posting a solution to a homework problem, asking a question, or answering a question). Ultimately, gender, language and posting type were found to be related to final grade.
ElectroVizQA: How well do Multi-modal LLMs perform in Electronics Visual Question Answering?
arXiv (Cornell University) · 2024-11-27 · 1 citations
preprintOpen accessSenior authorMulti-modal Large Language Models (MLLMs) are gaining significant attention for their ability to process multi-modal data, providing enhanced contextual understanding of complex problems. MLLMs have demonstrated exceptional capabilities in tasks such as Visual Question Answering (VQA); however, they often struggle with fundamental engineering problems, and there is a scarcity of specialized datasets for training on topics like digital electronics. To address this gap, we propose a benchmark dataset called ElectroVizQA specifically designed to evaluate MLLMs' performance on digital electronic circuit problems commonly found in undergraduate curricula. This dataset, the first of its kind tailored for the VQA task in digital electronics, comprises approximately 626 visual questions, offering a comprehensive overview of digital electronics topics. This paper rigorously assesses the extent to which MLLMs can understand and solve digital electronic circuit questions, providing insights into their capabilities and limitations within this specialized domain. By introducing this benchmark dataset, we aim to motivate further research and development in the application of MLLMs to engineering education, ultimately bridging the performance gap and enhancing the efficacy of these models in technical fields.
Recent grants
Frequent coauthors
- 31 shared
Hongyu Gong
- 22 shared
Pramod Viswanath
- 17 shared
Michelle Perry
University of Illinois Urbana-Champaign
- 15 shared
Ziheng Zeng
- 15 shared
Jianing Zhou
- 14 shared
Tarek Sakakini
University of Illinois Urbana-Champaign
- 12 shared
Wanzheng Zhu
University of Leeds
- 11 shared
Jiaqi Mu
Education
- 2010
Ph.D, Electrical and Computer Engineering
University of Illinois Urbana-Champaign
- 2000
MA, South and Southeast Asian Studies
University of California, Berkeley
- 1996
M.E, Electrical Engineering
Indian Institute of Science Bangalore
- 1992
BS, Statistics
Mangalore University
Awards & honors
- ECE Ronald W. Pratt Faculty Outstanding Teaching Award (07/0…
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Suma Bhat
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup