Jana Diesner
· Department AffiliateVerifiedUniversity of Illinois Urbana-Champaign · Computer Science
Active 2004–2026
About
Jana Diesner is a faculty member affiliated with the Siebel School of Computing and Data Science at the University of Illinois Urbana-Champaign. Her contact email is jdiesner@illinois.edu. She teaches courses such as Data Governance, Network Analysis, Responsible Data Science & AI, and Conflict-Related Displacement. Her research focuses on areas within computing and data science, including data governance, network analysis, responsible data science, and AI. She is involved in the academic and research activities of the school, contributing to the advancement of computing education and research.
Research topics
- Computer science
- Artificial intelligence
- Natural language processing
- Data science
- Information retrieval
Selected publications
GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts
ArXiv.org · 2026-04-14
articleOpen accessOptical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open-weight and proprietary vision-language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly tracks script-level pretraining coverage, suggesting that current OCR systems rely on language model pretraining as much as on visual recognition. Models confronted with unfamiliar scripts either produce random noise or hallucinate characters from similar scripts they already know. We release the benchmark and pipeline for reproducibility. Pipeline Code: https://github.com/cisnlp/glotocr-bench, Benchmark: https://hf.co/datasets/cis-lmu/glotocr-bench.
A probabilistic model of hype reveals a rhetorical shift in biomedicine
Quantitative Science Studies · 2026-05-08
articleOpen accessAbstract Promotional language, or ‘hype,’ characterized by words like novel, outstanding, and promising, has been increasingly observed in scientific writing. To systematically analyze how hype is employed and evolves in structured biomedical abstracts, we introduce a probabilistic model to estimate the propensity of hype in selected candidate words in biomedical abstracts. Our analysis examines the positional distribution of 43 candidate hype words across approximately 14.5 million PubMed abstracts within the IMRaD (Introduction, Methods, Results, and Discussion) framework of structured writing. The model accounts for context, filtering out technical concepts (e.g., major histocompatibility complex or ‘vital capacity‘) and non-hyping uses (e.g., outstanding questions). We find that the degree of hype varies depending on the word and its context: words such as promising and noteworthy frequently convey hype, whereas others like major and central typically remain neutral. Temporal trends suggest that the increased hype is not due to a shift in the words’ propensity to hype but rather authors’ strategic rhetorical choices, particularly in the ‘Introduction’ and ‘Discussion’ sections. This analytical approach enhances the identification and understanding of hype’s role, and we provide a labeled dataset annotated with abstract-level hype probabilities to facilitate further research into its impact on scientific communication. Peer Review https://www.webofscience.com/api/gateway/wos/peer-review/10.1162/QSS.a.482
GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts
arXiv (Cornell University) · 2026-04-14
preprintOpen accessOptical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open-weight and proprietary vision-language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly tracks script-level pretraining coverage, suggesting that current OCR systems rely on language model pretraining as much as on visual recognition. Models confronted with unfamiliar scripts either produce random noise or hallucinate characters from similar scripts they already know. We release the benchmark and pipeline for reproducibility. Pipeline Code: https://github.com/cisnlp/glotocr-bench, Benchmark: https://hf.co/datasets/cis-lmu/glotocr-bench.
ArXiv.org · 2025-11-05
preprintOpen accessSenior authorLarge language models (LLMs) excel in generating fluent utterances but can lack reliable grounding in verified information. At the same time, knowledge-graph-based fact-checkers deliver precise and interpretable evidence, yet suffer from limited coverage or latency. By integrating LLMs with knowledge graphs and real-time search agents, we introduce a hybrid fact-checking approach that leverages the individual strengths of each component. Our system comprises three autonomous steps: 1) a Knowledge Graph (KG) Retrieval for rapid one-hop lookups in DBpedia, 2) an LM-based classification guided by a task-specific labeling prompt, producing outputs with internal rule-based logic, and 3) a Web Search Agent invoked only when KG coverage is insufficient. Our pipeline achieves an F1 score of 0.93 on the FEVER benchmark on the Supported/Refuted split without task-specific fine-tuning. To address Not enough information cases, we conduct a targeted reannotation study showing that our approach frequently uncovers valid evidence for claims originally labeled as Not Enough Information (NEI), as confirmed by both expert annotators and LLM reviewers. With this paper, we present a modular, opensource fact-checking pipeline with fallback strategies and generalization across datasets.
PLoS ONE · 2025-01-31 · 3 citations
articleOpen accessSenior authorCorrespondingMultiple studies have linked diversity in scientific collaborations to innovative and impactful research. Here, we explore how different diversity indices-ethnicity, gender, academic age, and topical expertise-interact and thereby influence scientific impact. Leveraging nearly 900,000 biomedical journal articles from PubMed, published in major journals between 1991 and 2014, we investigate the nuanced relationships among these diversity indices and their collective influence on research outcomes. By systematically varying model parametrizations, we assess the robustness of the observed relationships and examine multiple methodological choices. Our findings reveal a consistent pattern of demographic homophily, where scientists tend to collaborate with others who share similar ethnic and gender backgrounds. While each diversity index correlates significantly with impact when considered individually, gender diversity and topical expertise emerge as the strongest positive predictors of impact after accounting for key covariates. However, the association between diversity and impact is moderated by the number of collaborating authors, with larger teams sometimes showing opposite trends due to interactions between the computed diversity indices and team size. Despite this complexity, the practical drivers of scientific impact for an article remain the journal of publication, authors' prior citation rate, and the number of co-authors. On further examining expertise diversity through three separate dimensions: variety, balance, and disparity, our findings indicate that impactful teams balance a wide range of subject matter expertise while maintaining a focused connection on closely related topics. These findings highlight the importance of strategic team composition and underline the significance of team diversity in scientific research.
ArXiv.org · 2025-01-30
preprintOpen accessSenior authorGender biases in scholarly metrics remain a persistent concern, despite numerous bibliometric studies exploring their presence and absence across productivity, impact, acknowledgment, and self-citations. However, methodological inconsistencies, particularly in author name disambiguation and gender identification, limit the reliability and comparability of these studies, potentially perpetuating misperceptions and hindering effective interventions. A review of 70 relevant publications over the past 12 years reveals a wide range of approaches, from name-based and manual searches to more algorithmic and gold-standard methods, with no clear consensus on best practices. This variability, compounded by challenges such as accurately disambiguating Asian names and managing unassigned gender labels, underscores the urgent need for standardized and robust methodologies. To address this critical gap, we propose the development and implementation of ``Scholarly Data Analysis (SoDA) Cards." These cards will provide a structured framework for documenting and reporting key methodological choices in scholarly data analysis, including author name disambiguation and gender identification procedures. By promoting transparency and reproducibility, SoDA Cards will facilitate more accurate comparisons and aggregations of research findings, ultimately supporting evidence-informed policymaking and enabling the longitudinal tracking of analytical approaches in the study of gender and other social biases in academia.
2025-01-01
articleOpen accessSenior authorLarge language models (LLMs) excel in generating fluent utterances but can lack reliable grounding in verified information.At the same time, knowledge-graph-based fact-checkers deliver precise and interpretable evidence, yet suffer from limited coverage or latency.By integrating LLMs with knowledge graphs and real-time search agents, we introduce a hybrid fact-checking approach that leverages the individual strengths of each component.Our system comprises three autonomous steps: 1) a Knowledge Graph (KG) Retrieval for rapid one-hop lookups in DBpedia, 2) an LM-based classification guided by a task-specific labeling prompt, producing outputs with internal rule-based logic, and 3) a Web Search Agent invoked only when KG coverage is insufficient.Our pipeline achieves an F1 score of 0.93 on the FEVER benchmark on the Supported/Refuted split without task-specific fine-tuning.To address Not enough information cases, we conduct a targeted reannotation study showing that our approach frequently uncovers valid evidence for claims originally labeled as Not Enough Information (NEI), as confirmed by both expert annotators and LLM reviewers.With this paper, we present a modular, opensource fact-checking pipeline with fallback strategies and generalization across datasets.
CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs
2025-01-01
articleOpen accessSenior authorWarning: This paper contains content that may be offensive or upsetting.Improvements in model construction, including fortified safety guardrails, allow Large language models (LLMs) to increasingly pass standard safety checks.However, LLMs sometimes slip into revealing harmful behavior, such as expressing racist viewpoints, during conversations.To analyze this systematically, we introduce CoBia, a suite of lightweight adversarial attacks that allow us to refine the scope of conditions under which LLMs depart from normative or ethical behavior in conversations.CoBia creates a constructed conversation where the model utters a biased claim about a social group.We then evaluate whether the model can recover from the fabricated bias claim and reject biased follow-up questions.We evaluate 11 open-source as well as proprietary LLMs for their outputs related to six socio-demographic categories that are relevant to individual safety and fair treatment, i.e., gender, race, religion, nationality, sex orientation, and others.Our evaluation is based on established LLM-based bias metrics, and we compare the results against human judgments to scope out the LLMs' reliability and alignment.The results suggest that purposefully constructed conversations reliably reveal bias amplification and that LLMs often fail to reject biased follow-up questions during dialogue.This form of stress-testing highlights deeply embedded biases that can be surfaced through interaction.Code and artifacts are available at github.com/nafisenik/CoBia.
Author response for "A probabilistic model of hype reveals a rhetorical shift in biomedicine"
2025-03-14
peer-reviewMEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
2025-01-01
articleOpen accessEnglish-centric large language models (LLMs) often show strong multilingual capabilities.However, their multilingual performance remains unclear and is under-evaluated for many other languages.Most benchmarks for multilinguality focus on classic NLP tasks or cover a minimal number of languages.We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric LLMs using parallel sentences, which are available for more languages than existing downstream tasks.MEXA leverages that Englishcentric LLMs use English as a pivot language in their intermediate layers.MEXA computes the alignment between English and non-English languages using parallel sentences to evaluate the transfer of language understanding from English to other languages.This alignment can be used to estimate model performance in different languages.We conduct controlled experiments using various parallel datasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral, and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC).We explore different methods to compute embeddings in decoder-only models.Our results show that MEXA, in its default settings, achieves an average Pearson correlation of 0.90 between its predicted scores and actual task performance across languages.This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs.
Frequent coauthors
- 26 shared
Kathleen M. Carley
Carnegie Mellon University
- 26 shared
Rezvaneh Rezapour
- 22 shared
Shubhanshu Mishra
- 21 shared
Ly Dinh
University of South Florida
- 16 shared
Jinseok Kim
Konkuk University
- 11 shared
Sullam Jeoung
- 7 shared
Pingjing Yang
University of Illinois Urbana-Champaign
- 7 shared
Rania Al-Sabbagh
University of Sharjah
Education
PhD, School of Computer Science
Carnegie Mellon University
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Jana Diesner
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup