Renzhe Yu
· Assistant Professor, Learning Analytics / Educational Data MiningVerifiedColumbia University · Curriculum & Teaching
Active 2016–2026
About
Renzhe Yu is an Assistant Professor specializing in Learning Analytics and Educational Data Mining at Teachers College, Columbia University. He is a faculty member at the Data Science Institute and a research affiliate at the Community College Research Center. His academic background includes a Ph.D. in Educational Data Science from the University of California, Irvine, a Master’s degree in Economics of Education from Peking University, and both a Bachelor’s degree in Artificial Intelligence and a Bachelor’s degree in Economics from Peking University. Yu’s scholarly interests focus on studying the affordances and challenges of artificial intelligence and data science in education, with particular attention to issues of equity. He is engaged in exploring how digital innovation and educational technology impact higher education and community colleges. For more detailed and updated information, he directs interested parties to his personal website.
Research topics
- Computer Science
- Data science
- Machine Learning
- Data Mining
- Artificial Intelligence
- Medicine
- World Wide Web
- Human–computer interaction
- Developmental psychology
- Multimedia
- Clinical psychology
- Social psychology
- Psychology
- Environmental health
- Psychiatry
Selected publications
Evaluating 21st-Century Competencies in Postsecondary Curricula with Large Language Models
Journal of Learning Analytics · 2026-02-25
articleOpen accessSenior authorThe growing emphasis on 21st-century competencies in postsecondary education, intensified by the transformative impact of generative artificial intelligence (GenAI) on the economy and society, underscores the urgent need to evaluate how they are embedded in curricula and how effectively academic programs align with evolving workforce and societal demands. Curricular analytics, particularly recent advancements powered by GenAI, offer a promising data-driven approach to this challenge. However, the analysis of 21st-century competencies requires pedagogical reasoning beyond surface-level information retrieval, and the capabilities of large language models (LLMs) in this context remain underexplored. In this study, we extend prior research on curricular analytics of 21st-century competencies across a broader range of curriculum documents, competency frameworks, and models. Using 7,600 manually annotated curriculum-competency alignment scores (38 competencies and 200 courses across five curriculum document types), we evaluate the informativeness of different curriculum document sources, benchmark the performance of general-purpose LLMs on mapping curricula to competencies, and analyze error patterns. We further introduce a reasoning-based prompting strategy, curricular chain-of-thought (CoT), to strengthen LLMs’ pedagogical reasoning. Our results show that detailed instructional activity descriptions are the most informative type of curriculum document for competency analytics. Open-weight LLMs achieve accuracy comparable to proprietary models on coarse-grained tasks, demonstrating their scalability and cost-effectiveness for institutional use. However, no model reaches human-level precision in fine-grained pedagogical reasoning. Our proposed curricular CoT yields modest improvements by reducing bias in instructional keyword inference and improving the detection of nuanced pedagogical evidence in long text. Together, these findings highlight the untapped potential of institutional curriculum documents and provide an empirical foundation for advancing AI-driven curricular analytics.
Enhancing LLM-Based Data Annotation with Error Decomposition
ArXiv.org · 2026-01-17
articleOpen accessSenior authorLarge language models offer a scalable alternative to human coding for data annotation tasks, enabling the scale-up of research across data-intensive domains. While LLMs are already achieving near-human accuracy on objective annotation tasks, their performance on subjective annotation tasks, such as those involving psychological constructs, is less consistent and more prone to errors. Standard evaluation practices typically collapse all annotation errors into a single alignment metric, but this simplified approach may obscure different kinds of errors that affect final analytical conclusions in different ways. Here, we propose a diagnostic evaluation paradigm that incorporates a human-in-the-loop step to separate task-inherent ambiguity from model-driven inaccuracies and assess annotation quality in terms of their potential downstream impacts. We refine this paradigm on ordinal annotation tasks, which are common in subjective annotation. The refined paradigm includes: (1) a diagnostic taxonomy that categorizes LLM annotation errors along two dimensions: source (model-specific vs. task-inherent) and type (boundary ambiguity vs. conceptual misidentification); (2) a lightweight human annotation test to estimate task-inherent ambiguity from LLM annotations; and (3) a computational method to decompose observed LLM annotation errors following our taxonomy. We validate this paradigm on four educational annotation tasks, demonstrating both its conceptual validity and practical utility. Theoretically, our work provides empirical evidence for why excessively high alignment is unrealistic in specific annotation tasks and why single alignment metrics inadequately reflect the quality of LLM annotations. In practice, our paradigm can be a low-cost diagnostic tool that assesses the suitability of a given task for LLM annotation and provides actionable insights for further technical optimization.
Enhancing LLM-Based Data Annotation with Error Decomposition
arXiv (Cornell University) · 2026-01-17
preprintOpen accessSenior authorLarge language models offer a scalable alternative to human coding for data annotation tasks, enabling the scale-up of research across data-intensive domains. While LLMs are already achieving near-human accuracy on objective annotation tasks, their performance on subjective annotation tasks, such as those involving psychological constructs, is less consistent and more prone to errors. Standard evaluation practices typically collapse all annotation errors into a single alignment metric, but this simplified approach may obscure different kinds of errors that affect final analytical conclusions in different ways. Here, we propose a diagnostic evaluation paradigm that incorporates a human-in-the-loop step to separate task-inherent ambiguity from model-driven inaccuracies and assess annotation quality in terms of their potential downstream impacts. We refine this paradigm on ordinal annotation tasks, which are common in subjective annotation. The refined paradigm includes: (1) a diagnostic taxonomy that categorizes LLM annotation errors along two dimensions: source (model-specific vs. task-inherent) and type (boundary ambiguity vs. conceptual misidentification); (2) a lightweight human annotation test to estimate task-inherent ambiguity from LLM annotations; and (3) a computational method to decompose observed LLM annotation errors following our taxonomy. We validate this paradigm on four educational annotation tasks, demonstrating both its conceptual validity and practical utility. Theoretically, our work provides empirical evidence for why excessively high alignment is unrealistic in specific annotation tasks and why single alignment metrics inadequately reflect the quality of LLM annotations. In practice, our paradigm can be a low-cost diagnostic tool that assesses the suitability of a given task for LLM annotation and provides actionable insights for further technical optimization.
AI-exposed jobs deteriorated before ChatGPT
arXiv (Cornell University) · 2026-01-05
preprintOpen accessSenior authorPublic debate links worsening job prospects for AI-exposed occupations to the release of ChatGPT in late 2022. Using monthly U.S. unemployment insurance records, we measure occupation- and location-specific unemployment risk and find that risk rose in AI-exposed occupations beginning in early 2022, months before ChatGPT. Analyzing millions of LinkedIn profiles, we show that graduate cohorts from 2021 onward entered AI-exposed jobs at lower rates than earlier cohorts, with gaps opening before late 2022. Finally, from millions of university syllabi, we find that graduates taking more AI-exposed curricula had higher first-job pay and shorter job searches after ChatGPT. Together, these results point to forces pre-dating generative AI and to the ongoing value of LLM-relevant education.
ArXiv.org · 2026-01-16
articleOpen accessSenior authorThe growing emphasis on 21st-century competencies in postsecondary education, intensified by the transformative impact of generative AI, underscores the need to evaluate how these competencies are embedded in curricula and how effectively academic programs align with evolving workforce and societal demands. Curricular Analytics, particularly recent generative AI-powered approaches, offer a promising data-driven pathway. However, analyzing 21st-century competencies requires pedagogical reasoning beyond surface-level information retrieval, and the capabilities of large language models in this context remain underexplored. In this study, we extend prior curricular analytics research by examining a broader range of curriculum documents, competency frameworks, and models. Using 7,600 manually annotated curriculum-competency alignment scores, we assess the informativeness of different curriculum sources, benchmark general-purpose LLMs for curriculum-to-competency mapping, and analyze error patterns. We further introduce a reasoning-based prompting strategy, Curricular CoT, to strengthen LLMs' pedagogical reasoning. Our results show that detailed instructional activity descriptions are the most informative type of curriculum document for competency analytics. Open-weight LLMs achieve accuracy comparable to proprietary models on coarse-grained tasks, demonstrating their scalability and cost-effectiveness for institutional use. However, no model reaches human-level precision in fine-grained pedagogical reasoning. Our proposed Curricular CoT yields modest improvements by reducing bias in instructional keyword inference and improving the detection of nuanced pedagogical evidence in long text. Together, these findings highlight the untapped potential of institutional curriculum documents and provide an empirical foundation for advancing AI-driven curricular analytics.
Tracing Need for Cognition in Digital Learning
2026-02-16
articleOpen accessNeed for Cognition (NFC) reflects individuals’ tendencies to engage in and enjoy effortful cognitive activities and has been linked to positive academic outcomes (e.g., higher academic achievement). However, the extent to which NFC is expressed in actual learning behaviors remains unclear. Therefore, we investigated how NFC manifests in digital learning behaviors by analyzing behavioral trace data from undergraduate students in an online chemistry course across 4 periods (before Midterms 1 and 2, respectively, and before and after the final exam). We identified 20 behavioral indicators that were most strongly correlated with NFC and used them in supervised machine learning (ML) models to predict achievement (final course grade) and motivation (continued course interest). Interestingly, whereas NFC was correlated with interest but not achievement, NFC-related behaviors moderately predicted achievement but explained only little variance in interest. Building on the full set of > 700 behavioral indicators, ML models more strongly predicted academic achievement but were also less successful in predicting interest. Generally, greater overall activity, self-testing behavior, and lecture engagement were predictive of better performance. Interest was predicted by indicators reflecting behavioral variability. Importantly, NFC-related behaviors were not among the most predictive features for either outcome. This finding suggests that, although NFC has previously been linked to better academic functioning, its behavioral expressions might not align with the most effective digital learning patterns. Our study offers novel insights into NFC in digital learning and highlights the importance and challenges of using trace data to predict motivational outcomes, such as interest.
Enhancing LLM-Based Data Annotation with Error Decomposition
2026-04-25
articleOpen accessSenior authorLarge language models (LLMs) offer a scalable alternative to human coding for data annotation tasks, enabling the scale-up of research across data-intensive domains such as learning analytics. While LLMs are already achieving near-human accuracy on objective annotation tasks, their performance on subjective annotation tasks, such as those involving psychological constructs, is less consistent and more prone to errors. Standard evaluation practices typically collapse all annotation errors into a single alignment metric, but this simplified approach may obscure different kinds of errors that affect final analytical conclusions in different ways. Here, we propose a diagnostic evaluation paradigm that incorporates a human-in-the-loop step to separate task-inherent ambiguity from model-driven inaccuracies and assess annotation quality in terms of their potential downstream impacts. We refine this paradigm on ordinal annotation tasks, which are common in subjective annotation. The refined paradigm includes: (1) a diagnostic taxonomy that categorizes LLM annotation errors along two dimensions: source (model-specific vs. task-inherent) and type (boundary ambiguity vs. conceptual misidentification); (2) a lightweight human annotation test to estimate task-inherent ambiguity from LLM annotations; and (3) a computational method to decompose observed LLM annotation errors following our taxonomy. We validate this paradigm on four educational annotation tasks, demonstrating both its conceptual validity and practical utility. Theoretically, our work provides empirical evidence for why excessively high alignment is unrealistic in specific annotation tasks and why single alignment metrics inadequately reflect the quality of LLM annotations. In practice, our paradigm can be a low-cost diagnostic tool that assesses the suitability of a given task for LLM annotation and provides actionable insights for further technical optimization.
AI-exposed jobs deteriorated before ChatGPT
ArXiv.org · 2026-01-05
articleOpen accessSenior authorPublic debate links worsening job prospects for AI-exposed occupations to the release of ChatGPT in late 2022. Using monthly U.S. unemployment insurance records, we measure occupation- and location-specific unemployment risk and find that risk rose in AI-exposed occupations beginning in early 2022, months before ChatGPT. Analyzing millions of LinkedIn profiles, we show that graduate cohorts from 2021 onward entered AI-exposed jobs at lower rates than earlier cohorts, with gaps opening before late 2022. Finally, from millions of university syllabi, we find that graduates taking more AI-exposed curricula had higher first-job pay and shorter job searches after ChatGPT. Together, these results point to forces pre-dating generative AI and to the ongoing value of LLM-relevant education.
arXiv (Cornell University) · 2026-01-16
preprintOpen accessSenior authorThe growing emphasis on 21st-century competencies in postsecondary education, intensified by the transformative impact of generative AI, underscores the need to evaluate how these competencies are embedded in curricula and how effectively academic programs align with evolving workforce and societal demands. Curricular Analytics, particularly recent generative AI-powered approaches, offer a promising data-driven pathway. However, analyzing 21st-century competencies requires pedagogical reasoning beyond surface-level information retrieval, and the capabilities of large language models in this context remain underexplored. In this study, we extend prior curricular analytics research by examining a broader range of curriculum documents, competency frameworks, and models. Using 7,600 manually annotated curriculum-competency alignment scores, we assess the informativeness of different curriculum sources, benchmark general-purpose LLMs for curriculum-to-competency mapping, and analyze error patterns. We further introduce a reasoning-based prompting strategy, Curricular CoT, to strengthen LLMs' pedagogical reasoning. Our results show that detailed instructional activity descriptions are the most informative type of curriculum document for competency analytics. Open-weight LLMs achieve accuracy comparable to proprietary models on coarse-grained tasks, demonstrating their scalability and cost-effectiveness for institutional use. However, no model reaches human-level precision in fine-grained pedagogical reasoning. Our proposed Curricular CoT yields modest improvements by reducing bias in instructional keyword inference and improving the detection of nuanced pedagogical evidence in long text. Together, these findings highlight the untapped potential of institutional curriculum documents and provide an empirical foundation for advancing AI-driven curricular analytics.
2025-07-17
articleProcrastination has been linked to lower academic performance and sociodemographic achievement gaps in a variety of educational contexts, posing challenges to student success and educational equity. While prior research acknowledges that learning environments play a crucial role in shaping student procrastination alongside personal traits, there is a lack of solid empirical evidence on the connection between specific variations in learning environments and academic procrastination. This study provides a large-scale evaluation of the relationship between course and assignment characteristics and student procrastination behavior using a sample of 33,514 students across 3,169 courses at a US university. Using fixed effects linear regression models, we find that students tend to procrastinate less in courses with larger enrollment, non-introductory content, and well-structured deadlines. Procrastination is also lower for assignments with spaced-out deadlines, weekend deadlines, and a quiz or discussion post format. However, these patterns do not apply equally across all student groups. Male, ethnic minority, and first-generation college students exhibit higher levels of procrastination than their peers, especially for courses and assignments with specific characteristics. We suggest two instructional design strategies to help manage procrastination across student populations: (1) allowing more time before the first assignment deadline, and (2) ensuring adequate spacing between deadlines. This study provides large-scale evidence of the complex relationship between learning environment design, student characteristics, and procrastination.
Frequent coauthors
- 24 shared
Christian Fischer
University of Tübingen
- 20 shared
Hye Rin Lee
Pusan National University Yangsan Hospital
- 20 shared
Luise von Keyserlingk
University of Tübingen
- 18 shared
Hanna Gaspard
TU Dortmund University
- 18 shared
Katsumi Yamaguchi‐Pedroza
- 17 shared
Marion Spengler
MSB Medical School Berlin
- 17 shared
Julia Moeller
Leipzig University
- 10 shared
René F. Kizilcec
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Renzhe Yu
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup