Kevin Chenchuan Chang
· ProfessorVerifiedUniversity of Illinois Urbana-Champaign · Computer Science
Active 1996–2025
About
Kevin Chen-Chuan Chang is a Professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign. He received a Bachelor of Science degree from National Taiwan University and a Ph.D. in Electrical Engineering from Stanford University in 2001. His research addresses large-scale information access and knowledge acquisition, focusing on search, mining, and integration across structured and unstructured big data. His current research emphasizes Web search and mining, as well as social media analytics. He leads the FORWARD Data Lab group within the Data and Information Systems Laboratories at UIUC. His work aims to bridge structured and unstructured data to enable semantic-rich access to vast amounts of information. His research spans natural language processing, data mining, data management, information retrieval, and machine learning, with a focus on applications in Web and social media-based knowledge organization. He has received numerous awards, including the ICDE 10-Year Test of Time Award in 2022, NSF CAREER Award in 2002, and multiple teaching awards at the University of Illinois. He has also co-founded a startup, Cazoodle, and developed GrantForward.com, a funding discovery service used by over 200 institutions.
Research topics
- Computer Science
- Artificial Intelligence
- Mathematics
- Theoretical computer science
- Management science
- Combinatorics
- Cognitive science
- Psychology
- Engineering
Selected publications
Understanding Cross-Domain Adaptation in Low-Resource Topic Modeling
ArXiv.org · 2025-06-09
preprintOpen accessSenior authorTopic modeling plays a vital role in uncovering hidden semantic structures within text corpora, but existing models struggle in low-resource settings where limited target-domain data leads to unstable and incoherent topic inference. We address this challenge by formally introducing domain adaptation for low-resource topic modeling, where a high-resource source domain informs a low-resource target domain without overwhelming it with irrelevant content. We establish a finite-sample generalization bound showing that effective knowledge transfer depends on robust performance in both domains, minimizing latent-space discrepancy, and preventing overfitting to the data. Guided by these insights, we propose DALTA (Domain-Aligned Latent Topic Adaptation), a new framework that employs a shared encoder for domain-invariant features, specialized decoders for domain-specific nuances, and adversarial alignment to selectively transfer relevant information. Experiments on diverse low-resource datasets demonstrate that DALTA consistently outperforms state-of-the-art methods in terms of topic coherence, stability, and transferability.
Conversion of the Treatise on Invertebrate Paleontology volumes into a FAIR database
Abstracts with programs - Geological Society of America · 2025-01-01
articleUnderstanding Cross-Domain Adaptation in Low-Resource Topic Modeling
2025-01-01
articleOpen accessSenior authorTopic modeling plays a vital role in uncovering hidden semantic structures within text corpora, but existing models struggle in lowresource settings where limited target-domain data leads to unstable and incoherent topic inference.We address this challenge by formally introducing domain adaptation for lowresource topic modeling, where a high-resource source domain informs a low-resource target domain without overwhelming it with irrelevant content.We establish a finite-sample generalization bound showing that effective knowledge transfer depends on robust performance in both domains, minimizing latent-space discrepancy, and preventing overfitting to the data.Guided by these insights, we propose DALTA (Domain-Aligned Latent Topic Adaptation), a new framework that employs a shared encoder for domain-invariant features, specialized decoders for domain-specific nuances, and adversarial alignment to selectively transfer relevant information.Experiments on diverse lowresource datasets demonstrate that DALTA consistently outperforms state-of-the-art methods in terms of topic coherence, stability, and transferability.
ERU-KG: Efficient Reference-aligned Unsupervised Keyphrase Generation
2025-01-01
articleOpen accessSenior authorUnsupervised keyphrase prediction has gained growing interest in recent years.However, existing methods typically rely on heuristically defined importance scores, which may lead to inaccurate informativeness estimation.In addition, they lack consideration for time efficiency.To solve these problems, we propose ERU-KG, an unsupervised keyphrase generation (UKG) model that consists of an informativeness and a phraseness module.The former estimates the relevance of keyphrase candidates, while the latter generate those candidates.The informativeness module innovates by learning to model informativeness through references (e.g., queries, citation contexts, and titles) and at the term-level, thereby 1) capturing how the key concepts of documents are perceived in different contexts and 2) estimating informativeness of phrases more efficiently by aggregating term informativeness, removing the need for explicit modeling of the candidates.ERU-KG demonstrates its effectiveness on keyphrase generation benchmarks by outperforming unsupervised baselines and achieving on average 89% of the performance of a supervised model for top 10 predictions.Additionally, to highlight its practical utility, we evaluate the model on text retrieval tasks and show that keyphrases generated by ERU-KG are effective when employed as query and document expansions.Furthermore, inference speed tests reveal that ERU-KG is the fastest among baselines of similar model sizes.Finally, our proposed model can switch between keyphrase generation and extraction by adjusting hyperparameters, catering to diverse application requirements. 1
MuSha: Subgraph Matching by Multilevel Sharing
2025-05-19
articleSubgraph matching (SM) is a fundamental problem in graph data analysis. Real-world patterns used in graph analysis are often symmetric and contain isomorphic substructures, but existing SM algorithms fail to explore such properties. To fill this gap, we propose MuSha, a multi-objective optimization framework for SM, leveraging multilevel sharing of isomorphic substructure results to speed up SM and symmetry breaking to avoid directly computing symmetric results. To efficiently compute and cache intermediate results for sharing, MuSha applies worst-case optimal joins (WCOJs) and utilizes trie data structures to compress and index results. To enable multilevel sharing, MuSha solves a multi-objective optimization problem involving pattern decomposition, symmetry breaking, WCOJ orders, and trie structural orders. Experimental results demonstrate that MuSha outperforms the state of the art by up to two orders of magnitude on graphs of millions of vertices.
Query Optimization for Parametric Knowledge Refinement in Retrieval-Augmented Large Language Models
2025-01-01
articleOpen accessSenior authorWe introduce the Extract-Refine-Retrieve-Read (ERRR) framework, a novel approach designed to bridge the pre-retrieval information gap in Retrieval-Augmented Generation (RAG) systems through query optimization tailored to meet the specific knowledge requirements of Large Language Models (LLMs).Unlike conventional query optimization techniques used in RAG, the ERRR framework begins by extracting parametric knowledge from LLMs, followed by using a specialized query optimizer for refining these queries.This process ensures the retrieval of only the most pertinent information essential for generating accurate responses.Moreover, to enhance flexibility and reduce computational costs, we propose a trainable scheme for our pipeline that utilizes a smaller, tunable model as the query optimizer, which is refined through knowledge distillation from a larger teacher model.Our evaluations on various question-answering (QA) datasets and with different retrieval systems show that ERRR consistently outperforms existing baselines, proving to be a versatile and cost-effective module for improving the utility and accuracy of RAG systems.
CASPER: Concept-integrated Sparse Representation for Scientific Retrieval
ArXiv.org · 2025-08-18
preprintOpen accessSenior authorIdentifying relevant research concepts is crucial for effective scientific search. However, primary sparse retrieval methods often lack concept-aware representations. To address this, we propose CASPER, a sparse retrieval model for scientific search that utilizes both tokens and keyphrases as representation units (i.e., dimensions in the sparse embedding space). This enables CASPER to represent queries and documents via research concepts and match them at both granular and conceptual levels. Furthermore, we construct training data by leveraging abundant scholarly references (including titles, citation contexts, author-assigned keyphrases, and co-citations), which capture how research concepts are expressed in diverse settings. Empirically, CASPER outperforms strong dense and sparse retrieval baselines across eight scientific retrieval benchmarks. We also explore the effectiveness-efficiency trade-off via representation pruning and demonstrate CASPER's interpretability by showing that it can serve as an effective and efficient keyphrase generation model.
RL-based Query Rewriting with Distilled LLM for online E-Commerce Systems
ArXiv.org · 2025-01-29
preprintOpen accessSenior authorQuery rewriting (QR) is a critical technique in e-commerce search, addressing the lexical gap between user queries and product descriptions to enhance search performance. Existing QR approaches typically fall into two categories: discriminative models and generative methods leveraging large language models (LLMs). Discriminative models often struggle with natural language understanding and offer limited flexibility in rewriting, while generative LLMs, despite producing high-quality rewrites, face high inference latency and cost in online settings. These limitations force offline deployment, making them vulnerable to issues like information staleness and semantic drift. To overcome these challenges, we propose a novel hybrid pipeline for QR that balances efficiency and effectiveness. Our approach combines offline knowledge distillation to create a lightweight but efficient student model with online reinforcement learning (RL) to refine query rewriting dynamically using real-time feedback. A key innovation is the use of LLMs as simulated human feedback, enabling scalable reward signals and cost-effective evaluation without manual annotations. Experimental results on Amazon ESCI dataset demonstrate significant improvements in query relevance, diversity, and adaptability, as well as positive feedback from the LLM simulation. This work contributes to advancing LLM capabilities for domain-specific applications, offering a robust solution for dynamic and complex e-commerce search environments.
Social Media and Orthopaedics: Establishing Your Online Reputation.
PubMed · 2025-01-01
review1st authorCorrespondingWith the rise of internet and social media usage in the 21st century, patients have increasingly been looking to online resources for information regarding their health care. It is imperative for physicians to recognize the trends and role of these tools in clinical orthopaedic practice, and to harness these tools to educate users, connect with other physicians, and interact with current and potential patients. It is important to review the current literature regarding social media in orthopaedics; some commonly used social media platforms and their individual characteristics; and general guidelines for creating content and managing an online reputation.
MiniELM: A Lightweight and Adaptive Query Rewriting Framework for E-Commerce Search Optimization
2025-01-01
articleOpen accessSenior authorQuery rewriting (QR) is a critical technique in e-commerce search, addressing the lexical gap between user queries and product descriptions to enhance search performance.Existing QR approaches typically fall into two categories: discriminative models and generative methods leveraging large language models (LLMs).Discriminative models often struggle with natural language understanding and offer limited flexibility in rewriting, while generative LLMs, despite producing high-quality rewrites, face high inference latency and cost in online settings.These limitations force offline deployment, making them vulnerable to issues like information staleness and semantic drift.To overcome these challenges, we propose a novel hybrid pipeline for QR that balances efficiency and effectiveness.Our approach combines offline knowledge distillation to create a lightweight but efficient student model with online reinforcement learning (RL) to refine query rewriting dynamically using real-time feedback.A key innovation is the use of LLMs as simulated human feedback, enabling scalable reward signals and cost-effective evaluation without manual annotations.Experimental results on Amazon ESCI dataset demonstrate significant improvements in query relevance, diversity, and adaptability, as well as positive feedback from the LLM simulation.This work contributes to advancing LLM capabilities for domain-specific applications, offering a robust solution for dynamic and complex e-commerce search environments.
Recent grants
NSF · $500k · 2016–2020
CAREER: MetaQuerier: Dynamic Ad Hoc Information Integration Across the Internet
NSF · $300k · 2002–2007
NSF · $500k · 2010–2014
NSF · $1.8M · 2016–2022
ITR: Shallow Integration over the Deep Web: A Holistic Approach
NSF · $306k · 2003–2006
Frequent coauthors
- 32 shared
Jie Huang
Chinese University of Hong Kong
- 25 shared
Vincent W. Zheng
Rutgers, The State University of New Jersey
- 18 shared
Wen‐mei Hwu
University of Illinois Urbana-Champaign
- 16 shared
Pritom Saha Akash
- 15 shared
Bin He
University of Illinois Urbana-Champaign
- 15 shared
Jinjun Xiong
- 12 shared
Yuan Fang
Singapore Management University
- 11 shared
Jie Huang
Education
- 2005
Ph.D., Computer Science
University of Illinois at Urbana-Champaign
- 2001
M.S., Computer Science
University of Illinois at Urbana-Champaign
- 1998
B.S., Computer Science
University of Science and Technology of China
Awards & honors
- ICDE 10-Year Test of Time Award (2022)
- Best Paper Selection/Awards in VLDB 2000 and 2013
- ASONAM 2019
- NSF CAREER Award (2002)
- NCSA Faculty Fellow Award (2003)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Kevin Chenchuan Chang
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup