Muhao Chen
· Assistant Professor of Computer ScienceVerifiedUniversity of California, Davis · Biomedical Engineering
Active 2015–2026
About
Dr. Muhao Chen is the director of the Language Understanding and Knowledge Acquisition (LUKA) Lab, also known as the Capital NLP Group of California. His research primarily focuses on developing robust and accountable machine learning methods tailored for natural language processing and multi-modal data processing. Recently, his work has concentrated on addressing robustness and safety challenges associated with large multi-modal language models and foundation model agents. The overarching goal of Dr. Chen's research is to create robust, generalizable, and trustworthy learning systems that enable machines to better understand the natural world.
Research topics
- Computer Science
- Artificial Intelligence
- Machine Learning
- Data Mining
- Natural Language Processing
- Theoretical computer science
- Engineering
Selected publications
DebugLM: Learning Traceable Training Data Provenance for LLMs
arXiv (Cornell University) · 2026-03-18
preprintOpen accessSenior authorLarge language models (LLMs) are trained through multi-stage pipelines over heterogeneous data sources, yet developers lack a principled way to pinpoint the specific data responsible for an observed behavior. This lack of observability reduces debugging to reactive patching and makes failures prone to recur under distribution shift or subsequent model updates. To address this limitation, we propose DebugLM, a framework that equips LLMs with built-in data provenance, enabling them to explicitly trace the origins of their behaviors to specific training data sources. Specifically, the model learns to associate its responses with unique provenance tags that indicate the responsible dataset, empowering developers to precisely identify where undesirable behaviors are learned. Building on this capability, DebugLM further supports targeted test-time remediation, enabling developers to selectively trigger targeted refusal for specified data sources without retraining or modifying model parameters. Experiments demonstrate that DebugLM provides accurate behavior tracing in multi-stage training pipelines and effective test-time remediation while preserving the general utility of the model.
DebugLM: Learning Traceable Training Data Provenance for LLMs
ArXiv.org · 2026-03-18
articleOpen accessSenior authorLarge language models (LLMs) are trained through multi-stage pipelines over heterogeneous data sources, yet developers lack a principled way to pinpoint the specific data responsible for an observed behavior. This lack of observability reduces debugging to reactive patching and makes failures prone to recur under distribution shift or subsequent model updates. To address this limitation, we propose DebugLM, a framework that equips LLMs with built-in data provenance, enabling them to explicitly trace the origins of their behaviors to specific training data sources. Specifically, the model learns to associate its responses with unique provenance tags that indicate the responsible dataset, empowering developers to precisely identify where undesirable behaviors are learned. Building on this capability, DebugLM further supports targeted test-time remediation, enabling developers to selectively trigger targeted refusal for specified data sources without retraining or modifying model parameters. Experiments demonstrate that DebugLM provides accurate behavior tracing in multi-stage training pipelines and effective test-time remediation while preserving the general utility of the model.
Code Execution as Grounded Supervision for LLM Reasoning
ArXiv.org · 2025-06-12
preprintOpen accessSenior authorTraining large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.
Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models
ArXiv.org · 2025-05-26
preprintOpen accessMultimodal Large Language Models demonstrate strong performance on multimodal benchmarks, yet often exhibit poor robustness when exposed to spurious modality interference, such as irrelevant text in vision understanding, or irrelevant visual content in question answering. At its core, modality interference refers to cases where spurious signals from non-essential modalities distort model decisions, which we systematically analyze through causal, perturbation-based diagnostic experiments. To address this problem, we propose a unified finetuning framework that combines heuristic and adversarial perturbation-based data augmentation with output-level consistency regularization between original and perturbed inputs. Extensive experiments across image-heavy, text-heavy, and multimodal benchmarks, spanning multiple MLLM architectures and model scales, demonstrate consistent improvements in unimodal robustness and generalization, while improving standard multimodal performance.
Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation
ArXiv.org · 2025-05-29
preprintOpen accessRecent decoding methods improve the factuality of large language models (LLMs) by refining how the next token is selected during generation. These methods typically operate at the token level, leveraging internal representations to suppress superficial patterns. Nevertheless, LLMs remain prone to hallucinations, especially over longer contexts. In this paper, we propose Active Layer-Contrastive Decoding (ActLCD), a novel decoding strategy that actively decides when to apply contrasting layers during generation. By casting decoding as a sequential decision-making problem, ActLCD employs a reinforcement learning policy guided by a reward-aware classifier to optimize factuality beyond the token level. Our experiments demonstrate that ActLCD surpasses state-of-the-art methods across five benchmarks, showcasing its effectiveness in mitigating hallucinations in diverse generation scenarios.
Heliyon · 2025-01-22 · 3 citations
articleOpen access1st authorCorrespondingL.) have been in high demand and planted in large quantities due to their nutritional value and appealing organoleptic properties. The management mode and species characteristics in the tillage process lead to a decrease in soil quality, and the stability of soil aggregates and decrease in soil nutrients indicate this. However, the effects of different sweet cherry varieties and increasing planting ages on soil quality remain unknown. In this study, soil samples were quantitatively analyzed at different soil depths (0-20 cm, 20-40 cm, 40-60 cm) in cheery orchards of different varieties and ages. The results demonstrated that the particle size content of soil aggregates differed among the varieties of sweet cherry in different soil layers. The mechanical stability of soil aggregates was found to be the lowest in Jimei cherry orchard, where the mass ratio of aggregates with particle sizes exceeding 0.25 mm (R > 0.25) was below the highest 20.99 %, geometric mean diameter (GMD) was below 22.52 %, and mean weight diameter (MWD) was below 17.46 %. In contrast, lower ages demonstrated superior performance in aggregate water stability. The stability of soil aggregates was found to be affected by sweet cherry cultivation, with changes observed in the content of SOC and TN in the surface soil. Principal component analysis indicated that soil quality deteriorated increasingly with ages, while pass-through analysis demonstrated that ages and soil aggregate stability were key factors influencing soil quality. In conclusion, in addition to economic benefits, soil quality should also be protected. This study can help to improve the scientific theoretical basis for the introduction of sweet cherry planting on the Loess Plateau and the management of soil quality.
ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation
ArXiv.org · 2025-10-09
preprintOpen accessSenior authorBenchmarks are central to measuring the capabilities of large language models and guiding model development, yet widespread data leakage from pretraining corpora undermines their validity. Models can match memorized content rather than demonstrate true generalization, which inflates scores, distorts cross-model comparisons, and misrepresents progress. We introduce ArenaBencher, a model-agnostic framework for automatic benchmark evolution that updates test cases while preserving comparability. Given an existing benchmark and a diverse pool of models to be evaluated, ArenaBencher infers the core ability of each test case, generates candidate question-answer pairs that preserve the original objective, verifies correctness and intent with an LLM as a judge, and aggregates feedback from multiple models to select candidates that expose shared weaknesses. The process runs iteratively with in-context demonstrations that steer generation toward more challenging and diagnostic cases. We apply ArenaBencher to math problem solving, commonsense reasoning, and safety domains and show that it produces verified, diverse, and fair updates that uncover new failure modes, increase difficulty while preserving test objective alignment, and improve model separability. The framework provides a scalable path to continuously evolve benchmarks in step with the rapid progress of foundation models.
Exploring Spontaneous Social Interaction Swarm Robotics Powered by Large Language Models
2025-10-19
articleTraditional swarm robots rely on specific communication and planning strategies to coordinate particular tasks. Human swarms exhibit distinctive characteristics due to their capacity for language-based communication and active reasoning. This paper presents an exploratory approach to robotic swarm intelligence that leverages Large Language Models (LLMs) to emulate human-like active problem-solving behaviors. We introduce a decentralized multi-robot system where each robot initially only has its local information and does not know of the existence of the other robots. The robots utilize LLMs for reasoning and natural language for inter-robot communication, enabling them to discover peers, share information, and coordinate actions dynamically. In a series of experiments in zero-shot settings, we observed human-like social behaviors, including mutual discovery, identification, information exchange, collaboration, negotiation, and error correction. While the technical approach is straightforward, the main contribution lies in exploring the interactive societies that LLM-driven robots form – a form of robot social dynamics (or robotic social behavior analysis), examining how human-like communication protocols and collaborative structures emerge among robots through language-based interaction. In this context, we use the term "robot social dynamics" to describe the interaction patterns that arise within robot collectives, inspired by, but distinct from traditional human anthropology.
False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize
ArXiv.org · 2025-09-04
preprintOpen accessSenior authorLarge Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.
DiscoSum: Discourse-aware News Summarization
ArXiv.org · 2025-06-07
preprintOpen accessSenior authorRecent advances in text summarization have predominantly leveraged large language models to generate concise summaries. However, language models often do not maintain long-term discourse structure, especially in news articles, where organizational flow significantly influences reader engagement. We introduce a novel approach to integrating discourse structure into summarization processes, focusing specifically on news articles across various media. We present a novel summarization dataset where news articles are summarized multiple times in different ways across different social media platforms (e.g. LinkedIn, Facebook, etc.). We develop a novel news discourse schema to describe summarization structures and a novel algorithm, DiscoSum, which employs beam search technique for structure-aware summarization, enabling the transformation of news stories to meet different stylistic and structural demands. Both human and automatic evaluation results demonstrate the efficacy of our approach in maintaining narrative fidelity and meeting structural requirements.
Recent grants
CRII: III: Knowledge Graph Completion with Transferable Representation Learning
NSF · $175k · 2021–2024
Frequent coauthors
- 103 shared
Wenxuan Zhou
- 59 shared
Fei Wang
- 28 shared
Kai-Wei Chang
- 27 shared
Dan Roth
- 25 shared
Carlo Zaniolo
University of California, Los Angeles
- 22 shared
Hongming Zhang
- 21 shared
Zekun Li
- 21 shared
Ehsan Qasemi
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Muhao Chen
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup