
Guang Cheng
· ProfessorVerifiedUniversity of California, Los Angeles · Computer Science
Active 1994–2025
About
Guang Cheng is a professor in the Department of Computer Science at UCLA Samueli School of Engineering. His research interests include generative data science, machine/deep learning algorithms and theory, and the statistical foundations of data science. He holds a PhD in Statistics from the University of Wisconsin, Madison, obtained in 2006, and a BA in Economics and Management from Tsinghua University, earned in 2002. Cheng has received numerous awards and recognitions, including being named an IMS Fellow in 2020, receiving the Adobe Data Science Faculty Award in 2020, being designated a University Faculty Scholar in 2018, awarded the Simons Fellowship in Mathematics in 2014, the Noether Young Scholar Award in 2012, the NSF CAREER Award in 2012, and the Facebook X Instagram award. His work contributes to advancing the theoretical and practical understanding of data science and machine learning.
Research topics
- Computer Science
- Artificial Intelligence
- Political Science
- Data Mining
- Computer Security
- Machine Learning
- Data science
- Algorithm
- Mathematics
Selected publications
BTG-RF: Recognizing Douyin payment behaviors based on behavioral traffic graph analysis
Computer Networks · 2025-06-02
articleArXiv.org · 2025-05-27
preprintOpen accessSenior authorThe growing complexity of encrypted network traffic presents dual challenges for modern network management: accurate multiclass classification of known applications and reliable detection of unknown traffic patterns. Although deep learning models show promise in controlled environments, their real-world deployment is hindered by data scarcity, concept drift, and operational constraints. This paper proposes M3S-UPD, a novel Multi-Stage Self-Supervised Unknown-aware Packet Detection framework that synergistically integrates semi-supervised learning with representation analysis. Our approach eliminates artificial segregation between classification and detection tasks through a four-phase iterative process: 1) probabilistic embedding generation, 2) clustering-based structure discovery, 3) distribution-aligned outlier identification, and 4) confidence-aware model updating. Key innovations include a self-supervised unknown detection mechanism that requires neither synthetic samples nor prior knowledge, and a continuous learning architecture that is resistant to performance degradation. Experimental results show that M3S-UPD not only outperforms existing methods on the few-shot encrypted traffic classification task, but also simultaneously achieves competitive performance on the zero-shot unknown traffic discovery task.
MW3F: Improved multi-tab website fingerprinting attacks with Transformer-based feature fusion
Journal of Network and Computer Applications · 2025-01-30 · 3 citations
articleSenior authorCorrespondingArXiv.org · 2025-08-04
preprintOpen accessWith the continuous development of network environments and technologies, ensuring cyber security and governance is increasingly challenging. Network traffic classification(ETC) can analyzes attributes such as application categories and malicious intent, supporting network management services like QoS optimization, intrusion detection, and targeted billing. As the prevalence of traffic encryption increases, deep learning models are relied upon for content-agnostic analysis of packet sequences. However, the emergence of new services and attack variants often leads to incremental tasks for ETC models. To ensure model effectiveness, incremental learning techniques are essential; however, recent studies indicate that neural networks experience declining plasticity as tasks increase. We identified plasticity issues in existing incremental learning methods across diverse traffic samples and proposed the PRIME framework. By observing the effective rank of model parameters and the proportion of inactive neurons, the PRIME architecture can appropriately increase the parameter scale when the model's plasticity deteriorates. Experiments show that in multiple encrypted traffic datasets and different category increment scenarios, the PRIME architecture performs significantly better than other incremental learning algorithms with minimal increase in parameter scale.
Exploring Reasoning-Infused Text Embedding with Large Language Models for Zero-Shot Dense Retrieval
ArXiv.org · 2025-08-29
preprintOpen accessTransformer-based models such as BERT and E5 have significantly advanced text embedding by capturing rich contextual representations. However, many complex real-world queries require sophisticated reasoning to retrieve relevant documents beyond surface-level lexical matching, where encoder-only retrievers often fall short. Decoder-only large language models (LLMs), known for their strong reasoning capabilities, offer a promising alternative. Despite this potential, existing LLM-based embedding methods primarily focus on contextual representation and do not fully exploit the reasoning strength of LLMs. To bridge this gap, we propose Reasoning-Infused Text Embedding (RITE), a simple but effective approach that integrates logical reasoning into the text embedding process using generative LLMs. RITE builds upon existing language model embedding techniques by generating intermediate reasoning texts in the token space before computing embeddings, thereby enriching representations with inferential depth. Experimental results on BRIGHT, a reasoning-intensive retrieval benchmark, demonstrate that RITE significantly enhances zero-shot retrieval performance across diverse domains, underscoring the effectiveness of incorporating reasoning into the embedding process.
GReaTER: Generate Realistic Tabular data after data Enhancement and Reduction
2025-05-19 · 1 citations
articleSenior authorTabular data synthesis involves not only multi-table synthesis but also generating multi-modal data (e.g., strings and categories), which enables diverse knowledge synthesis. However, separating numerical and categorical data has limited the effectiveness of tabular data generation. The GReaT (Generate Realistic Tabular Data) framework uses Large Language Models (LLMs) to encode entire rows, eliminating the need to partition data types. Despite this, the framework's performance is constrained by two issues: (1) tabular data entries lack sufficient semantic meaning, limiting LLM's ability to leverage pre-trained knowledge for in-context learning, and (2) complex multi-table datasets struggle to establish effective relationships for collaboration. To address these, we propose GReaTER (Generate Realistic Tabular Data after data Enhancement and Reduction), which includes: (1) a data semantic enhancement system that improves LLM's understanding of tabular data through mapping, enabling better in-context learning, and (2) a cross-table connecting method to establish efficient relationships across complex tables. Experimental results show that GReaTER outperforms the GReaT framework.
Privacy Auditing Synthetic Data Release through Local Likelihood Attacks
ArXiv.org · 2025-08-28
preprintOpen accessSenior authorAuditing the privacy leakage of synthetic data is an important but unresolved problem. Existing privacy auditing frameworks for synthetic data rely on heuristics and unrealistic assumptions about model access, offering limited ability to describe or detect the privacy exposure of training data through synthetic data release. In this paper, we study designing membership inference attacks (MIAs) that specifically exploit the observation that tabular generative models tend to significantly overfit to certain regions of the training distribution. We propose \emph{Generative Likelihood Ratio Attack} (Gen-LRA), a novel, computationally efficient No-Box MIA that, with no assumption of model knowledge or access, formulates its attack by evaluating the influence a test observation has on a surrogate model's estimate of a local likelihood ratio over the synthetic data. We develop a theoretical framework for the attack: we show that the Gen-LRA score admits a closed-form characterization as a localized density-ratio statistic, and we prove that under a general model of local overfitting it produces a provable mean-score gap between members and non-members, yielding testable predictions for when the attack should succeed. We validate these predictions in a controlled simulation study and assess Gen-LRA against a comprehensive benchmark spanning diverse datasets, generative model architectures, and attack parameters. Across metrics, Gen-LRA consistently dominates competing MIAs, with especially strong gains at low false positive rates. These results underscore Gen-LRA's effectiveness as a privacy auditing tool for the release of synthetic data, and highlight the significant privacy risks posed by generative model overfitting in real-world applications.
A Probabilistic Perspective on Model Collapse
ArXiv.org · 2025-05-20
preprintOpen accessSenior authorIn recent years, model collapse has become a critical issue in language model training, making it essential to understand the underlying mechanisms driving this phenomenon. In this paper, we investigate recursive parametric model training from a probabilistic perspective, aiming to characterize the conditions under which model collapse occurs and, crucially, how it can be mitigated. We conceptualize the recursive training process as a random walk of the model estimate, highlighting how the sample size influences the step size and how the estimation procedure determines the direction and potential bias of the random walk. Under mild conditions, we rigorously show that progressively increasing the sample size at each training step is necessary to prevent model collapse. In particular, when the estimation is unbiased, the required growth rate follows a superlinear pattern. This rate needs to be accelerated even further in the presence of substantial estimation bias. Building on this probabilistic framework, we also investigate the probability that recursive training on synthetic data yields models that outperform those trained solely on real data. Moreover, we extend these results to general parametric model family in an asymptotic regime. Finally, we validate our theoretical results through extensive simulations and a real-world dataset.
Rate-Optimal Rank Aggregation with Private Pairwise Rankings
Journal of the American Statistical Association · 2025-04-03 · 1 citations
articleSenior authorBreaking Distortion-free Watermarks in Large Language Models
ArXiv.org · 2025-02-25
preprintOpen accessIn recent years, LLM watermarking has emerged as an attractive safeguard against AI-generated content, with promising applications in many real-world domains. However, there are growing concerns that the current LLM watermarking schemes are vulnerable to expert adversaries wishing to reverse-engineer the watermarking mechanisms. Prior work in breaking or stealing LLM watermarks mainly focuses on the distribution-modifying algorithm of Kirchenbauer et al. (2023), which perturbs the logit vector before sampling. In this work, we focus on reverse-engineering the other prominent LLM watermarking scheme, distortion-free watermarking (Kuditipudi et al. 2024), which preserves the underlying token distribution by using a hidden watermarking key sequence. We demonstrate that, even under a more sophisticated watermarking scheme, it is possible to compromise the LLM and carry out a spoofing attack, i.e. generate a large number of (potentially harmful) texts that can be attributed to the original watermarked LLM. Specifically, we propose using adaptive prompting and a sorting-based algorithm to accurately recover the underlying secret key for watermarking the LLM. Our empirical findings on LLAMA-3.1-8B-Instruct, Mistral-7B-Instruct, Gemma-7b, and OPT-125M challenge the current theoretical claims on the robustness and usability of the distortion-free watermarking techniques.
Recent grants
Collaborative Research: Nonparametric Bayesian Aggregation for Massive Data
NSF · $140k · 2017–2020
General Semiparametric Inference via Bootstrap Sampling
NSF · $100k · 2009–2012
Collaborative Research: Semiparametric ODE Models for Complex Gene Regulatory Networks
NSF · $46k · 2014–2017
CAREER: Bootstrap M-estimation in Semi-Nonparametric Models
NSF · $400k · 2012–2018
Frequent coauthors
- 49 shared
Zuofeng Shang
- 25 shared
Shilian Kan
Tianjin Hospital
- 25 shared
Shuangle Zong
Second Hospital of Tangshan
- 25 shared
Weidong Liang
First Affiliated Hospital of Gannan Medical University
- 25 shared
Lidong Li
University of Science and Technology Beijing
- 25 shared
Ligeng Li
Second Hospital of Tangshan
- 25 shared
Aijun Wang
Qilu Hospital of Shandong University
- 25 shared
Qiutao Zheng
Awards & honors
- IMS Fellow (2020)
- Adobe Data Science Faculty Award (2020)
- University Faculty Scholar (2018)
- Simons Fellowship in Mathematics (2014)
- Noether Young Scholar Award (2012)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Guang Cheng
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup