Anru Zhang
· Eugene Anson Stead, Jr. M.D. Associate ProfessorVerifiedDuke University · Civil & Environmental Engineering
Active 1997–2026
About
I am a tenured faculty member and the Eugene Anson Stead, Jr. M.D. Associate Professor with joint primary appointments in the Department of Biostatistics & Bioinformatics and the Department of Computer Science at Duke University. My research focuses on generative AI, tensor learning, electronic health records, healthcare, microbiome studies, high-dimensional statistics, and more.
Research topics
- Biology
- Mathematics
- Genetics
- Mathematical optimization
- Computational biology
- Cell biology
- Algorithm
- Statistics
Selected publications
Statistical Inference for Fuzzy Clustering
ArXiv.org · 2026-01-06
articleOpen accessSenior authorClustering is a central tool in biomedical research for discovering heterogeneous patient subpopulations, where group boundaries are often diffuse rather than sharply separated. Traditional methods produce hard partitions, whereas soft clustering methods such as fuzzy $c$-means (FCM) allow mixed memberships and better capture uncertainty and gradual transitions. Despite the widespread use of FCM, principled statistical inference for fuzzy clustering remains limited. We develop a new framework for weighted fuzzy $c$-means (WFCM) for settings with potential cluster size imbalance. Cluster-specific weights rebalance the classical FCM criterion so that smaller clusters are not overwhelmed by dominant groups, and the weighted objective induces a normalized density model with scale parameter $σ$ and fuzziness parameter $m$. Estimation is performed via a blockwise majorize--minimize (MM) procedure that alternates closed-form membership and centroid updates with likelihood-based updates of $(σ,\bw)$. The intractable normalizing constant is approximated by importance sampling using a data-adaptive Gaussian mixture proposal. We further provide likelihood ratio tests for comparing cluster centers and bootstrap-based confidence intervals. We establish consistency and asymptotic normality of the maximum likelihood estimator, validate the method through simulations, and illustrate it using single-cell RNA-seq and Alzheimer disease Neuroimaging Initiative (ADNI) data. These applications demonstrate stable uncertainty quantification and biologically meaningful soft memberships, ranging from well-separated cell populations under imbalance to a graded AD versus non-AD continuum consistent with disease progression.
Don't Retrain, Align: Adapting Autoregressive LMs to Diffusion LMs via Representation Alignment
ArXiv.org · 2026-05-07
articleOpen accessDiffusion language models (DLMs) have recently demonstrated capabilities that complement standard autoregressive (AR) models, particularly in non-sequential generation and bidirectional editing. Although recent work has shown that pretrained autoregressive checkpoints can be converted into diffusion language models, existing recipes primarily transfer parameters through continued denoising training with objective- and attention-level modifications. We instead ask whether the internal representation geometry learned by next-token prediction can be explicitly preserved during AR-to-DLM conversion. We hypothesize that much of the semantic structure learned by AR pretraining can transfer across generation orders, and thus DLM training should be viewed as relearning the decoding path rather than relearning language representations. To investigate this, we introduce REPR-ALIGN, a representation alignment objective that adapts a bidirectional masked diffusion model to reuse representations from a pretrained AR model of identical architecture. Concretely, we align the hidden states of the DLM to the frozen AR model at every layer using cosine similarity, while optimizing the standard masked denoising objective. This simple alignment, with no adapters and no architectural changes beyond the attention mask, yields up to 4x training acceleration in our setting and is particularly effective in low-data regimes. Our results suggest that linguistic representations can transfer across generation order, and that representation alignment provides a simple and effective technique for training diffusion language models. Code is available at https://github.com/pengzhangzhi/Open-dLLM.
Tensor Decomposition with Unaligned Observations
SIAM Journal on Matrix Analysis and Applications · 2026-02-03
articleSenior authorCoupling Models for One-Step Discrete Generation
ArXiv.org · 2026-05-08
articleOpen accessGenerative modeling over discrete structures underpins applications across deep learning, from biological sequence design and code generation to large language models, yet generation often remains sequential, relying on autoregressive decoding or iterative refinement. In this work, we introduce Coupling Models(Coupling Models), a one-step discrete generative model that learns a direct coupling between discrete sequences and Gaussian latents. Unlike recent distillation methods that compress a pretrained multi-step sampler into a few steps, Coupling Model trains a purpose-built decoder to invert this coupling and generate samples in a single step. The model also avoids complex continuous flows over the simplex and hand-specified data-to-noise couplings. Empirically,Coupling Model improves the strongest one-step baselines in each domain: it reduces LM1B text-generation perplexity by 33% at its lowest-perplexity operating point, Fly Brain enhancer-design FBD by 18%, and MNIST-Binary FID by 46%. These results suggest that effective one-step discrete generation depends strongly on how data and noise are coupled before decoding. Code is available at https://github.com/pengzhangzhi/Coupling-Models.
A General Framework for Computational Lower Bounds in Nontrivial Norm Approximation
ArXiv.org · 2026-04-01
articleOpen accessSenior authorIn this note, we propose a general framework for proving computational lower bounds in norm approximation by leveraging a reverse detection--estimation gap. The starting point is a testing problem together with an estimator whose error is significantly smaller than the corresponding computational detection threshold. We show that such a gap yields a lower bound on the approximation distortion achievable by any algorithm in the underlying computational class. In this way, reverse detection--estimation gaps can be turned into a general mechanism for certifying the hardness of approximating nontrivial norms. We apply this framework to the spectral norm of order-$d$ symmetric tensors in $\mathbb{R}^{p^d}$. Using a recently established low-degree hardness result for detecting nonzero high-order cumulant tensors, together with an efficiently computable estimator whose error is below the low-degree detection threshold, we prove that any degree-$D$ low-degree algorithm with $D \le c_d(\log p)^2$ must incur distortion at least $p^{d/4-1/2}/\operatorname{polylog}(p)$ for the tensor spectral norm. Under the low-degree conjecture, the same conclusion extends to all polynomial-time algorithms. In several important settings, this lower bound matches the best known upper bounds up to polylogarithmic factors, suggesting that the exponent $d/4-1/2$ captures a genuine computational barrier. Our results provide evidence that the difficulty of approximating tensor spectral norm is not merely an artifact of existing techniques, but reflects a broader computational barrier.
Synthetic Augmentation in Imbalanced Learning: When It Helps, When It Hurts, and How Much to Add
ArXiv.org · 2026-01-22
articleOpen accessSenior authorImbalanced classification often causes standard training procedures to prioritize the majority class and perform poorly on rare but important cases. A classic and widely used remedy is to augment the minority class with synthetic samples, but two basic questions remain under-resolved: when does synthetic augmentation actually help, and how many synthetic samples should be generated? We develop a unified statistical framework for synthetic augmentation in imbalanced learning, studying models trained on imbalanced data augmented with synthetic minority samples. Our theory shows that synthetic data is not always beneficial. In a "local symmetry" regime, imbalance is not the dominant source of error, so adding synthetic samples cannot improve learning rates and can even degrade performance by amplifying generator mismatch. When augmentation can help ("local asymmetry"), the optimal synthetic size depends on generator accuracy and on whether the generator's residual mismatch is directionally aligned with the intrinsic majority-minority shift. This structure can make the best synthetic size deviate from naive full balancing. Practically, we recommend Validation-Tuned Synthetic Size (VTSS): select the synthetic size by minimizing balanced validation loss over a range centered near the fully balanced baseline, while allowing meaningful departures. Extensive simulations and real data analysis further support our findings.
Subtype-Aware Registration of Longitudinal Electronic Health Records
Figshare · 2026-01-13
datasetOpen accessSenior authorElectronic Health Records (EHRs) contain extensive patient information that can inform downstream clinical decisions, such as mortality prediction, disease phenotyping, and disease onset prediction. A key challenge in EHR data analysis is the temporal gap between when a condition is first recorded and its actual onset time. Such timeline misalignment can lead to artificially distinct biomarker trends among patients with similar disease progression, undermining the reliability of downstream analyses and complicating tasks such as disease subtyping and outcome prediction. To address this challenge, we provide a subtype-aware timeline registration method that leverages data projection and discrete optimization to correct timeline misalignment. Through simulation and real-world data analyses, we demonstrate that the proposed method effectively aligns distorted observed records with the true disease progression patterns, enhancing subtyping clarity and improving performance in downstream clinical analyses. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
Statistical Inference for Fuzzy Clustering
arXiv (Cornell University) · 2026-01-06
preprintOpen accessSenior authorClustering is a central tool in biomedical research for discovering heterogeneous patient subpopulations, where group boundaries are often diffuse rather than sharply separated. Traditional methods produce hard partitions, whereas soft clustering methods such as fuzzy $c$-means (FCM) allow mixed memberships and better capture uncertainty and gradual transitions. Despite the widespread use of FCM, principled statistical inference for fuzzy clustering remains limited. We develop a new framework for weighted fuzzy $c$-means (WFCM) for settings with potential cluster size imbalance. Cluster-specific weights rebalance the classical FCM criterion so that smaller clusters are not overwhelmed by dominant groups, and the weighted objective induces a normalized density model with scale parameter $σ$ and fuzziness parameter $m$. Estimation is performed via a blockwise majorize--minimize (MM) procedure that alternates closed-form membership and centroid updates with likelihood-based updates of $(σ,\bw)$. The intractable normalizing constant is approximated by importance sampling using a data-adaptive Gaussian mixture proposal. We further provide likelihood ratio tests for comparing cluster centers and bootstrap-based confidence intervals. We establish consistency and asymptotic normality of the maximum likelihood estimator, validate the method through simulations, and illustrate it using single-cell RNA-seq and Alzheimer disease Neuroimaging Initiative (ADNI) data. These applications demonstrate stable uncertainty quantification and biologically meaningful soft memberships, ranging from well-separated cell populations under imbalance to a graded AD versus non-AD continuum consistent with disease progression.
Coupling Models for One-Step Discrete Generation
arXiv (Cornell University) · 2026-05-08
preprintOpen accessGenerative modeling over discrete structures underpins applications across deep learning, from biological sequence design and code generation to large language models, yet generation often remains sequential, relying on autoregressive decoding or iterative refinement. In this work, we introduce Coupling Models(Coupling Models), a one-step discrete generative model that learns a direct coupling between discrete sequences and Gaussian latents. Unlike recent distillation methods that compress a pretrained multi-step sampler into a few steps, Coupling Model trains a purpose-built decoder to invert this coupling and generate samples in a single step. The model also avoids complex continuous flows over the simplex and hand-specified data-to-noise couplings. Empirically,Coupling Model improves the strongest one-step baselines in each domain: it reduces LM1B text-generation perplexity by 33% at its lowest-perplexity operating point, Fly Brain enhancer-design FBD by 18%, and MNIST-Binary FID by 46%. These results suggest that effective one-step discrete generation depends strongly on how data and noise are coupled before decoding. Code is available at https://github.com/pengzhangzhi/Coupling-Models.
Synthetic Augmentation in Imbalanced Learning: When It Helps, When It Hurts, and How Much to Add
Open MIND · 2026-01-22
preprintSenior authorImbalanced classification often causes standard training procedures to prioritize the majority class and perform poorly on rare but important cases. A classic and widely used remedy is to augment the minority class with synthetic samples, but two basic questions remain under-resolved: when does synthetic augmentation actually help, and how many synthetic samples should be generated? We develop a unified statistical framework for synthetic augmentation in imbalanced learning, studying models trained on imbalanced data augmented with synthetic minority samples. Our theory shows that synthetic data is not always beneficial. In a "local symmetry" regime, imbalance is not the dominant source of error, so adding synthetic samples cannot improve learning rates and can even degrade performance by amplifying generator mismatch. When augmentation can help ("local asymmetry"), the optimal synthetic size depends on generator accuracy and on whether the generator's residual mismatch is directionally aligned with the intrinsic majority-minority shift. This structure can make the best synthetic size deviate from naive full balancing. Practically, we recommend Validation-Tuned Synthetic Size (VTSS): select the synthetic size by minimizing balanced validation loss over a range centered near the fully balanced baseline, while allowing meaningful departures. Extensive simulations and real data analysis further support our findings.
Recent grants
NIH · $1.5M · 2024–2028
Dimension reduction for high-dimensional high-order data
NSF · $100k · 2018–2021
NSF · $314k · 2021–2025
NSF · $156k · 2020–2021
Integrated Detection and Classification of Sepsis via Tensor Methods Using EHR
NIH · $1.7M · 2024–2029
Frequent coauthors
- 31 shared
Rungang Han
Duke University
- 27 shared
Yuetian Luo
University of Chicago
- 27 shared
Tommaso Cai
University of Oslo
- 20 shared
Wennan Chang
University of Indianapolis
- 20 shared
Changlin Wan
Purdue University West Lafayette
- 19 shared
Chi Zhang
- 17 shared
Xiaojuan Wang
- 15 shared
Pixu Shi
Labs
Anru Zhang's LabPI
Education
- 2015
Ph.D.
University of Pennsylvania
- 2010
B.S.
Peking University
Awards & honors
- Leo Breiman Junior Award (2026)
- COPSS Emerging Leader Award (2024)
- IMS Tweedie Award (2022)
- ASA Gottfried E. Noether Junior Award (2021)
- NSF CAREER Award (2020)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Anru Zhang
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup