
Yang Ni
· Assistant ProfessorVerifiedTexas A&M University · Statistics
Active 2008–2026
Research topics
- Artificial Intelligence
- Computer Science
- Data Mining
- Machine Learning
- Biology
- Bioinformatics
- Accounting
- Finance
- Computational biology
- Economics
- Monetary economics
- Business
- Financial system
- Data science
- Theoretical computer science
- Algorithm
- Genetics
Selected publications
Zenodo (CERN European Organization for Nuclear Research) · 2026-04-21
datasetOpen accessSenior authorPreprocessed data from the OneK1K cohort (Yazar et al., Science 2022) used in the MR-CCC analysis of causal cell-cell communication. B_T_NK_Monocytes.rda contains pseudo-bulk gene expression count matrices (genes × donors) for five cell types: B cells, CD4+ T cells, CD8+ T cells, NK cells, and monocytes. Each matrix was aggregated from single-cell counts per donor and library-size normalized. donor.rda contains donor-level metadata and genotype information, including: donor covariates (age, sex, ancestry principal components), a SNP genotype matrix (dosage-encoded), and GRanges objects for SNP and gene genomic coordinates used for cis-eQTL instrument construction. These files are intended for use with the MR-CCC analysis scripts available at: https://github.com/bitansa/MR-CCC Original OneK1K data: https://onek1k.org
Statistical Methods in Medical Research · 2026-01-23
articleIncreasing epidemiologic evidence suggests that the diversity and composition of the gut microbiome can predict infection risk in cancer patients. Infections remain a major cause of morbidity and mortality during chemotherapy. Analyzing microbiome data to identify associations with infection pathogenesis for proactive treatment has become a critical research focus. However, the high-dimensional nature of the data necessitates the use of dimension-reduction methods to facilitate inference and interpretation. Traditional dimension reduction methods, which assume Gaussianity, perform poorly with skewed and zero-inflated microbiome data. To address these challenges, we propose a semiparametric principal component analysis method based on a truncated latent Gaussian copula model that accommodates both skewness and zero inflation. Simulation studies demonstrate that the proposed method outperforms existing approaches by providing more accurate estimates of scores and loadings across various copula transformation settings. We apply our method, along with competing approaches, to gut microbiome data from pediatric patients with acute lymphoblastic leukemia. The principal scores derived from the proposed method reveal the strongest associations between pre-chemotherapy microbiome composition and adverse events during subsequent chemotherapy, offering valuable insights for improving patient outcomes.
Bayesian latent ising model for joint microbial and metabolomic network inference
Journal of Applied Statistics · 2026-03-06
article1st authorOneK1K B-cell dataset used for MR.RGM real-data analysis
Open MIND · 2026-02-05
datasetThis record contains the real-data resources used in the OneK1K-based analyses of the MR.RGM and MR.RGM+ methods. The upload includes one R data archive: donor_b_cell.rda This file contains derived and curated data objects for B-cell samples from the OneK1K project, including:- donor-level covariates (age, sex, genotype PCs),- genotype dosage matrices aligned to donor IDs,- gene-level RNA count matrices aggregated at the donor level, and- genomic annotations for variants and genes. The data have been preprocessed to enable direct use in the real-data analysis scripts provided in the associated code repository. The files in this record are provided to ensure full reproducibility of the OneK1K real-data analyses reported in the associated manuscript. Original OneK1K data were generated by the OneK1K Consortium. This record redistributes derived and reorganized data products for methodological reproducibility only.
GTEx v7 muscle skeletal tissue data used for MR.RGM real-data analysis
Open MIND · 2026-02-04
datasetSenior authorThis record contains the real-data resources used in the GTEx-based analyses of the MR.RGM and MR.RGM+ methods. The upload includes two archives: 1) GTEx.zip This archive contains preprocessed genotype and gene expression matrices derived from the GTEx v7 project for muscle skeletal tissue. These files were prepared for direct use in the real-data analysis scripts provided in the associated code repository. 2) GTEx_Analysis_v7_eQTL.tar.gz This archive contains publicly available GTEx v7 eQTL summary files for muscle skeletal tissue, including significant variant–gene pairs and eGenes, downloaded from the GTEx Portal. The files in this record are provided to enable full reproducibility of the real-data analyses reported in the associated manuscript. Users can download and extract the archives and run the provided R scripts without additional preprocessing. Original GTEx data were generated by the GTEx Consortium. This record redistributes derived and reorganized data products for methodological reproducibility only.
GTEx v7 muscle skeletal tissue data used for MR.RGM real-data analysis
Zenodo (CERN European Organization for Nuclear Research) · 2026-02-04
datasetOpen accessSenior authorThis record contains the real-data resources used in the GTEx-based analyses of the MR.RGM and MR.RGM+ methods. The upload includes two archives: 1) GTEx.zip This archive contains preprocessed genotype and gene expression matrices derived from the GTEx v7 project for muscle skeletal tissue. These files were prepared for direct use in the real-data analysis scripts provided in the associated code repository. 2) GTEx_Analysis_v7_eQTL.tar.gz This archive contains publicly available GTEx v7 eQTL summary files for muscle skeletal tissue, including significant variant–gene pairs and eGenes, downloaded from the GTEx Portal. The files in this record are provided to enable full reproducibility of the real-data analyses reported in the associated manuscript. Users can download and extract the archives and run the provided R scripts without additional preprocessing. Original GTEx data were generated by the GTEx Consortium. This record redistributes derived and reorganized data products for methodological reproducibility only.
Open MIND · 2026-04-21
datasetOpen accessSenior authorPreprocessed data from the OneK1K cohort (Yazar et al., Science 2022) used in the MR-CCC analysis of causal cell-cell communication. B_T_NK_Monocytes.rda contains pseudo-bulk gene expression count matrices (genes × donors) for five cell types: B cells, CD4+ T cells, CD8+ T cells, NK cells, and monocytes. Each matrix was aggregated from single-cell counts per donor and library-size normalized. donor.rda contains donor-level metadata and genotype information, including: donor covariates (age, sex, ancestry principal components), a SNP genotype matrix (dosage-encoded), and GRanges objects for SNP and gene genomic coordinates used for cis-eQTL instrument construction. These files are intended for use with the MR-CCC analysis scripts available at: https://github.com/bitansa/MR-CCC Original OneK1K data: https://onek1k.org
Molecular Neurodegeneration · 2026-01-29 · 1 citations
articleOpen access1st authorParkinson’s disease (PD) is the second most prevalent neurodegenerative disorder worldwide. The pathogenesis of PD is driven by multifactorial mechanisms involving a complex interplay among environmental exposures, genetic susceptibility, and aging-related processes. Among genetic contributors, heterozygous pathogenic variants in the GBA1 gene represent the most significant heritable risk factor for PD. The disease mechanisms of GBA1 defects in PD remains incompletely understood. It has been proposed that a partial loss-of-function of the lysosomal enzyme glucocerebrosidase, or potential toxic gain-of-function effects (e.g., endoplasmic reticulum stress) might contribute to the disease. These processes initiate a cascade of pathophysiological events, including dysregulated sphingolipid metabolism, compromised lysosomal-autophagic function, mitochondrial dysfunction, and accelerated α-synuclein aggregation. Subsequent dopaminergic neurodegeneration and sustained neuroinflammatory cascades ultimately drive PD progression. Nevertheless, the precise molecular mechanisms linking GBA1 mutations to PD pathogenesis remain incompletely elucidated, and clinically validated early diagnostic biomarkers for GBA1-associated PD (GBA1-PD) are still lacking. This review summarizes the distinct clinical phenotypes and mechanistic underpinnings of GBA1-PD, with particular emphasis on omics-derived stratification biomarkers (identified through genomics, transcriptomics, proteomics, and lipidomics approaches) coupled with neuroimaging signatures. This review advances our understanding of GBA1-mediated PD pathogenesis while providing a framework for developing precision diagnostic strategies and targeted therapeutic interventions addressing PD heterogeneity.
OneK1K B-cell dataset used for MR.RGM real-data analysis
Zenodo (CERN European Organization for Nuclear Research) · 2026-02-05
datasetOpen accessThis record contains the real-data resources used in the OneK1K-based analyses of the MR.RGM and MR.RGM+ methods. The upload includes one R data archive: donor_b_cell.rda This file contains derived and curated data objects for B-cell samples from the OneK1K project, including:- donor-level covariates (age, sex, genotype PCs),- genotype dosage matrices aligned to donor IDs,- gene-level RNA count matrices aggregated at the donor level, and- genomic annotations for variants and genes. The data have been preprocessed to enable direct use in the real-data analysis scripts provided in the associated code repository. The files in this record are provided to ensure full reproducibility of the OneK1K real-data analyses reported in the associated manuscript. Original OneK1K data were generated by the OneK1K Consortium. This record redistributes derived and reorganized data products for methodological reproducibility only.
PACKETCLIP: multi-modal embedding of network traffic and language for cybersecurity reasoning
Frontiers in Artificial Intelligence · 2025-07-28 · 6 citations
articleOpen accessTraffic classification is vital for cybersecurity, yet encrypted traffic poses significant challenges. We introduce PACKETCLIP which is a multi-modal framework combining packet data with natural language semantics through contrastive pre-training and hierarchical Graph Neural Network (GNN) reasoning. PACKETCLIP integrates semantic reasoning with efficient classification, enabling robust detection of anomalies in encrypted network flows. By aligning textual descriptions with packet behaviors, PACKETCLIP offers enhanced interpretability, scalability, and practical applicability across diverse security scenarios. With a 95% mean AUC, an 11.6% improvement over baselines, and a 92% reduction in intrusion detection training parameters, it is ideally suited for real-time anomaly detection. By bridging advanced machine-learning techniques and practical cybersecurity needs, PACKETCLIP provides a foundation for scalable, efficient, and interpretable solutions to tackle encrypted traffic classification and network intrusion detection challenges in resource-constrained environments.
Recent grants
Frequent coauthors
- 32 shared
Yuan Ji
University of Chicago
- 20 shared
Francesco C. Stingo
University of Florence
- 20 shared
Peter Müller
GeoSphere Austria
- 19 shared
Veerabhadran Baladandayuthapani
University of Michigan–Ann Arbor
- 17 shared
Ruijie Gong
Anhui University
- 14 shared
Suping Wang
Shanxi Medical University
- 12 shared
Yong Cai
Shanghai Jiao Tong University
- 10 shared
Jeffrey Pittman
Virginia Tech
Labs
Statistics Department of Texas A&M UniversityPI
Education
- 2015
PhD, Statistics
Rice University
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Yang Ni
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup