Marten Wegkamp

· ProfessorVerified

Cornell University · Mathematics

Active 1995–2024

h-index29

Citations4.7k

Papers10224 last 5y

Funding$443k

Faculty page

See your match with Marten Wegkamp — sign in to PhdFit.Sign in

About

Marten Wegkamp is a Professor in the Department of Mathematics at Cornell University. He holds a PhD in Mathematics from Leiden University, obtained in 1996. His research focuses on applied mathematics, probability, and statistics, with particular emphasis on developing new methodology in statistics and machine learning. His work includes contributions to mathematical statistics, empirical process theory, high-dimensional statistics, and statistical learning theory. Wegkamp's research efforts are centered on creating innovative statistical and machine learning methodologies, and he has published extensively in these areas, including work on latent factor regression models, topic models, and empirical processes.

Research topics

Computer Science
Statistics
Mathematical optimization
Data Mining
Machine Learning
Artificial Intelligence
Mathematics
Algorithm
Combinatorics
Discrete mathematics
Biology
Applied mathematics

Selected publications

A New Regression Lens on Multi-Class Classification
arXiv (Cornell University) · 2024-02-22
preprintOpen accessSenior author
Linear Discriminant Analysis (LDA) is a fundamental method for classification. Its simple linear structure facilitates interpretation, and it is naturally suited to multi-class settings. LDA is also closely connected to several classical multivariate techniques, including Fisher's discriminant analysis, canonical correlation analysis, and linear regression. In this paper, we strengthen the connection between LDA and multivariate response regression by establishing an explicit relationship between discriminant directions and regression coefficients. This characterization yields a new regression-based framework for multi-class classification that accommodates structured, regularized, and even non-parametric regression methods. In contrast to existing regression-based approaches, our formulation is particularly amenable to theoretical analysis: we develop a general strategy for deriving bounds on the excess misclassification risk of the proposed classifier across all such regression procedures. As concrete applications, we provide complete theoretical guarantees for two widely used methods -- $\ell_1$-regularization and reduced-rank regression -- neither of which has previously been fully analyzed in the LDA context. The theoretical results are supported by extensive simulation studies and empirical evaluations on real data.
Publisher OA PDF DOI
SLIDE: Significant Latent Factor Interaction Discovery and Exploration across biological domains
Nature Methods · 2024 · 25 citations
- Computer Science
- Computer Science
- Machine Learning
Publisher OA PDF DOI
Learning large softmax mixtures with warm start EM
arXiv (Cornell University) · 2024-09-16
preprintOpen accessSenior author
Softmax mixture models (SMMs) are discrete $K$-mixtures introduced to model the probability of choosing an attribute $x_j \in \RR^L$ from $p$ candidates, in heterogeneous populations. They have been known as mixed multinomial logits in the econometrics literature, and are gaining traction in the LLM literature, where single softmax models are routinely used in the final layer of a neural network. This paper provides a comprehensive analysis of the EM algorithm for SMMs in high dimensions. Its population-level theoretical analysis forms the basis for proving (i) local identifiability, in SSMs with generic features and, further, via a stochastic argument, (ii) full identifiability in SSMs with random features, when $p$ is large enough. These are the first results in this direction for SSMs with $L > 1$. The population-level EM analysis characterizes the initialization radius for algorithmic convergence. This also guides the construction of warm starts of the sample level EM. Under suitable initialization, the EM algorithm is shown to recover the mixture atoms of the SSM at near-parametric rate. We provide two main directions for warm start construction, both based on a new method for estimating the moments of the mixing measure underlying an SSM with random design. First, we construct a method of moments (MoM) estimator of the mixture parameters, and provide its first theoretical analysis. While MoM can enjoy parametric rates of convergence, and thus can serve as a warm-start, the estimator's quality degrades exponentially in $K$. Our recommendation, when $K$ is not small, is to run the EM algorithm several times with random initializations. We again make use of the novel latent moments estimation method to estimate the $K$-dimensional subspace of the mixture atoms. Sampling from this subspace reduces substantially the number of required draws.
Publisher OA PDF DOI
Detecting approximate replicate components of a high-dimensional random vector with latent structure
Bernoulli · 2023-02-20 · 2 citations
articleSenior author
High-dimensional feature vectors are likely to contain sets of measurements that are approximate replicates of one another. In complex applications, or automated data collection, these feature sets are not known a priori, and need to be determined. This work proposes a class of latent factor models on the observed, high-dimensional, random vector X∈Rp, for defining, identifying and estimating the index set of its approximately replicate components. The model class is parametrized by a p×K loading matrix A that contains a hidden sub-matrix whose rows can be partitioned into groups of parallel vectors. Under this model class, a set of approximate replicate components of X corresponds to a set of parallel rows in A: these entries of X are, up to scale and additive error, the same linear combination of the K latent factors; the value of K is itself unknown. The problem of finding approximate replicates in X reduces to identifying, and estimating, the location of the hidden sub-matrix within A, and of the partition H of its row index set H. Both H and H can be fully characterized in terms of a new family of criteria based on the correlation matrix of X, and their identifiability, as well as that of the unknown latent dimension K, are obtained as consequences. The constructive nature of the identifiability arguments enables computationally efficient procedures, with consistency guarantees. Furthermore, when the loading matrix A has a particular sparse structure, provided by the errors-in-variable parametrization, the difficulty of the problem is elevated. The task becomes that of separating out groups of parallel rows that are proportional to canonical basis vectors from other, possibly dense, parallel rows in A. This is met under a scale assumption, via a principled way of selecting the target row indices, guided by the successive maximization of Schur complements of appropriate covariance matrices. The resulting procedure is an enhanced version of that developed for recovering general parallel rows in A. It is also computationally efficient, consistent. It has immediate applications to latent space overlapping clustering and the estimation of loading matrices that satisfy a canonical parametrization.
Publisher DOI
Interpolating discriminant functions in high-dimensional Gaussian latent mixtures
Biometrika · 2023-06-08 · 1 citations
articleSenior authorCorresponding
Abstract This paper considers binary classification of high-dimensional features under a postulated model with a low-dimensional latent Gaussian mixture structure and nonvanishing noise. A generalized least-squares estimator is used to estimate the direction of the optimal separating hyperplane. The estimated hyperplane is shown to interpolate on the training data. While the direction vector can be consistently estimated, as could be expected from recent results in linear regression, a naive plug-in estimate fails to consistently estimate the intercept. A simple correction, which requires an independent hold-out sample, renders the procedure minimax optimal in many scenarios. The interpolation property of the latter procedure can be retained, but surprisingly depends on the way the labels are encoded.
Publisher DOI
Optimal discriminant analysis in high-dimensional latent factor models
The Annals of Statistics · 2023-06-01
articleSenior author
In high-dimensional classification problems, a commonly used approach is to first project the high-dimensional features into a lower-dimensional space, and base the classification on the resulting lower-dimensional projections. In this paper, we formulate a latent-variable model with a hidden low-dimensional structure to justify this two-step procedure and to guide which projection to choose. We propose a computationally efficient classifier that takes certain principal components (PCs) of the observed features as projections, with the number of retained PCs selected in a data-driven way. A general theory is established for analyzing such two-step classifiers based on any projections. We derive explicit rates of convergence of the excess risk of the proposed PC-based classifier. The obtained rates are further shown to be optimal up to logarithmic factors in the minimax sense. Our theory allows the lower dimension to grow with the sample size and is also valid even when the feature dimension (greatly) exceeds the sample size. Extensive simulations corroborate our theoretical findings. The proposed method also performs favorably relative to other existing discriminant methods on three real data examples.
Publisher DOI
Interpolating Discriminant Functions in High-Dimensional Gaussian Latent Mixtures
arXiv (Cornell University) · 2022-10-25
preprintOpen accessSenior author
This paper considers binary classification of high-dimensional features under a postulated model with a low-dimensional latent Gaussian mixture structure and non-vanishing noise. A generalized least squares estimator is used to estimate the direction of the optimal separating hyperplane. The estimated hyperplane is shown to interpolate on the training data. While the direction vector can be consistently estimated as could be expected from recent results in linear regression, a naive plug-in estimate fails to consistently estimate the intercept. A simple correction, that requires an independent hold-out sample, renders the procedure minimax optimal in many scenarios. The interpolation property of the latter procedure can be retained, but surprisingly depends on the way the labels are encoded.
Publisher OA PDF DOI
SLIDE: Significant Latent Factor Interaction Discovery and Exploration across biological domains
bioRxiv (Cold Spring Harbor Laboratory) · 2022-11-27
preprintOpen access
Abstract Modern multi-omic technologies can generate deep multi-scale profiles. However, differences in data modalities, multicollinearity of the data, and large numbers of irrelevant features make the analyses and integration of high-dimensional omic datasets challenging. Here, we present Significant Latent factor Interaction Discovery and Exploration (SLIDE), a first-in-class interpretable machine learning technique for identifying significant interacting latent factors underlying outcomes of interest from high-dimensional omic datasets. SLIDE makes no assumptions regarding data-generating mechanisms, comes with theoretical guarantees regarding identifiability of the latent factors/corresponding inference, outperforms/performs at least as well as state-of-the-art approaches in terms of prediction, and provides inference beyond prediction. Using SLIDE on scRNA-seq data from systemic sclerosis (SSc) patients, we first uncovered significant interacting latent factors underlying SSc pathogenesis. In addition to accurately predicting SSc severity and outperforming existing benchmarks, SLIDE uncovered significant factors that included well-elucidated altered transcriptomic states in myeloid cells and fibroblasts, an intriguing keratinocyte-centric signature validated by protein staining, and a novel mechanism involving altered HLA signaling in myeloid cells, that has support in genetic data. SLIDE also worked well on spatial transcriptomic data and was able to accurately identify significant interacting latent factors underlying immune cell partitioning by 3D location within lymph nodes. Finally, SLIDE leveraged paired scRNA-seq and TCR-seq data to elucidate latent factors underlying extents of clonal expansion of CD4 T cells in a nonobese diabetic model of T1D. The latent factors uncovered by SLIDE included well-known activation markers, inhibitory receptors and intracellular regulators of receptor signaling, but also honed in on several novel naïve and memory states that standard analyses missed. Overall, SLIDE is a versatile engine for biological discovery from modern multi-omic datasets.
Publisher OA PDF DOI
Inference in latent factor regression with clusterable features
Bernoulli · 2022-03-04 · 5 citations
articleSenior author
Regression models, in which the observed features X∈Rp and the response Y∈R depend, jointly, on a lower dimensional, unobserved, latent vector Z∈RK, with K≪p, are popular in a large array of applications, and mainly used for predicting a response from correlated features. In contrast, methodology and theory for inference on the regression coefficient β∈RK relating Y to Z are scarce, since typically the un-observable factor Z is hard to interpret. Furthermore, the determination of the asymptotic variance of an estimator of β is a long-standing problem, with solutions known only in a few particular cases. To address some of these outstanding questions, we develop inferential tools for β in a class of factor regression models in which the observed features are signed mixtures of the latent factors. The model specifications are both practically desirable, in a large array of applications, render interpretability to the components of Z, and are sufficient for parameter identifiability. Without assuming that the number of latent factors K or the structure of the mixture is known in advance, we construct computationally efficient estimators of β, along with estimators of other important model parameters. We benchmark the rate of convergence of β by first establishing its ℓ2-norm minimax lower bound, and show that our proposed estimator βˆ is minimax-rate adaptive. Our main contribution is the provision of a unified analysis of the component-wise Gaussian asymptotic distribution of βˆ and, especially, the derivation of a closed form expression of its asymptotic variance, together with consistent variance estimators. The resulting inferential tools can be used when both K and p are independent of the sample size n, and also when both, or either, p and K vary with n, while allowing for p>n. This complements the only asymptotic normality results obtained for a particular case of the model under consideration, in the regime K=O(1) and p→∞, but without a variance estimate. As an application, we provide, within our model specifications, a statistical platform for inference in regression on latent cluster centers, thereby increasing the scope of our theoretical results. We benchmark the newly developed methodology on a recently collected data set for the study of the effectiveness of a new SIV vaccine. Our analysis enables the determination of the top latent antibody-centric mechanisms associated with the vaccine response.
Publisher DOI
Optimal Discriminant Analysis in High-Dimensional Latent Factor Models
arXiv (Cornell University) · 2022-10-23
preprintOpen accessSenior author
In high-dimensional classification problems, a commonly used approach is to first project the high-dimensional features into a lower dimensional space, and base the classification on the resulting lower dimensional projections. In this paper, we formulate a latent-variable model with a hidden low-dimensional structure to justify this two-step procedure and to guide which projection to choose. We propose a computationally efficient classifier that takes certain principal components (PCs) of the observed features as projections, with the number of retained PCs selected in a data-driven way. A general theory is established for analyzing such two-step classifiers based on any projections. We derive explicit rates of convergence of the excess risk of the proposed PC-based classifier. The obtained rates are further shown to be optimal up to logarithmic factors in the minimax sense. Our theory allows the lower-dimension to grow with the sample size and is also valid even when the feature dimension (greatly) exceeds the sample size. Extensive simulations corroborate our theoretical findings. The proposed method also performs favorably relative to other existing discriminant methods on three real data examples.
Publisher OA PDF DOI

Recent grants

Sparsity oracle inequalities via l_1 regularization in nonparametric models
NSF · $243k · 2007–2010
Estimation of High Dimensional Matrices of Low Effective Rank with Applications to Structural Copula Models
NSF · $200k · 2013–2016

Frequent coauthors

Florentina Bunea
Cornell University
88 shared
Adrian Barbu
Florida State University
36 shared
Alexandre B. Tsybakov
24 shared
Xin Bing
23 shared
Dragan Radulović
Institute for Technology of Nuclear and other Mineral Raw Materials
14 shared
Jean‐David Fermanian
9 shared
Seth Strimas-Mackey
Cornell University
8 shared
Yiyuan She
Florida State University
6 shared

Education

Ph.D., Mathematics
Leiden University

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Marten Wegkamp

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you