
Joan Bruna
· Professor of Computer Science and Data ScienceVerifiedNew York University · Computer Science
Active 2008–2025
About
Joan Bruna is a professor who advises students through the Computer Science department at the Courant Institute (CILVR Group), the Data Science department at the Center for Data Science (MaD group), and the Mathematics department at the Courant Institute. He provides guidance to prospective PhD students and encourages applications to the respective programs that best fit individual profiles. While he cannot address all requests from prospective MSc or undergraduate students seeking internships, he invites those with compelling stories and concrete links to his research to reach out. Currently, he is not taking any summer internships and advises prospective students to consult the program websites for further information.
Research topics
- Artificial Intelligence
- Computer Science
- Mathematics
- Machine Learning
- Theoretical computer science
- Statistical physics
- Statistics
- Combinatorics
- Physics
- Algorithm
- Astrophysics
- Quantum mechanics
Selected publications
Compositional Reasoning with Transformers, RNNs, and Chain of Thought
ArXiv.org · 2025-03-03
preprintOpen accessSenior authorIt is well understood that different neural network architectures are suited to different tasks, but is there always a single best architecture for a given task? We compare the expressive power of transformers, RNNs, and transformers with chain of thought tokens on a simple and natural class of tasks we term Compositional Reasoning Questions (CRQ). This family captures multi-step problems with tree-like compositional structure, such as evaluating Boolean formulas. We prove that under standard hardness assumptions, \emph{none} of these three architectures is capable of solving CRQs unless some hyperparameter (depth, embedding dimension, and number of chain of thought tokens, respectively) grows with the size of the input. We then provide constructions for solving CRQs with each architecture. For transformers, our construction uses depth that is logarithmic in the problem size. For RNNs, logarithmic embedding dimension is necessary and sufficient, so long as the inputs are provided in a certain order. For transformers with chain of thought, our construction uses $n$ CoT tokens for input size $n$. These results show that, while CRQs are inherently hard, there are several different ways for language models to overcome this hardness. Even for a single class of problems, each architecture has strengths and weaknesses, and none is strictly better than the others.
Propagation of Chaos in One-hidden-layer Neural Networks beyond Logarithmic Time
ArXiv.org · 2025-04-17
preprintOpen accessSenior authorWe study the approximation gap between the dynamics of a polynomial-width neural network and its infinite-width counterpart, both trained using projected gradient descent in the mean-field scaling regime. We demonstrate how to tightly bound this approximation gap through a differential equation governed by the mean-field dynamics. A key factor influencing the growth of this ODE is the local Hessian of each particle, defined as the derivative of the particle's velocity in the mean-field dynamics with respect to its position. We apply our results to the canonical feature learning problem of estimating a well-specified single-index model; we permit the information exponent to be arbitrarily large, leading to convergence times that grow polynomially in the ambient dimension $d$. We show that, due to a certain ``self-concordance'' property in these problems -- where the local Hessian of a particle is bounded by a constant times the particle's velocity -- polynomially many neurons are sufficient to closely approximate the mean-field dynamics throughout training.
Communications on Pure and Applied Mathematics · 2025-07-15
articleAbstract We study gradient flow on the multi‐index regression problem for high‐dimensional Gaussian data. Multi‐index functions consist of a composition of an unknown low‐rank linear projection and an arbitrary unknown, low‐dimensional link function. As such, they constitute a natural template for feature learning in neural networks. We consider a two‐timescale algorithm, whereby the low‐dimensional link function is learnt with a non‐parametric model infinitely faster than the subspace parametrizing the low‐rank projection. By appropriately exploiting the matrix semigroup structure arising over the subspace correlation matrices, we establish global convergence of the resulting Grassmannian gradient flow dynamics, and provide a quantitative description of its associated “saddle‐to‐saddle” dynamics. Notably, the timescales associated with each saddle can be explicitly characterized in terms of an appropriate Hermite decomposition of the target link function.
Generative Modeling from Black-box Corruptions via Self-Consistent Stochastic Interpolants
ArXiv.org · 2025-12-11
preprintOpen accessSenior authorTransport-based methods have emerged as a leading paradigm for building generative models from large, clean datasets. However, in many scientific and engineering domains, clean data are often unavailable: instead, we only observe measurements corrupted through a noisy, ill-conditioned channel. A generative model for the original data thus requires solving an inverse problem at the level of distributions. In this work, we introduce a novel approach to this task based on Stochastic Interpolants: we iteratively update a transport map between corrupted and clean data samples using only access to the corrupted dataset as well as black box access to the corruption channel. Under appropriate conditions, this iterative procedure converges towards a self-consistent transport map that effectively inverts the corruption channel, thus enabling a generative model for the clean data. We refer to the resulting method as the self-consistent stochastic interpolant (SCSI). It (i) is computationally efficient compared to variational alternatives, (ii) highly flexible, handling arbitrary nonlinear forward models with only black-box access, and (iii) enjoys theoretical guarantees. We demonstrate superior performance on inverse problems in natural image processing and scientific reconstruction, and establish convergence guarantees of the scheme under appropriate assumptions. Our source code is publicly available at https://github.com/modichirag/SCSI
Survey on Algorithms for Multi-Index Models
Statistical Science · 2025-08-01
article1st authorCorrespondingWe review the literature on algorithms for estimating the index space in a multi-index model. The primary focus is on computationally efficient (polynomial-time) algorithms in Gaussian space, the assumptions under which consistency is guaranteed by these methods, and their sample complexities. In many cases, a gap is observed between the sample complexity of the best known computationally efficient methods and the information-theoretical minimum. We also review algorithms based on estimating the span of gradients using nonparametric methods, and algorithms based on fitting neural networks using gradient descent.
Data-driven multiscale modeling for correcting dynamical systems
Machine Learning Science and Technology · 2025-10-31
articleOpen accessSenior authorAbstract We propose a multiscale approach for predicting quantities in dynamical systems which is explicitly structured to extract information in both fine-to-coarse and coarse-to-fine directions. Our approach improves model accuracy and stability with minimally increased computation compared to non-multiscale approaches with analogous network architecture. We evaluate our approach on an idealized fluid subgrid parameterization (known as closure) task in which our multiscale networks correct chaotic underlying models to reflect the contributions of unresolved, fine-scale dynamics.
Axial Neural Networks for Dimension-Free Foundation Models
ArXiv.org · 2025-10-15
preprintOpen accessThe advent of foundation models in AI has significantly advanced general-purpose learning, enabling remarkable capabilities in zero-shot inference and in-context learning. However, training such models on physics data, including solutions to partial differential equations (PDEs), poses a unique challenge due to varying dimensionalities across different systems. Traditional approaches either fix a maximum dimension or employ separate encoders for different dimensionalities, resulting in inefficiencies. To address this, we propose a dimension-agnostic neural network architecture, the Axial Neural Network (XNN), inspired by parameter-sharing structures such as Deep Sets and Graph Neural Networks. XNN generalizes across varying tensor dimensions while maintaining computational efficiency. We convert existing PDE foundation models into axial neural networks and evaluate their performance across three training scenarios: training from scratch, pretraining on multiple PDEs, and fine-tuning on a single PDE. Our experiments show that XNNs perform competitively with original models and exhibit superior generalization to unseen dimensions, highlighting the importance of multidimensional pretraining for foundation models.
The Generative Leap: Sharp Sample Complexity for Efficiently Learning Gaussian Multi-Index Models
ArXiv.org · 2025-06-05
preprintOpen accessSenior authorIn this work we consider generic Gaussian Multi-index models, in which the labels only depend on the (Gaussian) $d$-dimensional inputs through their projection onto a low-dimensional $r = O_d(1)$ subspace, and we study efficient agnostic estimation procedures for this hidden subspace. We introduce the \emph{generative leap} exponent $k^\star$, a natural extension of the generative exponent from [Damian et al.'24] to the multi-index setting. We first show that a sample complexity of $n=Θ(d^{1 \vee \k/2})$ is necessary in the class of algorithms captured by the Low-Degree-Polynomial framework. We then establish that this sample complexity is also sufficient, by giving an agnostic sequential estimation procedure (that is, requiring no prior knowledge of the multi-index model) based on a spectral U-statistic over appropriate Hermite tensors. We further compute the generative leap exponent for several examples including piecewise linear functions (deep ReLU networks with bias), and general deep neural networks (with $r$-dimensional first hidden layer).
Emergence of Linear Truth Encodings in Language Models
ArXiv.org · 2025-10-17
preprintOpen accessRecent probing studies reveal that large language models exhibit linear subspaces that separate true from false statements, yet the mechanism behind their emergence is unclear. We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end and exposes one concrete route by which they can arise. We study one simple setting in which truth encoding can emerge: a data distribution where factual statements co-occur with other factual statements (and vice-versa), encouraging the model to learn this distinction in order to lower the LM loss on future tokens. We corroborate this pattern with experiments in pretrained language models. Finally, in the toy setting we observe a two-phase learning dynamic: networks first memorize individual factual associations in a few steps, then -- over a longer horizon -- learn to linearly separate true from false, which in turn lowers language-modeling loss. Together, these results provide both a mechanistic demonstration and an empirical motivation for how and why linear truth representations can emerge in language models.
Survey on Algorithms for multi-index models
ArXiv.org · 2025-04-07
preprintOpen access1st authorCorrespondingWe review the literature on algorithms for estimating the index space in a multi-index model. The primary focus is on computationally efficient (polynomial-time) algorithms in Gaussian space, the assumptions under which consistency is guaranteed by these methods, and their sample complexity. In many cases, a gap is observed between the sample complexity of the best known computationally efficient methods and the information-theoretical minimum. We also review algorithms based on estimating the span of gradients using nonparametric methods, and algorithms based on fitting neural networks using gradient descent
Frequent coauthors
- 37 shared
Yann LeCun
New York University
- 34 shared
Stéphane Mallat
- 33 shared
Anastasiia Gorbunova
Institut des Géosciences de l'Environnement
- 27 shared
Julien Le Sommer
Université Grenoble Alpes
- 25 shared
Samy Jelassi
- 25 shared
Julie Deshayes
Sorbonne Université
- 19 shared
Denis Zorin
New York University
- 16 shared
Rob Fergus
Labs
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Joan Bruna
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup