
Stefano Soatto
· ProfessorVerifiedUniversity of California, Los Angeles · Computer Science
Active 1993–2026
About
Stefano Soatto is a professor in the Department of Computer Science and Electrical and Computer Engineering at UCLA Samueli School of Engineering. His research interests include computer vision, machine learning, and robotics. He holds a PhD from the California Institute of Technology, earned in 1996. Dr. Soatto has been recognized as an ACM Fellow in 2023 and an IEEE Fellow in 2013. His contributions to the field have been acknowledged through awards such as the David Marr Prize in 1999. He is actively involved in advancing knowledge in artificial intelligence and visual perception, contributing to the understanding of how humans and machines interpret visual data.
Research topics
- Computer Science
- Artificial Intelligence
- Machine Learning
- Mathematics
- Algorithm
- Computer vision
Selected publications
Learning When to Attend: Conditional Memory Access for Long-Context LLMs
arXiv (Cornell University) · 2026-03-18
preprintOpen accessSenior authorLanguage models struggle to generalize beyond pretraining context lengths, limiting long-horizon reasoning and retrieval. Continued pretraining on long-context data can help but is expensive due to the quadratic scaling of Attention. We observe that most tokens do not require (Global) Attention over the entire sequence and can rely on local context. Based on this, we propose L2A (Learning To Attend), a layer that enables conditional (token-wise) long-range memory access by deciding when to invoke global attention. We evaluate L2A on Qwen 2.5 and Qwen 3 models, extending their effective context length from 32K to 128K tokens. L2A matches the performance of standard long-context training to within 3% while skipping Global Attention for $\sim$80% of tokens, outperforming prior baselines. We also design custom Triton kernels to efficiently implement this token-wise conditional Attention on GPUs, achieving up to $\sim$2x improvements in training throughput and time-to-first-token over FlashAttention. Moreover, L2A enables post-training pruning of highly sparse Global Attention layers, reducing KV cache memory by up to 50% with negligible performance loss.
Evolutionary Generation of Multi-Agent Systems
ArXiv.org · 2026-02-06
articleOpen accessSenior authorLarge language model (LLM)-based multi-agent systems (MAS) show strong promise for complex reasoning, planning, and tool-augmented tasks, but designing effective MAS architectures remains labor-intensive, brittle, and hard to generalize. Existing automatic MAS generation methods either rely on code generation, which often leads to executability and robustness failures, or impose rigid architectural templates that limit expressiveness and adaptability. We propose Evolutionary Generation of Multi-Agent Systems (EvoMAS), which formulates MAS generation as structured configuration generation. EvoMAS performs evolutionary generation in configuration space. Specifically, EvoMAS selects initial configurations from a pool, applies feedback-conditioned mutation and crossover guided by execution traces, and iteratively refines both the candidate pool and an experience memory. We evaluate EvoMAS on diverse benchmarks, including BBEH, SWE-Bench, and WorkBench, covering reasoning, software engineering, and tool-use tasks. EvoMAS consistently improves task performance over both human-designed MAS and prior automatic MAS generation methods, while producing generated systems with higher executability and runtime robustness. EvoMAS outperforms the agent evolution method EvoAgent by +10.5 points on BBEH reasoning and +7.1 points on WorkBench. With Claude-4.5-Sonnet, EvoMAS also reaches 79.1% on SWE-Bench-Verified, matching the top of the leaderboard. Code is available at https://github.com/amazon-science/EvoMAS
Learning When to Attend: Conditional Memory Access for Long-Context LLMs
ArXiv.org · 2026-03-18
articleOpen accessSenior authorLanguage models struggle to generalize beyond pretraining context lengths, limiting long-horizon reasoning and retrieval. Continued pretraining on long-context data can help but is expensive due to the quadratic scaling of Attention. We observe that most tokens do not require (Global) Attention over the entire sequence and can rely on local context. Based on this, we propose L2A (Learning To Attend), a layer that enables conditional (token-wise) long-range memory access by deciding when to invoke global attention. We evaluate L2A on Qwen 2.5 and Qwen 3 models, extending their effective context length from 32K to 128K tokens. L2A matches the performance of standard long-context training to within 3% while skipping Global Attention for $\sim$80% of tokens, outperforming prior baselines. We also design custom Triton kernels to efficiently implement this token-wise conditional Attention on GPUs, achieving up to $\sim$2x improvements in training throughput and time-to-first-token over FlashAttention. Moreover, L2A enables post-training pruning of highly sparse Global Attention layers, reducing KV cache memory by up to 50% with negligible performance loss.
PICASO: Permutation-Invariant Context Composition with State Space Models
ArXiv.org · 2025-02-24
preprintOpen accessSenior authorProviding Large Language Models with relevant contextual knowledge at inference time has been shown to greatly improve the quality of their generations. This is often achieved by prepending informative passages of text, or 'contexts', retrieved from external knowledge bases to their input. However, processing additional contexts online incurs significant computation costs that scale with their length. State Space Models (SSMs) offer a promising solution by allowing a database of contexts to be mapped onto fixed-dimensional states from which to start the generation. A key challenge arises when attempting to leverage information present across multiple contexts, since there is no straightforward way to condition generation on multiple independent states in existing SSMs. To address this, we leverage a simple mathematical relation derived from SSM dynamics to compose multiple states into one that efficiently approximates the effect of concatenating raw context tokens. Since the temporal ordering of contexts can often be uninformative, we enforce permutation-invariance by efficiently averaging states obtained via our composition algorithm across all possible context orderings. We evaluate our resulting method on WikiText and MSMARCO in both zero-shot and fine-tuned settings, and show that we can match the strongest performing baseline while enjoying on average 5.4x speedup.
MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation
arXiv (Cornell University) · 2025-11-11
preprintOpen accessSenior authorReinforcement Learning with Verifiable Rewards (RLVR) has become a standard recipe for post-training LLMs on reasoning tasks, with Group Relative Policy Optimization (GRPO) emerging as a leading approach. However, GRPO and its variants are inherently single-turn: they optimize from terminal rewards on isolated prompt-response pairs, leaving them poorly suited to agentic settings where models must iteratively refine solutions in response to environmental feedback. We introduce MURPHY, a multi-turn extension of GRPO for self-correcting code generation. MURPHY constructs feedback-conditioned rollout trees in which failed candidate solutions are paired with executor feedback and expanded into subsequent turns, and propagates rewards backward through the tree so that later successful refinements credit earlier attempts that surfaced informative feedback. We study two propagation strategies, Max Reward (MARS) and Mean Reward (MERS), and introduce post-rollout pruning mechanisms that reduce multi-turn optimization cost. Across three code generation benchmarks (HumanEval, MBPP, LiveCodeBench-v6) and two model families (Qwen3-1.7B/4B, OLMo-2-7B), MURPHY delivers up to 6% absolute pass@1 gains over the strongest prior multi-turn execution-feedback methods. Gains are largest on the Medium/Hard subset (+4.38/+4.20 at Iter-5), where iterative self-correction matters more.
AI Agents as Universal Task Solvers
arXiv (Cornell University) · 2025-10-14
preprintOpen accessSenior authorWe describe AI agents as stochastic dynamical systems and frame the problem of learning to reason as in transductive inference: Rather than approximating the distribution of past data as in classical induction, the objective is to capture its algorithmic structure so as to reduce the time needed to solve new tasks. In this view, information from past experience serves not only to reduce a model's uncertainty - as in Shannon's classical theory - but to reduce the computational effort required to find solutions to unforeseen tasks. Working in the verifiable setting, where a checker or reward function is available, we establish three main results. First, we show that the optimal speed-up on a new task is tightly related to the algorithmic information it shares with the training data, yielding a theoretical justification for the power-law scaling empirically observed in reasoning models. Second, while the compression view of learning, rooted in Occam's Razor, favors simplicity, we show that transductive inference yields its greatest benefits precisely when the data-generating mechanism is most complex. Third, we identify a possible failure mode of naive scaling: in the limit of unbounded model size and compute, models with access to a reward signal can behave as savants - brute-forcing solutions without acquiring transferable reasoning strategies. Accordingly, we argue that a critical quantity to optimize when scaling reasoning models is time, whose role in learning has remained largely unexplored.
Descriminative-Generative Custom Tokens for Vision-Language Models
ArXiv.org · 2025-02-17
preprintOpen accessSenior authorThis paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs). Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries. The targeted concept is specified in terms of a small set of images and a parent concept described using text. We operate on CLIP text features and propose to use a combination of a textual inversion loss and a classification loss to ensure that text features of the learned token are aligned with image features of the concept in the CLIP embedding space. We restrict the learned token to a low-dimensional subspace spanned by tokens for attributes that are appropriate for the given super-class. These modifications improve the quality of compositions of the learned token with natural language for generating new scenes. Further, we show that learned custom tokens can be used to form queries for text-to-image retrieval task, and also have the important benefit that composite queries can be visualized to ensure that the desired concept is faithfully encoded. Based on this, we introduce the method of Generation Aided Image Retrieval, where the query is modified at inference time to better suit the search intent. On the DeepFashion2 dataset, our method improves Mean Reciprocal Retrieval (MRR) over relevant baselines by 7%.
Robust Planning for Autonomous Driving via Mixed Adversarial Diffusion Predictions
2025-05-19
articleSenior authorWe describe a robust planning method for autonomous driving that mixes normal and adversarial agent predictions output by a diffusion model trained for motion prediction. We first train a diffusion model to learn an unbiased distribution of normal agent behaviors. We then generate a distribution of adversarial predictions by biasing the diffusion model at test time to generate predictions that are likely to collide with a candidate plan. We score plans using expected cost with respect to a mixture distribution of normal and adversarial predictions, leading to a planner that is robust against adversarial behaviors but not overly conservative when agents behave normally. Unlike current approaches, we do not use risk measures that over-weight adversarial behaviors while placing little to no weight on low-cost normal behaviors or use hard safety constraints that may not be appropriate for all driving scenarios. We show the effectiveness of our method on single-agent and multi-agent jaywalking scenarios as well as a red light violation scenario.
Scaling up Image Segmentation across Data and Tasks
2025-06-10 · 1 citations
articleSenior authorTraditional segmentation models, while effective in isolated tasks, often fail to generalize to more complex and open-ended segmentation problems, such as free-form, open-vocabulary, and in-the-wild scenarios. To bridge this gap, we propose to scale up image segmentation across diverse datasets and tasks such that the knowledge across different tasks and datasets can be integrated while improving the generalization ability. Mixed-Query Transformer (MQ-Former), a novel segmentation framework, is introduced and designed to scale seamlessly across both data size and task diversity. It is built upon a dynamic object query mechanism called mixed query, which fuses different types of queries using cross-attention. This hybrid approach enables the model to balance between instance- and stuff-level segmentation, providing enhanced scalability for handling diverse object types. We further enhance scalability by leveraging synthetic data-generating segmentation masks and captions for pixel-level and open-vocabulary tasks-drastically reducing the need for costly human annotations. By training on multiple datasets and tasks at scale, MQ-Former continuously improves performance as the volume and diversity of data and tasks increase. It exhibits strong generalization capabilities, boosting performance in open-set segmentation tasks SeginW by 7 points. These advancements mark a key step toward universal, scalable segmentation models capable of addressing the demands of real-world applications.
LATTS: Locally Adaptive Test-Time Scaling
ArXiv.org · 2025-09-16
preprintOpen accessSenior authorOne common strategy for improving the performance of Large Language Models (LLMs) on downstream tasks involves using a \emph{verifier model} to either select the best answer from a pool of candidates or to steer the auto-regressive generation process towards better outputs. This class of methods typically results in improved accuracy at the cost of increased computation at test-time, a paradigm known as \emph{test-time scaling}. However, most existing approaches increase computation uniformly across all samples and generation steps, without considering the complexity of individual instances, leading to inefficient resource use. We address this limitation by proposing an approach, called \emph{Locally Adaptive Test-Time Scaling (LATTS)}, that allocates variable compute across generation steps. Specifically, at each generation step, LATTS employs a verifier-based acceptance criterion to decide whether to resample, backtrack, restart, or stop the generation process. This criterion effectively adjusts the per-step computational effort based on a precise notion of \emph{local difficulty} derived from the verifier model. Empirical results show that LATTS achieves significantly superior accuracy--compute tradeoffs compared to standard verifier-based methods.
Recent grants
Frequent coauthors
- 102 shared
S. Shankar Sastry
- 100 shared
Alessandro Achille
- 99 shared
Yi Ma
Shaoyang University
- 98 shared
Jana Košecká
- 61 shared
Avinash Ravichandran
- 49 shared
Alex Wong
- 45 shared
Pietro Perona
- 38 shared
Anthony Yezzi
Georgia Institute of Technology
Education
- 1995
Ph.D., Computer Science
University of California, Los Angeles
- 1991
M.S., Computer Science
University of California, Los Angeles
- 1987
B.S., Computer Science
University of Rome 'La Sapienza'
Awards & honors
- ACM Fellow, 2023
- IEEE Fellow, 2013
- David Marr Prize, 1999
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Stefano Soatto
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup