
Lei Cao
· Assistant ProfessorVerifiedUniversity of Arizona · Computer Science
Active 1932–2026
About
Lei Cao is an Assistant Professor in the Computer Science department at the University of Arizona. He holds a research affiliation at MIT CSAIL, where he spent several years as a Postdoctoral Associate and then as a Research Scientist. During his time at MIT, he actively collaborated with prominent researchers including Prof. Samuel Madden, Prof. Michael Stonebraker, Prof. Tim Kraska, and Dr. Michael Cafarella. Prior to his academic career, Lei Cao worked as a Research Staff Member at IBM T.J. Watson Research Center. His research spans broad areas of data systems and data science, covering topics from low-level core database performance optimization to the design of high-level, application-specific machine learning techniques. His recent work focuses on the emerging area of "Systems for AI and AI for Systems," aiming to build data management and analytics tools that satisfy the SAUL properties: Scalable, Automatic, and Human-in-the-loop. Lei Cao's group actively welcomes research interns to contribute to projects developing next-generation AI-powered data systems.
Research topics
- Medicine
- Internal medicine
- Environmental health
- Demography
- Geography
- Gerontology
- Emergency medicine
- Surgery
- Medical emergency
- Physical therapy
Selected publications
A Bitter Lesson for Retail Demand Forecasting: Evidence from Fine-Tuning Foundation Models
SSRN Electronic Journal · 2026-01-01
preprintOpen accessSenior authorBRIEF: Bi-Level Coreset Selection for Efficient Instruction Tuning in LLMs
Proceedings of the VLDB Endowment · 2026-02-01
articleSenior authorInstruction tuning is a key step in adapting large language models (LLMs) to effectively understand and follow human instructions. It enables LLMs to transform general knowledge into task-specific responses that align with user intent. Although many high-quality instruction tuning datasets have been released, efficiently utilizing these data sources during supervised fine-tuning (SFT) is important, as training on the full high-quality corpus can be computationally expensive. To address this inefficiency, we explore whether a compact, high-quality subset of instruction data can achieve comparable performance to full-dataset SFT, thereby reducing training cost without sacrificing effectiveness. To this end, this work proposes to select such a subset (a.k.a., coreset) of instruction examples that maintains comparable downstream performance while improving training efficiency. The key idea is inspired by our discovered decomposition that in instruction tuning, the training loss can be decomposed into two components that effectively quantify the contribution of an instruction to the two fundamental capabilities of LLMs, namely knowledge-related capability and instruction following capability. We then revisit the objective of the classical coreset approaches to balance the two capabilities when selecting instruction examples. Based on a bi-level formulation and a composite gradient distance that makes the objective submodular, we design an effective algorithm to achieve a bounded approximation error. Experiments on 4 datasets across 9 downstream tasks demonstrate that BRIEF reduces computational costs by 3× while improving accuracy by 5% on Llama-3.1-8B, Qwen3-4B and Mistral-Nemo-12B.
Buildings · 2026-05-11
articleOpen accessSenior authorCorrespondingThe conservation and restoration of architectural heritage face dual challenges from natural erosion and human interference, necessitating the adoption of efficient and non-contact digital technologies to achieve sustainable preservation. Virtual reality (VR) technology, with its advantages of immersion, interactivity, and visualization, provides a novel technological pathway for digital documentation, conservation decision-making, and public presentation of architectural heritage. Taking the Fuliang Red Pagoda in Jingdezhen, Jiangxi Province, as the research object, this study constructs a high-precision digital reconstruction and VR interactive application workflow based on the integration of terrestrial laser scanning and close-range photogrammetry. Through point cloud denoising, Iterative Closest Point (ICP) registration, and Poisson surface reconstruction algorithms, a refined three-dimensional model of the pagoda is achieved, and an immersive VR system is developed with functions including component information query, virtual restoration scheme switching, and interactive exploration. The results demonstrate that this technical workflow not only enables non-contact digital archiving of the Fuliang Red Pagoda but also provides a visual decision-support tool for conservation interventions. Under full-scene operation, the system achieves an average rendering frame rate of 92 FPS and maintains motion-to-photon latency below 20 ms, ensuring good real-time performance and interaction stability. The findings indicate that VR-based digital technologies can enhance the scientific rigor of conservation planning and promote public engagement while adhering to the principles of authenticity and minimum intervention. This study provides a replicable technical pathway and practical reference for high-precision digital reconstruction and sustainable conservation of historic buildings.
KEN: An Execution Engine for Unstructured Database Systems
Proceedings of the VLDB Endowment · 2026-01-01
articleUnstructured database management systems (UDBMSes) leverage machine learning to apply the relational model to modalities beyond tables, such as documents, images and videos. Queries in a UDBMS consist of logical operators for which the UDBMS chooses physical implementations (e.g., different models) with the goal to optimize both query latency and accuracy. However, many operators only expose a coarse-grained set of implementations, forcing the UDBMS to excessively sacrifice either accuracy or latency without middle-ground options. For example, an entity matching operator can either be implemented through small, specialized models or large, general-purpose models (e.g., Large Language Models) — while the former struggles on challenging inputs, the latter is more accurate but incurs orders of magnitude more computation. In this work, we aim to address this issue with model cascades , which seek to process "easy" inputs with small models and only resort to large models when necessary. However, cascades incur higher memory usage and additional data transfer between GPU memory and arithmetic units, which often slows queries compared to single models. To address this issue, we introduce Ken, a dedicated UDBMS execution engine that dynamically adapts its use of cascades to the query load, and optimizes the GPU placement and invocation scheduling of the cascade models. Compared to baselines, Ken achieves 1.7× –3.3× latency reductions when combining similar models for a single operator, and 122× latency reductions when combining models with orders of magnitude size differences in a multi-operator query.
Not All Documents Are What You Need for Extracting Instruction Tuning Data
ArXiv.org · 2025-05-18
preprintOpen accessSenior authorInstruction tuning improves the performance of large language models (LLMs), but it heavily relies on high-quality training data. Recently, LLMs have been used to synthesize instruction data using seed question-answer (QA) pairs. However, these synthesized instructions often lack diversity and tend to be similar to the input seeds, limiting their applicability in real-world scenarios. To address this, we propose extracting instruction tuning data from web corpora that contain rich and diverse knowledge. A naive solution is to retrieve domain-specific documents and extract all QA pairs from them, but this faces two key challenges: (1) extracting all QA pairs using LLMs is prohibitively expensive, and (2) many extracted QA pairs may be irrelevant to the downstream tasks, potentially degrading model performance. To tackle these issues, we introduce EQUAL, an effective and scalable data extraction framework that iteratively alternates between document selection and high-quality QA pair extraction to enhance instruction tuning. EQUAL first clusters the document corpus based on embeddings derived from contrastive learning, then uses a multi-armed bandit strategy to efficiently identify clusters that are likely to contain valuable QA pairs. This iterative approach significantly reduces computational cost while boosting model performance. Experiments on AutoMathText and StackOverflow across four downstream tasks show that EQUAL reduces computational costs by 5-10x and improves accuracy by 2.5 percent on LLaMA-3.1-8B and Mistral-7B
bioRxiv (Cold Spring Harbor Laboratory) · 2025-05-11
preprintOpen accessAbstract Standardizing cell type annotations across single-cell RNA-seq datasets remains a major challenge due to inconsistencies in nomenclature, variation in annotation granularity, and the presence of rare or previously unseen populations. We present UniCell, a hierarchical annotation framework that combines Cell Ontology structure with transcriptomic data for scalable, interpretable, and ontology-aware cell identity inference. UniCell leverages a multi-task architecture that jointly optimizes local and global classifiers, yielding coherent predictions across multiple levels of the ontology-defined hierarchy. When benchmarked across 20 human and mouse datasets, UniCell consistently outperformed state-of-the-art tools, including CellTypist, scANVI, OnClass, and SingleR, in annotation performance, and sensitivity to low-abundance populations. In disease settings, UniCell effectively identified previously unseen cell types through confidence-guided novelty detection. Applied to 45 human and 23 mouse tissue atlases, UniCell enabled cross-dataset and cross-species harmonization by embedding cells into a unified latent space aligned with Cell Ontology structure. Moreover, when used to supervise single-cell foundation models, UniCell substantially improved downstream annotation accuracy, rare cell detection, and hierarchical consistency. Together, these results establish UniCell as a generalizable framework that supports high-resolution annotation, nomenclature standardization, and atlas-level integration, providing a scalable and biologically grounded solution for single-cell transcriptomic analysis across diverse biological systems.
The power of peers: how common ownership networks shape corporate digitalization
Chinese Management Studies · 2025-09-13 · 1 citations
articlePurpose This study aims to investigate peer influence mechanisms in corporate digital transformation within common ownership networks, and extends social network theory in strategic management by examining how these interconnected ownership structures shape firms’ transformation strategies. Design/methodology/approach Using panel data from Chinese A-share listed companies (2014–2023), this study uses social network analysis to construct common ownership networks and applies econometric models to test for peer effects. The research further examines network centrality as a moderator and the influence of industry leaders’ demonstration effects on follower firms. Findings The results confirm robust peer effects on digital transformation decisions within common ownership networks. Network centrality enhances these effects, rendering centrally located firms more susceptible to peer influence. Industry leaders accelerate transformation among follower firms through demonstration effects. Information diffusion via network ties, competitive pressure and organizational learning are identified as key underlying mechanisms. The study also documents significant heterogeneity in these effects across ownership structures, geographical concentrations and industrial characteristics, and finds that innovation capability mediates the relationship between digital transformation and corporate productivity. Originality/value This research contributes to network governance literature by empirically demonstrating the influence of common ownership networks on corporate digital transformation. It offers a framework identifying key peer effect mechanisms (organizational learning, information diffusion and competitive pressure) and clarifies the moderating role of network centrality. These findings deepen theoretical understanding and provide practical insights for the strategic management of digital transformation.
2025-09-01 · 1 citations
articleMultivariate time series forecasting is a critical focus across many fields. Existing transformer-based models have overlooked the explicit modeling of inter-variable correlations. Similarly, the graph-based methods have also failed to address the dynamic nature of multivariate correlations and the noise in correlation modeling. To overcome these challenges, we propose a novel Dynamic Graph Learning Guided Multi-Scale Transformer (DGraFormer) for multivariate time series forecasting. Specifically, our method consists of two main components: Dynamic correlation-aware graph Learning (DCGL) and multi-scale temporal transformer (MTT). The former aims to capture dynamic correlations across different time windows, filters out noise, and selects key weights to guide the aggregation of relevant feature representations. The latter can effectively extract temporal patterns from patch data at varying scales. Finally, the proposed method can capture rich local correlation graph structures and multi-scale global temporal features. Experimental results demonstrate that DGraformer significantly outperforms existing state-of-the-art models on ten real-world datasets, achieving the best performance across multiple evaluation metrics. The source code of our model is available at \url{https://anonymous.4open.science/r/DGraFormer}.
Two Birds with One Stone: Efficient Deep Learning over Mislabeled Data through Subset Selection
Proceedings of the ACM on Management of Data · 2025-06-17
articleUsing a large training dataset to train a big and powerful model -- a typical practice in modern deep learning, often suffers from two major problems: the expensive and slow training process and the error-prone labels. The existing approaches, targeting either speeding up the training by selecting a subset of representative training instances (subset selection) or eliminating the negative effect of mislabels during training (mislabel detection), do not perform well in this scenario due to overlooking one of these two problems. To fill this gap, we propose Deem, a novel data-efficient framework that selects a subset of representative training instances under label uncertainty. The key idea is to leverage the metadata produced during deep learning training, e.g., training losses and gradients, to estimate the label uncertainty and select the representative instances. In particular, we model the problem of subset selection under uncertainty as a problem of finding a subset that closely approximates the gradient of the whole training data set derived on soft labels. We show that it is an NP-hard problem with submodular property and propose a low complexity algorithm to solve this problem with an approximate ratio. Training on this small subset thus improves the training efficiency while guaranteeing the model's accuracy. Moreover, we propose an efficient strategy to dynamically refine this subset during the iterative training process. Extensive experiments on 6 datasets and 10 baselines demonstrate that Deem accelerates the training process up to 10X without sacrificing the model accuracy.
Multiple cosmic strings in Chern–Simons–Higgs theory with gravity
Nonlinear Analysis · 2025-07-08
article1st authorCorresponding
Frequent coauthors
- 34 shared
Samuel Madden
- 31 shared
Elke A. Rundensteiner
- 20 shared
Wen‐Jun Tu
Capital Medical University
- 15 shared
Nan Tang
Hong Kong University of Science and Technology
- 14 shared
Yizhou Yan
Tsinghua University
- 12 shared
Longde Wang
National Health and Family Planning Commission
- 10 shared
Wei Zhang
KK Women's and Children's Hospital
- 9 shared
Chongjing Lv
Shandong Marine Resource and Environment Research Institute
Labs
Research in data systems and data science, focusing on systems for AI and AI for systems.
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Lei Cao
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup