Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…

Andrew Blumberg

· Professor of Mathematics and of Computer Science; Herbert and Florence Irving Professor of Cancer Data ResearchVerified

Columbia University · Joint Programs

Active 1996–2025

h-index6
Citations161
Papers3318 last 5y
Funding$656k
See your match with Andrew Blumberg — sign in to PhdFit.Sign in

About

Andrew J. Blumberg is the Herbert and Florence Irving Professor of Cancer Data Research at Columbia University, holding positions in the Herbert and Florence Irving Institute of Cancer Dynamics and the Herbert Irving Comprehensive Cancer Center. He is also a Professor of Mathematics and Computer Science at Columbia University. His work focuses on cancer data research, integrating mathematical and computational approaches to advance understanding in this field.

Research topics

  • Computer Science
  • Artificial Intelligence
  • Algorithm
  • Mathematics
  • Statistics
  • Pure mathematics
  • Machine Learning
  • Biology
  • Physics
  • Combinatorics

Selected publications

  • Multi-parameter Module Approximation: an efficient and interpretable invariant for multi-parameter persistence modules with guarantees

    Journal of Applied and Computational Topology · 2025-10-25

    articleOpen accessSenior author
  • A statistical framework for analyzing shape in a time series of random geometric objects

    The Annals of Statistics · 2025-04-01 · 1 citations

    articleSenior author

    We introduce a new framework to analyze shape descriptors that capture the geometric features of an ensemble of point clouds. At the core of our approach is the point of view that the data arises as sampled recordings from a metric space-valued stochastic process, possibly of nonstationary nature, thereby integrating geometric data analysis into the realm of functional time series analysis. Our framework allows for natural incorporation of spatial-temporal dynamics, heterogeneous sampling, and the study of convergence rates. Further, we derive complete invariants for classes of metric space-valued stochastic processes in the spirit of Gromov, and relate these invariants to so-called ball volume processes. Under mild dependence conditions, a weak invariance principle in D([0,1]×[0,R]) is established for sequential empirical versions of the latter, assuming the probabilistic structure possibly changes over time. Finally, we use this result to introduce novel test statistics for topological change, which are distribution-free in the limit under the hypothesis of stationarity. We explore these test statistics on time series of single-cell mRNA expression data, using shape descriptors coming from topological data analysis.

  • Measures determined by their values on balls and Gromov-Wasserstein convergence

    arXiv (Cornell University) · 2024-01-20

    preprintOpen accessSenior author

    A classical question about a metric space is whether Borel measures on the space are determined by their values on balls. We show that for any given measure this property is stable under Gromov-Wasserstein convergence of metric measure spaces. We then use this result to show that suitable bounded subspaces of the space of persistence diagrams have the property that any Borel measure is determined by its values on balls. This justifies the use of empirical ball volumes for statistical testing in topological data analysis (TDA). Our intended application is to deploy the statistical foundations of van Delft and Blumberg (2023) for time series of random geometric objects in the context of TDA invariants, specifically persistent homology and zigzag persistence.

  • Automated Cell Type Annotation with Reference Cluster Mapping

    bioRxiv (Cold Spring Harbor Laboratory) · 2024-12-05

    preprintOpen accessSenior authorCorresponding

    Abstract Single-cell RNA sequencing has transformed the field of cellular biology by providing unprecedented insights into cellular heterogeneity. However, characterizing scRNA-seq datasets remains a significant challenge. We introduce RefCM, a novel computational method that combines optimal transport and integer programming to enhance the annotation of scRNA clusters using established reference datasets. Our method produces highly accurate cross-technology, cross-tissue, and cross-species mappings while remaining tractable at atlas scale, outperforming existing methods across all these tasks. By providing precise annotations, RefCM can enable the discovery of new cell types, states, and relationships in single-cell transcriptomic data.

  • Statistical estimation of sparsity and efficiency for molecular codes

    bioRxiv (Cold Spring Harbor Laboratory) · 2024-08-15 · 2 citations

    preprintOpen accessSenior author

    A fundamental biological question is to understand how cell types and functions are determined by genomic and proteomic coding. A basic form of this question is to ask if small families of genes or proteins code for cell types. For example, it has been shown that the collection of homeodomain proteins can uniquely delineate all 118 neuron classes in the nematode C. elegans. However, unique characterization is neither robust nor rare. Our goal in this paper is to develop a rigorous methodology to characterize molecular codes. We show that in fact for information-theoretic reasons almost any sufficiently large collection of genes is able to disambiguate cell types, and that this property is not robust to noise. To quantify the discriminative properties of a molecular codebook in a more refined way, we develop new statistics - partition cardinality and partition entropy - borrowing ideas from coding theory. We prove these are robust to data perturbations, and then apply these in the C. elegans example and in cancer. In the worm, we show that the homeodomain transcription factor family is distinguished by coding for cell types sparsely and efficiently compared to a control of randomly selected family of genes. Furthermore, the resolution of cell type identities defined using molecular features increases as the worm embryo develops. In cancer, we perform a pan-cancer study where we use our statistics to quantify interpatient tumor heterogeneity and we identify the chromosome containing the HLA family as sparsely and efficiently coding for melanoma.

  • Resampling and averaging coordinates on data

    arXiv (Cornell University) · 2024-08-02

    preprintOpen access1st authorCorresponding

    We introduce algorithms for robustly computing intrinsic coordinates on point clouds. Our approach relies on generating many candidate coordinates by subsampling the data and varying hyperparameters of the embedding algorithm (e.g., manifold learning). We then identify a subset of representative embeddings by clustering the collection of candidate coordinates and using shape descriptors from topological data analysis. The final output is the embedding obtained as an average of the representative embeddings using generalized Procrustes analysis. We validate our algorithm on both synthetic data and experimental measurements from genomics, demonstrating robustness to noise and outliers.

  • Cellstitch: 3D cellular anisotropic image segmentation via optimal transport

    BMC Bioinformatics · 2023-12-15 · 13 citations

    articleOpen accessSenior author

    BACKGROUND: Spatial mapping of transcriptional states provides valuable biological insights into cellular functions and interactions in the context of the tissue. Accurate 3D cell segmentation is a critical step in the analysis of this data towards understanding diseases and normal development in situ. Current approaches designed to automate 3D segmentation include stitching masks along one dimension, training a 3D neural network architecture from scratch, and reconstructing a 3D volume from 2D segmentations on all dimensions. However, the applicability of existing methods is hampered by inaccurate segmentations along the non-stitching dimensions, the lack of high-quality diverse 3D training data, and inhomogeneity of image resolution along orthogonal directions due to acquisition constraints; as a result, they have not been widely used in practice. METHODS: To address these challenges, we formulate the problem of finding cell correspondence across layers with a novel optimal transport (OT) approach. We propose CellStitch, a flexible pipeline that segments cells from 3D images without requiring large amounts of 3D training data. We further extend our method to interpolate internal slices from highly anisotropic cell images to recover isotropic cell morphology. RESULTS: We evaluated the performance of CellStitch through eight 3D plant microscopic datasets with diverse anisotropic levels and cell shapes. CellStitch substantially outperforms the state-of-the art methods on anisotropic images, and achieves comparable segmentation quality against competing methods in isotropic setting. We benchmarked and reported 3D segmentation results of all the methods with instance-level precision, recall and average precision (AP) metrics. CONCLUSIONS: The proposed OT-based 3D segmentation pipeline outperformed the existing state-of-the-art methods on different datasets with nonzero anisotropy, providing high fidelity recovery of 3D cell morphology from microscopic images.

  • A Framework for Fast and Stable Representations of Multiparameter Persistent Homology Decompositions

    arXiv (Cornell University) · 2023-06-19 · 4 citations

    preprintOpen accessSenior author

    Topological data analysis (TDA) is an area of data science that focuses on using invariants from algebraic topology to provide multiscale shape descriptors for geometric data sets such as point clouds. One of the most important such descriptors is {\em persistent homology}, which encodes the change in shape as a filtration parameter changes; a typical parameter is the feature scale. For many data sets, it is useful to simultaneously vary multiple filtration parameters, for example feature scale and density. While the theoretical properties of single parameter persistent homology are well understood, less is known about the multiparameter case. In particular, a central question is the problem of representing multiparameter persistent homology by elements of a vector space for integration with standard machine learning algorithms. Existing approaches to this problem either ignore most of the multiparameter information to reduce to the one-parameter case or are heuristic and potentially unstable in the face of noise. In this article, we introduce a new general representation framework that leverages recent results on {\em decompositions} of multiparameter persistent homology. This framework is rich in information, fast to compute, and encompasses previous approaches. Moreover, we establish theoretical stability guarantees under this framework as well as efficient algorithms for practical computation, making this framework an applicable and versatile tool for analyzing geometric and point cloud data. We validate our stability results and algorithms with numerical experiments that demonstrate statistical convergence, prediction accuracy, and fast running times on several real data sets.

  • CellStitch: 3D Cellular Anisotropic Image Segmentation via Optimal Transport

    bioRxiv (Cold Spring Harbor Laboratory) · 2023-06-22 · 2 citations

    preprintOpen accessSenior authorCorresponding

    Abstract Background Spatial mapping of transcriptional states provides valuable biological insights into cellular functions and interactions in the context of the tissue. Accurate 3D cell segmentation is a critical step in the analysis of this data towards understanding diseases and normal development in situ . Current approaches designed to automate 3D segmentation include stitching masks along one dimension, training a 3D neural network architecture from scratch, and reconstructing a 3D volume from 2D segmentations on all dimensions. However, the applicability of existing methods is hampered by inaccurate segmentations along the non-stitching dimensions, the lack of high-quality diverse 3D training data, and inhomogeneity among different dimensions; as a result, they have not been widely used in practice. Methods To address these challenges, we formulate the problem of finding cell correspondence across layers with a novel optimal transport (OT) approach. We propose CellStitch, a flexible pipeline that segments cells from 3D images without requiring large amounts of 3D training data. We further extend our method to interpolate internal slices from highly anisotropic cell images to recover isotropic cell morphology. Results We evaluated the performance of CellStitch through eight 3D plant microscopic datasets with diverse anisotropic levels and cell shapes. CellStitch substantially outperforms the state-of-the art methods on anisotropic images, and achieves comparable segmentation quality against competing methods in isotropic setting. We benchmarked and reported 3D segmentation results of all the methods with instance-level precision, recall and average precision (AP) metrics. Conclusion The proposed OT-based 3D segmentation pipeline outperformed the existing state-of-the-art methods on different datasets with nonzero anisotropy, providing high fidelity recovery of 3D cell morphology from microscopic images.

  • A statistical framework for analyzing shape in a time series of random geometric objects

    arXiv (Cornell University) · 2023-04-04 · 1 citations

    preprintOpen accessSenior author

    We introduce a new framework to analyze shape descriptors that capture the geometric features of an ensemble of point clouds. At the core of our approach is the point of view that the data arises as sampled recordings from a metric space-valued stochastic process, possibly of nonstationary nature, thereby integrating geometric data analysis into the realm of functional time series analysis. Our framework allows for natural incorporation of spatial-temporal dynamics, heterogeneous sampling, and the study of convergence rates. Further, we derive complete invariants for classes of metric space-valued stochastic processes in the spirit of Gromov, and relate these invariants to so-called ball volume processes. Under mild dependence conditions, a weak invariance principle in $D([0,1]\times [0,\mathscr{R}])$ is established for sequential empirical versions of the latter, assuming the probabilistic structure possibly changes over time. Finally, we use this result to introduce novel test statistics for topological change, which are distribution-free in the limit under the hypothesis of stationarity. We explore these test statistics on time series of single-cell mRNA expression data, using shape descriptors coming from topological data analysis.

Recent grants

Frequent coauthors

  • Mathieu Carrière

    Inria Saclay - Île de France

    11 shared
  • Jun Hou Fung

    Columbia University

    10 shared
  • Raúl Rabadán

    Columbia University

    9 shared
  • Michael A. Mandell

    5 shared
  • Soledad Villar

    4 shared
  • Yining Liu

    3 shared
  • Elham Azizi

    Columbia University

    3 shared
  • Sakellarios Zairis

    3 shared

Labs

Education

  • Ph.D.

    Columbia University

  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Andrew Blumberg

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup