Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…

Ugur Cetintemel

· Khosrowshahi University Professor of Computer Science

Brown University · Computer Science

Active 1998–2025

h-index39
Citations7.2k
Papers17915 last 5y
Funding$4.0M
See your match with Ugur Cetintemel — sign in to PhdFit.Sign in

Research topics

  • Artificial Intelligence
  • Computer Science
  • Data Mining
  • Algorithm
  • Theoretical computer science
  • Natural Language Processing
  • Programming language
  • Medicine
  • Mathematics
  • Parallel computing
  • Database
  • Cardiology
  • Radiology

Selected publications

  • Continuous Prompts: LLM-Augmented Pipeline Processing over Unstructured Streams

    ArXiv.org · 2025-12-03

    preprintOpen accessSenior author

    Monitoring unstructured streams increasingly requires persistent, semantics-aware computation, yet today's LLM frameworks remain stateless and one-shot, limiting their usefulness for long-running analytics. We introduce Continuous Prompts (CPs), the first framework that brings LLM reasoning into continuous stream processing. CPs extend RAG to streaming settings, define continuous semantic operators, and provide multiple implementations, primarily focusing on LLM-based approaches but also reporting one embedding-based variants. Furthermore, we study two LLM-centric optimizations, tuple batching and operator fusion, to significantly improve efficiency while managing accuracy loss. Because these optimizations inherently trade accuracy for speed, we present a dynamic optimization framework that uses lightweight shadow executions and cost-aware multi-objective Bayesian optimization (MOBO) to learn throughput-accuracy frontiers and adapt plans under probing budgets. We implement CPs in the VectraFlow stream processing system. Using operator-level microbenchmarks and streaming pipelines on real datasets, we show that VectraFlow can adapt to workload dynamics, navigate accuracy-efficiency trade-offs, and sustain persistent semantic queries over evolving unstructured streams.

  • Large language model-based multi-source integration pipeline for automated diagnostic classification and zero-shot prognoses for brain tumor

    Meta-Radiology · 2025-04-29 · 3 citations

    articleOpen access

    In this study, we use large language models (LLMs) to integrate information from multi-source medical reports to enhance the accuracy of automated diagnostic classification and prognosis for brain tumors. Brain MRI reports from a cohort of 426 brain tumor patients were manually labeled for tumor presence and stability. Pathology reports from the same cohort were incorporated as an additional information source. A pre-trained LLM was used to extract features from the multi-source reports, and a Multi-layer perceptron (MLP) was trained for classification tasks. Model performance was evaluated on the test set using Micro F1 scores and AUROCs. The model’s zero-shot prognostic capability was validated on an independent cohort of 33 glioblastoma patients. Micro F1-score 0.849 (95%CI: 0.814, 0.880) for tumor presence classification and 0.929 (95%CI: 0.904, 0.954) for tumor stability classification are reached. Compared to using solely radiology reports, the developed model showed improvements on Micro F1 of 10.4 ​% for tumor presence and 5.6 ​% for stability classification. Log-rank tests confirmed significant distinction between the high- and low-risk patient groups stratified by model-predicted “Tumor Stability” label ( p -value ​= ​0.017), confirming the prognostic value of the model-generated labels. This study developed a multi-source integration model based on LLMs for automated diagnostic classification and zero-shot prognosis of brain tumors. The integration of multi-source reports improved classification accuracy compared to single-source reports. Predicted tumor stability labels demonstrated survival prognostic capabilities. These findings confirm the potential of LLMs in brain tumor research, supporting precision diagnostics and prognosis. • The study integrates MRI and pathology data, highlighting the value of multi-source cancer information. • The study demonstrates LLMs' ability to bridge modalities ​in cancer diagnosis and prognosis ​applications. • The study confirms the prognostic value of automated diagnostic labels using survival correlation and log-rank tests.

  • Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems

    Proceedings of the VLDB Endowment · 2025-07-01 · 2 citations

    articleOpen accessSenior author

    AI-augmented data processing systems (DPSs) integrate large language models (LLMs) into query pipelines, allowing powerful semantic operations on structured and unstructured data. However, the reliability (a.k.a. trust) of these systems is fundamentally challenged by the potential for LLMs to produce errors, limiting their adoption in critical domains. To help address this reliability bottleneck, we introduce semantic integrity constraints (SICs) —a declarative abstraction for specifying and enforcing correctness conditions over LLM outputs in semantic queries. SICs generalize traditional database integrity constraints to semantic settings, supporting common types of constraints, such as grounding, soundness, and exclusion, with both reactive and proactive enforcement strategies. We argue that SICs provide a foundation for building reliable and auditable AI-augmented data systems. Specifically, we present a system design for integrating SICs into query planning and runtime execution and discuss its realization in AI-augmented DPSs. To guide and evaluate our vision, we outline several design goals—covering criteria around expressiveness, runtime semantics, integration, performance, and enterprise-scale applicability—and discuss how our framework addresses each, along with open research challenges.

  • Leveraging Large Language Models to Detect Protected Heath Information: Does Context Matter?

    SSRN Electronic Journal · 2024-01-01

    preprintOpen access
  • End-to-end artificial intelligence platform for the management of large vessel occlusions: A preliminary study

    Journal of Stroke and Cerebrovascular Diseases · 2022-09-14 · 10 citations

    article
  • The Case for In-Memory OLAP on "Wimpy" Nodes

    2021-04-01 · 4 citations

    articleSenior author

    Research projects will often use the latest hardware to achieve orders-of-magnitude performance improvements while ignoring the (usually hefty) associated price tag. Real-world deployments typically follow suit, requiring expensive computing infrastructures that cost even more to power and cool.In this paper, we challenge the conventional wisdom that high-end hardware is absolutely necessary for state-of-the-art performance and instead advocate for a radically different approach based on cheap single-board computers (SBCs). While others have previously explored similar ideas for computationally simple and easily partitionable use cases (e.g., key-value stores), so-called "wimpy" nodes have traditionally been rejected as unsuitable for more complex workloads. We believe, however, that recent hardware advancements driven by the mobile computing market call this orthodoxy into question. For example, our microbenchmarks show that one popular SBC, the Raspberry Pi 3B+, offers single-core compute performance that is surprisingly competitive with many server-grade Intel Xeon and ARM-based CPUs at a fraction of the cost and energy consumption.To make our case, we conducted an extensive experimental study, beginning with a series of microbenchmarks to identify the strengths and weaknesses of SBCs relative to server-grade CPUs. Then, to evaluate the ability of SBCs to handle more complex use cases, we analyzed the performance of an in-memory OLAP workload in both single-node and distributed settings. Overall, our results demonstrate up to several orders of magnitude in cost reductions coupled with substantial energy savings when compared to traditional on-premises and cloud deployments, all without a significant increase in absolute runtimes.

  • Odlaw: A Tool for Retroactive GDPR Compliance

    2021-04-01 · 6 citations

    articleSenior author

    In this demo, we present ODLAW, a new tool for retroactive compliance with privacy laws like the European Union's General Data Protection Regulation (GDPR). The GDPR enumerates the explicit rights of individuals regarding the use of their personal data, and regulators can impose strict penalties for organizations that fail to comply. While others have advocated for a completely new class of systems to address these regulations, ODLAW takes a different approach by achieving GDPR compliance while allowing an organization to keep its existing data management infrastructure intact. Using a variety of realistic datasets, the demo will show the specific ways that ODLAW can help with GDPR compliance, as well as highlight some of the key challenges that arise in real-world settings.

  • Dynamic Query Refinement for Interactive Data Exploration

    Movebank · 2020-01-01 · 1 citations

    articleOpen access
  • DeepSqueeze: Deep Semantic Compression for Tabular Data

    2020 · 28 citations

    Senior authorCorresponding
    • Computer Science
    • Computer Science
    • Data Mining

    With the rapid proliferation of large datasets, efficient data compression has become more important than ever. Columnar compression techniques (e.g., dictionary encoding, run-length encoding, delta encoding) have proved highly effective for tabular data, but they typically compress individual columns without considering potential relationships among columns, such as functional dependencies and correlations. Semantic compression techniques, on the other hand, are designed to leverage such relationships to store only a subset of the columns necessary to infer the others, but existing approaches cannot effectively identify complex relationships across more than a few columns at a time. We propose DeepSqueeze, a novel semantic compression framework that can efficiently capture these complex relationships within tabular data by using autoencoders to map tuples to a lower-dimensional representation. DeepSqueeze also supports guaranteed error bounds for lossy compression of numerical data and works in conjunction with common columnar compression formats. Our experimental evaluation uses real-world datasets to demonstrate that DeepSqueeze can achieve over a 4x size reduction compared to state-of-the-art alternatives.

  • The Case for a Learned Sorting Algorithm

    2020 · 44 citations

    • Computer Science
    • Computer Science
    • Algorithm

    Sorting is one of the most fundamental algorithms in Computer Science and a common operation in databases not just for sorting query results but also as part of joins (i.e., sort-merge-join) or indexing. In this work, we introduce a new type of distribution sort that leverages a learned model of the empirical CDF of the data. Our algorithm uses a model to efficiently get an approximation of the scaled empirical CDF for each record key and map it to the corresponding position in the output array. We then apply a deterministic sorting algorithm that works well on nearly-sorted arrays (e.g., Insertion Sort) to establish a totally sorted order. We compared this algorithm against common sorting approaches and measured its performance for up to 1 billion normally-distributed double-precision keys. The results show that our approach yields an average 3.38x performance improvement over C++ STL sort, which is an optimized Quicksort hybrid, 1.49x improvement over sequential Radix Sort, and 5.54x improvement over a C++ implementation of Timsort, which is the default sorting function for Java and Python.

Recent grants

Frequent coauthors

  • Yanif Ahmad

    Johns Hopkins University

    47 shared
  • Stan Zdonik

    John Brown University

    46 shared
  • Tim Kraska

    Amazon (United States)

    34 shared
  • Mert Akdere

    Brown University

    28 shared
  • Eli Upfal

    28 shared
  • Carsten Binnig

    Technical University of Darmstadt

    27 shared
  • Olga Papaemmanouil

    27 shared
  • Jeong-Hyon Hwang

    Albany State University

    26 shared
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Ugur Cetintemel

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup