Sudeepa Roy
· Associate Professor of Computer ScienceDuke University · Computer Science
Active 2006–2025
About
Sudeepa Roy is an Associate Professor of Computer Science at Duke University, having joined the department in Fall 2015. Her research broadly focuses on data and information management, with an emphasis on foundational aspects of big data analysis. Her work aims to assist users with heterogeneous backgrounds and interests in leveraging the maximum benefit from available data. She is engaged in ongoing research on explanations in databases to help users gain deep insights into data by providing rich explanations to their questions. Additionally, her research explores data and workflow provenance, probabilistic databases, and crowd-sourcing, addressing fundamental questions related to the processing and analysis of unstructured, noisy, and unreliable data while preserving its context. Prior to her appointment at Duke, she was a postdoctoral research associate at the University of Washington, where she worked with Prof. Dan Suciu and the database group. She earned her Ph.D. in Computer and Information Science from the University of Pennsylvania, advised by Prof. Susan Davidson and Prof. Sanjeev Khanna, and completed internships at IBM Research, Almaden. She holds a master's degree from the Indian Institute of Technology, Kanpur, and a bachelor's degree from Jadavpur University.
Research topics
- Computer Science
- Artificial Intelligence
- Data science
- Engineering
- Political Science
- World Wide Web
- Data Mining
- Machine Learning
- Theoretical computer science
- Mathematics
- Epistemology
- Mechanical engineering
- Econometrics
- Marketing
- Business
- Management science
- Knowledge management
- Psychology
Selected publications
Hint-QPT: Hints for Robust Query Performance Tuning
Proceedings of the VLDB Endowment · 2025-08-01 · 1 citations
articleQuery optimizers rely heavily on selectivity estimates to choose efficient execution plans, but inaccuracies in these estimates often result in poor query performance. We introduce Hint-QPT ( Hint s for Robust Q uery P erformance T uning), an interactive tool designed to help users diagnose and improve query performance. Hint-QPT proactively recommends robust plans that are resilient to uncertainty in selectivity estimates, identifies sensitive subqueries for which selectivity estimation errors greatly affect plan quality, and provides intuitive interfaces for targeted selectivity adjustments. Users can either choose the recommended robust plans for execution, or acquire additional statistics on the identified sensitive subqueries to tune query performance. Moreover, Hint-QPT visualizes the alternative execution plans and their costs under uncertainty, helping users to better understand their robustness.
A Double Machine Learning Approach for Combining Experimental and Observational Studies
Observational Studies · 2025-01-01
articleOpen accessExperimental and observational studies often lack validity due to untestable assumptions. We propose a double machine learning approach to combine experimental and observational studies, allowing practitioners to test for assumption violations and estimate treatment effects consistently. Our framework proposes a falsification test for external validity and ignorability under milder assumptions. We provide consistent treatment effect estimators even when one of the assumptions is violated. However, our no-free-lunch theorem highlights the necessity of accurately identifying the violated assumption for consistent treatment effect estimation. Through comparative analyses, we show our framework's superiority over existing data fusion methods. The practical utility of our approach is further exemplified by three real-world case studies, underscoring its potential for widespread application in empirical research.
Differentially private explanations for aggregate query answers
The VLDB Journal · 2025-01-27 · 2 citations
articleOpen accessSenior authorAbstract Differential privacy (DP) is the state-of-the-art and rigorous notion of privacy for answering aggregate database queries while preserving the privacy of sensitive information in the data. In today’s era of data analysis, however, it poses new challenges for users to understand the trends and anomalies observed in the query results: Is the unexpected answer due to the data itself, or is it due to the extra noise that must be added to preserve DP? In the second case, even the observation made by the users on query results may be wrong. In the first case, can we still mine interesting explanations from the sensitive data while protecting its privacy? To address these challenges, we present a three-phase framework DPXPlain , which is the first system to the best of our knowledge for explaining group-by aggregate query answers with DP. In its three phases, DPXPlain (a) answers a group-by aggregate query with DP, (b) allows users to compare aggregate values of two groups and with high probability assesses whether this comparison holds or is flipped by the DP noise, and (c) eventually provides an explanation table containing the approximately ‘top-k’ explanation predicates along with their relative influences and ranks in the form of confidence intervals, while guaranteeing DP in all steps. We perform an extensive experimental analysis of DPXPlain with multiple use-cases on real and synthetic data showing that DPXPlain efficiently provides insightful explanations with good accuracy and utility.
Journal of Statistical Software · 2025-01-01 · 1 citations
articleOpen accessdame-flame is a Python package for performing matching for observational causal inference on datasets containing discrete covariates. This package implements the dynamic almost matching exactly (DAME) and fast, large-scale almost matching exactly (FLAME) algorithms, which match treatment and control units on subsets of the covariates. The resulting matched groups are interpretable, because the matches are made directly on covariates, and high-quality, because machine learning is used to determine which covariates are important to match on instead of human inputs. The package provides several adjustable parameters to adapt the algorithms to specific applications, and can calculate treatment effects after matching. The most recent source code of the implementation is available at https://github.com/almost-matching-exactly/DAME-FLAME-Python-Package.
Proceedings of the AAAI Conference on Artificial Intelligence · 2024-03-24 · 1 citations
articleOpen accessAfter a person is arrested and charged with a crime, they may be released on bail and required to participate in a community supervision program while awaiting trial. These 'pre-trial programs' are common throughout the United States, but very little research has demonstrated their effectiveness. Researchers have emphasized the need for more rigorous program evaluation methods, which we introduce in this article. We describe a program evaluation pipeline that uses recent interpretable machine learning techniques for observational causal inference, and demonstrate these techniques in a study of a pre-trial program in Durham, North Carolina. Our findings show no evidence that the program either significantly increased or decreased the probability of new criminal charges. If these findings replicate, the criminal-legal system needs to either improve pre-trial programs or consider alternatives to them. The simplest option is to release low-risk individuals back into the community without subjecting them to any restrictions or conditions. Another option is to assign individuals to pre-trial programs that incentivize pro-social behavior. We believe that the techniques introduced here can provide researchers the rigorous tools they need to evaluate these programs.
Graph Machine Learning based Doubly Robust Estimator for Network Causal Effects
arXiv (Cornell University) · 2024-03-17
preprintOpen accessWe address the challenge of inferring causal effects in social network data. This results in challenges due to interference -- where a unit's outcome is affected by neighbors' treatments -- and network-induced confounding factors. While there is extensive literature focusing on estimating causal effects in social network setups, a majority of them make prior assumptions about the form of network-induced confounding mechanisms. Such strong assumptions are rarely likely to hold especially in high-dimensional networks. We propose a novel methodology that combines graph machine learning approaches with the double machine learning framework to enable accurate and efficient estimation of direct and peer effects using a single observational social network. We demonstrate the semiparametric efficiency of our proposed estimator under mild regularity conditions, allowing for consistent uncertainty quantification. We demonstrate that our method is accurate, robust, and scalable via an extensive simulation study. We use our method to investigate the impact of Self-Help Group participation on financial risk tolerance.
Qr-Hint: Actionable Hints Towards Correcting Wrong SQL Queries
arXiv (Cornell University) · 2024-04-05
preprintOpen accessWe describe a system called Qr-Hint that, given a (correct) target query Q* and a (wrong) working query Q, both expressed in SQL, provides actionable hints for the user to fix the working query so that it becomes semantically equivalent to the target. It is particularly useful in an educational setting, where novices can receive help from Qr-Hint without requiring extensive personal tutoring. Since there are many different ways to write a correct query, we do not want to base our hints completely on how Q* is written; instead, starting with the user's own working query, Qr-Hint purposefully guides the user through a sequence of steps that provably lead to a correct query, which will be equivalent to Q* but may still "look" quite different from it. Ideally, we would like Qr-Hint's hints to lead to the "smallest" possible corrections to Q. However, optimality is not always achievable in this case due to some foundational hurdles such as the undecidability of SQL query equivalence and the complexity of logic minimization. Nonetheless, by carefully decomposing and formulating the problems and developing principled solutions, we are able to provide provably correct and locally optimal hints through Qr-Hint. We show the effectiveness of Qr-Hint through quality and performance experiments as well as a user study in an educational setting.
Evaluating Datalog over Semirings: A Grounding-based Approach
arXiv (Cornell University) · 2024-03-19 · 1 citations
preprintOpen accessDatalog is a powerful yet elegant language that allows expressing recursive computation. Although Datalog evaluation has been extensively studied in the literature, so far, only loose upper bounds are known on how fast a Datalog program can be evaluated. In this work, we ask the following question: given a Datalog program over a naturally-ordered semiring $σ$, what is the tightest possible runtime? To this end, our main contribution is a general two-phase framework for analyzing the data complexity of Datalog over $σ$: first ground the program into an equivalent system of polynomial equations (i.e. grounding) and then find the least fixpoint of the grounding over $σ$. We present algorithms that use structure-aware query evaluation techniques to obtain the smallest possible groundings. Next, efficient algorithms for fixpoint evaluation are introduced over two classes of semirings: (1) finite-rank semirings and (2) absorptive semirings of total order. Combining both phases, we obtain state-of-the-art and new algorithmic results. Finally, we complement our results with a matching fine-grained lower bound.
DP-PQD: Privately Detecting Per-Query Gaps in Synthetic Data Generated by Black-Box Mechanisms
Proceedings of the VLDB Endowment · 2023-09-01 · 2 citations
articleSenior authorSynthetic data generation methods, and in particular, private synthetic data generation methods, are gaining popularity as a means to make copies of sensitive databases that can be shared widely for research and data analysis. Some of the fundamental operations in data analysis include analyzing aggregated statistics, e.g., count, sum, or median, on a subset of data satisfying some conditions. When synthetic data is generated, users may be interested in knowing if their aggregated queries generating such statistics can be reliably answered on the synthetic data, for instance, to decide if the synthetic data is suitable for specific tasks. However, the standard data generation systems do not provide "per-query" quality guarantees on the synthetic data, and the users have no way of knowing how much the aggregated statistics on the synthetic data can be trusted. To address this problem, we present a novel framework named DP-PQD (differentially-private per-query decider) to detect if the query answers on the private and synthetic datasets are within a user-specified threshold of each other while guaranteeing differential privacy. We give a suite of private algorithms for per-query deciders for count, sum, and median queries, analyze their properties, and evaluate them experimentally.
Scientific Knowledge Graph Creation and Analysis
2023-04-07 · 1 citations
articleSenior authorReviewing the internet for research often tends to be a time-intensive process. With a rapid increase in data available on the internet, one’s research can easily be swayed as there is a high probability of less important papers being displayed as search results. A significant percentage of research papers published lack credibility thus leading to an overall decline in the standard of research. With the objective of making the methodology of research straightforward and providing academia with essential and accurate information, we propose a system with the application of Knowledge Graphs. We utilized the Aminer(v13) dataset for the creation of the Knowledge Graph. We have introduced a novel scoring approach, “Node Importance Score” (NIS), for research papers using graph centrality measures i.e., PageRank, EigenVector Centrality, and Betweenness Centrality, which is combined with the H-Index of the publication of the paper. This score is used to find the top-k influential and important papers in a given domain. Validation of the obtained NIS shows that our approach overcomes certain flaws present in the state-of-the-art scoring system.
Recent grants
NSF · $408k · 2017–2022
CAREER: FIREFLY - Rich Explanations for Database Queries
NSF · $566k · 2016–2023
III: Student Travel Fellowships for SIGMOD 2017
NSF · $20k · 2017–2017
Frequent coauthors
- 21 shared
Susan B. Davidson
University of Pennsylvania
- 20 shared
Cynthia Rudin
- 20 shared
Alexander Volfovsky
Duke University
- 17 shared
Sanjeev Khanna
- 15 shared
Amir Gilad
Hebrew University of Jerusalem
- 14 shared
Tova Milo
- 13 shared
Dan Suciu
University of Washington
- 12 shared
Zhengjie Miao
Yangzhou University
Awards & honors
- Google PhD Fellowship in Structured Data (2011)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Sudeepa Roy
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup