Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Susan Davidson

Susan Davidson

· ProfessorVerified

University of Pennsylvania · Computer and Information Science

Active 1943–2025

h-index47
Citations9.0k
Papers25225 last 5y
Funding$4.2M
See your match with Susan Davidson — sign in to PhdFit.Sign in

Research topics

  • Artificial Intelligence
  • Machine Learning
  • Computer Science
  • Business

Selected publications

  • SHARQ: Explainability Framework for Association Rules on Relational Data

    Proceedings of the ACM on Management of Data · 2025-02-10 · 2 citations

    article

    Association rules are an important technique for gaining insights over large relational datasets consisting of tuples of elements (i.e. attribute-value pairs). However, it is difficult to explain the relative importance of data elements with respect to the rules in which they appear. This paper develops a measure of an element's contribution to a set of association rules based on Shapley values, denoted SHARQ (ShApley Rules Quantification). As is the case with many Shapely-based computations, the cost of a naive calculation of the score is exponential in the number of elements. To that end, we present an efficient framework for computing the exact SHARQ value of a single element whose running time is practically linear in the number of rules. Going one step further, we develop an efficient multi-element SHARQ algorithm which amortizes the cost of the single element SHARQ calculation over a set of elements. Based on the definition of SHARQ for elements we describe two additional use-cases for association rules explainability: rule importance and attribute importance. Extensive experiments over a novel benchmark dataset containing 67 instances of mined rule sets show the effectiveness of our approach.

  • Holistic Query Approximation via RL Modeling

    Proceedings of the VLDB Endowment · 2025-02-01

    article1st authorCorresponding

    In data exploration, executing queries over a large database can be time-consuming. Previous work has proposed approximate query processing as a way to speed up aggregate queries in this context, but do not address non-aggregate queries. Our paper introduces a novel holistic approach to handle both types of queries by finding an optimized subset of data, referred to as an approximation set. The goal is to maximize query result quality while using a smaller set of data, thereby significantly reducing the query execution time. We formalize this problem as Holistic Approximate Query Processing and establish its NP-completeness. To tackle this, we propose an approximate solution using Reinforcement Learning , termed HARLM. While HARLM does not provide theoretical guarantees due to its reliance on Reinforcement Learning, it effectively overcomes challenges related to the large action space and the need for generalization beyond a known query workload. Experimental results on both non-aggregate and aggregate benchmarks show that HARLM significantly outperforms the baselines both in terms of accuracy (30% improvement) and efficiency (10–35X).

  • ASQP-RL Demo: Learning Approximation Sets for Exploratory Queries

    2024-05-23

    articleOpen access1st authorCorresponding

    We demonstrate the Approximate Selection Query Processing (ASQP-RL) system, which uses Reinforcement Learning to select a subset of a large external dataset to process locally in a notebook during data exploration. Given a query workload over an external database and notebook memory size, the system translates the workload to select-project-join (non-aggregate) queries and finds a subset of each relation such that the data subset - called the approximation set - fits into the notebook memory and maximizes query result quality. The data subset can then be loaded into the notebook, and rapidly queried by the analyst. Our demonstration shows how ASQP-RL can be used during data exploration and achieve comparable results to external queries over the large dataset at significantly reduced query times. It also shows how ASQP-RL can be used for aggregation queries, achieving surprisingly good results compared to state-of-the-art techniques.

  • SHARQ: Explainability Framework for Association Rules on Relational Data

    arXiv (Cornell University) · 2024-12-24 · 1 citations

    preprintOpen access

    Association rules are an important technique for gaining insights over large relational datasets consisting of tuples of elements (i.e. attribute-value pairs). However, it is difficult to explain the relative importance of data elements with respect to the rules in which they appear. This paper develops a measure of an element's contribution to a set of association rules based on Shapley values, denoted SHARQ (ShApley Rules Quantification). As is the case with many Shapely-based computations, the cost of a naive calculation of the score is exponential in the number of elements. To that end, we present an efficient framework for computing the exact SharQ value of a single element whose running time is practically linear in the number of rules. Going one step further, we develop an efficient multi-element SHARQ algorithm which amortizes the cost of the single element SHARQ calculation over a set of elements. Based on the definition of SHARQ for elements we describe two additional use cases for association rules explainability: rule importance and attribute importance. Extensive experiments over a novel benchmark dataset containing 45 instances of mined rule sets show the effectiveness of our approach.

  • Learning Approximation Sets for Exploratory Queries

    arXiv (Cornell University) · 2024-01-30

    preprintOpen access1st authorCorresponding

    In data exploration, executing complex non-aggregate queries over large databases can be time-consuming. Our paper introduces a novel approach to address this challenge, focusing on finding an optimized subset of data, referred to as the approximation set, for query execution. The goal is to maximize query result quality while minimizing execution time. We formalize this problem as Approximate Non-Aggregates Query Processing (ANAQP) and establish its NP-completeness. To tackle this, we propose an approximate solution using advanced Reinforcement Learning architecture, termed ASQP-RL. This approach overcomes challenges related to the large action space and the need for generalization beyond a known query workload. Experimental results on two benchmarks demonstrate the superior performance of ASQP-RL, outperforming baselines by 30% in accuracy and achieving efficiency gains of 10-35X. Our research sheds light on the potential of reinforcement learning techniques for advancing data management tasks. Experimental results on two benchmarks show that ASQP-RL significantly outperforms the baselines both in terms of accuracy (30% better) and efficiency (10-35X). This research provides valuable insights into the potential of RL techniques for future advancements in data management tasks.

  • Selecting Sub-tables for Data Exploration

    2023-04-01 · 5 citations

    article

    Data scientists frequently examine the raw content of large tables when exploring an unknown dataset. In such cases, small subsets of the full tables (sub-tables) that accurately capture table contents are useful. We present a framework which, given a large data table T, creates a sub-table of small, fixed dimensions by selecting a subset of T’s rows and projecting them over a subset of T’s columns. The question is: Which rows and columns should be selected to yield an informative sub-table?Our first contribution is an informativeness metric for sub-tables with two complementary dimensions: cell coverage, which measures how well the sub-table captures prominent data patterns in T, and diversity. We use association rules as the patterns captured by sub-tables, and show that computing optimal sub-tables directly using this metric is infeasible. We then develop an efficient algorithm that indirectly accounts for association rules using table embedding. The resulting framework produces sub-tables for the full table as well as for the results of queries over the table, enabling the user to quickly understand results and determine subsequent queries. Experimental results show that high-quality sub-tables can be efficiently computed, and verify the soundness of our metrics as well as the usefulness of selected sub-tables through user studies.

  • Selecting Sub-tables for Data Exploration

    arXiv (Cornell University) · 2022-03-05 · 1 citations

    preprintOpen access

    We present a framework for creating small, informative sub-tables of large data tables to facilitate the first step of data science: data exploration. Given a large data table table T, the goal is to create a sub-table of small, fixed dimensions, by selecting a subset of T's rows and projecting them over a subset of T's columns. The question is: which rows and columns should be selected to yield an informative sub-table? We formalize the notion of "informativeness" based on two complementary metrics: cell coverage, which measures how well the sub-table captures prominent association rules in T, and diversity. Since computing optimal sub-tables using these metrics is shown to be infeasible, we give an efficient algorithm which indirectly accounts for association rules using table embedding. The resulting framework can be used for visualizing the complete sub-table, as well as for displaying the results of queries over the sub-table, enabling the user to quickly understand the results and determine subsequent queries. Experimental results show that we can efficiently compute high-quality sub-tables as measured by our metrics, as well as by feedback from user-studies.

  • ShapGraph: An Holistic View of Explanations through Provenance Graphs and Shapley Values

    Proceedings of the 2022 International Conference on Management of Data · 2022-06-10 · 10 citations

    article1st authorCorresponding

    Explaining query results is an essential tool for enhancing the transparency and quality of data processing, and has been extensively studied in recent years. In particular, Data Provenance -- the tracking of transformations that data undergoes in query evaluation -- has been shown to be a key component of explanations. A hurdle that remains is that data provenance itself is often too large and complex to be presented in its entirety. To that end, we propose to leverage novel advancements on quantifying and computing the contributions of individual input tuples to query answers, based on the game-theoretic notion of the Shapley value. Our proposed prototype solution, called ShapGraph, combines the global view of explanations through provenance graphs with a local quantification of contributions through Shapley values. The graphical interface allows users to switch between and combine these two views to obtain a deeper understanding of the most influential parts of the database and how they interact to yield query answers.

  • SubTab: Data Exploration with Informative Sub-Tables

    Proceedings of the 2022 International Conference on Management of Data · 2022-06-10 · 7 citations

    article

    We demonstrate SubTab, a framework for creating small, informative sub-tables of large data tables to speed up data exploration. Given a table with n rows and m columns where n and m are large, SubTab creates a sub-table T_sub with k<n rows and l<m columns, i.e. a subset of k rows of the table projected over a subset of l columns. The rows and columns are chosen as representatives of prominent data patterns within and across columns in the input table. SubTab can also be used for query results, enabling the user to quickly understand the results and determine subsequent queries.

  • Credit distribution in relational scientific databases

    Information Systems · 2022-05-10 · 5 citations

    articleOpen access

    Digital data is a basic form of research product for which citation, and the generation of credit or recognition for authors, are still not well understood. The notion of data credit has therefore recently emerged as a new measure, defined and based on data citation groundwork. Data credit is a real value representing the importance of data cited by a research entity. We can use credit to annotate data contained in a curated scientific database and then as a proxy of the significance and impact of that data in the research world. It is a method that, together with citations, helps recognize the value of data and its creators. In this paper, we explore the problem of Data Credit Distribution, the process by which credit is distributed to the database parts responsible for producing data being cited by a research entity. We adopt as use case the IUPHAR/BPS Guide to Pharmacology (GtoPdb), a widely-used curated scientific relational database. We focus on Select-Project-Join (SPJ) queries under bag semantics, and we define three distribution strategies based on how-provenance, responsibility, and the Shapley value. Using these distribution strategies, we show how credit can highlight frequently used database areas and how it can be used as a new bibliometric measure for data and their curators. In particular, credit rewards data and authors based on their research impact, not only on the citation count. We also show how these distribution strategies vary in their sensitivity to the role of an input tuple in the generation of the output data and reward input tuples differently.

Recent grants

Frequent coauthors

  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Susan Davidson

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup