Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Walid G. Aref

Walid G. Aref

Verified

Purdue University · Computer Science

Active 1990–2026

h-index51
Citations10.3k
Papers35452 last 5y
Funding$2.4M
See your match with Walid G. Aref — sign in to PhdFit.Sign in

About

Walid G. Aref is a professor of computer science at Purdue University. His research interests are in extending the functionality of database systems to support emerging applications, including spatial, spatio-temporal, graph, biological, and sensor databases. He is also interested in query processing, indexing, data streaming, and geographic information systems (GIS). Walid's research has been supported by various organizations such as the National Science Foundation, the National Institute of Health, Purdue Research Foundation, CERIAS, Panasonic, and Microsoft Corp. He received his BSc and MSc in Computer Science from Alexandria University, Egypt, in 1983 and 1986 respectively, and his PhD from the University of Maryland at College Park in 1993. Walid has been a faculty member at Purdue since Fall 1999, and he has received several awards including the NSF CAREER Award in 2001 and a Purdue University Faculty Scholar award in 2004. He is a member of Purdue's CERIAS, serves as the Editor-in-Chief of the ACM Transactions of Spatial Algorithms and Systems, and has held editorial positions in other prominent journals. Walid is a Fellow of the IEEE and a member of the ACM, and he has served as the chair of the ACM Special Interest Group on Spatial Information (SIGSPATIAL) from 2011 to 2014.

Research topics

  • Computer Science
  • Data Mining
  • Theoretical computer science
  • Data science

Selected publications

  • P-MOSS: Scheduling Main-Memory Indexes Over NUMA Servers Using Next Token Prediction

    Proceedings of the ACM on Management of Data · 2026-04-02

    preprintOpen accessSenior author

    Ever since the Dennard scaling broke down in the early 2000s and the frequency of the CPUs stalled, vendors have started to increase the core count in each CPU chip at the expense of introducing heterogeneity, thus ushering the era of NUMA and Chiplet processors. Since then, the heterogeneity in the design space of hardware has only increased to the point that DBMS performance may vary significantly up to an order of magnitude in modern servers. An important factor that affects performance includes the location of the logical cores where the DBMS queries execute, and the location where the data resides. This paper introduces P-MOSS, a learned spatial scheduling framework that schedules query execution to specific logical cores, and co-locates data on the corresponding NUMA node. For cross-hardware and workload adaptability, P-MOSS leverages core principles from Large Language Models, such as Next Token prediction, Generative Pre-training, and Fine-tuning. In the spirit of hardware-software synergy, P-MOSS guides its scheduling decision solely based on the low-level hardware statistics collected from the hardware Performance Monitoring Unit with the aid of a Decision Transformer. Experimental evaluation is performed in the context of the B^+-Tree index. Performance results demonstrate that P-MOSS offers an improvement of up to 6× over traditional schedules in terms of query throughput.

  • Gen-DBA: Generative Database Agents

    ArXiv.org · 2026-01-23

    articleOpen accessSenior author

    Leveraging Machine Learning to optimize database systems, referred to as Machine Learning for Databases (ML4DB, for short), dates back to the early 1990s, spanning indexing techniques, selectivity estimation, and query optimization. However, the idea has gained mainstream traction following the introduction of learned indexes in 2018, triggering a surge of research spanning learned indexes and cardinality estimators to learned query optimizers, storage layout design, resource management, and database tuning. The current ML4DB optimization landscape is dominated by narrow specialist ML models that are small and are trained on limited training data. Each specialist ML model targets a single database learning task on a fixed database engine, hardware platform, query workload, and optimization objective. As a result, they fall short in real-world settings, where these factors can vary significantly and evolve over time. This leads to an exponential number of ML models with limited portability and generalization capability, thus limiting the utility of existing ML4DB approaches. We address this limitation with Gen-DBA, a single general-purpose foundation model for optimizing databases with agentic capabilities. This paper presents the vision for Gen-DBA, provides a sketch design of how to realize it, and highlights several research challenges that need to be addressed to fully realize Gen-DBA.

  • Gen-DBA: Generative Database Agents

    arXiv (Cornell University) · 2026-01-23

    preprintOpen accessSenior author

    Leveraging Machine Learning to optimize database systems, referred to as Machine Learning for Databases (ML4DB, for short), dates back to the early 1990s, spanning indexing techniques, selectivity estimation, and query optimization. However, the idea has gained mainstream traction following the introduction of learned indexes in 2018, triggering a surge of research spanning learned indexes and cardinality estimators to learned query optimizers, storage layout design, resource management, and database tuning. The current ML4DB optimization landscape is dominated by narrow specialist ML models that are small and are trained on limited training data. Each specialist ML model targets a single database learning task on a fixed database engine, hardware platform, query workload, and optimization objective. As a result, they fall short in real-world settings, where these factors can vary significantly and evolve over time. This leads to an exponential number of ML models with limited portability and generalization capability, thus limiting the utility of existing ML4DB approaches. We address this limitation with Gen-DBA, a single general-purpose foundation model for optimizing databases with agentic capabilities. This paper presents the vision for Gen-DBA, provides a sketch design of how to realize it, and highlights several research challenges that need to be addressed to fully realize Gen-DBA.

  • GTX: A Write-Optimized Latch-free Graph Data System with Transactional Support

    Proceedings of the ACM on Management of Data · 2025-06-17 · 2 citations

    articleOpen accessSenior author

    This paper introduces GTX, a standalone main-memory write-optimized graph data system that specializes in structural and graph property updates while enabling concurrent reads and graph analytics through ACID transactions. Recent graph systems target concurrent read and write support while guaranteeing transaction semantics. However, their performance suffers from updates with real-world temporal locality over the same vertices and edges due to vertex-centric lock contentions. GTX has an adaptive delta-chain locking protocol on top of a carefully designed latch-free graph storage. It eliminates vertex-level locking contention, and adapts to real-life workloads while maintaining sequential access to the graph's adjacency lists storage. GTX's transactions further support cache-friendly block-level concurrency control, and cooperative group commit and garbage collection. This combination of features ensures high update throughput and provides low latency graph analytics. Based on experimental evaluation, in addition to not sacrificing the performance of read-heavy analytical workloads, and having competitive performance similar to state-of-the-art systems, GTX has high read-write transaction throughput. For write-heavy transactional workloads, GTX achieves up to 11X better transaction throughput than the best-performing state-of-the-art system.

  • An Adaptive Index for Oscillating Write-Heavy and Read-Heavy Workloads

    2025-06-20

    articleOpen accessSenior author
  • Lory: Location-aware Data Discovery from Data Lakes

    2025-11-03

    articleOpen accessSenior author

    Data lakes are becoming increasingly prevalent across public and organizational settings, driven by the "store now, query later" paradigm. However, querying these lakes—particularly when composed of heterogeneous tabular data lacking a unified schema—remains a major challenge. Many real-world queries require integrating multiple tables through joins and unions, yet existing data discovery systems largely ignore spatial aspects, despite the presence of geographic attributes such as coordinates and region names.

  • Exploring Next Token Prediction For Optimizing Databases

    ArXiv.org · 2025-03-25

    preprintOpen accessSenior author

    The Next Token Prediction paradigm (NTP, for short) lies at the forefront of modern large foundational models that are pre-trained on diverse and large datasets. These models generalize effectively, and have proven to be very successful in Natural Language Processing (NLP). Inspired by the generalization capabilities of Large Language Models (LLMs), we investigate whether the same NTP paradigm can be applied to DBMS design and optimization tasks. Adopting NTP directly for database optimization is non-trivial due to the fundamental differences between the domains. In this paper, we present a framework, termed Probe and Learn (PoLe), for applying NTP to optimize database systems. PoLe leverages Decision Transformers and hardware-generated tokens to effectively incorporate NTP into database systems. As a proof of concept, we demonstrate PoLe in the context of the index scheduling task over NUMA servers in main-memory database systems. Preliminary results for this scheduling task demonstrate that adopting NTP and PoLe can improve both performance and generalizability.

  • The Ubiquitous Skiplist: A Survey of What Cannot be Skipped About the Skiplist and its Applications in Data Systems

    ACM Computing Surveys · 2025-05-22 · 2 citations

    reviewOpen accessSenior author

    Skiplists have become prevalent in systems. The main advantages of skiplists are their simplicity and ease of implementation, and the ability to support operations in the same asymptotic complexities as their tree-based counterparts. In this survey, we explore skiplists and their many variants. We highlight many scenarios about how skiplists are useful, and how they fit well in these usage scenarios. We also compare skiplists with other data structures, especially tree-based structures. Extensions to skiplists include structural modifications, as well as algorithmic enhancements and operations. We categorize the existing extensions, and summarize the skiplist variants that belong to each category. We present how data systems incorporate skiplist variants into many different application scenarios to serve various purposes. These data systems cover a wide range of applications, from data indexing to block-chain, from network algorithms to deterministic skiplists, and so on. It illustrates an impactful and diverse applications of skiplists in various domains of data systems.

  • BMTree: Designing, Learning, and Updating Piecewise Space-Filling Curves for Multi-Dimensional Data Indexing

    ArXiv.org · 2025-05-03

    preprintOpen access

    Space-filling curves (SFC, for short) have been widely applied to index multi-dimensional data, which first maps the data to one dimension, and then a one-dimensional indexing method, e.g., the B-tree indexes the mapped data. Existing SFCs adopt a single mapping scheme for the whole data space. However, a single mapping scheme often does not perform well on all the data space. In this paper, we propose a new type of SFC called piecewise SFCs that adopts different mapping schemes for different data subspaces. Specifically, we propose a data structure termed the Bit Merging tree (BMTree) that can generate data subspaces and their SFCs simultaneously, and achieve desirable properties of the SFC for the whole data space. Furthermore, we develop a reinforcement learning-based solution to build the BMTree, aiming to achieve excellent query performance. To update the BMTree efficiently when the distributions of data and/or queries change, we develop a new mechanism that achieves fast detection of distribution shifts in data and queries, and enables partial retraining of the BMTree. The retraining mechanism achieves performance enhancement efficiently since it avoids retraining the BMTree from scratch. Extensive experiments show the effectiveness and efficiency of the BMTree with the proposed learning-based methods.

  • A Survey of Learned Indexes for the Multi-dimensional Space

    ACM Computing Surveys · 2025-09-17 · 4 citations

    reviewOpen accessSenior author

    A recent research trend involves treating database index structures as Machine Learning (ML) models. In this domain, single or multiple ML models are trained to learn the mapping from keys to positions inside a dataset. This class of indexes is known as “Learned Indexes.” Learned indexes have demonstrated improved search performance and reduced space requirements for one-dimensional data. The concept of one-dimensional learned indexes has naturally been extended to multi-dimensional (e.g., spatial) data, leading to the development of “Learned Multi-dimensional Indexes.” This survey presents a taxonomy that classifies and categorizes both learned one- and multi-dimensional indexes, and surveys the existing literature on learned indexes according to this taxonomy with an emphasis on learned multi-dimensional index structures. Specifically, it reviews the current state of this research area, explains the core concepts behind each proposed method, and classifies these methods based on several well-defined criteria. Additionally, we present a timeline to illustrate the evolution of research on learned indexes. Finally, we highlight several open challenges and future research directions in this emerging and highly active field.

Recent grants

Frequent coauthors

Education

  • Ph.D., Computer Science

    The University of Maryland

    1993

Awards & honors

  • CAREER Award from the National Science Foundation (2001)
  • Purdue University Faculty Scholar award (2004)
  • VLDB ten-year best paper award (2016)
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Walid G. Aref

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup