Walid G. Aref

Verified

Purdue University · Computer Science

Active 1990–2026

h-index51

Citations10.3k

Papers35452 last 5y

Funding$2.4M

Faculty page

See your match with Walid G. Aref — sign in to PhdFit.Sign in

About

Walid G. Aref is a professor of computer science at Purdue University. His research interests are in extending the functionality of database systems to support emerging applications, including spatial, spatio-temporal, graph, biological, and sensor databases. He is also interested in query processing, indexing, data streaming, and geographic information systems (GIS). Walid's research has been supported by various organizations such as the National Science Foundation, the National Institute of Health, Purdue Research Foundation, CERIAS, Panasonic, and Microsoft Corp. He received his BSc and MSc in Computer Science from Alexandria University, Egypt, in 1983 and 1986 respectively, and his PhD from the University of Maryland at College Park in 1993. Walid has been a faculty member at Purdue since Fall 1999, and he has received several awards including the NSF CAREER Award in 2001 and a Purdue University Faculty Scholar award in 2004. He is a member of Purdue's CERIAS, serves as the Editor-in-Chief of the ACM Transactions of Spatial Algorithms and Systems, and has held editorial positions in other prominent journals. Walid is a Fellow of the IEEE and a member of the ACM, and he has served as the chair of the ACM Special Interest Group on Spatial Information (SIGSPATIAL) from 2011 to 2014.

Research topics

Computer Science
Data Mining
Theoretical computer science
Data science

Selected publications

P-MOSS: Scheduling Main-Memory Indexes Over NUMA Servers Using Next Token Prediction
Proceedings of the ACM on Management of Data · 2026-04-02
preprintOpen accessSenior author
Ever since the Dennard scaling broke down in the early 2000s and the frequency of the CPUs stalled, vendors have started to increase the core count in each CPU chip at the expense of introducing heterogeneity, thus ushering the era of NUMA and Chiplet processors. Since then, the heterogeneity in the design space of hardware has only increased to the point that DBMS performance may vary significantly up to an order of magnitude in modern servers. An important factor that affects performance includes the location of the logical cores where the DBMS queries execute, and the location where the data resides. This paper introduces P-MOSS, a learned spatial scheduling framework that schedules query execution to specific logical cores, and co-locates data on the corresponding NUMA node. For cross-hardware and workload adaptability, P-MOSS leverages core principles from Large Language Models, such as Next Token prediction, Generative Pre-training, and Fine-tuning. In the spirit of hardware-software synergy, P-MOSS guides its scheduling decision solely based on the low-level hardware statistics collected from the hardware Performance Monitoring Unit with the aid of a Decision Transformer. Experimental evaluation is performed in the context of the B^+-Tree index. Performance results demonstrate that P-MOSS offers an improvement of up to 6× over traditional schedules in terms of query throughput.
Publisher OA PDF DOI
Gen-DBA: Generative Database Agents
ArXiv.org · 2026-01-23
articleOpen accessSenior author
Leveraging Machine Learning to optimize database systems, referred to as Machine Learning for Databases (ML4DB, for short), dates back to the early 1990s, spanning indexing techniques, selectivity estimation, and query optimization. However, the idea has gained mainstream traction following the introduction of learned indexes in 2018, triggering a surge of research spanning learned indexes and cardinality estimators to learned query optimizers, storage layout design, resource management, and database tuning. The current ML4DB optimization landscape is dominated by narrow specialist ML models that are small and are trained on limited training data. Each specialist ML model targets a single database learning task on a fixed database engine, hardware platform, query workload, and optimization objective. As a result, they fall short in real-world settings, where these factors can vary significantly and evolve over time. This leads to an exponential number of ML models with limited portability and generalization capability, thus limiting the utility of existing ML4DB approaches. We address this limitation with Gen-DBA, a single general-purpose foundation model for optimizing databases with agentic capabilities. This paper presents the vision for Gen-DBA, provides a sketch design of how to realize it, and highlights several research challenges that need to be addressed to fully realize Gen-DBA.
Publisher OA PDF
Gen-DBA: Generative Database Agents
arXiv (Cornell University) · 2026-01-23
preprintOpen accessSenior author
Leveraging Machine Learning to optimize database systems, referred to as Machine Learning for Databases (ML4DB, for short), dates back to the early 1990s, spanning indexing techniques, selectivity estimation, and query optimization. However, the idea has gained mainstream traction following the introduction of learned indexes in 2018, triggering a surge of research spanning learned indexes and cardinality estimators to learned query optimizers, storage layout design, resource management, and database tuning. The current ML4DB optimization landscape is dominated by narrow specialist ML models that are small and are trained on limited training data. Each specialist ML model targets a single database learning task on a fixed database engine, hardware platform, query workload, and optimization objective. As a result, they fall short in real-world settings, where these factors can vary significantly and evolve over time. This leads to an exponential number of ML models with limited portability and generalization capability, thus limiting the utility of existing ML4DB approaches. We address this limitation with Gen-DBA, a single general-purpose foundation model for optimizing databases with agentic capabilities. This paper presents the vision for Gen-DBA, provides a sketch design of how to realize it, and highlights several research challenges that need to be addressed to fully realize Gen-DBA.
Publisher DOI
GTX: A Write-Optimized Latch-free Graph Data System with Transactional Support
Proceedings of the ACM on Management of Data · 2025-06-17 · 2 citations
articleOpen accessSenior author
This paper introduces GTX, a standalone main-memory write-optimized graph data system that specializes in structural and graph property updates while enabling concurrent reads and graph analytics through ACID transactions. Recent graph systems target concurrent read and write support while guaranteeing transaction semantics. However, their performance suffers from updates with real-world temporal locality over the same vertices and edges due to vertex-centric lock contentions. GTX has an adaptive delta-chain locking protocol on top of a carefully designed latch-free graph storage. It eliminates vertex-level locking contention, and adapts to real-life workloads while maintaining sequential access to the graph's adjacency lists storage. GTX's transactions further support cache-friendly block-level concurrency control, and cooperative group commit and garbage collection. This combination of features ensures high update throughput and provides low latency graph analytics. Based on experimental evaluation, in addition to not sacrificing the performance of read-heavy analytical workloads, and having competitive performance similar to state-of-the-art systems, GTX has high read-write transaction throughput. For write-heavy transactional workloads, GTX achieves up to 11X better transaction throughput than the best-performing state-of-the-art system.
Publisher DOI
An Adaptive Index for Oscillating Write-Heavy and Read-Heavy Workloads
2025-06-20
articleOpen accessSenior author
Publisher DOI
Lory: Location-aware Data Discovery from Data Lakes
2025-11-03
articleOpen accessSenior author
Data lakes are becoming increasingly prevalent across public and organizational settings, driven by the "store now, query later" paradigm. However, querying these lakes—particularly when composed of heterogeneous tabular data lacking a unified schema—remains a major challenge. Many real-world queries require integrating multiple tables through joins and unions, yet existing data discovery systems largely ignore spatial aspects, despite the presence of geographic attributes such as coordinates and region names.
Publisher OA PDF DOI
Exploring Next Token Prediction For Optimizing Databases
ArXiv.org · 2025-03-25
preprintOpen accessSenior author
The Next Token Prediction paradigm (NTP, for short) lies at the forefront of modern large foundational models that are pre-trained on diverse and large datasets. These models generalize effectively, and have proven to be very successful in Natural Language Processing (NLP). Inspired by the generalization capabilities of Large Language Models (LLMs), we investigate whether the same NTP paradigm can be applied to DBMS design and optimization tasks. Adopting NTP directly for database optimization is non-trivial due to the fundamental differences between the domains. In this paper, we present a framework, termed Probe and Learn (PoLe), for applying NTP to optimize database systems. PoLe leverages Decision Transformers and hardware-generated tokens to effectively incorporate NTP into database systems. As a proof of concept, we demonstrate PoLe in the context of the index scheduling task over NUMA servers in main-memory database systems. Preliminary results for this scheduling task demonstrate that adopting NTP and PoLe can improve both performance and generalizability.
Publisher OA PDF DOI
The Ubiquitous Skiplist: A Survey of What Cannot be Skipped About the Skiplist and its Applications in Data Systems
ACM Computing Surveys · 2025-05-22 · 2 citations
reviewOpen accessSenior author
Skiplists have become prevalent in systems. The main advantages of skiplists are their simplicity and ease of implementation, and the ability to support operations in the same asymptotic complexities as their tree-based counterparts. In this survey, we explore skiplists and their many variants. We highlight many scenarios about how skiplists are useful, and how they fit well in these usage scenarios. We also compare skiplists with other data structures, especially tree-based structures. Extensions to skiplists include structural modifications, as well as algorithmic enhancements and operations. We categorize the existing extensions, and summarize the skiplist variants that belong to each category. We present how data systems incorporate skiplist variants into many different application scenarios to serve various purposes. These data systems cover a wide range of applications, from data indexing to block-chain, from network algorithms to deterministic skiplists, and so on. It illustrates an impactful and diverse applications of skiplists in various domains of data systems.
Publisher DOI
BMTree: Designing, Learning, and Updating Piecewise Space-Filling Curves for Multi-Dimensional Data Indexing
ArXiv.org · 2025-05-03
preprintOpen access
Space-filling curves (SFC, for short) have been widely applied to index multi-dimensional data, which first maps the data to one dimension, and then a one-dimensional indexing method, e.g., the B-tree indexes the mapped data. Existing SFCs adopt a single mapping scheme for the whole data space. However, a single mapping scheme often does not perform well on all the data space. In this paper, we propose a new type of SFC called piecewise SFCs that adopts different mapping schemes for different data subspaces. Specifically, we propose a data structure termed the Bit Merging tree (BMTree) that can generate data subspaces and their SFCs simultaneously, and achieve desirable properties of the SFC for the whole data space. Furthermore, we develop a reinforcement learning-based solution to build the BMTree, aiming to achieve excellent query performance. To update the BMTree efficiently when the distributions of data and/or queries change, we develop a new mechanism that achieves fast detection of distribution shifts in data and queries, and enables partial retraining of the BMTree. The retraining mechanism achieves performance enhancement efficiently since it avoids retraining the BMTree from scratch. Extensive experiments show the effectiveness and efficiency of the BMTree with the proposed learning-based methods.
Publisher OA PDF DOI
A Survey of Learned Indexes for the Multi-dimensional Space
ACM Computing Surveys · 2025-09-17 · 4 citations
reviewOpen accessSenior author
A recent research trend involves treating database index structures as Machine Learning (ML) models. In this domain, single or multiple ML models are trained to learn the mapping from keys to positions inside a dataset. This class of indexes is known as “Learned Indexes.” Learned indexes have demonstrated improved search performance and reduced space requirements for one-dimensional data. The concept of one-dimensional learned indexes has naturally been extended to multi-dimensional (e.g., spatial) data, leading to the development of “Learned Multi-dimensional Indexes.” This survey presents a taxonomy that classifies and categorizes both learned one- and multi-dimensional indexes, and surveys the existing literature on learned indexes according to this taxonomy with an emphasis on learned multi-dimensional index structures. Specifically, it reviews the current state of this research area, explains the core concepts behind each proposed method, and classifies these methods based on several well-defined criteria. Additionally, we present a timeline to illustrate the evolution of research on learned indexes. Finally, we highlight several open challenges and future research directions in this emerging and highly active field.
Publisher DOI

Recent grants

III-COR-Small: Collaborative Research: Preference- And Context-Aware Query Processing for Location-based Database Servers
NSF · $193k · 2008–2012
III:Small: Commugrate -- A Community-based Data Integration System
NSF · $498k · 2009–2014
CAREER: Research and Development of Database Technologies for Modern Applications
NSF · $300k · 2001–2007
III: Small: On the Conceptual Evaluation and Optimization of Queries in Spatiotemporal Data Systems
NSF · $497k · 2011–2016
III: Small: In-memory, Distributed, and Adaptive Spatio-textual Query Processing
NSF · $432k · 2018–2022

Frequent coauthors

Mohamed F. Mokbel
University of Minnesota System
80 shared
Ahmed K. Elmagarmid
77 shared
Mourad Ouzzani
Hamad bin Khalifa University
46 shared
Ahmed R. Mahmood
Google (United States)
43 shared
Moustafa A. Hammad
39 shared
Ahmed M. Aly
Future University in Egypt
27 shared
Mingjie Tang
Sichuan University
23 shared
Arif Ghafoor
22 shared

Education

Ph.D., Computer Science
The University of Maryland
1993

Awards & honors

CAREER Award from the National Science Foundation (2001)
Purdue University Faculty Scholar award (2004)
VLDB ten-year best paper award (2016)

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Walid G. Aref

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you