Leman Akoglu

· Assistant Professor

Carnegie Mellon University · Heinz College

Active 2006–2025

h-index45

Citations8.8k

Papers22087 last 5y

Funding$1.7M

Faculty page Lab page Website

See your match with Leman Akoglu — sign in to PhdFit.Sign in

About

Leman Akoglu is the Heinz College Dean's Associate Professor of Information Systems at Carnegie Mellon University, holding a tenured position. He directs the Data Analytics Techniques Algorithms (DATA) Lab at Heinz College. His research interests broadly encompass data mining, graph mining, machine learning, and knowledge discovery, with a specific focus on anomalies—identifying and characterizing 'what stands out' in large-scale, time-varying, multi-modal data sources through scalable computational methods. Akoglu holds a Ph.D. in Computer Science from Carnegie Mellon University, obtained in 2012, and a B.S. in Computer Science from Bilkent University, completed in 2007. He also holds courtesy appointments at the Machine Learning Department and the Computer Science Department of the School of Computer Science. His work has led to numerous contributions in anomaly detection, outlier detection, hyperparameter sensitivity analysis, graph neural networks, and the development of foundation models for various applications, including healthcare, finance, and social networks. Akoglu is actively involved in research collaborations, keynote speaking engagements, and organizing workshops and conferences, advancing the field of data science and machine learning.

Research topics

Computer Science
Artificial Intelligence
Computer Security
Machine Learning
Data Mining
Mathematics
Database
Data science
Human–computer interaction
Theoretical computer science
World Wide Web

Selected publications

Trajectory Anomaly Detection with By-Design Complementary Detectors
Society for Industrial and Applied Mathematics eBooks · 2025-01-01 · 2 citations
book-chapterSenior author
Trajectory anomaly detection is critical across a wide range of applications, from traffic control, and wildlife conservation, to public transportation optimization. However, detecting anomalies in trajectory data is challenging due to the diverse nature of anomalies. In this paper, we propose CETrajAD, an ensemble method for trajectory anomaly detection that integrates complementary detectors, each targeting different aspects of trajectory anomalies. Our approach leverages three types of trajectory embeddings—Route, Speed, and Shape—that vary in their sensitivity to length, direction, shape, and speed, enabling the detection of diverse anomaly types. We combine detectors from both the embedding and input spaces and show how their complementary nature improves anomaly detection performance. Through theoretical analysis, we demonstrate the conditions when the proposed ensemble design outperforms traditional ensemble methods. Experiments on multiple real-world datasets, containing both simulated and ground-truth anomalies, show that the proposed model consistently outperforms existing baselines.
Publisher DOI
End-To-End Self-Tuning Self-Supervised Time Series Anomaly Detection
Society for Industrial and Applied Mathematics eBooks · 2025-01-01 · 1 citations
book-chapterSenior author
Time series anomaly detection (TSAD) finds many applications such as monitoring environmental sensors, industry KPIs, patient biomarkers, etc. A two-fold challenge for TSAD is a versatile and unsupervised model that can detect various different types of time series anomalies (spikes, discontinuities, trend shifts, etc.) without any labeled data. Modern neural networks have outstanding ability in modeling complex time series. Self-supervised models in particular tackle unsupervised TSAD by transforming the input via various augmentations to create pseudo anomalies for training. However, their performance is sensitive to the choice of augmentation, which is hard to choose in practice, while there exists no effort in the literature on data augmentation tuning for TSAD without labels. Our work aims to fill this gap. We introduce TSAP for TSA “on autoPilot”, which can (self-)tune augmentation hyperparameters end-to-end. It stands on two key components: a differentiable augmentation architecture and an unsupervised validation loss to effectively assess the alignment between augmentation type and anomaly type. Case studies show TSAP’s ability to effectively select the (discrete) augmentation type and associated (continuous) hyperparameters. In turn, it outperforms established baselines, including SOTA self-supervised models, on diverse TSAD tasks exhibiting different anomaly types.
Publisher DOI
Can Machine Learning Target Health Care Fraud? Evidence From Medicare Hospitalizations
Journal of Policy Analysis and Management · 2025-12-25 · 1 citations
articleOpen accessSenior author
The US spends more than $4 trillion per year on health care, largely conducted by private providers and reimbursed by insurers. A major concern in this system is overbilling and fraud by hospitals, who face incentives to misreport their claims to receive higher payments. In this work, we develop novel machine learning tools to identify hospitals that overbill insurers, which can be used to guide investigations and auditing of suspicious hospitals for both public and private health insurance systems. Using large-scale claims data from Medicare, the US federal health insurance program for the elderly and disabled, we identify patterns consistent with fraud among inpatient hospitalizations. Our proposed approach for fraud detection is fully unsupervised, not relying on any labeled training data, and is explainable to end users, providing interpretations for which diagnosis, procedure, and billing codes lead to hospitals being labeled suspicious. Using newly collected data from the Department of Justice on hospitals facing anti-fraud lawsuits, and case studies of suspicious hospitals, we validate our approach and findings. Our method provides a nearly 5-fold lift over random targeting of hospitals. We also perform a post-analysis to understand which hospital characteristics, not used for detection, are associated with suspiciousness.
Publisher OA PDF DOI
Pard: Permutation-Invariant Autoregressive Diffusion for Graph Generation
arXiv (Cornell University) · 2024-02-06 · 1 citations
preprintOpen accessSenior author
Graph generation has been dominated by autoregressive models due to their simplicity and effectiveness, despite their sensitivity to ordering. Yet diffusion models have garnered increasing attention, as they offer comparable performance while being permutation-invariant. Current graph diffusion models generate graphs in a one-shot fashion, but they require extra features and thousands of denoising steps to achieve optimal performance. We introduce PARD, a Permutation-invariant Auto Regressive Diffusion model that integrates diffusion models with autoregressive methods. PARD harnesses the effectiveness and efficiency of the autoregressive model while maintaining permutation invariance without ordering sensitivity. Specifically, we show that contrary to sets, elements in a graph are not entirely unordered and there is a unique partial order for nodes and edges. With this partial order, PARD generates a graph in a block-by-block, autoregressive fashion, where each block's probability is conditionally modeled by a shared diffusion model with an equivariant network. To ensure efficiency while being expressive, we further propose a higher-order graph transformer, which integrates transformer with PPGN. Like GPT, we extend the higher-order graph transformer to support parallel training of all blocks. Without any extra features, PARD achieves state-of-the-art performance on molecular and non-molecular datasets, and scales to large datasets like MOSES containing 1.9M molecules. Pard is open-sourced at https://github.com/LingxiaoShawn/Pard.
Publisher OA PDF DOI
Outlier Detection Bias Busted: Understanding Sources of Algorithmic Bias through Data-centric Factors
arXiv (Cornell University) · 2024-08-24
preprintOpen accessSenior author
The astonishing successes of ML have raised growing concern for the fairness of modern methods when deployed in real world settings. However, studies on fairness have mostly focused on supervised ML, while unsupervised outlier detection (OD), with numerous applications in finance, security, etc., have attracted little attention. While a few studies proposed fairness-enhanced OD algorithms, they remain agnostic to the underlying driving mechanisms or sources of unfairness. Even within the supervised ML literature, there exists debate on whether unfairness stems solely from algorithmic biases (i.e. design choices) or from the biases encoded in the data on which they are trained. To close this gap, this work aims to shed light on the possible sources of unfairness in OD by auditing detection models under different data-centric factors. By injecting various known biases into the input data -- as pertain to sample size disparity, under-representation, feature measurement noise, and group membership obfuscation -- we find that the OD algorithms under the study all exhibit fairness pitfalls, although differing in which types of data bias they are more susceptible to. Most notable of our study is to demonstrate that OD algorithm bias is not merely a data bias problem. A key realization is that the data properties that emerge from bias injection could as well be organic -- as pertain to natural group differences w.r.t. sparsity, base rate, variance, and multi-modality. Either natural or biased, such data properties can give rise to unfairness as they interact with certain algorithmic design choices.
Publisher OA PDF DOI
Descriptive Kernel Convolution Network with Improved Random Walk Kernel
2024-05-08 · 1 citations
articleOpen accessSenior author
Graph kernels used to be the dominant approach to feature engineering for structured data, which are superseded by modern GNNs as the former lacks learnability. Recently, a suite of Kernel Convolution Networks (KCNs) successfully revitalized graph kernels by introducing learnability, which convolves input with learnable hidden graphs using a certain graph kernel. The random walk kernel (RWK) has been used as the default kernel in many KCNs, gaining increasing attention. In this paper, we first revisit the RWK and its current usage in KCNs, revealing several shortcomings of the existing designs, and propose an improved graph kernel RWK^+, by introducing color-matching random walks and deriving its efficient computation. We then propose RWK^+ CN, a KCN that uses RWK^+ as the core kernel to learn descriptive graph features with an unsupervised objective, which can not be achieved by GNNs. Further, by unrolling RWK^+, we discover its connection with a regular GCN layer, and propose a novel GNN layer RWK^+ Conv. In the first part of experiments, we demonstrate the descriptive learning ability of RWK^+ CN with the improved random walk kernel RWK^+ on unsupervised pattern mining tasks; in the second part, we show the effectiveness of RWK^+ for a variety of KCN architectures and supervised graph learning tasks, and demonstrate the expressiveness of RWK^+ Conv layer, especially on the graph-level tasks. RWK^+ and RWK^+ Conv adapt to various real-world applications, including web applications such as bot detection in a web-scale Twitter social network, and community classification in Reddit social interaction networks.
Publisher DOI
Machine Learning in Finance
2024-08-24
article1st authorCorresponding
This workshop aims to explore the intersection of Generative AI with the rich tapestry of financial data types, seeking to uncover new methodologies and techniques that can enhance predictive analytics, fraud detection, and customer insights across the sector. By harnessing these advancements in AI, we can pave the way to not only understand customer behavior but also anticipate their needs more effectively, leading to superior customer outcomes and more personalized services. Our objective is to shed light on the challenges and opportunities presented by the diverse data formats in finance. We aim to bridge the gap between the dominance of traditional models for tabular data analysis and the emerging potential of Generative AI to revolutionize the treatment of time series, click streams, and other unstructured data forms.
Publisher DOI
End-To-End Self-Tuning Self-Supervised Time Series Anomaly Detection
arXiv (Cornell University) · 2024-04-03
preprintOpen accessSenior author
Time series anomaly detection (TSAD) finds many applications such as monitoring environmental sensors, industry KPIs, patient biomarkers, etc. A two-fold challenge for TSAD is a versatile and unsupervised model that can detect various different types of time series anomalies (spikes, discontinuities, trend shifts, etc.) without any labeled data. Modern neural networks have outstanding ability in modeling complex time series. Self-supervised models in particular tackle unsupervised TSAD by transforming the input via various augmentations to create pseudo anomalies for training. However, their performance is sensitive to the choice of augmentation, which is hard to choose in practice, while there exists no effort in the literature on data augmentation tuning for TSAD without labels. Our work aims to fill this gap. We introduce TSAP for TSA "on autoPilot", which can (self-)tune augmentation hyperparameters end-to-end. It stands on two key components: a differentiable augmentation architecture and an unsupervised validation loss to effectively assess the alignment between augmentation type and anomaly type. Case studies show TSAP's ability to effectively select the (discrete) augmentation type and associated (continuous) hyperparameters. In turn, it outperforms established baselines, including SOTA self-supervised models, on diverse TSAD tasks exhibiting different anomaly types.
Publisher OA PDF DOI
FoMo-0D: A Foundation Model for Zero-shot Tabular Outlier Detection
arXiv (Cornell University) · 2024-09-09
preprintOpen accessSenior author
Outlier detection (OD) has a vast literature as it finds numerous real-world applications. Being an unsupervised task, model selection is a key bottleneck for OD without label supervision. Despite a long list of available OD algorithms with tunable hyperparameters, the lack of systematic approaches for unsupervised algorithm and hyperparameter selection limits their effective use in practice. In this paper, we present FoMo-0D, a pre-trained Foundation Model for zero/0-shot OD on tabular data, which bypasses the hurdle of model selection altogether. Having been pre-trained on synthetic data, FoMo-0D can directly predict the (outlier/inlier) label of test samples without parameter fine-tuning -- requiring no labeled data, and no additional training or hyperparameter tuning when given a new task. Extensive experiments on 57 real-world datasets against 26 baselines show that FoMo-0D is highly competitive; outperforming the majority of the baselines with no statistically significant difference from the 2nd best method. Further, FoMo-0D is efficient in inference time requiring only 7.7 ms per sample on average, with at least 7x speed-up compared to previous methods. To facilitate future research, our implementations for data synthesis and pre-training as well as model checkpoints are openly available at https://github.com/A-Chicharito-S/FoMo-0D.
Publisher OA PDF DOI
Descriptive Kernel Convolution Network with Improved Random Walk Kernel
arXiv (Cornell University) · 2024-02-08
preprintOpen accessSenior author
Graph kernels used to be the dominant approach to feature engineering for structured data, which are superseded by modern GNNs as the former lacks learnability. Recently, a suite of Kernel Convolution Networks (KCNs) successfully revitalized graph kernels by introducing learnability, which convolves input with learnable hidden graphs using a certain graph kernel. The random walk kernel (RWK) has been used as the default kernel in many KCNs, gaining increasing attention. In this paper, we first revisit the RWK and its current usage in KCNs, revealing several shortcomings of the existing designs, and propose an improved graph kernel RWK+, by introducing color-matching random walks and deriving its efficient computation. We then propose RWK+CN, a KCN that uses RWK+ as the core kernel to learn descriptive graph features with an unsupervised objective, which can not be achieved by GNNs. Further, by unrolling RWK+, we discover its connection with a regular GCN layer, and propose a novel GNN layer RWK+Conv. In the first part of experiments, we demonstrate the descriptive learning ability of RWK+CN with the improved random walk kernel RWK+ on unsupervised pattern mining tasks; in the second part, we show the effectiveness of RWK+ for a variety of KCN architectures and supervised graph learning tasks, and demonstrate the expressiveness of RWK+Conv layer, especially on the graph-level tasks. RWK+ and RWK+Conv adapt to various real-world applications, including web applications such as bot detection in a web-scale Twitter social network, and community classification in Reddit social interaction networks.
Publisher OA PDF DOI

Recent grants

CAREER: A General Framework for Methodical and Interpretable Anomaly Mining
NSF · $503k · 2016–2022
III: Medium: Collaborative Research: Collective Opinion Fraud Detection: Identifying and Integrating Cues from Language, Behavior, and Networks
NSF · $417k · 2016–2019
CAREER: A General Framework for Methodical and Interpretable Anomaly Mining
NSF · $192k · 2015–2016
III: Medium: Collaborative Research: Collective Opinion Fraud Detection: Identifying and Integrating Cues from Language, Behavior, and Networks
NSF · $600k · 2014–2017

Frequent coauthors

Christos Faloutsos
Carnegie Mellon University
50 shared
Lingxiao Zhao
Dalian Maritime University
28 shared
Tina Eliassi‐Rad
24 shared
Bart Baesens
24 shared
Hanghang Tong
23 shared
Véronique Van Vlasselaer
21 shared
Monique Snoeck
21 shared
Bryan Hooi
17 shared

Education

Ph.D., Computer Science
Carnegie Mellon University
2004
M.S., Computer Science
Carnegie Mellon University
2000
B.S., Computer Engineering
Middle East Technical University
1996

Awards & honors

Heinz College Dean's Professor for Feb 2019-2022
Best Research Paper Award, SIAM SDM 2019
Best Student Machine Learning Paper Runner-up Award, ECML PK…
NSF CAREER Award, 2015-2020
Best Research Paper Runner-up Award, SIAM SDM 2016

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Leman Akoglu

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you