Youakim Badr

· Professor of Data Analytics and Artificial IntelligenceVerified

Pennsylvania State University · Artificial Intelligence

Active 2001–2026

h-index21

Citations2.4k

Papers24445 last 5y

Funding—

Faculty page

See your match with Youakim Badr — sign in to PhdFit.Sign in

About

Dr. Youakim Badr is a Professor of Data Analytics and Artificial Intelligence at The Pennsylvania State University, Great Valley School of Graduate Professional Studies. He serves as Professor-in-Charge of the Master of Artificial Intelligence program across Penn State Great Valley and World Campus. He is also the Founding Director of the Trustworthy Intelligence and Thinking Machine Lab (THINK Lab) and a Research Fellow with Penn State’s Center for Socially Responsible Artificial Intelligence and Clinical. Dr. Badr received his Ph.D. in Computer Science from the National Institute of Applied Sciences of Lyon (INSA-Lyon), France, in 2003, and held a tenured faculty appointment at INSA-Lyon before joining Penn State. His research advances trustworthy, secure, scalable, and composable AI systems, with emphasis on agentic AI, AI analytics systems, and trustworthy AI. He has advised and co-advised more than 20 doctoral students and numerous graduate students internationally, leading or contributing to funded research projects supported by agencies and partners in the United States, France, Europe, and industry. Dr. Badr has played a significant role in designing, launching, and scaling Penn State’s Master of Artificial Intelligence program, developing multiple stackable certificates and graduate AI courses, and expanding interdisciplinary AI pathways in collaboration with programs in data analytics, software engineering, computer science, and business. He has authored or co-authored over 150 peer-reviewed publications and has received multiple awards for research, teaching, innovation, service, and student engagement from Penn State and the French Ministry of Higher Education and Research. His professional activities include serving as a court-appointed scientific expert in artificial intelligence, evaluating AI and data-intensive systems for national and international funding panels, and participating actively in professional communities such as ACM, EDUCAUSE, INFORMS, Linux Foundation AI & Data, and the Agentic AI Foundation.

Research topics

Computer Science
Artificial Intelligence
Machine Learning
Natural Language Processing
Statistics
Control engineering
Engineering
Medicine
Surgery
Mathematics
Mechanical engineering
World Wide Web
Simulation
Automotive engineering
Electrical engineering

Selected publications

ParaKab – Many Languages, One Kabyle: A Multilingual Parallel Corpus for a Low-Resource Language
Open MIND · 2026-03-05
datasetSenior author
Description of the Dataset This dataset consists of three parallel corpora involving the Kabyle language, covering the following language pairs: Kabyle-Arabic Kabyle-French Kabyle-English Together, these corpora comprise approximately one million (1M) aligned sentence pairs, forming a multilingual Kabyle-centric parallel dataset. The resource is intended for research in natural language processing (NLP), particularly for low-resource languages (LRLs), machine translation, and cross-linguistic studies. The data were compiled from multiple sources, primarily from publicly available resources distributed via the OPUS platform, as well as selected public websites. All texts underwent a rigorous pipeline including dataset selection, cleaning, normalization, alignment, verification, and deduplication, resulting in a ready-to-use dataset for research and machine learning applications. Academic Context This dataset was developed as part of a university research project on low-resource languages, focusing on Kabyle, a northern Tamazight (Berber) language. The project aims to contribute to the digital inclusion of low-resource languages and to support their integration into modern NLP systems. Licensing and Data Rights Each sentence pair in the dataset is linked to its original data source and, where applicable, its corresponding license or rights holder, ensuring transparency and compliance with source-specific terms of use. For the publication of this corpus, a general license is provided for the dataset as a whole. Users are responsible for respecting the original licenses and usage conditions associated with each source when reusing the data.
DOI
Beyond All-Reduce: Event-Driven Model Parallelism Without Collective Communication Primitives (EBD2N)
Research Square · 2026-03-05
preprintOpen access
Publisher OA PDF DOI
ParaKab – Many Languages, One Kabyle: A Multilingual Parallel Corpus for a Low-Resource Language
Zenodo (CERN European Organization for Nuclear Research) · 2026-03-05
datasetOpen accessSenior author
Description of the Dataset This dataset consists of three parallel corpora involving the Kabyle language, covering the following language pairs: Kabyle-Arabic Kabyle-French Kabyle-English Together, these corpora comprise approximately one million (1M) aligned sentence pairs, forming a multilingual Kabyle-centric parallel dataset. The resource is intended for research in natural language processing (NLP), particularly for low-resource languages (LRLs), machine translation, and cross-linguistic studies. The data were compiled from multiple sources, primarily from publicly available resources distributed via the OPUS platform, as well as selected public websites. All texts underwent a rigorous pipeline including dataset selection, cleaning, normalization, alignment, verification, and deduplication, resulting in a ready-to-use dataset for research and machine learning applications. Academic Context This dataset was developed as part of a university research project on low-resource languages, focusing on Kabyle, a northern Tamazight (Berber) language. The project aims to contribute to the digital inclusion of low-resource languages and to support their integration into modern NLP systems. Licensing and Data Rights Each sentence pair in the dataset is linked to its original data source and, where applicable, its corresponding license or rights holder, ensuring transparency and compliance with source-specific terms of use. For the publication of this corpus, a general license is provided for the dataset as a whole. Users are responsible for respecting the original licenses and usage conditions associated with each source when reusing the data.
Publisher DOI
Editorial: Journal of Cyber Security and Risk Auditing
Journal of Cyber Security and Risk Auditing · 2025-05-01
editorialOpen access1st authorCorresponding
Dear Readers, It is with great pleasure that we introduce to you our upcoming journal, "Journal of Cyber Security and Risk Auditing." This journal is dedicated to exploring the advancements in the field of cybersecurity and providing a platform for researchers and scholars to exchange ideas, fostering progress in the area of cybersecurity and risk auditing. On behalf of the editorial team, I extend our heartfelt gratitude and a warm welcome to the scholars, experts, researchers, and readers who support and follow our journal. Purpose of the Journal The Journal of Cyber Security and Risk Auditing aims to promote the development of cybersecurity fields, enhance the research level of cybersecurity technologies, and strengthen academic exchanges on an international scale. We are committed to building an open, inclusive, and innovative platform for researchers in the field of cybersecurity to present their findings, share experiences, and exchange ideas.
Publisher DOI
Federated Learning in Healthcare: A Bibliometric Analysis of Privacy, Security, and Adversarial Threats (2021-2024)
Shifra. · 2025-01-17 · 28 citations
articleOpen access
Federated Learning (FL) has rapidly emerged as a transformative machine learning approach, enabling healthcare institutions to collaboratively build predictive models without compromising patient data privacy. As healthcare increasingly adopts digital technologies, federated learning offers promising solutions to critical issues such as data privacy, security, data poisoning, and adversarial attacks. Despite the recognized potential of FL, significant gaps persist in existing research, particularly concerning comprehensive security frameworks and practical healthcare applications. This bibliometric analysis systematically explores the research landscape from 2021 to 2024, explicitly focusing on data privacy, security threats, and adversarial attacks within federated learning in healthcare. Utilizing bibliometric data from the Scopus database, the study identifies key thematic trends, evaluates global collaborative networks, and assesses contributions from leading institutions and countries. Findings reveal rapidly growing scholarly interest, robust international collaboration, and notable institutional contributions, with a specific emphasis on privacy-preserving techniques, healthcare-specific applications, and emerging technologies such as blockchain and edge computing. The analysis also highlights critical limitations due to incomplete bibliographic metadata. This research provides a comprehensive understanding of current trends and identifies future directions to enhance the security and privacy framework of federated learning in healthcare.
Publisher OA PDF DOI
Leveraging Graph Neural Networks for Attack Detection in IoT Systems
Lecture notes in computer science · 2025-01-01
book-chapterOpen access
Publisher OA PDF DOI
Enhancing Trust in Central Differential Privacy Using zk-SNARKs and Cryptographic Hashes
Lecture notes on data engineering and communications technologies · 2025-01-01 · 1 citations
book-chapterOpen access
Publisher OA PDF DOI
A Survey on Acoustic Side-Channel Attacks: An Artificial Intelligence Perspective
Journal of Cybersecurity and Privacy · 2025-12-29
articleOpen accessSenior author
Acoustic Side-Channel Attacks (ASCAs) exploit the sound produced by keyboards and other devices to infer sensitive information without breaching software or network defenses. Recent advances in deep learning, large language models, and signal processing have greatly expanded the feasibility and accuracy of these attacks. To clarify the evolving threat landscape, this survey systematically reviews ASCA research published between January 2020 and February 2025. We categorize modern ASCA methods into three levels of text reconstruction—individual keystrokes, short text (words/phrases), and long-text regeneration— and analyze the signal processing, machine learning, and language-model decoding techniques that enable them. We also evaluate how environmental factors such as microphone placement, ambient noise, and keyboard design influence attack performance, and we examine the challenges of generalizing laboratory-trained models to real-world settings. This survey makes three primary contributions: (1) it provides the first structured taxonomy of ASCAs based on text generation granularity and decoding methodology; (2) it synthesizes cross-study evidence on environmental and hardware factors that fundamentally shape ASCA performance; and (3) it consolidates emerging countermeasures, including Generative Adversarial Network-based noise masking, cryptographic defenses, and environmental mitigation, while identifying open research gaps and future threats posed by voice-enabled IoT and prospective quantum side-channels. Together, these insights underscore the need for interdisciplinary, multi-layered defenses against rapidly advancing ASCA techniques.
Publisher OA PDF DOI
Digital Signature Quantification in the Bitcoin Blockchain: A Statistical Approach
2025-06-18
article
Digital signatures are crucial for blockchain security, ensuring transaction authenticity, integrity, and nonrepudiation. This reliance on secure digital signatures also extends to emerging blockchain applications like in the Internet of Vehicles (IoV) and the Internet of Devices (IoD). However, the emergence of Post-Quantum Cryptography (PQC) poses a potential threat to blockchain performance by significantly increasing the size of digital signatures and public keys. This paper examines the number of signatures per block in the Bitcoin network using a real-world data-driven analysis. We propose an efficient signature counting algorithm that processes transaction data and accounts for all acceptable digital signature formats. To evaluate the methodology and the quality of the data, we apply the Shapiro-Wilk test for normality and use the Coefficient of Variation (CV) to assess data distribution and variability. This research advances blockchain analytics by offering a systematic approach to quantifying digital signatures.
Publisher DOI
SLIM: Stateless-Based Lightweight Sharding Mechanism for Secure Data Interchange in Blockchain-Enabled Internet of Vehicles
IEEE Transactions on Intelligent Transportation Systems · 2025-12-24
article
The emergence of Blockchain-enabled Internet of Vehicles stands as a critical method for secure data exchange between vehicles and traffic infrastructures. Yet, significant challenges persist, particularly in terms of limited fault tolerance and constraints associated with storage and computational resources. In this study, we introduce the Stateless-based Lightweight Sharding Mechanism (SLIM) for enhanced secure data exchange. This mechanism is built on three key strategies: 1) SLIM incorporates a robust sharding protocol which includes identity establishment and verification, shard formation, and intra-shard consensus to mitigate the impact of Byzantine nodes within each shard. This protocol is designed to reduce the influence of malicious nodes in each shard, thereby improving fault tolerance and security. 2) To address storage challenges, we employ a stateless verification algorithm to allow service nodes in the Internet of Vehicles to authenticate vehicular data transactions without the need to store the entire historical blockchain data. 3) SLIM also integrates a Stackelberg game model for efficient resource distribution in collaborative networks. By offloading the mining process and data storage to the cloud and allowing roadside units to adjust their storage and computing strategies through the game model, the computing resource requirement of roadside units is thus substantially reduced, leading to optimized revenue generation. The security effectiveness of SLIM is theoretically proven. The simulations and experimental results demonstrate that SLIM’s block size is significantly smaller (66.9 times less) than Ethereum’s, and its block verification time is 67% faster compared to conventional stateless blockchains.
Publisher DOI

Frequent coauthors

Frédérique Biennier
66 shared
Arthur Gatouillat
Laboratoire d'Informatique en Images et Systèmes d'Information
44 shared
Maroun Abi Assaf
40 shared
Youssef Amghar
Institut National des Sciences Appliquées de Lyon
36 shared
Bertrand Massot
Centre National de la Recherche Scientifique
27 shared
Zakaria Maamar
Qatar Science and Technology Park
21 shared
Xiaoyang Zhu
Institut National des Sciences Appliquées de Lyon
20 shared
Robin G. Qiu
Pennsylvania State University
18 shared

Labs

Trustworthy Intelligence and Thinking Machine Lab (THINK Lab)PI

Awards & honors

2022-2023 Distinguished Research and Scholarship Award, Penn…
2022-2023 Excellence in Teaching Award, Penn State Great Val…
2022 Arthur L. Glenn Award for Excellence in Student Engagem…
2020-2021 Award for Faculty Service, Penn State Great Valley…
2019-2020 Arthur L. Glenn Award for Faculty Innovation, Penn…

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Youakim Badr

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you