Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Jian Pei

Jian Pei

Verified

Duke University · Electrical and Computer Engineering

Active 1987–2025

h-index96
Citations81.6k
Papers836272 last 5y
Funding
See your match with Jian Pei — sign in to PhdFit.Sign in

About

Jian Pei is a professor in the field of computer science with research interests that include large language model (LLM)-based agents. His work focuses on the generalizability of these agents, which refers to their ability to maintain consistently high performance across varied instructions, tasks, environments, and domains, especially those different from the agent’s fine-tuning data. Pei has contributed to advancing the understanding of generalizability by providing comprehensive reviews that clarify its definition and boundaries, review existing benchmarks, and categorize strategies for improving generalizability. These strategies include methods targeting the backbone LLM, agent components, and their interactions. Furthermore, his research distinguishes between generalizable frameworks and generalizable agents, outlining how frameworks can be translated into agent-level generalizability. Pei’s work aims to establish a foundation for principled research on building LLM-based agents that generalize reliably across diverse real-world applications, identifying future directions such as standardized evaluation frameworks, variance- and cost-based metrics, and hybrid approaches integrating methodological innovations with agent architecture-level designs.

Research topics

  • Computer Science
  • Data Mining
  • Artificial Intelligence
  • Machine Learning
  • Data science

Selected publications

  • Safety and Hemodynamic Efficacy of the LVIS Stent in the Endovascular Treatment of Intracranial Wide-Necked Aneurysms: A Single-Center Retrospective Study

    Research Square · 2025-12-16

    preprintOpen access
  • The 2nd Workshop on Large Language Models for E-Commerce

    2025-08-03

    article

    Large Language Models (LLMs) are revolutionizing E-Commerce by enabling product recommendation, search, classification, question answering, and advertising applications. Their increasing adoption in real-world systems underscores their potential; however, challenges persist in ensuring accuracy, efficiency, fairness, and privacy. This workshop aims to bring together researchers and industry practitioners to explore both the limitations and opportunities of LLMs in e-commerce. The workshop seeks to foster collaboration, bridge the gap between academia and industry, and drive innovation in the application of LLMs to E-Commerce through discussions on model design, algorithmic advancements, and practical deployment.

  • AI4DE: The 1st International Workshop on AI for Data Editing

    2025-08-03

    articleOpen accessSenior author

    Machine learning traditionally emphasizes developing models for given datasets, but real-world data is often messy, making model improvement insufficient for enhancing performance. AI for data editing (AI4DE) is an emerging field that systematically improves datasets, leading to significant practical ML advancements. While experienced data scientists have manually refined datasets through trial-and-error and intuition, AI4DE approaches data enhancement as a systematic engineering discipline. AI4DE represents a shift from focusing on models to the underlying data used for training and evaluation. Despite the dominance of common model architectures and predictable scaling rules, building and using datasets remain labor-intensive and costly, lacking infrastructure and best practices. The AI4DE movement aims to develop efficient, high-productivity open data engineering tools for modern ML systems. This workshop seeks to foster an interdisciplinary AI4DE community to address practical data challenges, including data collection, generation, labeling, preprocessing, augmentation, quality evaluation, debt, and governance. By defining and shaping the AI4DE movement, this workshop aims to influence the future of AI and ML, inviting interested parties to contribute through paper submissions

  • Research on a lightweight traffic sign detection algorithm based on GCL-YOLOv8

    2025-09-19

    article1st authorCorresponding

    A lightweight traffic sign detection algorithm based on YOLOv8n improved GCL-YOLOv8 is proposed to address the issues of low accuracy and large parameter count in the process, resulting in long computation time and complexity. Firstly, a new module GRA is designed using GhostModule, RepConv, and ECA channel attention to replace the original C2f module, balancing inference efficiency and feature expression ability, reducing computational complexity while also considering accuracy; Secondly, the CARAFE lightweight upsampling module is used to replace ordinary upsampling, and the content aware recombination mechanism is applied to effectively amplify and preserve feature details; Then, by adding an LADH detection head, the parameters and computational complexity can be effectively reduced; Finally, Wise MPDIOU was used to replace the original CIoU loss function, simplifying the calculation process of the loss function and improving convergence speed and regression accuracy. Compared with the basic YOLOv8n algorithm, this algorithm improved accuracy P by 2.5%, mAP50 by 2.7%, parameter Params by 42.3%, and GFLOPs by 48.8% on the TT100K dataset. Proved the lightweight and accuracy of the improved algorithm.

  • CDA: Cost-Sensitive Data Acquisition for Incomplete Datasets

    2025-05-19

    articleSenior author

    This paper introduces the novel concept of cost-sensitive data acquisition (CDA), a desirable addition to the data preparation process in a data science pipeline that focuses on strategically acquiring data from various priced sources, such as data markets, under budget constraints. CDA improves data quality by identifying the best set of values to acquire and integrating them into incomplete datasets, optimizing a particular objective defined in the resulting tables (data products). This paper focuses on CDA for a single relational table while also exploring possible extensions to multi-table contexts. First, we introduce an algorithm that utilizes conformal risk control to select rows likely to be included in the data product with probabilistic guarantees. We then investigate ways to acquire data to complete these rows under various CDA scenarios. We start with a scenario where data records are available on a row-wise basis, which proves to be an NP-hard problem. To solve this problem, we introduce an efficient row-wise greedy algorithm (RGreedy), which approaches an approximation ratio of 1. Subsequently, we explore a more generic scenario where each unit of data for acquisition may involve multiple records with a subset of the attributes. We propose a coverage minimum option selection (CMOS) algorithm for its solution, focusing on scalability. Through empirical evaluations on three real-world datasets and one synthetic dataset, we demonstrate that our methods yield performance improvements of 20 % to 40 % over applicable baselines.

  • Computing Shapley Values in Preference Queries

    2025-05-19

    article

    This paper tackles the novel problem of computing Shapley values when multiple data owners collaborate to answer preference queries. Despite extensive existing research on preference queries and Shapley value computation separately, the evaluation of data owners' contributions to cooperatively answering such queries has not been systematically explored. To address this gap, we first establish that, for a linear preference utility function with one data point per owner, the Shapley value can be computed in polynomial time. This finding is applicable to attribute weight spaces that are subsets of a simplex and represent various linear preference utility functions. For scenarios involving multiple data points per owner, we observe that only the locally optimal points from each data owner can make non-zero marginal contributions. Thus, we partition the attribute weight space into a polynomial number of subsets, ensuring that in each subset, only one data point per owner needs to be considered. Experimental results on real Airbnb Listing data and synthetic data sets validate the effectiveness and efficiency of our algorithms, which significantly outperform baseline methods.

  • A Survey on Small Language Models in the Era of Large Language Models: Architecture, Capabilities, and Trustworthiness

    2025-08-03 · 5 citations

    articleOpen access

    Large language models (LLMs) based on Transformer architecture are powerful but face challenges with deployment, inference latency, and costly fine-tuning. These limitations highlight the emerging potential of small language models (SLMs), which can either replace LLMs through innovative architectures and technologies, or assist them as efficient proxy or reward models. Emerging architectures such as Mamba and xLSTM address the quadratic scaling of inference with window length in Transformers by enabling linear scaling. To maximize SLM performance, test-time compute scaling strategies reduce the performance gap with LLMs by allocating extra compute budget during test time. Beyond standalone usage, SLMs could also assist in LLMs via weak-to-strong learning, proxy tuning, and guarding, fostering secure and efficient LLM deployment. Lastly, the trustworthiness of SLMs remains a critical yet underexplored research area. However, there is a lack of tutorials on cutting-edge SLM technologies, prompting us to conduct one.

  • LILRB4 in tumor-associated macrophage regulates macrophage polarization and glioblastoma progression via STAT3/IL10 axis

    Gene · 2025-10-01 · 6 citations

    article1st author
  • Sensitive immunosensing of melanoma biomarker based on enhanced electrochemiluminescence via electronic metal-support interactions

    Frontiers in Chemistry · 2025-12-12 · 3 citations

    articleOpen access1st author

    Developing highly sensitive and convenient immunosensor for the detection of biomarker is important for enhancing the effectiveness of melanoma prevention and control measures. In this work, immunosensor was fabricated for sensitive detection of the melanoma biomarker S100B based on enhanced electrochemiluminescence (ECL) via electronic metal-support interactions. CoAl-layered double hydroxide (LDH) was selected as to modify the costless indium tin oxide (ITO) electrode due to its high surface area and tunable structure. To improve its conductivity and electron transfer capability, oxygen vacancies (Ov) were introduced on LDH through an alkaline etching process, resulting in the LDH-Ov structure. Platinum nanoparticles (Pt) were then in situ loaded onto the LDH-Ov surface (Pt@LDH-Ov/ITO). The electronic metal-support interaction (EMSI) between LDH-Ov and Pt nanoparticles played a critical role in improving the catalytic activity, leading to an enhanced ECL signal in the luminol-dissolved oxygen (DO) system. The immunorecognition interface was fabricated on Pt@LDH-Ov/ITO, enabling selective detection of S100B. The constructed immunosensor exhibited a linear detection range for S100B from 100 fg/mL to 100 ng/mL, with a limit of detection (LOD) of 65 fg/mL. The high performance and enhanced sensitivity of the immunosensor make it a promising tool for the early diagnosis, monitoring of recurrence, and personalized treatment of melanoma.

  • Finding Non-Redundant Simpson's Paradox from Multidimensional Data

    ArXiv.org · 2025-11-02

    preprintOpen access

    Simpson's paradox, a long-standing statistical phenomenon, describes the reversal of an observed association when data are disaggregated into sub-populations. It has critical implications across statistics, epidemiology, economics, and causal inference. Existing methods for detecting Simpson's paradox overlook a key issue: many paradoxes are redundant, arising from equivalent selections of data subsets, identical partitioning of sub-populations, and correlated outcome variables, which obscure essential patterns and inflate computational cost. In this paper, we present the first framework for discovering non-redundant Simpson's paradoxes. We formalize three types of redundancy - sibling child, separator, and statistic equivalence - and show that redundancy forms an equivalence relation. Leveraging this insight, we propose a concise representation framework for systematically organizing redundant paradoxes and design efficient algorithms that integrate depth-first materialization of the base table with redundancy-aware paradox discovery. Experiments on real-world datasets and synthetic benchmarks show that redundant paradoxes are widespread, on some real datasets constituting over 40% of all paradoxes, while our algorithms scale to millions of records, reduce run time by up to 60%, and discover paradoxes that are structurally robust under data perturbation. These results demonstrate that Simpson's paradoxes can be efficiently identified, concisely summarized, and meaningfully interpreted in large multidimensional datasets.

Frequent coauthors

  • Yang Yu

    666 shared
  • Enhong Chen

    University of Science and Technology of China

    666 shared
  • Zhi‐Hua Zhou

    Nanjing University

    652 shared
  • João Gama

    INESC TEC

    651 shared
  • Chengqi Zhang

    651 shared
  • Geoffrey I. Webb

    650 shared
  • Hiroshi Motoda

    Osaka University

    650 shared
  • Jaideep Srivastava

    649 shared

Education

  • Ph.D., Computer Science

    University of California, Berkeley

    1994
  • M.S., Computer Science

    University of California, Berkeley

    1991
  • B.S., Computer Science

    University of Science and Technology of China

    1988

Awards & honors

  • IEEE Fellow
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Jian Pei

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup