Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Jimeng  Sun

Jimeng Sun

· ProfessorVerified

University of Illinois Urbana-Champaign · Computer Science

Active 1999–2026

h-index82
Citations25.5k
Papers549290 last 5y
Funding$2.3M1 active
See your match with Jimeng Sun — sign in to PhdFit.Sign in

About

Jimeng Sun is a professor at the Siebel School of Computing and Data Science at the University of Illinois Urbana-Champaign. His research interests focus on artificial intelligence (AI) for healthcare, including deep learning for drug discovery, clinical trial optimization, computational phenotyping, clinical predictive modeling, treatment recommendation, and health monitoring. He is involved in data and information systems, bioinformatics, and computational biology, applying AI techniques to advance healthcare solutions. Sun has contributed to the development of deep learning methods specifically tailored for healthcare applications and has been recognized for his influence in the field. He has taught courses related to deep learning and AI in medicine and actively engages in research that leverages AI to improve clinical outcomes and accelerate medical research processes.

Research topics

  • Computer Science
  • Artificial Intelligence
  • Data Mining
  • Machine Learning
  • Political Science
  • Data science
  • Economics
  • Geography
  • Econometrics
  • Medicine
  • Business
  • Operations research
  • Engineering
  • Actuarial science
  • Cognitive science
  • Bioinformatics
  • Programming language
  • Meteorology
  • Mathematics
  • Environmental health
  • Statistics
  • World Wide Web
  • Biology

Selected publications

  • SocialStep: Fast Prediction of Social Determinants of Health

    2026-04-30

    articleSenior author
  • Accelerating clinical evidence synthesis with large language models

    npj Digital Medicine · 2025-08-07 · 29 citations

    articleOpen accessSenior author

    Clinical evidence synthesis largely relies on systematic reviews (SR) of clinical studies from medical literature. Here, we propose a generative artificial intelligence (AI) pipeline named TrialMind to streamline study search, study screening, and data extraction tasks in SR. We chose published SRs to build TrialReviewBench, which contains 100 SRs and 2,220 clinical studies. For study search, it achieves high recall rates (Ours 0.711-0.834 v.s. Human baseline 0.138-0.232). For study screening, TrialMind beats previous document ranking methods in a 1.5-2.6 fold change. For data extraction, it outperforms a GPT-4's accuracy by 16-32%. In a pilot study, human-AI collaboration with TrialMind improved recall by 71.4% and reduced screening time by 44.2%, while in data extraction, accuracy increased by 23.5% with a 63.4% time reduction. Medical experts preferred TrialMind's synthesized evidence over GPT-4's in 62.5%-100% of cases. These findings show the promise of accelerating clinical evidence synthesis driven by human-AI collaboration.

  • s3: You Don't Need That Much Data to Train a Search Agent via RL

    ArXiv.org · 2025-05-20

    preprintOpen access

    Retrieval-augmented generation (RAG) systems empower large language models (LLMs) to access external knowledge during inference. Recent advances have enabled LLMs to act as search agents via reinforcement learning (RL), improving information acquisition through multi-turn interactions with retrieval engines. However, existing approaches either optimize retrieval using search-only metrics (e.g., NDCG) that ignore downstream utility or fine-tune the entire LLM to jointly reason and retrieve-entangling retrieval with generation and limiting the real search utility and compatibility with frozen or proprietary models. In this work, we propose s3, a lightweight, model-agnostic framework that decouples the searcher from the generator and trains the searcher using a Gain Beyond RAG reward: the improvement in generation accuracy over naive RAG. s3 requires only 2.4k training samples to outperform baselines trained on over 70x more data, consistently delivering stronger downstream performance across six general QA and five medical QA benchmarks.

  • BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research

    ArXiv.org · 2025-05-22

    preprintOpen accessSenior author

    Validating scientific hypotheses is a central challenge in biomedical research, and remains difficult for artificial intelligence (AI) agents due to the complexity of real-world data analysis and evidence interpretation. In this work, we present BioDSA-1K, a benchmark designed to evaluate AI agents on realistic, data-driven biomedical hypothesis validation tasks. BioDSA-1K consists of 1,029 hypothesis-centric tasks paired with 1,177 analysis plans, curated from over 300 published biomedical studies to reflect the structure and reasoning found in authentic research workflows. Each task includes a structured hypothesis derived from the original study's conclusions, expressed in the affirmative to reflect the language of scientific reporting, and one or more pieces of supporting evidence grounded in empirical data tables. While these hypotheses mirror published claims, they remain testable using standard statistical or machine learning methods. The benchmark enables evaluation along four axes: (1) hypothesis decision accuracy, (2) alignment between evidence and conclusion, (3) correctness of the reasoning process, and (4) executability of the AI-generated analysis code. Importantly, BioDSA-1K includes non-verifiable hypotheses: cases where the available data are insufficient to support or refute a claim, reflecting a common yet underexplored scenario in real-world science. We propose BioDSA-1K as a foundation for building and evaluating generalizable, trustworthy AI agents for biomedical discovery.

  • Regulatory Effects of Paclobutrazol and Uniconazole Mixture on the Morphology and Biomass Allocation of Amorpha fruticosa Seedlings

    Plants · 2025-12-03 · 1 citations

    articleOpen access

    Global climate change has intensified land desertification in the arid and semi-arid regions of northwestern China, highlighting the urgent need to cultivate plant species with ideal architecture and well-developed root systems to combat ecosystem degradation. Amorpha fruticosa is widely used as a windbreak and sand-fixation shrub; however, its rapid growth and high transpiration during the early planting stage often result in excessive water loss, low survival rates, and limited vegetation restoration effectiveness. Plant growth retardants (PGRts) are known to suppress apical dominance and promote branching. In this study, one-year-old A. fruticosa seedlings were treated with different combinations of paclobutrazol (PP333) and uniconazole (S3307) to investigate their effects on plant morphology and biomass allocation; it aims to determine the optimal formula for cultivating shrub structures with excellent windbreak and sand-fixation effects in land desertification areas. The results showed that both PP333 and S3307 significantly inhibited plant height while promoting basal stem diameter, branching, and root development. Among all treatments, the S3307 200 mg·L−1 + PP333 200 mg·L−1 combination (SD3) was the most effective, resulting in the greatest increases in basal diameter, branch number, total root length, and root-to-shoot ratio, while significantly reducing height increment, leaf length and leaf area (p < 0.05). Under the S3307 200 mg·L−1 + PP333 300 mg·L−1 treatment (SD4), leaf width and specific leaf area were reduced by 17.92% and 38.89%, respectively, compared with the control. Correlation analysis revealed significant positive or negative relationships among most growth traits, with leaf length negatively correlated with other morphological indicators. Fresh and dry weights of both aboveground and root tissues were significantly positively correlated with basal diameter (R = 0.38) and branch basal diameter (R = 0.33). Principal component analysis demonstrated that the SD3 treatment achieved the highest comprehensive score (2.91), indicating its superiority in promoting a compact yet robust plant architecture. Overall, the SD3 treatment improved drought resistance and sand-fixation capacity of A. fruticosa by “dwarfing and strengthening plants while optimizing root–shoot allocation.” These findings provide theoretical support for large-scale cultivation and vegetation restoration in arid and semi-arid regions and offer a technical reference for growth regulation and windbreak and sand-fixation capacity in other xerophytic shrub species.

  • MRI2PET: Realistic PET Image Synthesis from MRI for Automated Inference of Brain Atrophy and Alzheimer’s

    medRxiv · 2025-04-25 · 3 citations

    preprintOpen access

    Background: Positron Emission Tomography (PET) scans are a crucial tool in the diagnosing and monitoring of a number of complex conditions, including cancer, heart health, and especially cognitive brain function. However, they are also often much more expensive than comparable imaging modalities such as X-Ray and magnetic resonance imaging (MRI), which can limit their availability and the impact of their use in both medical and machine learning settings. We propose to address this problem by using generative models to simulate the PET scan results based on prior MRI. Methods: While recent work has yielded impressive realism in image generation, this PET synthesis task presents a series of technical challenges based on the scarcity of paired data as well as the complexity and nuance of the 3D images. So, we propose MRI2PET to generate AV45-PET scans from T1-weighted MRI images. MRI2PET is a 3D diffusion-based method which makes use of style transferred pre-training and a Laplacian pyramid loss to address these challenges by utilizing larger available unpaired MRI datasets and structural similarities between the MRI and PET images while simultaneously emphasizing the crucial details. Findings: We evaluate MRI2PET through a series of studies on the ADNI dataset where we show that it both generates realistic images and improves clinically-based disease classification. When compared to training on only the original AV45-PET data, MRI2PET augmentation increases AUROC of brain scan classification to 0.780 ± 0.005 from 0.688 ± 0.014 when classifying brain scans into one of three clinically defined groups: cognitively normal, mild cognitive impairment, and Alzheimer's Disease. Interpretation: The capability to generate high quality, clinically relevant PET scans from MRI has the potential to expand the utility of cost-effective and accessible imaging workflows and improve both image-based machine learning capabilities and patient care. Funding: US National Institute on Aging, US National Institutes of Health, US National Science Foundation.

  • CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making

    ArXiv.org · 2025-06-15

    preprintOpen access

    In medical visual question answering (Med-VQA), achieving accurate responses relies on three critical steps: precise perception of medical imaging data, logical reasoning grounded in visual input and textual questions, and coherent answer derivation from the reasoning process. Recent advances in general vision-language models (VLMs) show that large-scale reinforcement learning (RL) could significantly enhance both reasoning capabilities and overall model performance. However, their application in medical domains is hindered by two fundamental challenges: 1) misalignment between perceptual understanding and reasoning stages, and 2) inconsistency between reasoning pathways and answer generation, both compounded by the scarcity of high-quality medical datasets for effective large-scale RL. In this paper, we first introduce Med-Zero-17K, a curated dataset for pure RL-based training, encompassing over 30 medical image modalities and 24 clinical tasks. Moreover, we propose a novel large-scale RL framework for Med-VLMs, Consistency-Aware Preference Optimization (CAPO), which integrates rewards to ensure fidelity between perception and reasoning, consistency in reasoning-to-answer derivation, and rule-based accuracy for final responses. Extensive experiments on both in-domain and out-of-domain scenarios demonstrate the superiority of our method over strong VLM baselines, showcasing strong generalization capability to 3D Med-VQA benchmarks and R1-like training paradigms.

  • RDMA: Cost Effective Agent-Driven Rare Disease Mining from Electronic Health Records

    ArXiv.org · 2025-07-14

    preprintOpen accessSenior author

    Rare diseases affect 1 in 10 Americans yet remain systematically underdocumented in clinical records. ICD-based systems cannot capture their breadth, over 50\% of Orphanet codes lack a direct ICD mapping and only 2.2\% of HPO codes have matching ICD codes, leaving patient populations invisible and delaying diagnosis. Mining unstructured clinical notes offers a direct path forward, but real notes are long, noisy, and abbreviation-dense, and limited annotations make fine-tuning infeasible, demanding approaches that generalize without task-specific training. We present Rare Disease Mining Agents (RDMA), an agentic framework equipping smaller quantized LLMs with tools for abbreviation resolution, implicit phenotype reasoning, and ontology grounding against Orphanet and HPO. RDMA substantially outperforms fine-tuned and RAG-based baselines across benchmarks with different data characteristics, without any task-specific training. A small quantized model achieves maximal performance, reducing inference costs by up to 10x and local hardware costs by up to 17x, enabling private deployment on standard hardware without cloud-based PHI exposure. RDMA's uncertainty-flagging mechanism further reduces expert annotation burden while preserving agreement quality, supporting scalable rare disease documentation in clinical practice. Available at https://github.com/jhnwu3/RDMA.

  • MediSim: Multi-granular simulation for enriching longitudinal, multi-modal electronic health records

    Patterns · 2025-05-08 · 1 citations

    articleOpen accessSenior author

    We introduce MediSim, a multi-modal generative model for simulating and augmenting electronic health records across multiple modalities, including structured codes, clinical notes, and medical imaging. MediSim employs a multi-granular, autoregressive architecture to simulate missing modalities and visits and iterative, reinforcement learning-based training to improve simulation in low-data settings. Additionally, it utilizes encoder-decoder model pairs to handle complex modalities like notes and images. Experiments on outpatient claims and inpatient ICU datasets have demonstrated MediSim's superiority over baselines in predicting missing codes, creating enriched data, and improving downstream predictive modeling. Specifically, MediSim improved over 74% on missing code prediction, enabled up to 65% better downstream predictive performance compared to original deficient records missing either some visits or entire data modalities, and successfully produced realistic note and X-ray samples for use in downstream tasks. MediSim's ability to generate comprehensive, high-dimensional EHR data has the potential to significantly improve AI applications throughout healthcare.

  • MEDS: Building Models and Tools in a Reproducible Health AI Ecosystem

    2025-08-03

    articleOpen access

    KDD ’25, Toronto, ON, Canada

Recent grants

Frequent coauthors

  • Cao Xiao

    162 shared
  • Lucas M. Glass

    IQVIA (United States)

    106 shared
  • M. Brandon Westover

    Harvard University

    44 shared
  • Tianfan Fu

    Rensselaer Polytechnic Institute

    37 shared
  • Shenda Hong

    36 shared
  • Walter F. Stewart

    35 shared
  • Chaoqi Yang

    29 shared
  • Christos Faloutsos

    Carnegie Mellon University

    28 shared

Labs

  • Siebel School of Computing and Data SciencePI

Education

  • Ph.D., Computer Science

    University of Illinois at Urbana-Champaign

    2006
  • M.S., Computer Science

    University of Illinois at Urbana-Champaign

    2002
  • B.S., Computer Science

    University of Science and Technology of China

    1999

Awards & honors

  • Jimeng Sun and Kaiyu Guan rank among 12 Illinois scientists…
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Jimeng Sun

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup