David Carlson
· Yoh Family Associate Professor of Civil and Environmental EngineeringVerifiedDuke University · Biostatistics and Bioinformatics
Active 1964–2026
About
Professor David Carlson is associated with the Duke Center for Combinatorial Gene Regulation, which is an NIH Center of Excellence in Genomic Science. The center aims to make combinatorial studies of the function of regulatory elements and noncoding variants routine. Its goal is to develop a systematic and comprehensive framework for determining the effects of combinations of regulatory elements on gene expression and downstream phenotypes. This work is expected to transform human disease studies by prioritizing genes for targeted study and drug design, and by predicting the pathogenicity of noncoding variants that have not been observed, which is crucial for whole genome sequencing studies of both rare and common diseases.
Research topics
- Computer Science
- Artificial Intelligence
- Genetics
- Psychology
- Biology
- Neuroscience
- Environmental science
- Meteorology
- Geography
- Computer vision
- Remote sensing
- Internal medicine
- Physiology
- Cognitive psychology
- Engineering
- Medicine
Selected publications
Journal of Neurology Neurosurgery & Psychiatry · 2026-04-08
articleBACKGROUND: Despite advances in epilepsy surgery, seizure freedom is achieved in only ~50-70% of cases, highlighting the need to better understand factors driving surgical success. METHODS: A preregistered systematic review and individual patient data meta-analysis was conducted on studies reporting clinical outcomes in epilepsy surgery, based on a comprehensive literature search through August 2024. Data were extracted following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. Unique patient data from 385 studies were pooled, yielding 5588 patients with outcomes, localisation, demographics, pathology and other findings. Surgical success rates (% Engel 1/ILAE 1-2) were reported with 95% Wald CIs. Associations with patient- and disease-specific factors were assessed using chi-squared tests (p<0.05), effect sizes with Cramer's V, and post hoc comparisons adjusted using the false discovery rate. RESULTS: Surgical success varied by lobar anatomy (χ²=52, p<0.001, V=0.12), with the highest success rates in temporal (68.6% (67.0% to 70.1%)) and insular lobes (66.2% (55.4% to 77.0%)). Multilobar resections had lower success rates, with outcomes varying by lobar combination (χ²=25, p=0.02, V=0.22). Variability in outcomes was influenced by histopathology and MRI findings (χ²=121, p<0.001, V=0.16; highest success in tumours (78.2% (74.9% to 81.6%))) and by surgical intervention (χ²=30.5, p<0.001, V=0.07; lowest success with corpus callosotomy (43.4% (35.4% to 51.5%))). Overall surgical success rates remained stable over time (r=0.25, p=0.13), despite surgery being extended to more complex patients. CONCLUSIONS: These findings inform surgical planning for drug-resistant epilepsy, emphasising individual patient characteristics to guide personalised treatment, improve outcomes and reflect the growing complexity of intersecting factors. PROSPERO REGISTRATION NUMBER: CRD42024530397.
CD4+ mucosal-associated invariant T cells express highly diverse T cell receptors
The Journal of Immunology · 2025-11-09 · 3 citations
articleOpen accessMucosal-associated invariant T (MAIT) cells are highly conserved innate-like T cells in mammals recognized for their high baseline frequency in human blood and cytotoxic effector functions during infectious diseases, autoimmunity, and cancer. While the majority of these cells in humans express a conserved CD8αβ+ TRAV1-2 T cell receptor (TCR) recognizing microbially derived vitamin B2 intermediates presented by the evolutionarily conserved major histocompatibility complex class I-related molecule, MR1, there is an emerging appreciation for diverse MAIT cell subsets that possess distinct functions including CD4+ MAIT cells that remain underexplored. In this study, we adopted an unbiased single-cell TCR-sequencing approach in MR1-5-OP-RU-tetramer-reactive T cells. We discovered that CD4+ MAIT cells are enriched with highly diverse TRAV1-2- TCRs. To specifically characterize this TCR repertoire, we analyzed VDJ sequences across 2 datasets and identified distinct TCR usage among CD4+ MAIT cells including TRAV21, TRAV8 (TRAV8-1, TRAV8-2, TRAV8-3), and TRAV12 families (TRAV12-2, TRAV12-3), as well as more variable J segment, CDR3α, and TRBV sequences. TRAV1-2- MAIT cell TCRs were also enriched after in vitro culture with interleukin-2 and Mycobacterium tuberculosis. These results indicate that mature human CD4+ MAIT cells adopt distinct TCR usage from the canonical TRAV1-2+ CD8+ subset and suggest that alternative MR1 ligands in addition to riboflavin intermediates may select for them.
AJE Advances Research in Epidemiology · 2025-09-24
articleOpen accessSenior authorAbstract Childhood obesity is a major risk factor for adult cardiovascular disease. Current obesity-prediction models were not developed in diverse populations and do not include heterogeneous social, environmental, and climate factors that may impact body mass index across the full pediatric spectrum. Additionally, they consider only the immediate neighborhood within which a child lives, ignoring contextual factors from expanded (ie, distal) neighborhoods. This study uses expanded neighborhoods’ social, environmental, and climate data to improve individual-level body mass index prediction–from underweight through obesity–using a novel machine learning approach. We obtained demographic and clinical data from the electronic health records of the Duke University Health System, identifying 12,226 children aged 6-18 years in Durham County, North Carolina, with body mass index data from 2014 to 2022. Participants’ data were linked to socioeconomic and environmental information at the census block group level. We captured expanded neighborhood effects with a graph neural network and combined this information with individual-level factors to predict body mass index. Our model predicted body mass index more accurately than simpler models for children aged 6–11 (R2 = 0.234, mean absolute error = 3.352, root mean square error = 4.370) and 12–18 (R2 = 0.147, mean absolute error = 4.980, root mean square error = 6.343) using all features. Key predictive factors identified included rent burden, poverty rate, and tree coverage. This research highlights the value of including broader socioeconomic and environmental factors in body mass index prediction, offering insights that could guide targeted, neighborhood-level interventions.
ArXiv.org · 2025-11-20
preprintOpen accessSenior authorEstimating counterfactual outcomes from time-series observations is crucial for effective decision-making, e.g. when to administer a life-saving treatment, yet remains significantly challenging because (i) the counterfactual trajectory is never observed and (ii) confounders evolve with time and distort estimation at every step. To address these challenges, we propose a novel framework that synergistically integrates two complementary approaches: Sub-treatment Group Alignment (SGA) and Random Temporal Masking (RTM). Instead of the coarse practice of aligning marginal distributions of the treatments in latent space, SGA uses iterative treatment-agnostic clustering to identify fine-grained sub-treatment groups. Aligning these fine-grained groups achieves improved distributional matching, thus leading to more effective deconfounding. We theoretically demonstrate that SGA optimizes a tighter upper bound on counterfactual risk and empirically verify its deconfounding efficacy. RTM promotes temporal generalization by randomly replacing input covariates with Gaussian noises during training. This encourages the model to rely less on potentially noisy or spuriously correlated covariates at the current step and more on stable historical patterns, thereby improving its ability to generalize across time and better preserve underlying causal relationships. Our experiments demonstrate that while applying SGA and RTM individually improves counterfactual outcome estimation, their synergistic combination consistently achieves state-of-the-art performance. This success comes from their distinct yet complementary roles: RTM enhances temporal generalization and robustness across time steps, while SGA improves deconfounding at each specific time point.
Scientific Reports · 2025-08-22
articleOpen accessElectroencephalography (EEG) recordings with visual stimuli require detailed coding to determine the periods of participant's attention. Here we propose to use a supervised machine learning model and off-the-shelf video cameras only. We extract computer vision-based features such as head pose, gaze, and face landmarks from the video of the participant, and train the machine learning model (multi-layer perceptron) on an initial dataset, then adapt it with a small subset of data from a new participant. Using a sample size of 23 autistic children with and without co-occurring ADHD (attention-deficit/hyperactivity disorder) aged 49-95 months, and training on additional 2560 labeled frames (equivalent to 85.3 s of the video) of a new participant, the median area under the receiver operating characteristic curve for inattention detection was 0.989 (IQR 0.984-0.993) and the median inter-rater reliability (Cohen's kappa) with a trained human annotator was 0.888. Agreement with human annotations for nine participants was in the 0.616-0.944 range. Our results demonstrate the feasibility of automatic tools to detect inattention during EEG recordings, and its potential to reduce the subjectivity and time burden of human attention coding. The tool for model adaptation and visualization of the computer vision features is made publicly available to the research community.
Open Forum Infectious Diseases · 2025-01-29
articleOpen accessAbstract Background Babesiosis is a potentially life-threatening disease transmitted by ticks. However, host and parasite determinants of human babesiosis are completely unknown. In the current study we evaluated the impact of B. microti infection, the primary cause of human babesiosis in the US, on the peripheral blood transcriptome. Figure 1. PCA of top 500 differentially expressed genes. PCA distinguishes Babesia patients at presentation from healthy controls. Methods Patients with confirmed B. microti acute infection were enrolled into a 1-year follow up cohort study at Stony Brook University Hospital, NY. Total RNA was isolated from blood using PaxGene RNA tubes and library preparation performed using Illumina's Stranded Total RNAPrep with Ribo-zero plus kit to remove human ribosomal RNA and globin RNA. 100 M RNA paired end reads were sequenced (2 x 151 bps). Low expression RNAs with less than 10 reads in 17 samples were filtered out prior to identification of differentially expressed genes (DEGs) (p&lt; 0.05).Figure 2.Volcano Plot comparing differential gene and noncoding RNA expression between healthy controls and patients with babesiosis at presentation.Red circles represent genes whose expression are increased or decreased in patients with babesiosis at presentation compared to healthy controls. Blue circles represent genes whose expression are not significantly different. Results The peripheral blood transcriptome from 20 healthy controls was compared to 42 patients at presentation for B. microti infection and 1, 3, 6, 9 and 12 months after treatment for patients who completed the longitudinal protocol after the first visit. A total of 1860 genes or noncoding RNAs were differentially expressed in patients with babesiosis at presentation (V0) compared to healthy controls; 370 were decreased in patients with babesiosis and 1490 were increased (adj p &lt; 0.01; expression fold change &gt;+/- 1.) PCA analysis (Fig 1), and the Volcano Plot (Fig. 2) demonstrate that patients with babesiosis have a distinct peripheral blood gene expression profile compared to healthy control individuals. The heat map of DEGs in Fig. 3 shows the peripheral blood transcriptome in patients with babesiosis largely returned to baseline levels by 1-month post-treatment. The Qiagen IPA Graphical Summary based on DEGs is shown in Figure 4 and highlights activation of diverse immune-mediators, including TNF and interferon families, as well as growth factors, and vascularization pathways in response to B. microti infection.Figure 3.Heat map of differentially expressed genes of patients with babesiosis at presentation (0) and 1, 3, 6, 9 and 12 months post-treatment compared to healthy control individuals.Gene expression in patients with babesiosis is markedly different at presentation but is similar to levels for healthy controls by 1 month post-treatment. Conclusion Infection with B. microti results in marked changes in expression of genes and noncoding RNA in peripheral blood that largely returns to baseline by 1-month post-treatment.Figure 4:Qiagen IPA Graphical Summary of pathways likely increased (orange) or decreased (blue) in patients presenting with babesiosis. Disclosures Paul Arnaboldi, PhD, Biopeptides, Corp: Patent Holder|Biopeptides, Corp: Employee|DiaSorin: Advisor/Consultant
CARE: Turning LLMs Into Causal Reasoning Expert
ArXiv.org · 2025-11-20
preprintOpen accessSenior authorLarge language models (LLMs) have recently demonstrated impressive capabilities across a range of reasoning and generation tasks. However, research studies have shown that LLMs lack the ability to identify causal relationships, a fundamental cornerstone of human intelligence. We first conduct an exploratory investigation of LLMs' behavior when asked to perform a causal-discovery task and find that they mostly rely on the semantic meaning of variable names, ignoring the observation data. This is unsurprising, given that LLMs were never trained to process structural datasets. To first tackle this challenge, we prompt the LLMs with the outputs of established causal discovery algorithms designed for observational datasets. These algorithm outputs effectively serve as the sufficient statistics of the observation data. However, quite surprisingly, we find that prompting the LLMs with these sufficient statistics decreases the LLMs' performance in causal discovery. To address this current limitation, we propose CARE, a framework that enhances LLMs' causal-reasoning ability by teaching them to effectively utilize the outputs of established causal-discovery algorithms through supervised fine-tuning. Experimental results show that a finetuned Qwen2.5-1.5B model produced by CARE significantly outperforms both traditional causal-discovery algorithms and state-of-the-art LLMs with over a thousand times more parameters, demonstrating effective utilization of its own knowledge and the external algorithmic clues.
An RF-CNN pipeline for predicting PM2.5 concentration in Sri Lanka
Journal of Hazardous Materials Advances · 2025-06-09 · 2 citations
articleOpen access• The RF-CNN pipeline can serve as a robust and comprehensive method for precisely forecasting the spatial and temporal fluctuations in PM 2.5 concentrations. • RF-CNN pipeline yielded satisfactory outcomes by employing the RF component as an error model. • The RF-CNN pipeline is intended to enhance air quality forecasting and guide policymakers in mitigating air pollution impacts worldwide. Air pollution is a considerable global public health threat, requiring efficient monitoring and forecasting to guide decision-making. This study introduces a cascaded model of enhanced Random Forest with Convolutional Neural Network (RF-CNN) that predicts spatiotemporal fluctuations in PM 2.5 concentrations throughout Sri Lanka. The K-Nearest Neighbors method is employed to impute missing data, and the model utilizes data from 24 low-cost PM 2.5 sensors that are distributed throughout the country. The Convolutional Neural Network (CNN) derives spatial features from four-band PlanetScope satellite images (3m/pixel resolution, 1km 2 spatial coverage), while the Random Forest (RF) component models the relationship between PM 2.5 levels and four meteorological parameters. These features, combined with meteorological, spatial, and temporal inputs, produce the final forecasting results. The dataset comprises 1934 satellite images that were collected between December 2022 and February 2024, with an average PM 2.5 concentration of approximately 15 μg/m 3 . The RF-CNN model exhibited robust performance metrics across a variety of climate zones, including a normalized root mean square error of approximately 32.4%, a mean absolute percentage error of approximately 25.7%, a normalized mean absolute error of approximately 22.8%, a Spearman r of 0.871, and a Pearson r of 0.873. Two metrics: Input Data Quality Score (IDQS) and Testing Data Quality Score (TDQS) were implemented to evaluate the effects of imputation. Performance was minimally impacted by imputation within acceptable ranges, while exceeding limits resulted in increased uncertainty. This research emphasizes the efficacy of the RF-CNN approach, which integrates satellite imagery and low-cost sensor data, as a scalable solution for predicting spatiotemporal PM 2.5 variations. It provides valuable insights for regions that lack extensive monitoring.
Scientific Reports · 2025-04-29 · 5 citations
articleOpen accessNatural Killer (NK) cells can recognize and kill Mycobacterium tuberculosis (Mtb)-infected cells in vitro, however their role after natural human exposure has not been well-studied. To identify Mtb-responsive NK cell populations, we analyzed the peripheral blood of healthy household contacts of active Tuberculosis (TB) cases and source community donors in an endemic region of Port-au-Prince, Haiti by flow cytometry. We observed higher CD8α expression on NK cells in putative resistors (Interferon γ release assay negative; IGRA− contacts) with a loss of CD8α surface expression during household-associated exposure and active TB disease. In vitro assays and CITE-seq analysis of CD8α+ NK cells demonstrated enhanced maturity, cytotoxic gene expression, and response to cytokine stimulation relative to CD8α− NK cells. CD8α+ NK cells also displayed dynamic surface expression dependent on MHC class I in contrast to conventional CD8+ T cells. Together, these results support a specialized role for CD8α+ NK cell populations during Mtb infection correlating with disease resistance.
CD4 <sup>+</sup> Mucosal-associated Invariant T (MAIT) cells express highly diverse T cell receptors
bioRxiv (Cold Spring Harbor Laboratory) · 2025-02-08
preprintOpen accessAbstract Mucosal-associated invariant T cells are highly conserved innate-like T cells in mammals recognized for their high baseline frequency in human blood and cytotoxic effector functions during infectious diseases, autoimmunity, and cancer. While the majority of these cells express a conserved CD8αβ+ TRAV1-2 T cell receptor recognizing microbially-derived Vitamin B2 intermediates presented by the evolutionarily conserved major histocompatibility complex I-related molecule, MR1, there is an emerging appreciation for diverse subsets that may be selected for in humans with distinct functions, including subpopulations that co-express CD4. Prior work has not examined T cell receptor (TCR) heterogeneity in CD4 + MAIT cells, largely due to bias of identifying human MAIT cells as CD8 + TRAV1-2 + cells. In this study, we adopted an unbiased single-cell TCR-sequencing approach of total MR1-5-OP-RU-tetramer-reactive T cells and discovered that CD4 + MAIT cells express highly diverse TRAV1-2 negative TCRs. To specifically characterize this TCR repertoire, we analyzed VDJ sequences of single MR1-5-OP-RU tetramer + MAIT cells across two datasets and identified distinct TCR usage among CD4 + MAIT cells including TRAV21, TRAV8 (TRAV8-1, TRAV8-2, TRAV8-3), and TRAV12 families (TRAV12-2, TRAV12-3), as well as more variable J chain and CDR3 sequences. Non-TRAV1-2 MAIT cell TCRs were also enriched after in vitro expansion, including with Mycobacterial tuberculosis . These results indicate that mature human CD4 + MAIT cells adopt distinct TCR usage from the canonical TRAV1-2 + CD8 + subset and suggest that alternative MR1 ligands in addition to riboflavin intermediates may select them.
Recent grants
MULTIREGIONAL ELECTRICAL ENCODING OF SOCIAL AGGRESSION
NIH · $2.9M · 2021–2026
Uncovering Population-Level Cellular Relationships to Behavior via Mesoscale Networks
NIH · $1.1M · 2019–2023
Frequent coauthors
- 94 shared
Kafui Dzirasa
Duke University Hospital
- 81 shared
Neil M. Gallagher
Cornell University
- 53 shared
Stephen D. Mague
- 52 shared
Lawrence Carin
King Abdullah University of Science and Technology
- 50 shared
Austin Talbot
- 42 shared
Staci D. Bilbo
Duke University
- 38 shared
Cameron Blount
Walter Reed Army Institute of Research
- 36 shared
Nkemdilim Ndubuizu
Duke University
Education
- 2015
Ph.D., Electrical and Computer Engineering
Duke University
- 2014
M.S., Electrical and Computer Engineering
Duke University
- 2010
B.S.E., Electrical and Computer Engineering
Duke University
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with David Carlson
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup