Simone Marini
· Assistant ProfessorVerifiedUniversity of Florida · Epidemiology
Active 1994–2026
About
Simone Marini, PhD, is associated with UF Health and is listed in the UF Health Directory. The page indicates his professional affiliation with the University of Florida Health Science Center and Shands hospitals, suggesting a role within the academic and healthcare community. Specific details about his research focus, background, or key contributions are not provided in the available text.
Research topics
- Biology
- Cell biology
- Genetics
- Cancer research
- Data Mining
- Chemistry
- Computer Science
- Medicine
- Internal medicine
- Pathology
- Database
- Anatomy
- Immunology
- Computational biology
Selected publications
bioRxiv (Cold Spring Harbor Laboratory) · 2026-04-29
articleAbstract Multidrug-resistant and extensively drug-resistant Mycobacterium tuberculosis (MTB) represents a growing global health crisis, characterized by limited treatment options and high mortality rates. Rapid and accurate prediction of resistance profiles is critical to guide effective therapy and curb transmission. Whole-genome sequencing (WGS) offers promise for individualized resistance profiling, yet existing computational tools remain constrained by predefined mutation catalogs and prohibitive resource requirements for large-scale analyses. Here, we present AURA, a GPU-accelerated, pangenome-scale machine learning framework for de novo resistance prediction. Trained on 12,185 globally diverse MTB isolates, AURA predicts resistance to 13 first-line, second-line, and repurposed antibiotics with high precision and identifies 59 novel resistance-associated loci, including variants in katG, pncA, rpoC , and members of the PE/PGRS gene family. By enabling model training on an unprecedented genomic scale, AURA provides new insights into the genetic architecture of resistance and establishes a scalable platform for precision-guided therapy and global surveillance of MTB.
Transcriptional programs diverge in aging mouse and human skeletal muscle
Aging · 2026-02-12
articleOpen accessAnimal models provide a crucial scientific substrate for medical innovation, yet findings in these models do not always translate directly to humans. Although murine models are extensively employed to study skeletal muscle aging, the extent to which they diverge from the human aging process remains poorly understood. This study examined transcriptional changes with aging in mouse and human skeletal muscle. RNA bulk-sequencing was performed on gastrocnemius muscles from young and old C57BL/6 mice and compared to transcriptomic data from young and old healthy human vastus lateralis muscles obtained from the GESTALT study (NIA/NIH) via the Gene Expression Omnibus database. Cross-species comparison revealed substantial divergence in age-associated transcriptional profiles, with fewer than 5% of significant GO and KEGG terms shared between species. Hypoxia signaling, VEGFA, and inflammatory pathways showed concordant downregulation with aging in both species; however, angiogenesis, neurogenesis, and myogenesis demonstrated opposing or non-significant trends. These findings caution against direct extrapolation of murine aging transcriptomics to human skeletal muscle biology, though select conserved pathways may represent viable cross-species targets for future investigation.
ESMO Open · 2025-09-01
articleOpen accessHead and neck squamous cell carcinoma (HNSCC) is a clinically and molecularly heterogeneous group of tumors primarily arising from the mucosal epithelium of the oral cavity, pharynx, and larynx. Over the past decade, despite the marked improvement in clinical outcome of many other tumor types, prognosis remains poor for patients with HNSCC. This study aimed to characterize the mutational landscape of HNSCC across anatomical subsites, in order to identify genomic alterations with potential predictive and prognostic significance, guiding novel personalized therapeutic strategies and improve clinical outcomes.
bioRxiv (Cold Spring Harbor Laboratory) · 2025-04-03
preprintOpen accessAbstract Pseudomonas aeruginosa infects immunocompromised and hospitalized individuals, resulting in over 500,000 annual deaths. With emerging multidrug resistance and stagnating antibiotic development, alternative antimicrobials are desperately needed. Bacteriophages (phages) offer a promising, effective, and safe alternative. We developed and optimized a high-content liquid assay screen and a stringent assessment of efficacy to isolate and characterize seven novel P. aeruginosa phages. Phages were screened individually and in cocktail, inhibiting the growth of over 90% (50/55) of multidrug-resistant clinical strains and ∼75% (102/137) of animal, environmental, and human isolates. When tested in a mouse bacteremia model, the phage cocktail successfully eradicated P. aeruginosa . A proteome-wide bi-directional BLAST identified eight proteins that influenced phage infection. The functional analysis of the corresponding genes reveals their putative roles involving genome modification and transcriptional regulation, metabolic processes, and structural components essential for phage docking. Collectively, we have developed a rigorous high-content approach to identify effective phages, which, coupled with functional genomics, revealed genes that affect phage-bacteria interaction. Author Summary In this study, we explored the potential of bacteriophages (phages) isolated from municipal and hospital wastewater sources for combating multidrug-resistant Pseudomonas aeruginosa , an opportunistic pathogen known for posing significant clinical challenges. A rigorous stepwise screen aimed at enhancing specificity against a broad set of 55 clinical P. aeruginosa strains allowed us to isolate diverse class phages that can target over 90% of the clinical isolates. Our phage efficacy assessments employed a colorimetric MTT assay to measure the metabolic activity of P. aeruginosa strains in response to phage exposure. Notably, the phages demonstrated broad coverage against the P. aeruginosa library, with individual phages showing varying degrees of efficacy and a cocktail exhibiting superior inhibitory properties. Further validation using a mouse bacteremia model confirmed the exceptional efficacy of the cocktail, supported by a complete attenuation of clinical signs of infection and a significant reduction of bacterial loads across all organs, supporting their utility as potential phage therapy. Finally, a comprehensive comparative genomic analysis of target bacteria combined with phage efficacy revealed novel genes that are potentially involved in phage infection. These findings provide a foundation for understanding phage-host interactions and pave the way for the development of targeted phage therapies against antibiotic-resistant bacterial infections.
vir2vec: A Viral Genome-Wide Viral Embedding
bioRxiv (Cold Spring Harbor Laboratory) · 2025-12-13
articleOpen accessSenior authorCorrespondingGenomic language models (gLMs) have recently emerged as powerful numerical surrogates for DNA, but existing architectures are largely focused on human DNA or trained on limited viral references, and no dedicated benchmark currently exists for viral genome understanding. Here we introduce vir2vec, a 422-million-parameter, decoder-only gLM obtained by continual pretraining of Mistral-DNA on a rigorously curated pan-viral corpus of 565,747 complete genomes spanning 295 viral species. vir2vec operates on byte-pair-encoded DNA subwords and exposes fixed-length genome-level embeddings that are reused across tasks. Additionally, we present vGUE, a unified benchmark for viral representation learning. In vGUE, we precompute vir2vec embeddings and feed them to simple classifiers (logistic regression, support vector machines, random forests) trained under nested cross-validation, to quantify how well they capture biologically motivated axes of viral variation. Using this framework, vGUE assesses genomic viral prediction tasks across: (i) organism-level discrimination (virus vs non-virus genomes and reads), (ii) genome-wide evolutionary fingerprints (DNA vs RNA viruses, host-range prediction), (iii) intra-genus species separation (HIV-1 vs HIV-2), (iv) fine-grained variant and subtype typing (SARS-CoV-2 lineages), and (v) phenotypic context signal detection (HIV-1 brain vs plasma Tropism). vir2vec attains the highest balanced accuracy across seven out of eight heterogenous classification tasks consistently outperforming both a human-trained genomic foundation model and a viral-specific one. By coupling a pan-viral gLM with a standardized evaluation suite, vir2vec and vGUE provide an open foundation for future viral genomic models, surveillance tools, and discovery pipelines. vir2vec is released as a controlled-access resource with the understanding that it is designed for discriminative/classification embedding tasks, and not generative; responsible deployment of viral genomic models requires consideration of dual-use implications and appropriate ethical, governance oversight.
JMIR Public Health and Surveillance · 2025-06-25 · 1 citations
articleOpen accessSenior authorBackground: To complete the Ending the HIV Epidemic initiative in areas with high HIV incidence, there needs to be a greater understanding of the demographic, behavioral, and geographic factors that influence the rate of new HIV diagnoses. This information will aid the creation of targeted prevention and intervention efforts. Objective: This study aims to identify the geographic distribution of risk groups and their role within potential transmission networks in Florida. Methods: Public data from the Florida Department of Health and behavioral data from the Surveillance Tools and Reporting System between 2012 and 2022 were used in these analyses. We analyzed records as a combination of variables of interest (gender, age, race or ethnicity, and HIV risk group) to create demographic-behavioral profiles (DBPs) that represent the profiles of people newly diagnosed with HIV. We then used the resulting DBPs to characterize Florida counties and HIV coordination areas and calculated the county-to-county and area-to-area rank (Spearman) correlation. We then drew a dendrogram based on the correlation matrix and identified clusters of similar counties and areas. Lastly, network analysis used HIV contact tracing data from the Surveillance Tools and Reporting System to identify HIV transmission and exposure contact networks and characterized large networks by DBPs and geolocation. Results: We identified 37 DBPs. The largest DBPs were Hispanic and non-Hispanic Black males aged 25-49 reporting male-to-male sexual contact (n=7539 and n=4329, respectively), non-Hispanic White males aged 25-49 reporting male-to-male sexual contact (n=4221), and non-Hispanic Black females aged 25-49 reporting heterosexual contact (n=3371). The state could be broken up generally into 2 transmission and exposure clusters by region: Northwestern or Northern and Central or Southern. We identified several counties with similar DBPs that were not in the same HIV coordination area. A total of 3097 contact networks were identified among 7944 people with HIV contact tracing data. Most (n=2508, 81%) networks involve only 2 people, 11% (n=349) involve 3 people, 7% (n=224) involve 4 to 19 people, and 6 networks involve 20 or more people. As network size increases, the proportion of people within the network who identify as female, non-Hispanic Black, aged older than 50 years, and exposed to HIV via heterosexual contact decreases. Conclusions: We identified distinct risk groups and clusters of transmission and exposure throughout Florida. These results can help regions identify health disparities and allocate their HIV prevention and intervention resources accordingly. The goal of this work was to highlight areas of need in a high-incidence setting, not to contribute to existing stigma against vulnerable groups, and it is important to consider the ethics and possible harm of advanced methodologies such as contact network analysis when addressing public health problems.
Predicting Pneumococcal Meningitis from Demographics and Early AMR Signals
2025-10-12
articleOpen accessSenior authorBacterial meningitis progresses rapidly, so treatment is often initiated before full laboratory results are available. We benchmark machine-learning models for predicting meningitis among invasive pneumococcal disease (IPD) cases using routinely obtainable information on a realistic clinical timeline. Using 6,049 Global Pneumococcal Sequencing (GPS) records (1,646 meningitis), we define four feature tiers: T1 demographics and PCV type; T2 adds serotype group; T3 adds molecular markers available early (MLST/PBP encodings and a small panel of AMR genes/loci); T4 adds AST signal summarized as label-agnostic log2 MIC tertiles. Logistic regression and random forest models with 5-fold stratified CV show step-wise gains: LR 54.0%→62.0% and RF 59.5%→65.1% balanced accuracy from T1→T4; the largest jump comes from adding serotype (T1→T2). Results indicate that (i) clinical metadata alone are insufficient; (ii) early serotyping provides the biggest boost; and (iii) a small AMR panel and coarse AST summaries yield incremental improvements. This tiered, reproducible setup mirrors data availability in practice and can guide which early tests to prioritize for meningitis risk triage, especially where comprehensive culture/NGS are delayed.
SSRN Electronic Journal · 2025-01-01
preprintOpen accessSARITA: a large language model for generating the S1 subunit of the SARS-CoV-2 spike protein
Briefings in Bioinformatics · 2025-07-01 · 6 citations
articleOpen accessSenior authorBACKGROUND: The COVID-19 pandemic has caused over 776 million infections and 7 million deaths globally between December 2019 and November 2024. Since the emergence of the original Wuhan strain, SARS-CoV-2 has evolved into multiple variants-including Alpha, Delta, and Omicron-primarily through mutations in the Spike glycoprotein. The S1 subunit, which binds the human angiotensin-converting enzyme 2 (ACE2) receptor, mutates frequently and plays a key role in infectivity and immune escape, while the more conserved S2 subunit mediates membrane fusion. Anticipating future mutations is essential for guiding vaccine design and therapeutic strategies. Generative Large Language Models (LLMs) have shown promise in protein sequence modeling due to their capacity to produce realistic and functional synthetic sequences. Here, we introduce SARITA, a GPT-3-based LLM with up to 1.2 billion parameters, fine-tuned via continual learning on the protein model RITA trained on 107 017 high-quality SARS-CoV-2 Spike sequences (up to March 1st 2021) to generate high-quality synthetic SARS-CoV-2 Spike S1 subunits. RESULTS: SARITA is able to generate realistic, full-length synthetic S1 subunits starting from a 14-amino-acid prompt. When evaluated on unseen sequences collected between March 2021 and November 2023-including major Variants of Concern (VOCs) such as Delta and Omicron, and Variants of Interest such as Iota-SARITA outperforms baseline and state-of-the-art LLMs in terms of sequence quality, biological plausibility, and similarity to real-world viral evolution. SARITA generates high-quality sequences in over 97% of cases, with markedly lower False Mutation Rate and higher similarity scores (PAM30, Levenshtein distance) compared to alternative approaches. It also accurately reproduces key mutations characteristic of future variants-such as L212I, R158L, T95P, and E406K-which were not present in the training data but emerged later in VOCs like Omicron and Delta. Structure-based analysis confirms the functional plausibility of these substitutions, with ΔΔG values within experimentally supported thresholds for ACE2 and antibody binding. Furthermore, SARITA anticipates immune-evasive mutations and accurately captures the positional and statistical distribution of mutations found in post- March 1st 2021 variants, highlighting its potential as a predictive tool for viral evolution. CONCLUSION: These results indicate the potential of SARITA to predict future SARS-CoV-2 S1 evolution, potentially aiding in the development of adaptable vaccines and treatments.
Bioinformatics · 2025-07-01
articleOpen accessMOTIVATION: Antibiotic resistance in Mycobacterium tuberculosis (MTB) poses a significant challenge to global public health. Rapid and accurate prediction of antibiotic resistance can inform treatment strategies and mitigate the spread of resistant strains. In this study, we present a novel approach leveraging large language models (LLMs) to predict antibiotic resistance in MTB (LLMTB). Our model is trained and evaluated on genomic data from 12 185 CRyPTIC isolates and their associated resistance profiles, utilizing natural language processing techniques to capture patterns and mutations linked to resistance. The model's architecture integrates state-of-the-art transformer-based LLMs, enabling the analysis of complex genomic sequences and the extraction of critical features relevant to antibiotic resistance. RESULTS: We evaluate our model's performance using a comprehensive dataset of MTB strains, demonstrating its ability to achieve high performance in predicting resistance to various antibiotics. Unlike traditional machine learning methods, fine-tuning or few-shot learning opens avenues for LLMs to adapt to new or emerging drugs, thereby reducing reliance on extensive data curation. Beyond predictive accuracy, LLMTB uncovers deeper biological insights, identifying critical genes, intergenic regions, and novel resistance mechanisms. This method marks a transformative shift in resistance prediction and offers significant potential for enhancing diagnostic capabilities and guiding personalized treatment plans, ultimately contributing to the global effort to combat tuberculosis and antibiotic resistance. AVAILABILITY AND IMPLEMENTATION: All source code is publicly available at https://github.com/ctestagrose/LLMTB.
Frequent coauthors
- 50 shared
Riccardo Bellazzi
- 44 shared
Mattia Prosperi
University of Florida
- 42 shared
Marco Salemi
University of Florida
- 23 shared
Brittany Rife Magalis
University of Louisville
- 23 shared
Shannan N. Rich
- 22 shared
Carla Mavian
University of Florida
- 19 shared
Chase A. Pagani
The University of Texas Southwestern Medical Center
- 18 shared
Benjamin Lévi
The University of Texas Southwestern Medical Center
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Simone Marini
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup