Eric Slud

· ProfessorVerified

University of Maryland, College Park · Statistics

Active 1977–2026

h-index30

Citations2.3k

Papers15220 last 5y

Funding—

Faculty page Lab page

See your match with Eric Slud — sign in to PhdFit.Sign in

About

Eric V. Slud is a Professor in the Statistics Program within the Department of Mathematics at the University of Maryland, College Park. His primary research interests lie in mathematical statistics and probability, with a focus on several specialized areas. One major area of his work is census statistics, where he concentrates on demographic modeling of nonresponse to national surveys, particularly in applications to weighting adjustment and small area estimation (SAE). He has contributed extensively to the Small Area Income and Poverty Estimation (SAIPE) program of the Census Bureau, including methodological research on small-area and mean squared error estimation from survey data under nonlinear and left-censored Fay-Herriot models. His work also addresses internal evaluation of biases due to weighting adjustment for nonresponse in longitudinal surveys such as the Survey on Income and Program Participation (SIPP), and he has developed methods for simultaneous nonresponse adjustment and calibration of weights in complex surveys.

Research topics

Computer Science
Artificial Intelligence
Data Mining
Mathematics
Econometrics
Statistics

Selected publications

Nonidentifiability of Within‐Cluster Dependence Parameters in Analytic Survey Inference
Wiley Interdisciplinary Reviews Computational Statistics · 2026-02-09
articleOpen access1st authorCorresponding
ABSTRACT In analytic survey inference, the attributes of units in a target or frame survey population are idealized as a sample from a superpopulation statistical model, and model parameters are estimated from survey data drawn from a probability sample of the frame population. The data structure in such survey inference consists of relevant attribute data together with survey weights associated with all survey respondents. These weights relate to the probability of inclusion of each unit within the respondent set, and they enable consistent estimation in large populations and samples of all frame‐population averages of functions of the unit attributes. However, even when this is assumed correct, model parameters such as those for within‐cluster dependence between survey attributes from distinct respondents may not be identifiable from survey data with weights. That is, even assuming the superpopulation model, with a parametric dependence structure for attributes within clusters, if sampled data are observed with precisely correct weights equal to the reciprocals of single‐inclusion probabilities or of conditional probabilities of inclusion given unit data, multiple distinct values of the parameters of the superpopulation model may yield the same likelihood for the data for some sample designs compatible with the weights. This article first describes the background and existing methods for the design‐based estimation of cluster‐level model parameters from survey data on a clustered superpopulation using single‐inclusion weights. Nonidentifiability results are presented rigorously as mathematical examples, proving that large‐sample consistent estimation of within‐cluster dependence parameters from survey data with single‐inclusion weights is not always possible. This article is categorized under: Statistical and Graphical Methods of Data Analysis > Sampling Algorithms and Computational Methods > Maximum Likelihood Methods
Publisher DOI
CooccurrenceAffinity: An R package for computing a novel metric of affinity in co-occurrence data that corrects for pervasive errors in traditional indices
PLoS ONE · 2025-01-16 · 6 citations
articleOpen accessSenior authorCorresponding
1. Analysis of co-occurrence data with traditional indices has led to many problems such as sensitivity of the indices to prevalence and the same value representing either a strong positive or strong negative association across different datasets. In our recent study (Mainali et al 2022), we revealed the source of the problems that make the traditional indices fundamentally flawed and unreliable-namely that the indices in common use have no target of estimation quantifying degree of association in the non-null case-and we further developed a novel parameter of association, alpha, with complete formulation of the null distribution for estimating the mechanism of affinity. We also developed the maximum likelihood estimate (MLE) of alpha in our previous study. 2. Here, we introduce the CooccurrenceAffinity R package that computes the MLE for alpha. We provide functions to perform the analysis based on a 2×2 contingency table of occurrence/co-occurrence counts as well as a m×n presence-absence matrix (e.g., species by site matrix). The flexibility of the function allows a user to compute the alpha MLE for entity pairs on matrix columns based on presence-absence states recorded in the matrix rows, or for entity pairs on matrix rows based on presence-absence recorded in columns. We also provide functions for plotting the computed indices. 3. As novel components of this software paper not reported in the original study, we present theoretical discussion of a median interval and of four types of confidence intervals. We further develop functions (a) to compute those intervals, (b) to evaluate their true coverage probability of enclosing the population parameter, and (c) to generate figures. 4. CooccurrenceAffinity is a practical and efficient R package with user-friendly functions for end-to-end analysis and plotting of co-occurrence data in various formats, making it possible to compute the recently developed metric of alpha MLE as well as its median and confidence intervals introduced in this paper. The package supplements its main output of the novel metric of association with the three most common traditional indices of association in co-occurrence data: Jaccard, Sørensen-Dice, and Simpson.
Publisher DOI
Small area estimates for Voting Rights Act Section 203(b) coverage determinations
Calcutta Statistical Association Bulletin · 2024-02-13 · 1 citations
article1st authorCorresponding
The Director of the US Census Bureau every five years, under the Voting Rights Act (VRA) Section 203(b), determines which states and political subdivisions must provide ballot assistance in languages other than English. These rule-based determinations use small-area estimates of the proportions of voting-age citizen members of racial/ethnic Language Minority Groups (LMGs) who are limited Englishproficient (LEP) and of LEP LMG persons who are also illiterate. This large-scale Small Area Estimation (SAE) effort by the US Census Bureau is based on American Community Survey five-year data. This paper focuses on the unique attributes of this SAE problem, including the predominance in each LMG of tiny domains along with relatively few large domains that account for the bulk of LMG citizens. The data and small area models are treated separately for distinct LMGs, and the spirit of the law requires that the domain estimates should be produced in a similar way across LMGs for each type of geography (Jurisdictions, which are mostly counties, plus two types of American Indian and Alaskan Native areas). The paper describes and assesses the SAE models developed under unique constraints, including novel methodological aspects and the use of hybrid frequentist and Bayesian computational techniques; situates these modelling choices in the broader literature on SAE; and discusses the strengths and weaknesses of the resulting models.
Publisher DOI
Optimal Stopping for Clinical Trials with Economic Costs: A Simulation-Based Approach
2024-12-15
articleSenior author
We consider the problem of designing an early stopping clinical trial investigating the efficacy of a medical intervention against the available standard of care. The standard approach is to determine a stopping rule minimizing the expected number of patients required, subject to error rate constraints, not considering costs explicitly depending on the magnitude of the treatment effect. In this paper we formulate an optimal stopping problem for clinical trials with instantaneous continuous response, the objective being minimizing an overall risk comprising loss functions accounting for costs involving the treatment effect that might model ethical and economic costs. To solve the optimization problem we propose a feasible directions simulation-based algorithm requiring new stochastic gradient estimators which we derive using Smoothed Perturbation Analysis. We conduct simulation experiments to test the effectiveness of the simulation optimization algorithm and to obtain insights on the effects of the various risk factors on the optimal solution.
Publisher DOI
Characterization of tumor evolution by functional clonality and phylogenetics in hepatocellular carcinoma
Communications Biology · 2024-03-29 · 4 citations
articleOpen access
Hepatocellular carcinoma (HCC) is a molecularly heterogeneous solid malignancy, and its fitness may be shaped by how its tumor cells evolve. However, ability to monitor tumor cell evolution is hampered by the presence of numerous passenger mutations that do not provide any biological consequences. Here we develop a strategy to determine the tumor clonality of three independent HCC cohorts of 524 patients with diverse etiologies and race/ethnicity by utilizing somatic mutations in cancer driver genes. We identify two main types of tumor evolution, i.e., linear, and non-linear models where non-linear type could be further divided into classes, which we call shallow branching and deep branching. We find that linear evolving HCC is less aggressive than other types. GTF2IRD2B mutations are enriched in HCC with linear evolution, while TP53 mutations are the most frequent genetic alterations in HCC with non-linear models. Furthermore, we observe significant B cell enrichment in linear trees compared to non-linear trees suggesting the need for further research to uncover potential variations in immune cell types within genomically determined phylogeny types. These results hint at the possibility that tumor cells and their microenvironment may collectively influence the tumor evolution process.
Publisher OA PDF DOI
Null model analyses are not adequate to summarize strong associations: Rebuttal to Ulrich et al. (2022)
Journal of Biogeography · 2023-10-21 · 2 citations
articleOpen accessSenior author
We recently developed a novel metric of association in pairwise co-occurrence data (Mainali et al., 2022) to address fundamental flaws in traditional indices, as elaborately discussed and conclusively shown in our published paper. Our new metric, the maximum likelihood estimator (MLE) alpha-hat of a statistical parameter alpha, quantifies the degree of association between species occupancy at ecological sites, and it is insensitive to the species prevalences and number of sites. In contrast, we showed that classic indices of co-occurrence (Jaccard, Simpson, Sørensen–Dice) can be highly sensitive to fixed margins of 2 × 2 contingency tables, estimating wildly variable degrees of association and even reversing the direction of association for tables with different margins but the same degree-of-association alpha. Ulrich et al. (2022), hereafter USSG, adversely commented on our paper, claiming that our metric lacked novelty and insights beyond null-hypothesis standardization techniques. In this commentary, we address each of USSG's specific claims reflecting their view that test statistics for null association adequately summarize strongly non-null association in co-occurrence data. We show that standardized co-occurrence behaves differently in different datasets with the same strongly non-null degree of association, while alpha-hat exhibits reliable performance. If the counts of two species are m A and m B , respectively, out of N total sites, then the hypergeometric distribution for their co-occurrence count X underlies Fisher's famous ‘exact test’ (Fisher, 1934) to analyse a contingency table (Mainali et al., 2022) for the possible dependence of its row and column categories. Recently, the hypergeometric was introduced in ecology (Griffith et al., 2016; Veech, 2013) as a null distribution for co-occurrence analysis. The FE null model of Gotelli (2000) specifies the same distribution, expressed in terms of stochastic simulations rather than mathematical formulas. While the null model enables the hypothesis test of no association, much of the scientific interest in co-occurrence analysis in Ecology and Biogeography concerns the degree of association in non-null settings. Null-hypothesis test statistics do not reliably capture the degree of association under diverse fixed table margins. A statistical analogy crystallizes our view of this problem. Suppose an investigator wants to summarize the results of many coin-toss experiments, in each of which one fair or biased coin is tossed repeatedly. Each such experiment records the number n of tosses and the number X of heads. The coin used in each experiment is different, some coins being fair and others biased in different ways. Motivated by the standard hypothesis test that the coin is fair, one could ‘summarize’ these experiments by the standardized count Z = X − n / 2 / √ n / 4 (or Binomial( n , ½ ) probability of X or more heads) used to test the fair-coin hypothesis. But is there any researcher—living in a world in which many coins are biased positively and many negatively—who would not instead parameterize the unknown heads probability p in each experiment and estimate it by p ̂ = X / n (or some similar Bayesian estimator) along with a confidence interval or estimate of variability of p ̂ ? The estimator has a target and meaning regardless of n , but Z does not when p differs from ½ . The situation is very similar for co-occurrences. Statistical methods estimating a target parameter reliably group like degrees of association regardless of differing 2 × 2 table margins. USSG claim our metric suffers from three ‘problems’: (1) for fixed 2 × 2 table margins, our alpha-hat is essentially equivalent to null-hypothesis standardized counts or P-values; (2) our affinity model shares with the null model the assumption that all N sites are equally likely to be occupied (separately) by each species once the prevalences m A , m B are fixed; and (3) the alpha-hat metric is too complicated and numerically unstable as implemented in our R code (now a CRAN package ‘CooccurrenceAffinity’). By far, the largest part of USSG's commentary focusses on their comment (1), interspersed with reanalyses of real and artificial datasets comparing null-standardized statistics with alpha-hat. We address all their comments sequentially (as presented in USSG), showing that USSG's data analyses reinforce the merits of alpha-hat rather than of null-standardized indices and provide exhibits showing the centring at alpha and stability of the distribution of alpha-hat across a variety of margins ( m A , m B , N ). In their first comment, USSG assume the common (hypergeometric) form of null model. We (Mainali et al., 2022) provided four ‘classic assumptions’ that are precisely stated and imply the hypergeometric null hypothesis under fixed marginals for a co-occurrence table. We did not examine the equivalence of our assumptions with the less precisely stated assumptions of the prior studies, including Gotelli (2000) and Wright et al. (1998). But our disagreements with USSG all relate to non-null degrees of association. USSG's first comment analysed a previously published dataset (Wright et al., 1998). We reproduced the example in which USSG argues that our ‘affinity and Veech (2013)'s probabilistic occurrence yield very similar results on a large set of empirical species pairs', as seen in the blue cluster of Figure 1a. This observation is correct for Veech's probability (pv, the probability of the observed or higher co-occurrences) values ranging approximately from ~0.05 to ~0.95, that is the nonsignificant ones (roughly corresponding with standardized co-occurrence from −1.65 to +1.65). However, the most interesting associations lie outside these ranges, revealing the incompleteness of USSG's view and the novelty of affinity. When co-occurrence counts fall in the statistically significant range, it is not sufficiently informative simply to declare them significant; instead, we should estimate a quantity measuring the strength of associations. For this purpose, affinity serves as a more reliable tool than null-hypothesis tail probabilities or standardized co-occurrence count Z . The correspondence between pv and affinity among the blue points in Figure 1a is best expressed by removing the curvature and compression of pv values in the tails through the transformation Φ − 1 (pv) = qnorm(pv) (the standard-normal quantile function), upon which the blue points exhibit a linear decrease with slope approximately −1.16 (Figure S1a). In all moderate-to-large 2 × 2 tables, Z and qnorm(pv) are approximately equivalent (Figure S1f), illustrating a general principle of approximate normality of Z in large 2 × 2 tables, as explained in the caption of Figure S1f. Therefore, throughout the paper we argue interchangeably in terms of qnorm(pv) and Z . USSG indicated that a separate cluster (red points in Figure 1a) ‘mostly stem from fully nested species pairs, where the occurrences of the less abundant species are a proper subsample of those of the more abundant species’. Indeed, we confirmed that every point of this cluster comes from a fully nested species pair. This behaviour has been discussed in mathematical detail under the ‘ML estimation of α ’ section of ‘Materials and Methods’ in Mainali et al. (2022). The key point is that whenever the co-occurrence count equals its largest or smallest logically possible value, alpha-hat is positively or negatively infinite (observed in 83% of species pairs in this dataset; pie chart in Figure 1a). In such situations, our software caps MLE to ± log 2 N 2 , with mathematical justification given at Equation 8 in Mainali et al. (2022); see red and purple points in Figure 1a. This is a small sample phenomenon requiring care in reporting the results because the (positive or negative) strengths of association compatible with the data are unboundedly large. The truncated affinity of species pairs with infinite affinity plotted against the number of sites, as shown in USSG's Figure 1a inset and our Figure 1b, reveals a deterministic logarithmic relation, an artefact of the way our software truncates positively infinite affinity. It is a feature of log odds ratios and not our software that requires nested species counts (a co-occurrence count coinciding with one of the species prevalence counts) to be treated as a special case. The recommended way to report nested counts is with the lower endpoint of a one-sided 95% confidence interval when the co-occurrence count is at its highest extreme and the upper endpoint when the co-occurrence count is at its lowest extreme (Figure 1c shows only lower confidence interval endpoints for nested species counts). USSG created an artificial ‘compartmented matrix’ of 22 species variously occupying 50 sites, thereby generating 231 species pairs. Since USSG provided no information about how the matrix entries were generated, we fixed the N = 50 sites and the counts of occupied sites by each of the 22 species and then created four co-occurrence counts for each pair of species. For each species pair and triple m A m B N of margins for species and total site counts in the compartmented matrix, we took as 4 observed co-occurrence counts the median of the extended hypergeometric distribution (Harkness, 1965) with alpha, respectively fixed at log(2), log(2.75), log(3.5) and log(4.25) (hereafter ‘ α -specific co-occurrences’). Extended hypergeometric is the exact distribution of X under the affinity model described in Mainali et al. (2022); the values log(2) = 0.69, log(2.75) = 1.01, log(3.5) = 1.25 and log(4.25) = 1.45 represent a range of moderate-to-strong positive associations among species site occurrences. For these pooled data, the relationship between affinity and qnorm(pv) (Figure 2a) is roughly linear, corresponding to a curvilinear affinity vs. pv relationship (Figure S1b), but among single-colour species pairs with shared value of alpha, the relationship is diffuse, especially in the purple and orange colour groups with larger true alpha. The relationship between affinity and Z or standardized Jaccard index is much the same (Figure S1c,d). So, we see that affinity and Z are not ‘equivalent’. We demonstrate the superiority of affinity over standardization by evaluating the probability mass function of both indices in settings with known nonzero alpha. We generated data with alpha fixed at 1.5 and 3 for each of 4 ( m A , m B , N ) combinations. In all examples, the true probability masses P X = k of each possible co-occurrence X are known. When the corresponding alpha-hat is plotted against probability mass, we observe a behaviour expected for a reliable estimate irrespective of ( m A , m B , N ): centring at the respective true alpha (Figure 2d). However, the corresponding standardized co-occurrence count exhibited a complete lack of centring (Figure 2c). This behaviour remains the same for negative associations (Figure S2). The standardized X values depend sensitively on absolute and relative magnitudes of ( m A , m B , N ) and could mislead investigators into thinking that the degrees of association between species A and B occurrences differed across the four combinations of ( m A , m B , N ) with same underlying association. In conclusion, alpha-hat estimates a true degree-of-association target which standardized co-occurrence cannot. The same applies for standardized Jaccard (Figure S1e) and qnorm-transformed pv (Figure S1f) because of their linear mapping with standardized co-occurrence. USSG analysed their compartmented matrix data and concluded that ‘the affinity index proved to be equivalent to the standardized effect size of the traditional Jaccard metric’ (Appendix S2) by showing in their Figure 1b an almost linear relationship between the two, with all points nearly overlaying the trendline. This approximately linear relationship is observed mainly in the severely limited range of moderate strengths of association seen in this artificial matrix: 87% of the affinity values were between −1 and 1.5 in USSG's original ‘compartmented matrix’. Using our ‘ α -specific co-occurrence’ data, we observed that each value of affinity corresponds to a wide range of standardized co-occurrence (Figure S1c) or standardized Jaccard values (Figure S1d) when the associations are strong, making the standardized indices imprecise measures of the departure from nullity (i.e. of α ≠ 0 ). Figure S1e confirms that the standardized X and J indices are essentially deterministically related and effectively equivalent. Null-hypothesis standardization is a useful summary of the null distribution (here, hypergeometric) primarily when the standardized distribution is close to a standard reference distribution, namely the normal distribution, regardless of the fixed margins ( m A , m B , and N ). However, with small sample sizes, the distribution of the standardized co-occurrence count may become very non-normal and may instead be asymmetric and have large jumps, making standardization unreliable. USSG's second comment claims that affinity exhibits a curved relationship with number of sites in their compartmented matrix, with highly significant R2 of 0.26 in quadratic regression (USSG Figure 1c). Their claim is invalid for two reasons: the quirky and unspecified nature of the ‘compartmented matrix’ co-occurrences they generated, along with the general invalidity of interpreting linear regression P-values as they do for occurrence matrix data. Using our ‘ α -specific co-occurrence’ data, we show that USSG's observation of affinity being sensitive to number of sites completely disappears (R2 value <0.013; quadratic regression) for each of the four α -specific scenarios (Figure 2b). Although we show Figure 2b for the sake of comparison with USSG, the p-values arising by doing linear or quadratic regressions like this with ecology/biogeography datasets are invalid. There are two reasons for this: (a) discreteness of the extended hypergeometric data and small samples imply that the P-values (generated from normal distributions in all standard software) are way off, and more importantly, (b) the data points for all species pairs in a measured set of species are not independent, a key assumption of standard regression software. The dependence arises because each species is reused in multiple species pairs. Of these two reasons, ecologists can probably tolerate non-normality of error distributions, although reasoning with those P-values on small data samples is sloppy science. But reason (b) implies that doing such regressions at all is misguided. See Appendix S3 for a third argument against the method of USSG's Figure 1c. USSG's second comment also criticizes our affinity model for assuming that all sites are equally suitable for each species considered separately. That is a validly expressed limitation of the model. Yet, their strong preference for continuing to use null model P-values and standardized indices begs the question, since the hypergeometric null model that they along with Griffith et al. (2016) and Veech (2013) use within the FE model also assumes this same exchangeability. Our affinity model presents improvements on the FE model and standardized co-occurrence/indices. Appendix S4 addresses USSG's comments about our R script/package that we believe were misguided. Throughout their Commentary, and particularly in their summary, USSG persistently ignore the fact that their statistical toolkit contains no model of joint species occurrence under which pairwise occurrence shows strong association, and this is the whole point of our innovation of affinity. Without conditioning on prevalence counts m A , m B , mainstream statistics also treats essentially the same model within the class of loglinear models (Agresti, 2013): Mainali et al. (2022) presents our affinity model and supplies an ecology-relevant interpretation of it and along with Mainali and Slud (2022) shows how to use our CooccurrenceAffinity R package to achieve scientifically sensible analyses under this model. KM was supported by the Grayce B. Kerr Fund, Inc. No permits were needed to carry out this research. None. R script and data for complete analysis and plots are available at https://github.com/kpmainali/Affinity_JBiogeo_Rejoinder. Appendix S1–S4. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article. Kumar Mainali is an ecologist and statistician who develops novel mathematical/statistical indices for pressing scientific challenges, and uses the cutting edge AI technology for precision conservation. His research and predictive analytics operate at the intersection of ecology, conservation biology, biogeography, and climate change. He frequently works with species distribution models and expert maps. Eric Slud is a mathematical statistician working on problems of biostatistical survival analysis, survey sampling inference, and inference for stochastic processes, with recent forays into ecology, spatial statistics, and genomics. The connecting threads in these research interests are: formulation of statistical models, theoretical study of identifiability of parameters, and derivations of mathematical properties of estimators. Author Contributions: Kumar P. Mainali and Eric Slud conceived the ideas; Eric Slud led the development of new functions for standardized quantities. Kumar P. Mainali and Eric Slud analyzed the data and wrote the manuscript.
Publisher OA PDF DOI
CooccurrenceAffinity: Affinity in Co-Occurrence Data
2023-05-03 · 1 citations
datasetOpen accessSenior author
Computes a novel metric of affinity between two entities based on their co-occurrence (using binary presence/absence data). The metric and its MLE, alpha hat, were advanced in Mainali, Slud, et al, 2021 <<a href="https://doi.org/10.1126%2Fsciadv.abj9204" target="_top">doi:10.1126/sciadv.abj9204</a>>. Various types of confidence intervals and median interval were developed in Mainali and Slud, 2022 <<a href="https://doi.org/10.1101%2F2022.11.01.514801" target="_top">doi:10.1101/2022.11.01.514801</a>>. The 'finches' dataset is now bundled internally (no longer pulled via the cooccur package, which has been dropped).
Publisher OA PDF DOI
CooccurrenceAffinity: An R package for computing a novel metric of affinity in co-occurrence data that corrects for pervasive errors in traditional indices
bioRxiv (Cold Spring Harbor Laboratory) · 2022-11-03 · 5 citations
preprintOpen accessSenior author
ABSTRACT Analysis of co-occurrence data in ecology and biogeography with traditional indices has led to many problems. In our recent study (Mainali, Slud, Singer, & Fagan, 2022), we revealed the source of the problem that makes the traditional indices fundamentally flawed and completely unreliable, and we further developed a novel metric of association, alpha, with complete formulation of the null distribution for estimating the mechanism of affinity. We also developed the maximum likelihood estimate (MLE) of alpha in our prior study. Here, we introduce the CooccurrenceAffinity R package that computes alpha MLE. We provide functions to perform the analysis based on a 2×2 contingency table of occurrence/co-occurrence counts as well as a m×n presence-absence matrix (e.g., species by site matrix). The flexibility of the function allows a user to compute the alpha MLE for entity pairs on matrix columns based on presence-absence states recorded in the matrix rows, or for entities on matrix rows based on presence-absence recorded in columns. We also provide functions for plotting the computed indices. As novel components of this software paper not reported in the original study, we present theoretical discussions about median interval and four types of confidence intervals; we further develop functions (a) to compute those intervals, (b) to evaluate their true coverage probability of enclosing the population parameter, and (c) to generate figures. CooccurrenceAffinity is a practical and efficient R package with user-friendly functions for end-to-end analysis and plotting of co-occurrence data in various formats, making it possible to compute the recently developed metric of alpha MLE as well as its median and confidence intervals introduced in this paper. The package supplements its main output of the novel metric of association with the three most common traditional indices of association in co-occurrence data: Jaccard, Sørensen–Dice, and Simpson.
Publisher OA PDF DOI
Differential richness inference for 16S rRNA marker gene surveys
Genome biology · 2022-08-01 · 9 citations
articleOpen access
BACKGROUND: Individual and environmental health outcomes are frequently linked to changes in the diversity of associated microbial communities. Thus, deriving health indicators based on microbiome diversity measures is essential. While microbiome data generated using high-throughput 16S rRNA marker gene surveys are appealing for this purpose, 16S surveys also generate a plethora of spurious microbial taxa. RESULTS: When this artificial inflation in the observed number of taxa is ignored, we find that changes in the abundance of detected taxa confound current methods for inferring differences in richness. Experimental evidence, theory-guided exploratory data analyses, and existing literature support the conclusion that most sub-genus discoveries are spurious artifacts of clustering 16S sequencing reads. We proceed to model a 16S survey's systematic patterns of sub-genus taxa generation as a function of genus abundance to derive a robust control for false taxa accumulation. These controls unlock classical regression approaches for highly flexible differential richness inference at various levels of the surveyed microbial assemblage: from sample groups to specific taxa collections. The proposed methodology for differential richness inference is available through an R package, Prokounter. CONCLUSIONS: False species discoveries bias richness estimation and confound differential richness inference. In the case of 16S microbiome surveys, supporting evidence indicate that most sub-genus taxa are spurious. Based on this finding, a flexible method is proposed and is shown to overcome the confounding problem noted with current approaches for differential richness inference. Package availability: https://github.com/mskb01/prokounter.
Publisher OA PDF DOI
A better index for analysis of co-occurrence and similarity
Science Advances · 2022 · 57 citations
- Computer Science
- Data Mining
- Statistics
contradicted predictions of the island biogeography theory finding that community stability increased with increasing physical isolation. Reanalysis of the same dataset with the estimator [Formula: see text] reversed that result and supported theoretical predictions. We found similarly marked effects in reanalyses of antibiotic cross-resistance and human disease biomarkers. Our index α is not merely an improvement; its use changes data interpretation in fundamental ways.
Publisher DOI

Frequent coauthors

M. S. Binoj Kumar
20 shared
Rafael A. Irizarry
15 shared
Yves Thibaudeau
United States Census Bureau
13 shared
Kumar P. Mainali
12 shared
Carolina Franco
National Opinion Research Center
12 shared
Meiyu Shen
Renmin Hospital of Wuhan University
10 shared
Estelle Russek‐Cohen
10 shared
Stephanie C. Hicks
Johns Hopkins University
9 shared

Labs

Eric V. Slud's LabPI
Research in mathematical statistics and probability, including census statistics, survival data analysis, meta-analysis, pharmaceutical statistical methods, large-scale data problems, and stochastic processes.

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Eric Slud

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you