60 research outputs found

    Bioconductor workflow for microbiome data analysis: from raw reads to community analyses [version 1; referees: 2 approved]

    Get PDF
    High-throughput sequencing of PCR-amplified taxonomic markers (like the 16S rRNA gene) has enabled a new level of analysis of complex bacterial communities known as microbiomes. Many tools exist to quantify and compare abundance levels or microbial composition of communities in different conditions. The sequencing reads have to be denoised and assigned to the closest taxa from a reference database. Common approaches use a notion of 97% similarity and normalize the data by subsampling to equalize library sizes. In this paper, we show that statistical models allow more accurate abundance estimates. By providing a complete workflow in R, we enable the user to do sophisticated downstream statistical analyses, including both parameteric and nonparametric methods. We provide examples of using the R packages dada2, phyloseq, DESeq2, ggplot2 and vegan to filter, visualize and test microbiome data. We also provide examples of supervised analyses using random forests, partial least squares and linear models as well as nonparametric testing using community networks and the ggnetwork package

    Localized Plasticity in the Streamlined Genomes of Vinyl Chloride Respiring Dehalococcoides

    Get PDF
    Vinyl chloride (VC) is a human carcinogen and widespread priority pollutant. Here we report the first, to our knowledge, complete genome sequences of microorganisms able to respire VC, Dehalococcoides sp. strains VS and BAV1. Notably, the respective VC reductase encoding genes, vcrAB and bvcAB, were found embedded in distinct genomic islands (GEIs) with different predicted integration sites, suggesting that these genes were acquired horizontally and independently by distinct mechanisms. A comparative analysis that included two previously sequenced Dehalococcoides genomes revealed a contextually conserved core that is interrupted by two high plasticity regions (HPRs) near the Ori. These HPRs contain the majority of GEIs and strain-specific genes identified in the four Dehalococcoides genomes, an elevated number of repeated elements including insertion sequences (IS), as well as 91 of 96 rdhAB, genes that putatively encode terminal reductases in organohalide respiration. Only three core rdhA orthologous groups were identified, and only one of these groups is supported by synteny. The low number of core rdhAB, contrasted with the high rdhAB numbers per genome (up to 36 in strain VS), as well as their colocalization with GEIs and other signatures for horizontal transfer, suggests that niche adaptation via organohalide respiration is a fundamental ecological strategy in Dehalococccoides. This adaptation has been exacted through multiple mechanisms of recombination that are mainly confined within HPRs of an otherwise remarkably stable, syntenic, streamlined genome among the smallest of any free-living microorganism

    Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics

    Get PDF
    Background: The Pan-African bioinformatics network, H3ABioNet, comprises 27 research institutions in 17 African countries. H3ABioNet is part of the Human Health and Heredity in Africa program (H3Africa), an African-led research consortium funded by the US National Institutes of Health and the UK Wellcome Trust, aimed at using genomics to study and improve the health of Africans. A key role of H3ABioNet is to support H3Africa projects by building bioinformatics infrastructure such as portable and reproducible bioinformatics workflows for use on heterogeneous African computing environments. Processing and analysis of genomic data is an example of a big data application requiring complex interdependent data analysis workflows. Such bioinformatics workflows take the primary and secondary input data through several computationally-intensive processing steps using different software packages, where some of the outputs form inputs for other steps. Implementing scalable, reproducible, portable and easy-to-use workflows is particularly challenging. Results: H3ABioNet has built four workflows to support (1) the calling of variants from high-throughput sequencing data; (2) the analysis of microbial populations from 16S rDNA sequence data; (3) genotyping and genome-wide association studies; and (4) single nucleotide polymorphism imputation. A week-long hackathon was organized in August 2016 with participants from six African bioinformatics groups, and US and European collaborators. Two of the workflows are built using the Common Workflow Language framework (CWL) and two using Nextflow. All the workflows are containerized for improved portability and reproducibility using Docker, and are publicly available for use by members of the H3Africa consortium and the international research community. Conclusion: The H3ABioNet workflows have been implemented in view of offering ease of use for the end user and high levels of reproducibility and portability, all while following modern state of the art bioinformatics data processing protocols. The H3ABioNet workflows will service the H3Africa consortium projects and are currently in use. All four workflows are also publicly available for research scientists worldwide to use and adapt for their respective needs. The H3ABioNet workflows will help develop bioinformatics capacity and assist genomics research within Africa and serve to increase the scientific output of H3Africa and its Pan-African Bioinformatics Network

    Localized Plasticity in the Streamlined Genomes of Vinyl Chloride Respiring Dehalococcoides

    No full text
    Vinyl chloride (VC) is a human carcinogen and widespread priority pollutant. Here we report the first, to our knowledge, complete genome sequences of microorganisms able to respire VC, Dehalococcoides sp. strains VS and BAV1. Notably, the respective VC reductase encoding genes, vcrAB and bvcAB, were found embedded in distinct genomic islands (GEIs) with different predicted integration sites, suggesting that these genes were acquired horizontally and independently by distinct mechanisms. A comparative analysis that included two previously sequenced Dehalococcoides genomes revealed a contextually conserved core that is interrupted by two high plasticity regions (HPRs) near the Ori. These HPRs contain the majority of GEIs and strain-specific genes identified in the four Dehalococcoides genomes, an elevated number of repeated elements including insertion sequences (IS), as well as 91 of 96 rdhAB, genes that putatively encode terminal reductases in organohalide respiration. Only three core rdhA orthologous groups were identified, and only one of these groups is supported by synteny. The low number of core rdhAB, contrasted with the high rdhAB numbers per genome (up to 36 in strain VS), as well as their colocalization with GEIs and other signatures for horizontal transfer, suggests that niche adaptation via organohalide respiration is a fundamental ecological strategy in Dehalococccoides. This adaptation has been exacted through multiple mechanisms of recombination that are mainly confined within HPRs of an otherwise remarkably stable, syntenic, streamlined genome among the smallest of any free-living microorganism

    phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data

    Get PDF
    <div><p>Background</p><p>The analysis of microbial communities through DNA sequencing brings many challenges: the integration of different types of data with methods from ecology, genetics, phylogenetics, multivariate statistics, visualization and testing. With the increased breadth of experimental designs now being pursued, project-specific statistical analyses are often needed, and these analyses are often difficult (or impossible) for peer researchers to independently reproduce. The vast majority of the requisite tools for performing these analyses reproducibly are already implemented in R and its extensions (packages), but with limited support for high throughput microbiome census data.</p><p>Results</p><p>Here we describe a software project, phyloseq, dedicated to the object-oriented representation and analysis of microbiome census data in R. It supports importing data from a variety of common formats, as well as many analysis techniques. These include calibration, filtering, subsetting, agglomeration, multi-table comparisons, diversity analysis, parallelized Fast UniFrac, ordination methods, and production of publication-quality graphics; all in a manner that is easy to document, share, and modify. We show how to apply functions from other R packages to phyloseq-represented data, illustrating the availability of a large number of open source analysis techniques. We discuss the use of phyloseq with tools for reproducible research, a practice common in other fields but still rare in the analysis of highly parallel microbiome census data. We have made available all of the materials necessary to completely reproduce the analysis and figures included in this article, an example of best practices for reproducible research.</p><p>Conclusions</p><p>The phyloseq project for R is a new open-source software package, freely available on the web from both GitHub and Bioconductor.</p></div

    Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible

    No full text
    <div><p>Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use <i>rarefying</i> of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called <i>metagenomeSeq</i>. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.</p></div

    Normalization by rarefying only, dependency on library size threshold.

    No full text
    <p>Unlike the analytical methods represented in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi-1003531-g004" target="_blank">Figure 4</a>, here rarefying is the only normalization method used, but at varying values of the minimum library size threshold, shown as library-size quantile (horizontal axis). Panel columns, panel rows, and point/line shading indicate effect size (ES), median library size (), and distance method applied after rarefying, respectively. Because discarded samples cannot be accurately clustered, the line is the maximum achievable accuracy.</p

    Examples of overdispersion in microbiome data.

    No full text
    <p>Common-Scale Variance versus Mean for Microbiome Data. Each point in each panel represents a different OTU's mean/variance estimate for a biological replicate and study. The data in this figure come from the <i>Global Patterns</i> survey <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi.1003531-Caporaso2" target="_blank">[48]</a> and the <i>Long-Term Dietary Patterns</i> study <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi.1003531-Wu1" target="_blank">[75]</a>, with results from additional studies included in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi.1003531.s001" target="_blank">Protocol S1</a>. (Right) Variance versus mean abundance for rarefied counts. (Left) Common-scale variances and common-scale means, estimated according to Equations 6 and 7 from Anders and Huber <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi.1003531-Anders1" target="_blank">[13]</a>, implemented in the DESeq package (<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi.1003531.s002" target="_blank">Text S1</a>). The dashed gray line denotes the <i>σ</i><sup>2</sup> = <i>μ</i> case (Poisson; <i>φ</i> = 0). The cyan curve denotes the fitted variance estimate using DESeq <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi.1003531-Anders1" target="_blank">[13]</a>, with method = ‘pooled’, sharingMode = ‘fit-only’, fitType = ‘local’.</p

    Performance of differential abundance detection with and without rarefying.

    No full text
    <p>Performance summarized here by the “Area Under the Curve” (AUC) metric of a Receiver Operator Curve (ROC) <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi.1003531-Sing1" target="_blank">[59]</a> (vertical axis). Briefly, the AUC value varies from 0.5 (random) to 1.0 (perfect), incorporating both sensitivity and specificity. The horizontal axis indicates the effect size, shown as the actual multiplication factor applied to OTU counts in the test class to simulate a differential abundance. Each curve traces the respective normalization method's mean performance of that panel, with a vertical bar indicating a standard deviation in performance across all replicates and microbiome templates. The right-hand side of the panel rows indicates the median library size, , while the darkness of line shading indicates the number of samples per simulated experiment. Color shade and shape indicate the normalization method. See <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#s3" target="_blank">Methods</a> section for the definitions of each normalization and testing method. For all methods, detection among multiple tests was defined using a False Discovery Rate (Benjamini-Hochberg <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi.1003531-Benjamini1" target="_blank">[52]</a>) significance threshold of 0.05.</p
    corecore