Search CORE

60 research outputs found

Bioconductor workflow for microbiome data analysis: from raw reads to community analyses [version 1; referees: 2 approved]

Author: Ben J. Callahan
Julia A. Fukuyama
Kris Sankaran
Paul J. McMurdie
Susan P. Holmes
Publication venue: 'F1000 Research Ltd'
Publication date: 01/06/2016
Field of study

High-throughput sequencing of PCR-amplified taxonomic markers (like the 16S rRNA gene) has enabled a new level of analysis of complex bacterial communities known as microbiomes. Many tools exist to quantify and compare abundance levels or microbial composition of communities in different conditions. The sequencing reads have to be denoised and assigned to the closest taxa from a reference database. Common approaches use a notion of 97% similarity and normalize the data by subsampling to equalize library sizes. In this paper, we show that statistical models allow more accurate abundance estimates. By providing a complete workflow in R, we enable the user to do sophisticated downstream statistical analyses, including both parameteric and nonparametric methods. We provide examples of using the R packages dada2, phyloseq, DESeq2, ggplot2 and vegan to filter, visualize and test microbiome data. We also provide examples of supervised analyses using random forests, partial least squares and linear models as well as nonparametric testing using community networks and the ggnetwork package

Directory of Open Access Journals

Localized Plasticity in the Streamlined Genomes of Vinyl Chloride Respiring Dehalococcoides

Author: A Drummond
A Dufresne
A Wagner
AC Darling
AC Frank
AL Delcher
Alfred M. Spormann
Alla Lapidus
AM Cupples
AS Waller
B Rosner
BG Rahm
C Mao
C Regeard
D Charif
D Cheng
D Gordon
D Laslett
DE Fennell
DR Johnson
EPC Rocha
EPC Rocha
ER Hendrickson
Eugene Goltsman
F Choulet
Frank E. Löffler
H Nonaka
H Smidt
J He
J He
J He
J Kielhorn
J Maillard
J Schultz
JA Müller
JH Badger
JK Magnuson
JK Magnuson
JM Fung
Jochen A. Müller
Jonathan Göke
Josep Casadesús
K Mathee
K West
KC Keiler
Kirsti M. Ritalahti
KP Williams
KP Williams
L Adrian
L Adrian
L Adrian
M Berriman
M Bunge
M Kanehisa
M Krzywinski
M Kube
M Lynch
M Valens
ML Coleman
P Siguier
Paul J. McMurdie
PJ McMurdie
PL McCarty
R Krajmalnik-Brown
R Seshadri
RC Edgar
RM Morris
Ryan Wagner
S Behrens
S Casjens
S Cuadros-Orellana
S Guindon
S Karlin
S Kumar
S Kurtz
S Rozen
SA van Hijum
Sebastian F. Behrens
SF Altschul
SJ Giovannoni
SJ Giovannoni
Susan Holmes
TM Lowe
VF Holmes
VM Markowitz
X Maymó-Gatell
X Maymó-Gatell
Publication venue: Public Library of Science
Publication date: 30/06/2009
Field of study

Vinyl chloride (VC) is a human carcinogen and widespread priority pollutant. Here we report the first, to our knowledge, complete genome sequences of microorganisms able to respire VC, Dehalococcoides sp. strains VS and BAV1. Notably, the respective VC reductase encoding genes, vcrAB and bvcAB, were found embedded in distinct genomic islands (GEIs) with different predicted integration sites, suggesting that these genes were acquired horizontally and independently by distinct mechanisms. A comparative analysis that included two previously sequenced Dehalococcoides genomes revealed a contextually conserved core that is interrupted by two high plasticity regions (HPRs) near the Ori. These HPRs contain the majority of GEIs and strain-specific genes identified in the four Dehalococcoides genomes, an elevated number of repeated elements including insertion sequences (IS), as well as 91 of 96 rdhAB, genes that putatively encode terminal reductases in organohalide respiration. Only three core rdhA orthologous groups were identified, and only one of these groups is supported by synteny. The low number of core rdhAB, contrasted with the high rdhAB numbers per genome (up to 36 in strain VS), as well as their colocalization with GEIs and other signatures for horizontal transfer, suggests that niche adaptation via organohalide respiration is a fundamental ecological strategy in Dehalococccoides. This adaptation has been exacted through multiple mechanisms of recombination that are mainly confined within HPRs of an otherwise remarkably stable, syntenic, streamlined genome among the smallest of any free-living microorganism

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

UNT Digital Library

Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics

Author: A. McKenna
Anthony M. Bolger
AV Aho
Ayton Meintjes
Azza E. Ahmed
Bo Liu
Brian D. O’Connor
Bryan N. Howie
C. Victor Jongeneel
Don Armstrong
Enis Afgan
Eugene de Beste
F. Muniz-Fernandez
Faisal M. Fadlelmola
Fourie Joubert
Geir Kjetil Sandve
Gerrit Botha
Hocine Bendou
Jared O'Connell
Jennie Zermeno
Jeremy Goecks
Jia-Nee Foo
Katherine Wolstencroft
Lerato E. Magosi
Liudmila Sergeevna Mainzer
Long Yi
Mamana Mbiyavanga
Mark A DePristo
Martin Kircher
Melissa J. Landrum
Michael C. Nelson
Michael Crusoe
Mohamed Abouelhoda
Mustafa Alghali
Nicola J. Mulder
Nicola Mulder
Oussema Souiai
Pablo Cingolani
Paul J. McMurdie
Peter van Heusden
Phelelani T. Mpangase
Sally R. Ellingson
Scott Hazelhurst
Shakuntala Baichoo
Shaun Aron
Stephen Turner
Sumir Panji
WadeL Schulz
Yaping Yang
Yassine Souilmi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Background: The Pan-African bioinformatics network, H3ABioNet, comprises 27 research institutions in 17 African countries. H3ABioNet is part of the Human Health and Heredity in Africa program (H3Africa), an African-led research consortium funded by the US National Institutes of Health and the UK Wellcome Trust, aimed at using genomics to study and improve the health of Africans. A key role of H3ABioNet is to support H3Africa projects by building bioinformatics infrastructure such as portable and reproducible bioinformatics workflows for use on heterogeneous African computing environments. Processing and analysis of genomic data is an example of a big data application requiring complex interdependent data analysis workflows. Such bioinformatics workflows take the primary and secondary input data through several computationally-intensive processing steps using different software packages, where some of the outputs form inputs for other steps. Implementing scalable, reproducible, portable and easy-to-use workflows is particularly challenging. Results: H3ABioNet has built four workflows to support (1) the calling of variants from high-throughput sequencing data; (2) the analysis of microbial populations from 16S rDNA sequence data; (3) genotyping and genome-wide association studies; and (4) single nucleotide polymorphism imputation. A week-long hackathon was organized in August 2016 with participants from six African bioinformatics groups, and US and European collaborators. Two of the workflows are built using the Common Workflow Language framework (CWL) and two using Nextflow. All the workflows are containerized for improved portability and reproducibility using Docker, and are publicly available for use by members of the H3Africa consortium and the international research community. Conclusion: The H3ABioNet workflows have been implemented in view of offering ease of use for the end user and high levels of reproducibility and portability, all while following modern state of the art bioinformatics data processing protocols. The H3ABioNet workflows will service the H3Africa consortium projects and are currently in use. All four workflows are also publicly available for research scientists worldwide to use and adapt for their respective needs. The H3ABioNet workflows will help develop bioinformatics capacity and assist genomics research within Africa and serve to increase the scientific output of H3Africa and its Pan-African Bioinformatics Network

Cape Town University OpenUCT

Crossref

Adelaide Research & Scholarship

Directory of Open Access Journals

University of the Western Cape Research Repository

UPSpace at the University of Pretoria

Localized Plasticity in the Streamlined Genomes of Vinyl Chloride Respiring Dehalococcoides

Author: McMurdie Paul J.
Publication venue: eScholarship, University of California
Publication date: 06/11/2009
Field of study

Ezid

eScholarship - University of California

phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data

Author: Paul J. McMurdie (257618)
Susan Holmes (243177)
Publication venue
Publication date: 22/04/2013
Field of study

<div>BackgroundThe analysis of microbial communities through DNA sequencing brings many challenges: the integration of different types of data with methods from ecology, genetics, phylogenetics, multivariate statistics, visualization and testing. With the increased breadth of experimental designs now being pursued, project-specific statistical analyses are often needed, and these analyses are often difficult (or impossible) for peer researchers to independently reproduce. The vast majority of the requisite tools for performing these analyses reproducibly are already implemented in R and its extensions (packages), but with limited support for high throughput microbiome census data.ResultsHere we describe a software project, phyloseq, dedicated to the object-oriented representation and analysis of microbiome census data in R. It supports importing data from a variety of common formats, as well as many analysis techniques. These include calibration, filtering, subsetting, agglomeration, multi-table comparisons, diversity analysis, parallelized Fast UniFrac, ordination methods, and production of publication-quality graphics; all in a manner that is easy to document, share, and modify. We show how to apply functions from other R packages to phyloseq-represented data, illustrating the availability of a large number of open source analysis techniques. We discuss the use of phyloseq with tools for reproducible research, a practice common in other fields but still rare in the analysis of highly parallel microbiome census data. We have made available all of the materials necessary to completely reproduce the analysis and figures included in this article, an example of best practices for reproducible research.ConclusionsThe phyloseq project for R is a new open-source software package, freely available on the web from both GitHub and Bioconductor.</div

CiteSeerX

Directory of Open Access Journals

PubMed Central

FigShare

Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible

Author: Paul J. McMurdie (257618)
Susan Holmes (243177)
Publication venue
Publication date: 12/12/2013
Field of study

<div>Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.</div

arXiv.org e-Print Archive

Directory of Open Access Journals

PubMed Central

FigShare

Normalization by rarefying only, dependency on library size threshold.

Author: Paul J. McMurdie (257618)
Susan Holmes (243177)
Publication venue
Publication date
Field of study

Unlike the analytical methods represented in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi-1003531-g004" target="_blank">Figure 4</a>, here rarefying is the only normalization method used, but at varying values of the minimum library size threshold, shown as library-size quantile (horizontal axis). Panel columns, panel rows, and point/line shading indicate effect size (ES), median library size (), and distance method applied after rarefying, respectively. Because discarded samples cannot be accurately clustered, the line is the maximum achievable accuracy.</p

FigShare

Examples of overdispersion in microbiome data.

Author: Paul J. McMurdie (257618)
Susan Holmes (243177)
Publication venue
Publication date
Field of study

Common-Scale Variance versus Mean for Microbiome Data. Each point in each panel represents a different OTU's mean/variance estimate for a biological replicate and study. The data in this figure come from the Global Patterns survey <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi.1003531-Caporaso2" target="_blank">[48]</a> and the Long-Term Dietary Patterns study <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi.1003531-Wu1" target="_blank">[75]</a>, with results from additional studies included in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi.1003531.s001" target="_blank">Protocol S1</a>. (Right) Variance versus mean abundance for rarefied counts. (Left) Common-scale variances and common-scale means, estimated according to Equations 6 and 7 from Anders and Huber <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi.1003531-Anders1" target="_blank">[13]</a>, implemented in the DESeq package (<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi.1003531.s002" target="_blank">Text S1</a>). The dashed gray line denotes the σ2 = μ case (Poisson; φ = 0). The cyan curve denotes the fitted variance estimate using DESeq <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi.1003531-Anders1" target="_blank">[13]</a>, with method = ‘pooled’, sharingMode = ‘fit-only’, fitType = ‘local’.</p

FigShare

Performance of differential abundance detection with and without rarefying.

Author: Paul J. McMurdie (257618)
Susan Holmes (243177)
Publication venue
Publication date
Field of study

Performance summarized here by the “Area Under the Curve” (AUC) metric of a Receiver Operator Curve (ROC) <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi.1003531-Sing1" target="_blank">[59]</a> (vertical axis). Briefly, the AUC value varies from 0.5 (random) to 1.0 (perfect), incorporating both sensitivity and specificity. The horizontal axis indicates the effect size, shown as the actual multiplication factor applied to OTU counts in the test class to simulate a differential abundance. Each curve traces the respective normalization method's mean performance of that panel, with a vertical bar indicating a standard deviation in performance across all replicates and microbiome templates. The right-hand side of the panel rows indicates the median library size, , while the darkness of line shading indicates the number of samples per simulated experiment. Color shade and shape indicate the normalization method. See <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#s3" target="_blank">Methods</a> section for the definitions of each normalization and testing method. For all methods, detection among multiple tests was defined using a False Discovery Rate (Benjamini-Hochberg <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003531#pcbi.1003531-Benjamini1" target="_blank">[52]</a>) significance threshold of 0.05.</p

FigShare