524 research outputs found
Metagenomic microbial community profiling using unique clade-specific marker genes
Metagenomic shotgun sequencing data can identify microbes populating a microbial community and their proportions, but existing taxonomic profiling methods are inefficient for increasingly large datasets. We present an approach that uses clade-specific marker genes to unambiguously assign reads to microbial clades more accurately and >50× faster than current approaches. We validated MetaPhlAn on terabases of short reads and provide the largest metagenomic profiling to date of the human gu
Data and Statistical Methods To Analyze the Human Microbiome
The Waldron lab for computational biostatistics bridges the areas of cancer genomics and microbiome studies for public health, developing methods to exploit publicly available data resources and to integrate-omics studies
Metagenomic biomarker discovery and explanation
This study describes and validates a new method for metagenomic biomarker discovery by way of class comparison, tests of biological consistency and effect size estimation. This addresses the challenge of finding organisms, genes, or pathways that consistently explain the differences between two or more microbial communities, which is a central problem to the study of metagenomics. We extensively validate our method on several microbiomes and a convenient online interface for the method is provided at http://huttenhower.sph.harvard.edu/lefse/.National Institute of Dental and Craniofacial Research (U.S.) (grant DE017106)National Institutes of Health (U.S.) (NIH grant AI078942)Burroughs Wellcome FundNational Institutes of Health (U.S.) (NIH 1R01HG005969
Cross-study validation for the assessment of prediction algorithms
Motivation: Numerous competing algorithms for prediction in high-dimensional settings have been developed in the statistical and machine-learning literature. Learning algorithms and the prediction models they generate are typically evaluated on the basis of cross-validation error estimates in a few exemplary datasets. However, in most applications, the ultimate goal of prediction modeling is to provide accurate predictions for independent samples obtained in different settings. Cross-validation within exemplary datasets may not adequately reflect performance in the broader application context. Methods: We develop and implement a systematic approach to ‘cross-study validation’, to replace or supplement conventional cross-validation when evaluating high-dimensional prediction models in independent datasets. We illustrate it via simulations and in a collection of eight estrogen-receptor positive breast cancer microarray gene-expression datasets, where the objective is predicting distant metastasis-free survival (DMFS). We computed the C-index for all pairwise combinations of training and validation datasets. We evaluate several alternatives for summarizing the pairwise validation statistics, and compare these to conventional cross-validation. Results: Our data-driven simulations and our application to survival prediction with eight breast cancer microarray datasets, suggest that standard cross-validation produces inflated discrimination accuracy for all algorithms considered, when compared to cross-study validation. Furthermore, the ranking of learning algorithms differs, suggesting that algorithms performing best in cross-validation may be suboptimal when evaluated through independent validation. Availability: The survHD: Survival in High Dimensions package (http://www.bitbucket.org/lwaldron/survhd) will be made available through Bioconductor. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online
Relating the metatranscriptome and metagenome of the human gut
Although the composition of the human microbiome is now wellstudied, the microbiota’s \u3e8 million genes and their regulation remain largely uncharacterized. This knowledge gap is in part because of the difficulty of acquiring large numbers of samples amenable to functional studies of the microbiota. We conducted what is, to our knowledge, one of the first human microbiome studies in a well-phenotyped prospective cohort incorporating taxonomic, metagenomic, and metatranscriptomic profiling at multiple body sites using self-collected samples. Stool and saliva were provided by eight healthy subjects, with the former preserved by three different methods (freezing, ethanol, and RNAlater) to validate self-collection. Within-subject microbial species, gene, and transcript abundances were highly concordant across sampling methods, with only a small fraction of transcripts (\u3c5%) displaying between-method variation. Next, we investigated relationships between the oral and gut microbial communities, identifying a subset of abundant oral microbes that routinely survive transit to the gut, but with minimal transcriptional activity there. Finally, systematic comparison of the gut metagenome and metatranscriptome revealed that a substantial fraction (41%) of microbial transcripts were not differentially regulated relative to their genomic abundances. Of the remainder, consistently underexpressed pathways included sporulation and amino acid biosynthesis, whereas up-regulated pathways included ribosome biogenesis and methanogenesis. Across subjects, metatranscriptional profiles were significantly more individualized than DNA-level functional profiles, but less variable than microbial composition, indicative of subject-specific whole-community regulation. The results thus detail relationships between community genomic potential and gene expression in the gut, and establish the feasibility of metatranscriptomic investigations in subject-collected and shipped samples
A Guide to Enterotypes across the Human Body: Meta-Analysis of Microbial Community Structures in Human Microbiome Datasets
Recent analyses of human-associated bacterial diversity have categorized individuals into ‘enterotypes’ or clusters based on the abundances of key bacterial genera in the gut microbiota. There is a lack of consensus, however, on the analytical basis for enterotypes and on the interpretation of these results. We tested how the following factors influenced the detection of enterotypes: clustering methodology, distance metrics, OTU-picking approaches, sequencing depth, data type (whole genome shotgun (WGS) vs.16S rRNA gene sequence data), and 16S rRNA region. We included 16S rRNA gene sequences from the Human Microbiome Project (HMP) and from 16 additional studies and WGS sequences from the HMP and MetaHIT. In most body sites, we observed smooth abundance gradients of key genera without discrete clustering of samples. Some body habitats displayed bimodal (e.g., gut) or multimodal (e.g., vagina) distributions of sample abundances, but not all clustering methods and workflows accurately highlight such clusters. Because identifying enterotypes in datasets depends not only on the structure of the data but is also sensitive to the methods applied to identifying clustering strength, we recommend that multiple approaches be used and compared when testing for enterotypes
Composition of the Adult Digestive Tract Bacterial Microbiome Based on Seven Mouth Surfaces, Tonsils, Throat and Stool Samples
Background: To understand the relationship between our bacterial microbiome and health, it is essential to define the microbiome in the absence of disease. The digestive tract includes diverse habitats and hosts the human body's greatest bacterial density. We describe the bacterial community composition of ten digestive tract sites from more than 200 normal adults enrolled in the Human Microbiome Project, and metagenomically determined metabolic potentials of four representative sites. Results: The microbiota of these diverse habitats formed four groups based on similar community compositions: buccal mucosa, keratinized gingiva, hard palate; saliva, tongue, tonsils, throat; sub- and supra-gingival plaques; and stool. Phyla initially identified from environmental samples were detected throughout this population, primarily TM7, SR1, and Synergistetes. Genera with pathogenic members were well-represented among this disease-free cohort. Tooth-associated communities were distinct, but not entirely dissimilar, from other oral surfaces. The Porphyromonadaceae, Veillonellaceae and Lachnospiraceae families were common to all sites, but the distributions of their genera varied significantly. Most metabolic processes were distributed widely throughout the digestive tract microbiota, with variations in metagenomic abundance between body habitats. These included shifts in sugar transporter types between the supragingival plaque, other oral surfaces, and stool; hydrogen and hydrogen sulfide production were also differentially distributed. Conclusions: The microbiomes of ten digestive tract sites separated into four types based on composition. A core set of metabolic pathways was present across these diverse digestive tract habitats. These data provide a critical baseline for future studies investigating local and systemic diseases affecting human health
Report on emerging technologies for translational bioinformatics: a symposium on gene expression profiling for archival tissues
Background: With over 20 million formalin-fixed, paraffin-embedded (FFPE) tissue samples archived each year in the United States alone, archival tissues remain a vast and under-utilized resource in the genomic study of cancer. Technologies have recently been introduced for whole-transcriptome amplification and microarray analysis of degraded mRNA fragments from FFPE samples, and studies of these platforms have only recently begun to enter the published literature
BAYESIAN NONPARAMETRIC CROSS-STUDY VALIDATION OF PREDICTION METHODS
We consider comparisons of statistical learning algorithms using multiple data sets, via leave-one-in cross-study validation: each of the algorithms is trained on one data set; the resulting model is then validated on each remaining data set. This poses two statistical challenges that need to be addressed simultaneously. The first is the assessment of study heterogeneity, with the aim of identifying a subset of studies within which algorithm comparisons can be reliably carried out. The second is the comparison of algorithms using the ensemble of data sets. We address both problems by integrating clustering and model comparison. We formulate a Bayesian model for the array of cross-study validation statistics, which defines clusters of studies with similar properties and provides the basis for meaningful algorithm comparison in the presence of study heterogeneity. We illustrate our approach through simulations involving studies with varying severity of systematic errors, and in the context of medical prognosis for patients diagnosed with cancer, using high-throughput measurements of the transcriptional activity of the tumor’s genes
curatedOvarianData: clinically annotated data for the ovarian cancer transcriptome
This article introduces a manually curated data collection for gene expression meta-analysis of patients with ovarian cancer and software for reproducible preparation of similar databases. This resource provides uniformly prepared microarray data for 2970 patients from 23 studies with curated and documented clinical metadata. It allows users to efficiently identify studies and patient subgroups of interest for analysis and to perform meta-analysis immediately without the challenges posed by harmonizing heterogeneous microarray technologies, study designs, expression data processing methods and clinical data formats. We confirm that the recently proposed biomarker CXCL12 is associated with patient survival, independently of stage and optimal surgical debulking, which was possible only through meta-analysis owing to insufficient sample sizes of the individual studies. The database is implemented as the curatedOvarianData Bioconductor package for the R statistical computing language, providing a comprehensive and flexible resource for clinically oriented investigation of the ovarian cancer transcriptome. The package and pipeline for producing it are available from http://bcb.dfci.harvard.edu/ovariancancer. Database URL: http://bcb.dfci.harvard.edu/ovariancance
- …