464 research outputs found

    Spaced seeds improve k-mer-based metagenomic classification

    Full text link
    Metagenomics is a powerful approach to study genetic content of environmental samples that has been strongly promoted by NGS technologies. To cope with massive data involved in modern metagenomic projects, recent tools [4, 39] rely on the analysis of k-mers shared between the read to be classified and sampled reference genomes. Within this general framework, we show in this work that spaced seeds provide a significant improvement of classification accuracy as opposed to traditional contiguous k-mers. We support this thesis through a series a different computational experiments, including simulations of large-scale metagenomic projects. Scripts and programs used in this study, as well as supplementary material, are available from http://github.com/gregorykucherov/spaced-seeds-for-metagenomics.Comment: 23 page

    Novel Methods for Metagenomic Analysis

    Get PDF
    By sampling the genetic content of microbes at the nucleotide level, metagenomics has rapidly established itself as the standard in characterizing the taxonomic diversity and functional capacity of microbial populations throughout nature. The decreasing cost of sequencing technologies and the simultaneous increase of throughput per run has given scientists the ability to deeply sample highly diverse communities on a reasonable budget. The Human Microbiome Project is representative of the flood of sequence data that will arrive in the coming years. Despite these advancements, there remains the significant challenge of analyzing massive metagenomic datasets to make appropriate biological conclusions. This dissertation is a collection of novel methods developed for improved analysis of metagenomic data: (1) We begin with Figaro, a statistical algorithm that quickly and accurately infers and trims vector sequence from large Sanger-based read sets without prior knowledge of the vector used in library construction. (2) Next, we perform a rigorous evaluation of methodologies used to cluster environmental 16S rRNA sequences into species-level operational taxonomic units, and discover that many published studies utilize highly stringent parameters, resulting in overestimation of microbial diversity. (3) To assist in comparative metagenomics studies, we have created Metastats, a robust statistical methodology for comparing large-scale clinical datasets with up to thousands of subjects. Given a collection of annotated metagenomic features (e.g. taxa, COGs, or pathways), Metastats determines which features are differentially abundant between two populations. (4) Finally, we report on a new methodology that employs the generalized Lotka-Volterra model to infer microbe-microbe interactions from longitudinal 16S rRNA data. It is our hope that these methods will enhance standard metagenomic analysis techniques to provide better insight into the human microbiome and microbial communities throughout our world. To assist metagenomics researchers and those developing methods, all software described in this thesis is open-source and available online

    Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3

    Get PDF
    17openInternationalBothCulture-independent analyses of microbial communities have progressed dramatically in the last decade, particularly due to advances in methods for biological profiling via shotgun metagenomics. Opportunities for improvement continue to accelerate, with greater access to multi-omics, microbial reference genomes, and strain-level diversity. To leverage these, we present bioBakery 3, a set of integrated, improved methods for taxonomic, strain-level, functional, and phylogenetic profiling of metagenomes newly developed to build on the largest set of reference sequences now available. Compared to current alternatives, MetaPhlAn 3 increases the accuracy of taxonomic profiling, and HUMAnN 3 improves that of functional potential and activity. These methods detected novel disease-microbiome links in applications to CRC (1262 metagenomes) and IBD (1635 metagenomes and 817 metatranscriptomes). Strain-level profiling of an additional 4077 metagenomes with StrainPhlAn 3 and PanPhlAn 3 unraveled the phylogenetic and functional structure of the common gut microbe Ruminococcus bromii, previously described by only 15 isolate genomes. With open-source implementations and cloud-deployable reproducible workflows, the bioBakery 3 platform can help researchers deepen the resolution, scale, and accuracy of multi-omic profiling for microbial community studies.openBeghini, Francesco; McIver, Lauren J; Blanco-Míguez, Aitor; Dubois, Leonard; Asnicar, Francesco; Maharjan, Sagun; Mailyan, Ana; Manghi, Paolo; Scholz, Matthias; Thomas, Andrew Maltez; Valles-Colomer, Mireia; Weingart, George; Zhang, Yancong; Zolfo, Moreno; Huttenhower, Curtis; Franzosa, Eric A.; Segata, NicolaBeghini, F.; Mciver, L.J.; Blanco-Míguez, A.; Dubois, L.; Asnicar, F.; Maharjan, S.; Mailyan, A.; Manghi, P.; Scholz, M.; Thomas, A.M.; Valles-Colomer, M.; Weingart, G.; Zhang, Y.; Zolfo, M.; Huttenhower, C.; Franzosa, E.A.; Segata, N

    Dispersal strategies shape persistence and evolution of human gut bacteria

    Get PDF
    Human gut bacterial strains can co-exist with their hosts for decades, but little is known about how these microbes persist and disperse, and evolve thereby. Here, we examined these processes in 5,278 adult and infant fecal metagenomes, longitudinally sampled in individuals and families. Our analyses revealed that a subset of gut species is extremely persistent in individuals, families, and geographic regions, represented often by locally successful strains of the phylum Bacteroidota. These ''tenacious'' bacteria show high levels of genetic adaptation to the human host but a high probability of loss upon antibiotic interventions. By contrast, heredipersistent bacteria, notably Firmicutes, often rely on dispersal strategies with weak phylogeographic patterns but strong family transmissions, likely related to sporulation. These analyses describe how different dispersal strategies can lead to the long-term persistence of human gut microbes with implications for gut flora modulations

    Use of Whole Genome Shotgun Sequencing for the Analysis of Microbial Communities in Arabidopsis thaliana Leaves

    Get PDF
    Microorganisms, such as all Bacteria, Archaeae, and some Eukaryotes, inhabit all imaginable habitats in the planet, from water vents in the deep ocean to extreme environments of high temperature and salinity. Microbes also constitute the most diverse group of organisms in terms if genetic information, metabolic function, and taxonomy. Furthermore, many of these microbes establish complex interactions with each others and with many other multicellular organisms. The collection of microbes that share a body space with a plant or animal is called the microbiota, and their genetic information is called the microbiome. The microbiota has emerged as a crucial determinant of a host’s overall health and understanding it has become crucial in many biological fields. In mammals, the gut microbiota has been linked to important diseases such as diabetes, inflammatory bowel disease, and dementia. In plants, the microbiota can provide protection against certain pathogens or confer resistance against harsh environmental conditions such as drought. Furthermore, the leaves of plants represent one of the largest surface areas that can potentially be colonized by microbes. The advent of sequencing technologies has let researchers to study microbial communities at unprecedented resolution and scale. By targeting individual loci such as the 16S rDNA locus in bacteria, many species can be studied simultaneously, as well as their properties such as relative abundance without the need of individual isolation of target taxa. Decreasing costs of DNA sequencing has also led to whole shotgun sequencing where instead of targeting a single or a number of loci, random fragments of DNA are sequenced. This effectively renders the entire microbiome accessible to study, referred to as metagenomics. Consequently many more areas of investigation are open, such as the exploration of within host genetic diversity, functional analysis, or assembly of individual genomes from metagenomes. In this study, I described the analysis of metagenomic sequencing data from microbial 11 communities in leaves of wild Arabidopsis thaliana individuals from southwest Germany. As a model organisms, A. thaliana not only is accessible in the wild but also has a rich body of previous research in plant-microbe interactions. In the first section, I describe how whole shotgun sequencing of leaf DNA extracts can be used to accurately describe the taxonomic composition of the microbial community of individual hosts. The nature of whole shotgun sequencing is used to estimate true microbial abundances which can not be done with amplicons sequencing. I show how this community varies across hosts, but some trends are seen, such as the dominance of the bacterial genera Pseudomonas and Sphingomonas . Moreover, even though there is variation between individuals, I explore the influence of site of origin and host genotype. Finally, metagenomic assembly is applied to individual samples, showing the limitations of WGS in plant leaves. In the second section, I explore the genomic diversity of the most abundant genera: Pseudomonas and Sphingomonas . I use a core genome approach where a set of common genes is obtained from previously sequenced and assembled genomes. Thereafter, the gene sequences of the core genome is used as a reference for short genome mapping. Based on these mappings, individual strain mixtures are inferred based on the frequency distribution of non reference bases at each detected single nucleotide polymorphism (SNP). Finally, SNP’s are then used to derive population structure of strain mixtures across samples and with known reference genomes. In conclusion, this thesis provides insights into the use of metagenomic sequencing to study microbial populations in wild plants. I identify the strengths and weaknesses of using whole genome sequencing for this purpose. As well as a way to study strain level dynamics of prevalent taxa within a single host
    • …
    corecore